Multimodal "Deepseek Moments" highlight the divide among major companies: ByteDance emphasizes "efficiency," Kuaishou focuses on "professionalism," and Alibaba concentrates on "e-commerce"!

robot
Abstract generation in progress

The wave of multimodal updates at the beginning of the year has been quite rapid: on January 31, Kuaishou released Kling 3.0; on February 7, ByteDance announced Seedance 2.0; and on February 10, ByteDance’s Seedream 5.0 and Alibaba’s Qwen-Image-2.0 further enhanced the foundation for “text-to-image/image editing” capabilities.

Yao Lei from Huachuang Securities Research Institute made a straightforward assessment in a report on the 12th—video generation is no longer just a flashy tech demo but is evolving into a tool that can be integrated into workflows: “AI video generation is transitioning from blind box entertainment to precise industrialized production.” The core issue hindering commercialization has been attributed to the uncontrollable marginal costs caused by “gacha” mechanics: the same demand requires repeated generation and rework, with waste footage rates consuming time and budgets.

The focus of the upgrades in Kling 3.0 and Seedance 2.0 is not simply on improving image quality but on elevating controllability to a higher priority: ensuring subject consistency across shots, semantic adherence to complex instructions, and “editable after generation” capabilities—all aimed at reducing waste footage rates. The research report concludes that technological breakthroughs are laying the foundation for AI video to enter large-scale enterprise workflows, with e-commerce advertising and short/long-form drama production expected to feel the impact sooner.

Further analysis divides the influence into two layers: one is product differentiation—ByteDance more resembles “efficiency infrastructure,” while Kuaishou leans toward “professional storytelling”; the other is a supply-side revolution that redefines cost structures—marginal costs of content production increasingly resemble computing power costs. Corresponding investment clues point toward benefits in content IP, content copyrights, AI video tools/models, and the demand for cloud and platform inference services.

The real breakthrough is in controlling the uncontrollable costs caused by gacha mechanics

The report repeatedly emphasizes a logical chain: the past difficulty in commercializing AI video was not “inability to produce,” but “unstable output.” Using the same script, materials, and prompts, the quality of generated videos fluctuated greatly, forcing creators to generate multiple iterations to gamble on results, which caused marginal costs to spiral out of control.

The report believes that the significance of the new generation of models lies in shifting “generation capability” to a later stage and prioritizing “controllability”: through native multimodal architectures, instruction alignment, and reinforcement of subject consistency/semantic adherence, waste footage rates can be reduced, leading to overall lower video production costs. As a result, the threshold for commercialization is being redefined—from “can it be done” to “can it be delivered reliably.”

Kling 3.0 emphasizes a “blockbuster feel”: prioritizing physical realism and long-form narrative

The key points of Kling 3.0 are summarized as two aspects: systematic upgrade of fundamental capabilities and integrated generation and editing (Omni).

On the video side, the upgrades mainly focus on: stronger subject consistency across multi-shot/continuous action scenes; more detailed parsing of complex text instructions; alleviating pronoun confusion in multi-person frames, with an emphasis on “precise mapping between text and visual characters” (including multilingual, dialectal voice acting, and natural lip-sync and expressions).

The Omni mode is another highlighted change: enabling localized controllable modifications on already generated content, reducing the need to “start over.” The report also mentions two more professional creation capabilities: one is the ability to create video subjects (extracting character features and original voice tones, enabling precise lip-sync and driving); the other is native custom scene creation, with single-generation durations increased to 15 seconds, allowing specification of shot duration, framing, perspective, narrative content, and camera movement at the shot level.

In terms of images, Kling Image 3.0 is also regarded as part of the “workflow completion”: supporting up to 10 reference images to lock in subject outlines, core elements, and tonal style; allowing free element addition, removal, and modification across multiple reference images; supporting batch output for storyboards/material packs; and enhancing high-definition output and detail rendering.

Seedance 2.0 transforms video into an “editable industrial tool”

The report positions Seedance 2.0 more as an “industrial standard”: emphasizing physical realism, natural motion, precise instruction understanding, and style consistency; highlighting three capabilities—consistency optimization (from faces to clothing, fonts, scene transitions); controllable recreation of complex camera movements and actions; and precise replication of creative templates and complex effects.

More critically, the interaction paradigm is emphasized. The report suggests that Seedance 2.0’s use of “@assetName” to specify the purpose of images/videos/audio effectively disassembles black-box generation into a controllable production process: models can extract @video camera movements, @image details, and @audio rhythm separately, significantly reducing waste footage.

The usage constraints and limitations are also aligned with “production constraints”: supporting up to 9 images as input; up to 3 video inputs with a total duration not exceeding 15 seconds; up to 3 MP3 audio uploads with total duration under 15 seconds; a maximum of 12 mixed input files; generation duration up to 15 seconds (optional 4-15 seconds); and built-in sound effects/music output. Entry points include “start/end frames” and “all-purpose references,” corresponding to different material organization methods.

ByteDance focuses on “efficiency infrastructure,” Kuaishou on “professional storytelling,” and Alibaba on vertical e-commerce

The report’s view on the competitive landscape is less about “ranking” and more about strategic differentiation.

It summarizes ByteDance’s approach as low-threshold, low-cost tooling with broad generalization capabilities, akin to an advanced form of “CapCut,” aiming to reduce overall content production costs and support the ecosystem; Kuaishou’s Kling emphasizes physical simulation, realistic complex scenes, and subject consistency, more suitable for professional content like film demos and narrative-driven productions; Alibaba’s Qwen in high-fidelity image model updates is more vertically oriented (e-commerce), strengthening capabilities related to product digitization.

These three paths do not point to the same business model: one pursues large-scale throughput, one emphasizes high-quality narrative delivery, and one targets vertical industry “ready-to-produce” solutions.

Content supply-side revolution: marginal costs converge toward computing power costs, making IP even scarcer

In terms of commercialization, the report describes a “supply-side revolution” quite aggressively: with dual enhancements in image and video foundational capabilities, the marginal cost of content production will increasingly approach the cost of computing power.

In the short term, it foresees two main changes: increased efficiency for marketing/e-commerce content creators leading to higher gross margins; and potential capacity explosions in short and long-form drama industries. In the medium to long term, the focus shifts to IP—since content becomes easier to produce, its scarcity value will be more concentrated in IP assets: top-tier IP and derivatives will command higher value, and mid-tier IP may also see a valuation reset through AI videoization. Meanwhile, giants with strong cloud infrastructure and closed-loop traffic scenarios will benefit most directly from the frequent inference calls driven by these developments.

Risk warnings and disclaimers

Market risks exist; investments should be cautious. This article does not constitute personal investment advice and does not consider individual users’ specific investment goals, financial situations, or needs. Users should evaluate whether any opinions, viewpoints, or conclusions herein are suitable for their particular circumstances. Invest at your own risk.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin

Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)