Original by BCC Global
October 11, 2024 | 9:30 AM

Introduction

ByteDance’s video generation model, officially launched in September, has sparked significant industry attention. In the highly competitive landscape of large AI models, video modality represents a crucial milestone for multimodal technology. Known for its strength in video content creation, ByteDance’s video model has positioned itself as a potential “Chinese version of Sora.” What pain points does this new model address, and can it stand out in the evolving field of video generation?

Doubao Video Model: A New Chapter for ByteDance

On September 24, ByteDance’s subsidiary, Volcano Engine, officially released its Doubao video generation models—PixelDance and Seaweed—during the AI Innovation Tour, marking the company’s foray into AI video generation. These models are now in private testing for the enterprise market, signaling ByteDance’s ambitions in the AI video space.

  1. PixelDance
  • Architecture: Based on the DiT (Diffusion Transformer) architecture, a mainstream technology route in AI video generation, known for its robust semantic understanding capabilities. This allows the model to effectively interpret input text or image data and convert it into high-quality video content.
  • Video Generation Capabilities: Currently, PixelDance v1.4 can generate videos ranging from 5 to 10 seconds in length, with a default resolution of 720p and a frame rate of approximately 25 fps. It supports multi-resolution output and adapts to both horizontal and vertical screen orientations. PixelDance excels in cinematic techniques, with capabilities such as zoom, panning, circling, tracking, and more, offering creators a wealth of possibilities. It is particularly suited for fields like film production and advertising, where creativity and video quality are critical.
  1. Seaweed
  • Architecture: Built on the Transformer structure, Seaweed leverages compressed temporal-spatial latent spaces for training. This architecture allows for more efficient handling of temporal and spatial data within videos, resulting in improved video generation quality and speed.
  • Resolution and Frame Rate: The model outputs videos in 720p at 24 fps by default, with a length of 5 seconds, though it can dynamically extend up to 20–30 seconds. This makes it ideal for most daily video creation needs as well as some basic commercial applications.
  • Output Flexibility: Seaweed supports multi-resolution generation and adapts to various screen ratios, such as horizontal and vertical formats. It can also adjust according to the resolution of the input images, providing more flexibility in how videos are displayed across different platforms and devices.

Both models demonstrate exceptional video generation capabilities. From advanced semantic understanding to complex interactions between multiple subjects, as well as maintaining visual consistency across multiple camera angles, these models perform at an industry-leading level. Creators testing Doubao’s models have noted the models’ ability to follow intricate instructions, such as having different characters complete coordinated actions while maintaining visual consistency in appearance, clothing details, and headgear across different shots—delivering near-realistic results.

With further refinements through platforms like CapCut and Jiying AI, the Doubao models have achieved professional-grade lighting and color harmony, offering visually stunning and lifelike scenes. The deeply optimized Transformer structure significantly enhances Doubao’s generalization capabilities, supporting various styles including 3D animation, 2D animation, traditional Chinese painting, black-and-white films, and thick brush painting. This versatility makes the models suitable for different mediums like films, television, computers, and mobile devices.

Currently, the new Doubao video generation models are undergoing small-scale testing in the beta version of Jiying AI, with plans to gradually open up to all users. This move promises to provide users with a more convenient and efficient video creation experience and presents new opportunities for the video generation AI model sector.

Image source: Beanbag official website

ByteDance’s September release of the Doubao video generation model has attracted widespread attention in the tech industry. With video modality becoming a critical hurdle in multimodal technology, ByteDance, well-known for its video content creation, is now entering this competitive space. Could its model, dubbed “the Chinese version of Sora,” be the strongest contender in video creation? What key issues does this innovative product address?

Doubao Video Generation Model: A Breakthrough in AI Video

On September 24, ByteDance’s Volcano Engine officially unveiled two video generation models—PixelDance and Seaweed—under the Doubao umbrella during an AI Innovation Expo. Both models are currently available for enterprise testing, marking ByteDance’s formal entry into the AI video generation arena.

  1. PixelDance
  • Architecture: Built on the DiT (Diffusion Transformer) architecture, a leading framework in AI video generation, PixelDance excels at semantic understanding. This allows it to interpret complex input data, such as text and images, and translate them into high-quality video content.
  • Video Generation: PixelDance v1.4 can generate videos of 5 to 10 seconds at a default 720p resolution and 25 frames per second (fps). It supports various resolutions and adapts to both horizontal and vertical formats. Its strength lies in cinematic techniques like zoom, pan, rotate, and tracking shots, making it ideal for creators in film, advertising, and media who require high-quality video content.
  1. Seaweed
  • Architecture: Seaweed utilizes a Transformer structure with a spatio-temporal compression mechanism, enhancing the model’s ability to handle both spatial and temporal information.
  • Resolution and Frame Rate: Seaweed generates 5-second videos at 720p and 24 fps by default, with the option to extend clips to 20–30 seconds. This flexibility is well-suited for daily video creation and commercial use.
  • Output Adaptation: The model supports multiple resolutions and can adapt output based on the input’s resolution, ensuring that videos look great on various devices, from phones to cinema screens.

Both models demonstrate top-tier performance in complex scene generation, multi-camera consistency, and dynamic storytelling. Creators who tested Doubao noted that the models can handle intricate commands, such as having characters interact with each other naturally, maintaining visual consistency across different scenes, and producing film-like visual quality.

Key Capabilities of Doubao Models:

  1. Accurate Semantic Understanding:
    Doubao’s video models surpass many in the market by handling complex prompts and generating multiple characters interacting in dynamic scenes. For example, users can input a prompt like “a brave knight battling a dragon in a forest, aided by a nearby sorcerer casting spells,” and the model will accurately generate this scene with detailed interactions between the knight and sorcerer.
  2. Overcoming Consistency Challenges:
    Thanks to an innovative diffusion-based training method, the Doubao models ensure consistency across multiple camera angles within the same video. This is crucial for storytelling, allowing users to create seamless transitions between scenes while maintaining character and stylistic consistency.
  3. Dynamic and Rich Video Creation:
    Doubao’s models offer professional-grade detail, color harmony, and lighting, producing visually stunning videos with rich detail, lifelike motion, and diverse camera techniques. Users can create immersive, cinematic-quality videos with zoom, tracking, and other advanced features typically seen in professional production.
  4. Multiple Styles and Formats:
    Doubao’s models support various artistic styles—3D animation, 2D animation, traditional Chinese painting, black-and-white, and more. They also offer flexibility in video ratios, such as 1:1, 16:9, and 21:9, catering to different platforms from social media to cinematic releases.

Doubao’s Expanding Multimodal AI Model Family:

Beyond video, ByteDance’s Doubao suite includes notable models for music and real-time translation.

  1. Music Generation Model:
    Doubao’s music model can generate high-quality compositions, including melodies, lyrics, and vocals, based on simple text prompts or images. Supporting over 10 musical genres, including pop, rock, and traditional Chinese music, the model offers creators a wide range of options. It can even match vocal styles to song genres, creating professional-sounding tracks with dynamic vocal transitions and nuances.
  2. Simultaneous Translation Model:
    The Doubao real-time translation model offers ultra-low latency and high accuracy, surpassing human interpreters in certain fields like law, education, and business. Additionally, it features voice cloning capabilities, preserving the original speaker’s tone during translation—a significant breakthrough for international conferences and live broadcasts.

Model Updates and Future Prospects:

Doubao has also upgraded its general language, image, and voice models. These enhancements include improved reasoning and speed in the image generation model and enhanced mixing capabilities in the voice model, offering users greater flexibility in creating unique soundscapes.

Challenges and Future of Video Generation Models:

Though video generation models face technical challenges, their potential is vast. As AI advances, we expect even greater accuracy in semantic understanding and higher-quality video outputs. Future applications could include personalized e-commerce videos, educational content, and immersive virtual tourism experiences. Video models like Doubao will likely reduce production costs, making video creation more accessible while driving innovation across industries.

However, there are hurdles, including ensuring the models’ stability, managing ethical concerns related to video misuse, and addressing potential impacts on traditional industries.

Disclaimer: This article is for informational purposes and does not constitute investment advice. All information is based on publicly available sources, expert opinions, and BCC research. No liability will be assumed for losses arising from the use of this information. Investing involves risks; proceed with caution.