Skip to content

A PyTorch implementation of a text-image to video diffussion model with a cascaded architecture I2VGEN-XL

License

Notifications You must be signed in to change notification settings

MOSTAFA1172m/Image-text-video-I2VGENXL

Repository files navigation

I2VGEN-XL: Image + Text to Video Diffusion Model

I2VGEN-XL is a Cascaded Diffussion model designed for generating videos from image and text inputs. It employs a cascaded architecture, leveraging diffusion processes to produce high-quality and coherent video outputs. This repository provides an easy-to-use Jupyter Notebook to get started with I2VGEN-XL.

Paper page Replicate

Features

  • Multimodal Inputs: Combines images and text to create dynamic video outputs.

  • Cascaded Architecture: Utilizes a hierarchical approach to ensure progressively refined video quality.

  • Lightweight Setup: Run directly in Jupyter Notebook with minimal setup requirements.

    Model Architecture
    Modelfigure
    Description:The I2VGen-XL framework consists of two main stages. In the base stage, hierarchical encoders work together to extract both high-level semantics and fine-grained details from input images, ensuring realistic motion while maintaining the content and structure of the images. In the refinement stage, a dedicated diffusion model enhances resolution and significantly improves temporal consistency by refining finer details. The "D.Enc." and "G.Enc." components represent the detail encoder and global encoder, respectively.

Getting Started

Requirements

  • A Kaggle account for running the notebook on Kaggle’s GPU-enabled environment.
  • Basic familiarity with Python and Jupyter Notebooks.

Steps

  1. Download the Notebook:

    • Clone the repository or download the IVGEN-XL.ipynb file directly.
      git clone https://github.com/MOSTAFA1172m/Image-text-video-IVGENXL.git
      cd IVGEN-XL
  2. Upload the Notebook to Kaggle:

    • Go to Kaggle.
    • Create a new notebook and upload the IVGEN-XL.ipynb file.
  3. Enable GPU:

    • In your Kaggle notebook, navigate to Settings > Accelerator and select GPU.
  4. Install Dependencies:

    • Run the following command in a code cell:
      !pip install -r requirements.txt
  5. Run the Notebook:

    • Follow the step-by-step instructions provided in the notebook to input images and text, and generate videos.

Results

Example Outputs

Here are some example outputs generated by the model. The results showcase how the model processes various inputs:

Image + Text Input Examples:

Example 1 Example 2
Newton portrait Mona Lisa painting by Leonardo da Vinci
Description: Newton smiling and waving. Description: Monalisa laughing.
Example 3 Example 4
car Sunset Sea
Description: Car driving on the road. Description: Sunset over the sea.
Example 5
Skyward clouds
Description: Clouds moving across the sky over the mountains.

Generation Speed

On Kaggle, you can use GPU for faster results.

Feel free to experiment with different inputs and see how the model generates videos in response.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For any questions or feedback, feel free to reach out to my linkedin.


Note: Ensure the GPU runtime is enabled before running the notebook to avoid performance issues.

About

A PyTorch implementation of a text-image to video diffussion model with a cascaded architecture I2VGEN-XL

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published