I2VGEN-XL: Image + Text to Video Diffusion Model

I2VGEN-XL is a Cascaded Diffussion model designed for generating videos from image and text inputs. It employs a cascaded architecture, leveraging diffusion processes to produce high-quality and coherent video outputs. This repository provides an easy-to-use Jupyter Notebook to get started with I2VGEN-XL.

Features

Multimodal Inputs: Combines images and text to create dynamic video outputs.
Cascaded Architecture: Utilizes a hierarchical approach to ensure progressively refined video quality.

Lightweight Setup: Run directly in Jupyter Notebook with minimal setup requirements.

Model Architecture

Description:The I2VGen-XL framework consists of two main stages. In the base stage, hierarchical encoders work together to extract both high-level semantics and fine-grained details from input images, ensuring realistic motion while maintaining the content and structure of the images. In the refinement stage, a dedicated diffusion model enhances resolution and significantly improves temporal consistency by refining finer details. The "D.Enc." and "G.Enc." components represent the detail encoder and global encoder, respectively.

Getting Started

Requirements

A Kaggle account for running the notebook on Kaggle’s GPU-enabled environment.
Basic familiarity with Python and Jupyter Notebooks.

Steps

Download the Notebook:
- Clone the repository or download the IVGEN-XL.ipynb file directly.
```
git clone https://github.com/MOSTAFA1172m/Image-text-video-IVGENXL.git
cd IVGEN-XL
```
Upload the Notebook to Kaggle:
- Go to Kaggle.
- Create a new notebook and upload the IVGEN-XL.ipynb file.
Enable GPU:
- In your Kaggle notebook, navigate to Settings > Accelerator and select GPU.
Install Dependencies:
- Run the following command in a code cell:
```
!pip install -r requirements.txt
```
Run the Notebook:
- Follow the step-by-step instructions provided in the notebook to input images and text, and generate videos.

Results

Example Outputs

Here are some example outputs generated by the model. The results showcase how the model processes various inputs:

Image + Text Input Examples:

Example 1	Example 2

Description: Newton smiling and waving.	Description: Monalisa laughing.

Example 3	Example 4

Description: Car driving on the road.	Description: Sunset over the sea.

Example 5

Description: Clouds moving across the sky over the mountains.

Generation Speed

On Kaggle, you can use GPU for faster results.

Feel free to experiment with different inputs and see how the model generates videos in response.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For any questions or feedback, feel free to reach out to my linkedin.

Note: Ensure the GPU runtime is enabled before running the notebook to avoid performance issues.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
results		results
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
IVGEN_XL.ipynb		IVGEN_XL.ipynb
LICENSE		LICENSE
PULL_REQUEST_TEMPLATE.md		PULL_REQUEST_TEMPLATE.md
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

I2VGEN-XL: Image + Text to Video Diffusion Model

Features