Ross: Reconstructive Visual Instruction Tuning

Reconstructive Visual Instruction Tuning by Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, and Zhaoxiang Zhang.

TL; DR: We propose reconstructive visual instruction tuning, a vision-centric supervision that enhances fine-grained comprehension capabilities and reduces hallucinations.

Abstract. This paper introduces reconstructive visual instruction tuning (Ross), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, Ross prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, Ross employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing their fine-grained comprehension capabilities and reducing hallucinations. Empirically, Ross consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, Ross delivers competitive performance with a single SigLIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs.

Release

[2024/12/31] 🔥 All codes and checkpoints of Ross have been released.
[2024/10/12] 🔥 Ross has been released. Checkout the paper for details.

Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Install

If you are not using Linux, do NOT proceed.

Clone this repository and navigate to LLaVA folder

git clone https://github.com/Haochen-Wang409/ross.git
cd ross

Install Package

conda create -n ross python=3.10 -y
conda activate ross
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Upgrade to latest code base

git pull
pip install -e .

# if you see some import errors when you upgrade,
# please try running the command below (without #)
# pip install flash-attn --no-build-isolation --no-cache-dir

Model Zoo

Method	LLM	Checkpoint
Ross-7B	Qwen2-7B-Instruct	HF

Performance

	POPE	HBench	MMB-EN	MMB-CN	SEED-I	MMMU-V	MMVP	GQA	AI2D
LLaVA-v1.5-7B	86.2	47.5	65.5	58.5	66.0	34.4	20.0	62.0	55.4
LLaVA-v1.6-7B	86.5	35.8	67.4	60.1	70.2	35.8	37.8	64.2	67.1
Cambrian-1-8B	87.4	48.7	75.9	68.9	74.7	42.7	51.3	64.6	73.0
Ross-7B	88.3	57.1	79.1	77.1	73.6	46.6	56.7	65.5	79.3

LLaVA-v1.5-13B	82.5	44.9	68.8	63.6	68.2	36.6	31.9	63.3	60.8
LLaVA-v1.6-13B	86.2	36.7	70.0	64.1	71.9	36.2	35.6	65.4	72.4
Cambrian-1-13B	85.7	54.0	75.7	65.9	74.4	40.0	41.3	64.3	73.6
Ross-13B	88.7	56.4	73.6	67.4	71.1	41.3	44.7	65.2	73.8

Train

Ross training consists of two stages: (1) feature alignment stage to connect a frozen vision encoder to a frozen LLM; (2) instruction tuning stage to teach the model to follow multimodal instructions.

Ross is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Download VAE checkpoints

Our base model takes the VAE from FLUX.1-dev as the fine-grained tokenizer. Downloading the checkpoint from this URL and put them into ./pretrained_vae.

Pretrain

Training script with DeepSpeed ZeRO-2 can be found in scripts/train_ross/pretrain_*.sh.

--mm_inv_projector_type denoiser_vit3x: the architecture of the denoiser, containing 3 transformer blocks by default.
--mm_pixel_decoder ./pretrained_vae: the visual tokenizer.

Instruction Tuning

Training script with DeepSpeed ZeRO-3 can be found in scripts/train_ross/funetune_*.sh.

Evaluation

In Ross, we evaluate models on a diverse set of benchmarks using VLMEvalKit and lmms-eval. The evaluation on MMVP is implemented based on Cambrian-1.

See EVALUATION.md for details.

Citation

If you find Ross useful for your research and applications, please cite using this BibTeX:

@article{wang2024ross,
  title={Reconstructive visual instruction tuning},
  author={Wang, Haochen and Zheng, Anlin and Zhao, Yucheng and Wang, Tiancai and Ge, Zheng and Zhang, Xiangyu and Zhang, Zhaoxiang},
  journal={arXiv preprint arXiv:2410.09575},
  year={2024}
}

Acknowledgement

LLaVA: the codebase we built upon and the dataset we utilized.
Cambrian-1: the dataset we utilized.
VLMEvalKit and lmms-eval: the evaluation codebases we built upon.
Qwen and Vicuna: the base LLMs we utilized.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
evaluation		evaluation
img		img
ross		ross
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ross: Reconstructive Visual Instruction Tuning

Release

Contents

Install

Upgrade to latest code base

Model Zoo

Performance

Train

Download VAE checkpoints

Pretrain

Instruction Tuning

Evaluation

Citation

Acknowledgement

About

Releases

Packages

Languages

License

Haochen-Wang409/ross

Folders and files

Latest commit

History

Repository files navigation

Ross: Reconstructive Visual Instruction Tuning

Release

Contents

Install

Upgrade to latest code base

Model Zoo

Performance

Train

Download VAE checkpoints

Pretrain

Instruction Tuning

Evaluation

Citation

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages