Reconstructive Visual Instruction Tuning by Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, and Zhaoxiang Zhang.
TL; DR: We propose reconstructive visual instruction tuning, a vision-centric supervision that enhances fine-grained comprehension capabilities and reduces hallucinations.
Abstract. This paper introduces reconstructive visual instruction tuning (Ross), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, Ross prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, Ross employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing their fine-grained comprehension capabilities and reducing hallucinations. Empirically, Ross consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, Ross delivers competitive performance with a single SigLIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs.
- [2024/12/31] 🔥 All codes and checkpoints of Ross have been released.
- [2024/10/12] 🔥 Ross has been released. Checkout the paper for details.
Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.
If you are not using Linux, do NOT proceed.
- Clone this repository and navigate to LLaVA folder
git clone https://github.com/Haochen-Wang409/ross.git
cd ross
- Install Package
conda create -n ross python=3.10 -y
conda activate ross
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
git pull
pip install -e .
# if you see some import errors when you upgrade,
# please try running the command below (without #)
# pip install flash-attn --no-build-isolation --no-cache-dir
Method | LLM | Checkpoint |
---|---|---|
Ross-7B | Qwen2-7B-Instruct | HF |
POPE | HBench | MMB-EN | MMB-CN | SEED-I | MMMU-V | MMVP | GQA | AI2D | |
---|---|---|---|---|---|---|---|---|---|
LLaVA-v1.5-7B | 86.2 | 47.5 | 65.5 | 58.5 | 66.0 | 34.4 | 20.0 | 62.0 | 55.4 |
LLaVA-v1.6-7B | 86.5 | 35.8 | 67.4 | 60.1 | 70.2 | 35.8 | 37.8 | 64.2 | 67.1 |
Cambrian-1-8B | 87.4 | 48.7 | 75.9 | 68.9 | 74.7 | 42.7 | 51.3 | 64.6 | 73.0 |
Ross-7B | 88.3 | 57.1 | 79.1 | 77.1 | 73.6 | 46.6 | 56.7 | 65.5 | 79.3 |
LLaVA-v1.5-13B | 82.5 | 44.9 | 68.8 | 63.6 | 68.2 | 36.6 | 31.9 | 63.3 | 60.8 |
LLaVA-v1.6-13B | 86.2 | 36.7 | 70.0 | 64.1 | 71.9 | 36.2 | 35.6 | 65.4 | 72.4 |
Cambrian-1-13B | 85.7 | 54.0 | 75.7 | 65.9 | 74.4 | 40.0 | 41.3 | 64.3 | 73.6 |
Ross-13B | 88.7 | 56.4 | 73.6 | 67.4 | 71.1 | 41.3 | 44.7 | 65.2 | 73.8 |
Ross training consists of two stages: (1) feature alignment stage to connect a frozen vision encoder to a frozen LLM; (2) instruction tuning stage to teach the model to follow multimodal instructions.
Ross is trained on 8 A100 GPUs with 80GB memory.
To train on fewer GPUs, you can reduce the per_device_train_batch_size
and increase the gradient_accumulation_steps
accordingly.
Always keep the global batch size the same: per_device_train_batch_size
x gradient_accumulation_steps
x num_gpus
.
Our base model takes the VAE from FLUX.1-dev as the fine-grained tokenizer.
Downloading the checkpoint from this URL and put them into ./pretrained_vae
.
Training script with DeepSpeed ZeRO-2 can be found in scripts/train_ross/pretrain_*.sh
.
--mm_inv_projector_type denoiser_vit3x
: the architecture of the denoiser, containing 3 transformer blocks by default.--mm_pixel_decoder ./pretrained_vae
: the visual tokenizer.
Training script with DeepSpeed ZeRO-3 can be found in scripts/train_ross/funetune_*.sh
.
In Ross, we evaluate models on a diverse set of benchmarks using VLMEvalKit and lmms-eval. The evaluation on MMVP is implemented based on Cambrian-1.
See EVALUATION.md for details.
If you find Ross useful for your research and applications, please cite using this BibTeX:
@article{wang2024ross,
title={Reconstructive visual instruction tuning},
author={Wang, Haochen and Zheng, Anlin and Zhao, Yucheng and Wang, Tiancai and Ge, Zheng and Zhang, Xiangyu and Zhang, Zhaoxiang},
journal={arXiv preprint arXiv:2410.09575},
year={2024}
}
- LLaVA: the codebase we built upon and the dataset we utilized.
- Cambrian-1: the dataset we utilized.
- VLMEvalKit and lmms-eval: the evaluation codebases we built upon.
- Qwen and Vicuna: the base LLMs we utilized.