Cloud Object Detector Adaptation by Integrating Different Source Knowledge (NeurIPS-24)

Shuaifeng Li¹, Mao Ye^1*, Lihua Zhou¹, Nianxin Li¹, Siying Xiao¹, Song Tang², Xiatian Zhu³

¹University of Electronic Science and Technology of China

²University of Shanghai for Science and Technology, ³University of Surrey

`Paper` | `Project` | `Slides` | `Poster` | `Blog` | `知乎` | `小红书`

💥 News

Welcome to my homepage: Shuaifeng Li.

We follow the trend of the times and explore an interesting and promising problem, Cloud Object Detector Adaptation (CODA), where the target domain leverages detections provided by a large vision-language cloud detector to build a target detector. Thank to the large cloud model, open target scenarios and categories are able to be adapted, making open-set adaptation no longer a problem.

Please note that CODA does not restrict whether CLIP is used, even though CLIP is used in our method COIN.

🎯 Our previous CVPR'22 ORAL work, Source-Free Object Detection by Learning to Overlook Domain Style, investigates the problem of source-free domain adaptive object detection, which considers privacy protection issues and assumes that source domain data is unaccessible. If you are interested, welcome to explore our Paper and Code.

🎉 Real-world applications

Fortunately, during the paper review process, the successive releases of Grounding DINO 1.5, 1.6, and even DINO-X have provided a timely boost to our work. Moreover, IDEA-Research has officially opened access to the Grounding DINO 1.5 API, offering a more practical and robust application scenario for our paper.

To request an API key for Grounding DINO 1.5, please follow the steps outlined here and install the environment following this guide.

We have written an example for the Foggy-Cityscapes dataset. Please write the obtained TOKEN into the bash files in the scripts/GDINO1.5API/ folder after MODEL.TEACHER_CLOUD.TOKEN, and then run the following command. Please refer to here for detailed explanation.

conda activate coin3.9api
bash scripts/GDINO1.5API/test/GDINO1.5API.sh
bash scripts/GDINO1.5API/test/CLIP.sh
bash scripts/GDINO1.5API/pretrain/CLIPDET.sh
bash scripts/GDINO1.5API/final/targetDET.sh

For datasets other than the six used in the paper, please prepare VOC format data and add lines in coin/data/datasets/builtin.py

⏳ Preparation

First, clone this repository: git clone https://github.com/Flashkong/COIN.git && cd COIN.

For environment setup, please refer to docs/Environment.md. For dataset preparation, please refer to docs/Datasets.md.

Then, execute the following command:

conda activate coin
rm -rf ./datasets  # Please make sure you have completed all steps in 'docs/Datasets.md'
ln -s your_datasets_dir ./datasets

Download cloud models

First, create a folder for cloud models: mkdir cloud_models.

Then, download models from the above links or their original github repositories: Grounding DINO and GLIPv1.

Grounding DINO - Swin B (Default): Github or Huggingface.
Grounding DINO - Swin T: Github or Huggingface.
GLIP - Swin L: Github

Finally, put all cloud models in cloud_models folder.

🔥 Get start

Test the performance of cloud detectors

bash scripts/GDINO/test/GDINO.sh
bash scripts/GLIP/test/GLIP.sh

Test the performance of CLIP

bash scripts/GDINO/test/CLIP.sh
bash scripts/GLIP/test/CLIP.sh

Pre-train the CLIP detector (Knowledge Dissemination)

If you don't want to pre-train CLIP detector, you can directly use our pre-trained CLIP detector for training. For details, please see here.

Execute the following commands to pre-train the CLIP detector. It will first collect the detection results of the cloud detector and CLIP and save the results in GDINO_collect.pth and CLIP_-000001.pth respectively. Then it will automatically pre-train the CLIP detector.

bash scripts/GDINO/pretrain/CLIPDET.sh
bash scripts/GLIP/pretrain/CLIPDET.sh

To resume training, run the following command. Note that the CLIP's detection results have been saved in the model's checkpoint, so there is no need to load them again.

If you want to train from scratch, and don't want to perform result collection again, please load CLIP_-000001.pth.

# modify the value of MODEL.WEIGHTS  e.g. output_GDINO/foggy/pretrain/CLIPDET/CLIP_0002999.pth
bash scripts/GDINO/pretrain/ResumeTrain.sh
bash scripts/GLIP/pretrain/ResumeTrain.sh

Final train (Knowledge Separation and Knowledge Distillation)

Execute the following commands. You need to modify the value of MODEL.WEIGHTS. The first path is the path to the pre-trained CLIP detector, and the second path is the path of detection results collected from the clou detector, e.g. MODEL.WEIGHTS output_GDINO/foggy/pretrain/CLIPDET/CLIP_0044999.pth+output_GDINO/foggy/pretrain/CLIPDET/GDINO_collect.pth for Foggy-Cityscapes under GDINO.

You can also directly use our pre-trained CLIP detector for training. For details, please see here.

bash scripts/GDINO/final/targetDET.sh
bash scripts/GLIP/final/targetDET.sh

To resume training, run the following command. Note that the detection results from cloud have been saved in the model's checkpoint, so there is no need to load them again.

# modify the value of MODEL.WEIGHTS  e.g. output_GDINO/foggy/gard/targetDet/model_0002999.pth
bash scripts/GDINO/final/ResumeTrain.sh
bash scripts/GLIP/final/ResumeTrain.sh

Test saved checkpoints

During training, the CLIP detector and target detector will be automatically tested. If you want to directly test a saved checkpoint, please run the following command:

# Using Foggy-Cityscapes under GDINO as an example
# Add one line: 'TEST.SAVE_DETECTION_PKLS True' to save the detection results to the 'detections.pckl' file
# Set '--test_model_role clipdet' to test CLIP detector
python train_net.py \
     --num-gpus 1 \
     --config configs/coin/GDINO/foggy.yaml \
     --eval-only \
     --test_model_role targetdet \
     MODEL.WEIGHTS your_checkpint_path \
     OUTPUT_DIR output_GDINO/foggy/test_targetdet

Run under GDINO with class-only output type

Please run the commands in the scripts/GDINO/classonly folder. It contains all the training and testing commands.

🧳 Model Zoo

All trained models are stored in the huggingface.co/Flashkong/COIN.

Name	Cloud detector	Dataset	Backbone	mAP	Link
CLIPDET (pretrain)	GDINO	Foggy-Cityscapes	ResNet50	28.2	model_zoo/GDINO/foggy/CLIPDET.pth
targetDET	GDINO	Foggy-Cityscapes	ResNet50	39.0	model_zoo/GDINO/foggy/targetDET.pth
CLIPDET (pretrain)	GDINO	Cityscapes	ResNet50	35.7	model_zoo/GDINO/cityscape/CLIPDET.pth
targetDET	GDINO	Cityscapes	ResNet50	44.5	model_zoo/GDINO/cityscape/targetDET.pth
CLIPDET (pretrain)	GDINO	BDD100K	ResNet50	31.9	model_zoo/GDINO/BDD100K/CLIPDET.pth
targetDET	GDINO	BDD100K	ResNet50	39.7	model_zoo/GDINO/BDD100K/targetDET.pth
CLIPDET (pretrain)	GDINO	KITTI	ResNet50	79.9	model_zoo/GDINO/KITTI/CLIPDET.pth
targetDET	GDINO	KITTI	ResNet50	80.8	model_zoo/GDINO/KITTI/targetDET.pth
CLIPDET (pretrain)	GDINO	SIM	ResNet50	60.0	model_zoo/GDINO/SIM/CLIPDET.pth
targetDET	GDINO	SIM	ResNet50	62.4	model_zoo/GDINO/SIM/targetDET.pth
CLIPDET (pretrain)	GDINO	Clipart	ResNet50	46.2	model_zoo/GDINO/clipart/CLIPDET.pth
targetDET	GDINO	Clipart	ResNet101	68.5	model_zoo/GDINO/clipart/targetDET.pth

Name	Cloud detector	Dataset	Backbone	mAP	Link
CLIPDET (pretrain)	GLIP	Foggy-Cityscapes	ResNet50	25.0	model_zoo/GLIP/foggy/CLIPDET.pth
targetDET	GLIP	Foggy-Cityscapes	ResNet50	27.7	model_zoo/GLIP/foggy/targetDET.pth
CLIPDET (pretrain)	GLIP	Cityscapes	ResNet50	30.9	model_zoo/GLIP/cityscape/CLIPDET.pth
targetDET	GLIP	Cityscapes	ResNet50	33.5	model_zoo/GLIP/cityscape/targetDET.pth
CLIPDET (pretrain)	GLIP	BDD100K	ResNet50	29.1	model_zoo/GLIP/BDD100K/CLIPDET.pth
targetDET	GLIP	BDD100K	ResNet50	33.5	model_zoo/GLIP/BDD100K/targetDET.pth
CLIPDET (pretrain)	GLIP	KITTI	ResNet50	55.9	model_zoo/GLIP/KITTI/CLIPDET.pth
targetDET	GLIP	KITTI	ResNet50	56.8	model_zoo/GLIP/KITTI/targetDET.pth
CLIPDET (pretrain)	GLIP	SIM	ResNet50	35.8	model_zoo/GLIP/SIM/CLIPDET.pth
targetDET	GLIP	SIM	ResNet50	37.1	model_zoo/GLIP/SIM/targetDET.pth

To verify the above model, please run the following command

mkdir model_zoo
# Place the downloaded models according to the Hugging Face directory structure.
bash scripts/modelzoo/GDINO/CLIPDET.sh
bash scripts/modelzoo/GDINO/targetDET.sh
bash scripts/modelzoo/GLIP/CLIPDET.sh
bash scripts/modelzoo/GLIP/targetDET.sh

Use our pre-trained CLIPDET for final training

Since pre-training an CLIP detector takes some time, you can directly use our pre-trained CLIPDET:

# Using Foggy-Cityscapes under GDINO as an example

# collect detection results
python train_net.py \
     --num-gpus 1 \
     --config configs/coin/PRETRAINS/CLIPDET_foggy.yaml \
     SOLVER.MAX_ITER 0 \
     OUTPUT_DIR output_GDINO/foggy/pretrain/CLIPDET

python train_net.py \
     --num-gpus 1 \
     --config configs/coin/GDINO/foggy.yaml \
     MODEL.WEIGHTS model_zoo/GDINO/foggy/CLIPDET.pth+output_GDINO/foggy/pretrain/CLIPDET/GDINO_collect.pth \
     OUTPUT_DIR output_GDINO/foggy/gard/targetDet

💡 Quick Tutorials

Configs (configs/coin):

BASELINES: Configuration files for testing cloud models and CLIP.
PRETRAINS: Configuration files for pre-training CLIP detector.
GDINO and GLIP: Configuration files for final training.
ORACLE: Configuration files for training oracle model.

Trainers (coin/engine):

test.py: For testing cloud models and CLIP.
pre_train.py: For pre-train the CLIP detector.
trainer.py: For final training.

Models (coin/modeling/meta_arch):

gdino.py and glip.py: Cloud detector entrance.
gdino_processor.py and glip_processor.py: Post-processing of cloud detection results, used to collect results when pre-training CLIP detector.
gdino_collector.py, glip_collector.py and clip_collector.py: Collector for saving detection results, used to collect results when pre-training CLIP detector.
clip_rcnn.py: Contains two models, one is modified CLIP to predict probabilities using the boxes from cloud detector; another is the architecture of CLIP detector and target detector: OpenVocabularyRCNN, as shown in our paper Fig.2(a).

CKG network:

coin/modeling/merge/ckg.py: The architecture of CKG network.
coin/modeling/roi_heads/fast_rcnn.py: The file where CKG is used.

✒️ Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@inproceedings{
li2024cloud,
title={Cloud Object Detector Adaptation by Integrating Different Source Knowledge},
author={Shuaifeng Li and Mao Ye and Lihua Zhou and Nianxin Li and Siying Xiao and Song Tang and Xiatian Zhu},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=S8SEjerTTg}
}

❤️ Acknowledgement

We would like to express our sincere gratitude to the following good projects and their contributors for their invaluable contributions.

The cloud detectors are GroundingDINO and GLIPv1.
Local knowledge comes from CLIP.
The entire code framework is based on Detectron2.
The implementation of the two-stage detector OpenVocabularyRCNN draws on RegionCLIP.
Part of the code was learned from ProbabilisticTeacher.

📜 Abstract

We propose to explore an interesting and promising problem, Cloud Object Detector Adaptation (CODA), where the target domain leverages detections provided by a large cloud model to build a target detector. Despite with powerful generalization capability, the cloud model still cannot achieve error-free detection in a specific target domain. In this work, we present a novel Cloud Object detector adaptation method by Integrating different source kNowledge (COIN). The key idea is to incorporate a public vision-language model (CLIP) to distill positive knowledge while refining negative knowledge for adaptation by self-promotion gradient direction alignment. To that end, knowledge dissemination, separation, and distillation are carried out successively. Knowledge dissemination combines knowledge from cloud detector and CLIP model to initialize a target detector and a CLIP detector in target domain. By matching CLIP detector with the cloud detector, knowledge separation categorizes detections into three parts: consistent, inconsistent and private detections such that divide-and-conquer strategy can be used for knowledge distillation. Consistent and private detections are directly used to train target detector; while inconsistent detections are fused based on a consistent knowledge generation network, which is trained by aligning the gradient direction of inconsistent detections to that of consistent detections, because it provides a direction toward an optimal target detector. Experiment results demonstrate that the proposed COIN method achieves the state-of-the-art performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloud Object Detector Adaptation by Integrating Different Source Knowledge (NeurIPS-24)

💥 News

🎉 Real-world applications

⏳ Preparation

Download cloud models

🔥 Get start

Test the performance of cloud detectors

Test the performance of CLIP

Pre-train the CLIP detector (Knowledge Dissemination)

Final train (Knowledge Separation and Knowledge Distillation)

Test saved checkpoints

Run under GDINO with class-only output type

🧳 Model Zoo

Use our pre-trained CLIPDET for final training

💡 Quick Tutorials

✒️ Citation

❤️ Acknowledgement

📜 Abstract

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
coin		coin
configs		configs
datasets		datasets
docs		docs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
train_net.py		train_net.py

License

Flashkong/COIN

Folders and files

Latest commit

History

Repository files navigation

Cloud Object Detector Adaptation by Integrating Different Source Knowledge (NeurIPS-24)

Shuaifeng Li1, Mao Ye1*, Lihua Zhou1, Nianxin Li1, Siying Xiao1, Song Tang2, Xiatian Zhu3 1University of Electronic Science and Technology of China 2University of Shanghai for Science and Technology, 3University of Surrey Paper | Project | Slides | Poster | Blog | 知乎 | 小红书

💥 News

🎉 Real-world applications

⏳ Preparation

Download cloud models

🔥 Get start

Test the performance of cloud detectors

Test the performance of CLIP

Pre-train the CLIP detector (Knowledge Dissemination)

Final train (Knowledge Separation and Knowledge Distillation)

Test saved checkpoints

Run under GDINO with class-only output type

🧳 Model Zoo

Use our pre-trained CLIPDET for final training

💡 Quick Tutorials

✒️ Citation

❤️ Acknowledgement

📜 Abstract

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages