Shuaifeng Li1, Mao Ye1*, Lihua Zhou1, Nianxin Li1, Siying Xiao1, Song Tang2, Xiatian Zhu3
1University of Electronic Science and Technology of China
2University of Shanghai for Science and Technology, 3University of Surrey
Paper
| Project
| Slides
| Poster
| Blog
| η₯δΉ
| ε°ηΊ’δΉ¦
Welcome to my homepage: Shuaifeng Li.
We follow the trend of the times and explore an interesting and promising problem, Cloud Object Detector Adaptation (CODA)
, where the target domain leverages detections provided by a large vision-language cloud detector to build a target detector. Thank to the large cloud model, open target scenarios and categories are able to be adapted, making open-set adaptation no longer a problem.
Please note that CODA does not restrict whether CLIP is used, even though CLIP is used in our method COIN.
π― Our previous CVPR'22 ORAL work, Source-Free Object Detection by Learning to Overlook Domain Style
, investigates the problem of source-free domain adaptive object detection, which considers privacy protection issues and assumes that source domain data is unaccessible. If you are interested, welcome to explore our Paper and Code.
Fortunately, during the paper review process, the successive releases of Grounding DINO 1.5, 1.6, and even DINO-X have provided a timely boost to our work. Moreover, IDEA-Research has officially opened access to the Grounding DINO 1.5 API, offering a more practical and robust application scenario for our paper.
To request an API key for Grounding DINO 1.5, please follow the steps outlined here and install the environment following this guide.
We have written an example for the Foggy-Cityscapes dataset. Please write the obtained TOKEN into the bash files in the scripts/GDINO1.5API/
folder after MODEL.TEACHER_CLOUD.TOKEN
, and then run the following command. Please refer to here for detailed explanation.
conda activate coin3.9api
bash scripts/GDINO1.5API/test/GDINO1.5API.sh
bash scripts/GDINO1.5API/test/CLIP.sh
bash scripts/GDINO1.5API/pretrain/CLIPDET.sh
bash scripts/GDINO1.5API/final/targetDET.sh
For datasets other than the six used in the paper, please prepare VOC format data and add lines in coin/data/datasets/builtin.py
First, clone this repository: git clone https://github.com/Flashkong/COIN.git && cd COIN
.
For environment setup, please refer to docs/Environment.md. For dataset preparation, please refer to docs/Datasets.md.
Then, execute the following command:
conda activate coin
rm -rf ./datasets # Please make sure you have completed all steps in 'docs/Datasets.md'
ln -s your_datasets_dir ./datasets
First, create a folder for cloud models: mkdir cloud_models
.
Then, download models from the above links or their original github repositories: Grounding DINO and GLIPv1.
- Grounding DINO - Swin B (Default): Github or Huggingface.
- Grounding DINO - Swin T: Github or Huggingface.
- GLIP - Swin L: Github
Finally, put all cloud models in cloud_models
folder.
bash scripts/GDINO/test/GDINO.sh
bash scripts/GLIP/test/GLIP.sh
bash scripts/GDINO/test/CLIP.sh
bash scripts/GLIP/test/CLIP.sh
If you don't want to pre-train CLIP detector, you can directly use our pre-trained CLIP detector for training. For details, please see here.
Execute the following commands to pre-train the CLIP detector. It will first collect the detection results of the cloud detector and CLIP and save the results in GDINO_collect.pth
and CLIP_-000001.pth
respectively. Then it will automatically pre-train the CLIP detector.
bash scripts/GDINO/pretrain/CLIPDET.sh
bash scripts/GLIP/pretrain/CLIPDET.sh
To resume training, run the following command. Note that the CLIP's detection results have been saved in the model's checkpoint, so there is no need to load them again.
If you want to train from scratch, and don't want to perform result collection again, please load CLIP_-000001.pth
.
# modify the value of MODEL.WEIGHTS e.g. output_GDINO/foggy/pretrain/CLIPDET/CLIP_0002999.pth
bash scripts/GDINO/pretrain/ResumeTrain.sh
bash scripts/GLIP/pretrain/ResumeTrain.sh
Execute the following commands. You need to modify the value of MODEL.WEIGHTS
. The first path is the path to the pre-trained CLIP detector, and the second path is the path of detection results collected from the clou detector, e.g. MODEL.WEIGHTS output_GDINO/foggy/pretrain/CLIPDET/CLIP_0044999.pth+output_GDINO/foggy/pretrain/CLIPDET/GDINO_collect.pth
for Foggy-Cityscapes under GDINO.
You can also directly use our pre-trained CLIP detector for training. For details, please see here.
bash scripts/GDINO/final/targetDET.sh
bash scripts/GLIP/final/targetDET.sh
To resume training, run the following command. Note that the detection results from cloud have been saved in the model's checkpoint, so there is no need to load them again.
# modify the value of MODEL.WEIGHTS e.g. output_GDINO/foggy/gard/targetDet/model_0002999.pth
bash scripts/GDINO/final/ResumeTrain.sh
bash scripts/GLIP/final/ResumeTrain.sh
During training, the CLIP detector and target detector will be automatically tested. If you want to directly test a saved checkpoint, please run the following command:
# Using Foggy-Cityscapes under GDINO as an example
# Add one line: 'TEST.SAVE_DETECTION_PKLS True' to save the detection results to the 'detections.pckl' file
# Set '--test_model_role clipdet' to test CLIP detector
python train_net.py \
--num-gpus 1 \
--config configs/coin/GDINO/foggy.yaml \
--eval-only \
--test_model_role targetdet \
MODEL.WEIGHTS your_checkpint_path \
OUTPUT_DIR output_GDINO/foggy/test_targetdet
Please run the commands in the scripts/GDINO/classonly
folder. It contains all the training and testing commands.
All trained models are stored in the huggingface.co/Flashkong/COIN.
Name | Cloud detector | Dataset | Backbone | mAP | Link |
---|---|---|---|---|---|
CLIPDET (pretrain) | GDINO | Foggy-Cityscapes | ResNet50 | 28.2 | model_zoo/GDINO/foggy/CLIPDET.pth |
targetDET | GDINO | Foggy-Cityscapes | ResNet50 | 39.0 | model_zoo/GDINO/foggy/targetDET.pth |
CLIPDET (pretrain) | GDINO | Cityscapes | ResNet50 | 35.7 | model_zoo/GDINO/cityscape/CLIPDET.pth |
targetDET | GDINO | Cityscapes | ResNet50 | 44.5 | model_zoo/GDINO/cityscape/targetDET.pth |
CLIPDET (pretrain) | GDINO | BDD100K | ResNet50 | 31.9 | model_zoo/GDINO/BDD100K/CLIPDET.pth |
targetDET | GDINO | BDD100K | ResNet50 | 39.7 | model_zoo/GDINO/BDD100K/targetDET.pth |
CLIPDET (pretrain) | GDINO | KITTI | ResNet50 | 79.9 | model_zoo/GDINO/KITTI/CLIPDET.pth |
targetDET | GDINO | KITTI | ResNet50 | 80.8 | model_zoo/GDINO/KITTI/targetDET.pth |
CLIPDET (pretrain) | GDINO | SIM | ResNet50 | 60.0 | model_zoo/GDINO/SIM/CLIPDET.pth |
targetDET | GDINO | SIM | ResNet50 | 62.4 | model_zoo/GDINO/SIM/targetDET.pth |
CLIPDET (pretrain) | GDINO | Clipart | ResNet50 | 46.2 | model_zoo/GDINO/clipart/CLIPDET.pth |
targetDET | GDINO | Clipart | ResNet101 | 68.5 | model_zoo/GDINO/clipart/targetDET.pth |
Name | Cloud detector | Dataset | Backbone | mAP | Link |
---|---|---|---|---|---|
CLIPDET (pretrain) | GLIP | Foggy-Cityscapes | ResNet50 | 25.0 | model_zoo/GLIP/foggy/CLIPDET.pth |
targetDET | GLIP | Foggy-Cityscapes | ResNet50 | 27.7 | model_zoo/GLIP/foggy/targetDET.pth |
CLIPDET (pretrain) | GLIP | Cityscapes | ResNet50 | 30.9 | model_zoo/GLIP/cityscape/CLIPDET.pth |
targetDET | GLIP | Cityscapes | ResNet50 | 33.5 | model_zoo/GLIP/cityscape/targetDET.pth |
CLIPDET (pretrain) | GLIP | BDD100K | ResNet50 | 29.1 | model_zoo/GLIP/BDD100K/CLIPDET.pth |
targetDET | GLIP | BDD100K | ResNet50 | 33.5 | model_zoo/GLIP/BDD100K/targetDET.pth |
CLIPDET (pretrain) | GLIP | KITTI | ResNet50 | 55.9 | model_zoo/GLIP/KITTI/CLIPDET.pth |
targetDET | GLIP | KITTI | ResNet50 | 56.8 | model_zoo/GLIP/KITTI/targetDET.pth |
CLIPDET (pretrain) | GLIP | SIM | ResNet50 | 35.8 | model_zoo/GLIP/SIM/CLIPDET.pth |
targetDET | GLIP | SIM | ResNet50 | 37.1 | model_zoo/GLIP/SIM/targetDET.pth |
To verify the above model, please run the following command
mkdir model_zoo
# Place the downloaded models according to the Hugging Face directory structure.
bash scripts/modelzoo/GDINO/CLIPDET.sh
bash scripts/modelzoo/GDINO/targetDET.sh
bash scripts/modelzoo/GLIP/CLIPDET.sh
bash scripts/modelzoo/GLIP/targetDET.sh
Since pre-training an CLIP detector takes some time, you can directly use our pre-trained CLIPDET:
# Using Foggy-Cityscapes under GDINO as an example
# collect detection results
python train_net.py \
--num-gpus 1 \
--config configs/coin/PRETRAINS/CLIPDET_foggy.yaml \
SOLVER.MAX_ITER 0 \
OUTPUT_DIR output_GDINO/foggy/pretrain/CLIPDET
python train_net.py \
--num-gpus 1 \
--config configs/coin/GDINO/foggy.yaml \
MODEL.WEIGHTS model_zoo/GDINO/foggy/CLIPDET.pth+output_GDINO/foggy/pretrain/CLIPDET/GDINO_collect.pth \
OUTPUT_DIR output_GDINO/foggy/gard/targetDet
Configs (configs/coin
):
BASELINES
: Configuration files for testing cloud models and CLIP.PRETRAINS
: Configuration files for pre-training CLIP detector.GDINO
andGLIP
: Configuration files for final training.ORACLE
: Configuration files for training oracle model.
Trainers (coin/engine
):
test.py
: For testing cloud models and CLIP.pre_train.py
: For pre-train the CLIP detector.trainer.py
: For final training.
Models (coin/modeling/meta_arch
):
gdino.py
andglip.py
: Cloud detector entrance.gdino_processor.py
andglip_processor.py
: Post-processing of cloud detection results, used to collect results when pre-training CLIP detector.gdino_collector.py
,glip_collector.py
andclip_collector.py
: Collector for saving detection results, used to collect results when pre-training CLIP detector.clip_rcnn.py
: Contains two models, one is modified CLIP to predict probabilities using the boxes from cloud detector; another is the architecture of CLIP detector and target detector: OpenVocabularyRCNN, as shown in our paper Fig.2(a).
CKG network:
coin/modeling/merge/ckg.py
: The architecture of CKG network.coin/modeling/roi_heads/fast_rcnn.py
: The file where CKG is used.
If you find our work helpful for your research, please consider citing the following BibTeX entry.
@inproceedings{
li2024cloud,
title={Cloud Object Detector Adaptation by Integrating Different Source Knowledge},
author={Shuaifeng Li and Mao Ye and Lihua Zhou and Nianxin Li and Siying Xiao and Song Tang and Xiatian Zhu},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=S8SEjerTTg}
}
We would like to express our sincere gratitude to the following good projects and their contributors for their invaluable contributions.
- The cloud detectors are GroundingDINO and GLIPv1.
- Local knowledge comes from CLIP.
- The entire code framework is based on Detectron2.
- The implementation of the two-stage detector OpenVocabularyRCNN draws on RegionCLIP.
- Part of the code was learned from ProbabilisticTeacher.
We propose to explore an interesting and promising problem, Cloud Object Detector Adaptation (CODA), where the target domain leverages detections provided by a large cloud model to build a target detector. Despite with powerful generalization capability, the cloud model still cannot achieve error-free detection in a specific target domain. In this work, we present a novel Cloud Object detector adaptation method by Integrating different source kNowledge (COIN). The key idea is to incorporate a public vision-language model (CLIP) to distill positive knowledge while refining negative knowledge for adaptation by self-promotion gradient direction alignment. To that end, knowledge dissemination, separation, and distillation are carried out successively. Knowledge dissemination combines knowledge from cloud detector and CLIP model to initialize a target detector and a CLIP detector in target domain. By matching CLIP detector with the cloud detector, knowledge separation categorizes detections into three parts: consistent, inconsistent and private detections such that divide-and-conquer strategy can be used for knowledge distillation. Consistent and private detections are directly used to train target detector; while inconsistent detections are fused based on a consistent knowledge generation network, which is trained by aligning the gradient direction of inconsistent detections to that of consistent detections, because it provides a direction toward an optimal target detector. Experiment results demonstrate that the proposed COIN method achieves the state-of-the-art performance.