Zero-shot Unsupervised Transfer Instance Segmentation [Best paper at CVPR 2023 L3D-IVU workshop]
___ ___ ___
/ /\ /__/\ ___ ___ / /\
/ /::| \ \:\ / /\ / /\ / /:/_
/ /:/:| \ \:\ / /:/ / /:/ / /:/ /\
/ /:/|:|__ ___ \ \:\ / /:/ /__/::\ / /:/ /::\
/__/:/ |:| /\ /__/\ \__\:\ / /::\ \__\/\:\__ /__/:/ /:/\:\
\__\/ |:|/:/ \ \:\ / /:/ /__/:/\:\ \ \:\/\ \ \:\/:/ /:/
| |:/:/ \ \:\ /:/ \__\/ \:\ \__\::/ \ \::/ /:/
| |::/ \ \:\/:/ \ \:\ /__/:/ \__\/ /:/
| |:/ \ \::/ \__\/ \__\/ /__/:/
|__|/ \__\/ \__\/
Official PyTorch implementation of Zero-shot Unsupervised Transfer Instance Segmentation. Details can be found in the paper.
[paper
] [poster
]
[project page
]
Please download datasets of interest first by visiting the following links:
Note that COCO20K dataset is composed of 19,817 images from the COCO2014 train set. The list of file names can be found at this link.
Note that, we only consider object categories (and a background) for the COCO2017 dataset.
If you want to construct image archives yourself for your custom dataset or a specific set of categories, you may want to download the following datasets to use as an index dataset (details can be found in our paper):
We advise you to put the downloaded dataset(s) into the following directory structure for ease of implementation:
{your_dataset_directory}
├──coca
│ ├──binary
│ ├──image
├──coco # for COCO-20K
│ ├──train2014
│ ├──annotations
│ ├──coco_20k_filenames.txt
├──coco2017
│ ├──annotations
│ ├──train2017
│ ├──val2017
├──ImageNet2012 # for an index dataset
│ ├──train
│ ├──val
├──ImageNet-S
│ ├──ImageNetS919
├──index_dataset
├──pass # for an index dataset
│ ├──images
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
conda install -c conda-forge tqdm
conda install -c conda-forge matplotlib
conda install -c anaconda ujson
conda install -c conda-forge pyyaml
conda install -c conda-forge pycocotools
conda install -c anaconda scipy
pip install opencv-python
pip install git+https://github.com/openai/CLIP.git
A required version of each package might vary depending on your local device.
Please change the following options to fit your directory structure before training/inference:
dir_ckpt: {your_dir_ckpt} # this should point to a checkpoint directory
dir_train_dataset:
[
"{your_dataset_dir}/ImageNet2012/train",
"{your_dataset_dir}/pass/images"
] # this points to a directory of an index dataset(s)
p_filename_to_image_embedding: [
"{your_dataset_dir}/ImageNet2012/filename_to_ViT_L_14_336px_train_img_embedding.pkl",
"{your_dataset_dir}/pass/filename_to_ViT_L_14_336px_img_embedding.pkl"
]
dir_val_dataset: "{your_dataset_dir}/{evaluation_benchmark}",
category_to_p_images_fp: "{your_dataset_dir}/index_dataset/{evaluation_benchmark}_category_to_p_images_n500.json"
You can download filename_to_image_embedding
and category_to_p_images
files below.
filename_to_image_embedding
:
CLIP
image embeddings for the ImageNet2012 training set (~4.1GB)CLIP
image embeddings for PASS (~4.6GB)
Once downloaded, please put the files into the corresponding dataset directory (i.e., ImageNet2012
and pass
directories shown in the recommended directory structure above). Note that, in both cases, the ViT-L/14@336px
CLIP
image encoder is used to extract the image embeddings.
category_to_p_images
:
Please put these files in the index_dataset
directory. In addition, you have to change the image paths in each file accordingly to your case.
ZUTIS
is trained with pseudo-labels from an unsupervised saliency detector (e.g., SelfMask
).
This involves two steps:
- Retrieving images for a list of categories of interest from index datasets using
CLIP
; - Generating pseudo-masks for the retrieved images by applying
SelfMask
to them.
For the first step to be successfully done, make sure you already downloaded CLIP
image embeddings for the images in the ImageNet2012 training set and in the PASS dataset as described in 0. Configuration file.
The pseudo-mask generation process will be automatically triggered when running a training script, e.g., for training a model with a set of categories in COCO2017:
bash coco2017_vit_b_16.sh
It is worth noting that, as mentioned in the paper, the training is done for both semantic segmentation and instance segmentation at once. I.e., in the above case for COCO2017, the model will be trained with images for 80 object categories in COCO2017 to do semantic and instance segmentations.
Semantic segmentation
To evaluate a model with pre-trained weights on a semantic segmentation benchmark, e.g., COCO2017, please run:
bash coco2017_vit_b_16.sh $PATH_TO_WEIGHTS
Instance segmentation
For an instance segmentation benchmark, run:
bash coco20k_vit_b_16.sh $PATH_TO_WEIGHTS
We provide the pre-trained weights of ZUTIS:
benchmark | backbone | APmk (%) | APmk50 (%) | APmk75 (%) | link |
---|---|---|---|---|---|
COCO-20K | ViT-B/16 | 5.7 | 11.0 | 5.4 | weights (~537.9 MB) |
benchmark | split | backbone | IoU (%) | pixel accuracy (%) | link |
---|---|---|---|---|---|
CoCA | - | ViT-B/16 | 32.7 | 80.7 | weights (~537.9 MB) |
COCO2017 | val | ViT-B/16 | 32.8 | 76.4 | weights (~537.9 MB) |
ImageNet-S919 | test | ViT-B/32 | 27.5 | - | weights (~544.6 MB) |
ImageNet-S919 | test | ViT-B/16 | 37.4 | - | weights (~537.9 MB) |
@inproceedings{shin2023zutis,
title = {Zero-shot Unsupervised Transfer Instance Segmentation},
author = {Shin, Gyungin and Albanie, Samuel and Xie, Weidi},
booktitle = {CVPRW},
year = {2023}
}
We borrowed code for CLIP
from https://github.com/openai/CLIP.
If you have any questions about our code/implementation, please contact us at gyungin [at] robots [dot] ox [dot] ac [dot] uk.