|
|
|
|
|
|
|
|
|
Top: Methods like LOST (shown in figure), TokenCut identify and localize the most salient foreground object and hence can detect only one object per image.
Bottom: MOST is a simple, yet effective method that localizes multiple objects per image without training. |
We tackle the challenging task of unsupervised object localization in this work. Recently, transformers trained with self-supervised learning have been shown to exhibit object localization properties without being trained for this task. In this work, we present Multiple Object localization with Self-supervised Transformers (MOST) that uses features of transformers trained using self-supervised learning to localize multiple objects in real world images. MOST analyzes the similarity maps of the features using box counting; a fractal analysis tool to identify tokens lying on foreground patches. The identified tokens are then clustered together, and tokens of each cluster are used to generate bounding boxes on foreground regions. Unlike recent state-of-the-art object localization methods, MOST can localize multiple objects per image and outperforms SOTA algorithms on several object localization and discovery benchmarks on PASCAL-VOC 07, 12 and COCO20k datasets. Additionally, we show that MOST can be used for self-supervised pre-training of object detectors, and yields consistent improvements on fully, semi-supervised object detection and unsupervised region proposal generation. |
| |
Example showing similarity maps of tokens within background and foreground for an image from the COCO dataset. In the figure above, we show three examples of the similarity maps of a token (shown in red) picked on the background (column 2) and foreground (columns 3, 4). Tokens within foreground patches have higher correlation than the ones on background. This results in the similarity maps of foreground patches being less "spatially" random than the ones on the background. The task then becomes to analyze the similarity maps and identify the ones with less spatial randomness. |
| |
MOST operates on features extracted from transformers trained using DINO. The features are used to compute the outer product A. Each row of A is analyzed by the entropy-based box analysis (EBA) module that identifies tokens extracted from foreground patches. These patches are clustered using spatial locations as features to form pools. Each pool is then post-processed to generate a bounding box. |
| |
| |
Qualitative results of MOST on COCO20k (Top) and PASCAL VOC 2007+12 ( (Bottom) ) datasets: MOST can localize multiple objects per image without training. Localization ability of MOST is not limited by the biases of annotators and can localize rocks, branches, water bodies etc. |
| |
| |
| |
MOST can easily be extended for the task of unsupervised saliency detection. We choose the object identified by the largest cluster as the salient object and demonstrate results on DUT-OMRON (Top), DUTS (Middle), ECSSD (Bottom) datasets. Each row shows two examples of input and the output of MOST. In each example, the first image is the input, the second image is the mask generated using the largest cluster, i.e. the output. The third image is the output mask when all the clusters are used and the fourth image is the ground truth. When only one salient object exists in the input (row-1) using all the clusters results in segmenting non salient objects. In the presence of multiple instances of the salient object (row-2), picking the largest cluster results in segmenting only a single instance. Finally, in row-3, we show some failure cases of MOST. Since all the three datasets consists of a majority of images with a single instance, we choose the the mask generated from the largest cluster as our output. |
S. Rambhatla, I. Misra, R. Chellappa, A. Shrivastava. MOST: Multiple Object localization with Self-supervised Transformers for object discovery. ICCV, 2023. (Paper | Supplementary) |