iSeg: An Iterative Refinement-based Framework for Training-free Segmentation

1Tianjin University, 2Mohamed bin Zayed University of Artificial Intelligence, 3Chongqing University, 4Shanghai Artificial Intelligence Laboratory

iSeg provides a more and more accurate semantic segmentation results with the increasing iterations.

Abstract

Stable diffusion has demonstrated strong image synthesis ability to given text descriptions, suggesting it to contain strong semantic clue for grouping objects. The researchers have explored employing stable diffusion for training-free segmentation. Most existing approaches refine cross-attention map by self-attention map once, demonstrating that self-attention map contains useful semantic information to improve segmentation.

To fully utilize self-attention map, we present a deep experimental analysis on iteratively refining cross-attention map with self-attention map, and propose an effective iterative refinement framework for training-free segmentation, named iSeg. The proposed iSeg introduces an entropy-reduced self-attention module that utilizes a gradient descent scheme to reduce the entropy of self-attention map, thereby suppressing the weak responses corresponding to irrelevant global information. Leveraging the entropy-reduced self-attention module, our iSeg stably improves refined cross-attention map with iterative refinement. Further, we design a category-enhanced cross-attention module to generate accurate cross-attention map, providing a better initial input for iterative refinement.

Extensive experiments across different datasets and diverse segmentation tasks (weakly-supervised semantic segmentation, open-vocabulary semantic segmentation, unsupervised segmentation, and mask generation on synthetic dataset) reveal the merits of proposed contributions, leading to promising performance on diverse segmentation tasks. For unsupervised semantic segmentation on Cityscapes, our iSeg achieves an absolute gain of 3.8% in terms of mIoU compared to the best existing training-free approach in literature. Moreover, our proposed iSeg can support segmentation with different kinds of images and interactions.

Framework

Natural Images Segmentation

Cross Domain Segmentation

Animation

We provide a cross attention map animation which start with the box containing a chair. With the increasing iterations, the box will become exactly the mask of the chair.

Interpolate start reference image.

Start

Loading...
Interpolation end reference image.

End


Interaction

We also provide an interaction demo, in which we can segment objects by points, lines, and boxes.