StepX โ Attention Explorer
Visualize how text and image attention evolve across Stable Diffusion denoising steps
Select a different model only if you want to switch from the default SD 1.5. The model loads automatically when you click Generate.
Text-to-Image Cross-Attention Maps
Visualize how text tokens attend to image regions during generation.
๐ Input
๐ผ๏ธ Generated Image
๐ Depth Map (ZoeDepth)
Metric depth from the generated image. Included in Export All ZIP.
๐๏ธ Attention Controls
๐ฌ GIF Animation
๐ฆ Export All
๐ฅ Attention Map
๐ก Note: Use the 'Download GIF' button to download the generated GIF. Use Export All to get one ZIP with every step ร token image + CSV and per-token GIFs.
Image Self-Attention Maps
Visualize how image pixels attend to each other during generation.
๐ Input
๐ผ๏ธ Generated Image
Compute heatmap for specific pixels
Pixel Heatmap shows which regions of the image are attended to by specific pixels.
๐ฏ How to Use:
- Click on Image: Click on any pixel in the generated image above, and the system will automatically identify and generate the heatmap (Recommended)
- Manual Input: You can also manually enter pixel IDs (row-major order, 0-4095 for SD v2-base):
- Single pixel:
100
- Range:
0-1023
- List:
0,1,2,3
Compute heatmap for bounding box
BBox Heatmap shows the average attention heatmap for all pixels within a bounding box.
๐ฏ How to Use:
- Click on Image: Click twice on the generated image above to define the bounding box (first click: top-left corner, second click: bottom-right corner)
- Manual Input: You can also manually enter coordinates in latent space (0-63 for SD v2-base):
Compute heatmap for polygon contour
Contour Heatmap uses polygon contours to define regions (supports irregular shapes).
๐ฏ How to Use:
- Click on Image: Click multiple times on the generated image above to add contour points (at least 3 points required)
- Manual Input: You can also manually enter contour points in image coordinates:
- Format 1:
x1,y1 x2,y2 x3,y3 ...(e.g.,0,0 256,0 256,256 0,256)
- Format 2:
x1 y1 x2 y2 x3 y3 ...(e.g.,0 0 256 0 256 256 0 256)
Compute diffused heatmap from a pixel
Pixel Diffused Heatmap starts from a single pixel and iteratively diffuses attention, gradually enhancing related regions.
Suitable for discovering semantically related regions of pixels.
๐ฏ How to Use:
- Click on Image: Click on any pixel in the generated image above to select the starting pixel (Recommended)
- Manual Input: You can also manually enter the starting pixel ID
Text-Guided Object Detection (TITAN Workflow)
Combines DAAM (Cross-Attention) and DAAM-I2I (Self-Attention) for text-guided object detection and segmentation.
Important: The detection result will be displayed on the DAAM-I2I generated image, not the DAAM image.
How It Works:
- DAAM tab: Generate an image with a prompt. This computes cross-attention (text-to-image attention) and creates word heatmaps.
- DAAM-I2I tab: Generate an image (can use the same or different prompt). This computes self-attention (image-to-image attention).
- Detection: Enter a word to detect. The system will:
- Extract the word's heatmap from DAAM (based on DAAM prompt and image)
- Use this heatmap as guidance for DAAM-I2I self-attention (based on DAAM-I2I prompt and image)
- Display the detection result on the DAAM-I2I image
โ ๏ธ Important Notes:
- Same prompts: If both tabs use the same prompt, detection works best as the word heatmap matches the image content.
- Different prompts: If prompts differ, the word you're detecting must exist in the DAAM prompt. The detection will still be shown on the DAAM-I2I image, but results may be unexpected if:
- The word doesn't exist in the DAAM-I2I prompt
- The word refers to different objects in the two prompts
- The images are completely different
Note: You need to generate images in both the DAAM and DAAM-I2I tabs before using this feature. The detection will be shown on the DAAM-I2I image.
TITAN: Large-Scale Visual Object Discovery
Automatically extracts objects from prompts, generates images, and automatically annotates them (bounding boxes and segmentation masks).
TITAN uses DAAM heatmaps to automatically detect and annotate objects in images, generating COCO-format datasets.