StepX — Attention Explorer

Inspectable generation traces for open diffusion models

StepX turns a Stable Diffusion run into a guided visual explanation: generate an image, inspect how prompt words connect to denoising-step images, and export the evidence. It explains the open pipeline used here, not private black-box image models.

Primary audience

AI image researchers and educators

Use StepX to demonstrate cross-attention and denoising behavior in an inspectable model.

Creative technologists and designers

Use it to debug prompt-image grounding when working with open diffusion workflows.

Students and reviewers

Use the guided outputs as evidence for presentations, reports, and critique sessions.

Recommended workflow

Generate

Start with one prompt and seed so the run is reproducible.

Inspect a word

Select Global or one prompt word to see where attention concentrates.

Move through steps

Compare early layout formation with later detail refinement.

Export evidence

Save a GIF or ZIP for analysis, teaching, or project documentation.

Main path

Generate & Inspect is the default route for token-level attention over denoising steps.

Follow-up analysis

Image Structure helps compare region-to-region relationships after generation.

Advanced tools

Object discovery and ZoeDepth are optional companions, not the primary explanation.

Who is StepX for?

StepX is for researchers, educators, creative technologists, and designers working with open diffusion models who need to explain what happened inside a generation run. It is not a general-purpose image generator, and it does not claim to explain private black-box models such as ChatGPT image generation.

Recommended path: generate an image in Generate & Inspect, choose one prompt word, move the Step slider, then export a GIF or ZIP if you want evidence for a presentation, paper, design review, or committee demo.

Generate an image, then inspect prompt-word attention

Main workflow. Use this first. StepX generates one Stable Diffusion image, saves the decoded denoising-step images, and overlays DAAM attention on the matching step image. Start with one prompt, then choose a word and move the Step slider.

📝 Input

Prompt

The text prompt used to generate the image and compute word-level attention maps.

Inference Steps

Number of denoising steps. More steps produce more time points to inspect, but take longer.

10 100

Seed

Controls reproducibility. Use the same seed to compare prompts more fairly.

Status

🖼️ Generated Image

Output

🎛️ Attention Controls

Pick a word, move through denoising steps, and adjust opacity. Each heatmap is overlaid on the decoded image for the selected step when available.

Focus Word

Choose Global for aggregate attention, or choose a prompt word to inspect its spatial grounding.

Step

Denoising time point. Early steps often reflect layout formation; later steps often refine detail.

0 50

Overlay Alpha

Heatmap opacity over the selected denoising-step image.

0.1 1

🎬 GIF Animation

Frame Duration (seconds)

Playback speed for the exported attention GIF.

0.1 2

GIF Status

Download GIF

📦 Export All

Export Status

Download ZIP

🔥 Attention Map

Attention Visualization

Export tip: use Download GIF for the selected word over denoising time, or Export All for every step x token overlay, denoising base image, CSV value, and per-token GIF.

Inspect region-to-region structure

Follow-up workflow. Use this after the main attention view when you want to ask how image regions relate to each other. DAAM-I2I shows self-attention from a selected pixel, box, contour, or diffused starting point.

📝 Input

Prompt

The text prompt used to generate the image whose self-attention will be inspected.

Inference Steps

Number of denoising steps used for the generated image.

10 100

Seed

Controls reproducibility for the DAAM-I2I generated image.

Status

🖼️ Generated Image

Output

Compute heatmap for specific pixels

Question answered: which regions are related to this selected pixel or pixel group? Click the generated image for the easiest workflow, or type row-major pixel IDs manually. Examples: 100, 0-1023, 0,1,2,3.

Pixel IDs

Row-major latent pixel IDs. A range or comma-separated list averages the selected pixels.

Click Status

Overlay Alpha

Heatmap opacity over the generated image.

0.1 1

Pixel Heatmap

StepX Function Guide

1. DAAM: Text-to-image cross-attention

Use this when you want to ask: Where does a prompt word appear to influence each denoising-stage image?

Enter a prompt and generate an image.
Choose Global to see overall attention, or choose a specific word to inspect token-region grounding.
Move Step to compare attention over the decoded image for that denoising step.
Use Overlay Alpha to make the heatmap more or less visible.
Use GIF Animation to export attention over the denoising image sequence for the selected word.
Use Export All to download all token/step maps, CSV values, and per-token GIFs.

2. Optional Scene Structure: ZoeDepth

Use this when you want a scene-structure companion to the attention map. The depth map is not an attention map; it is a separate estimate of near/far image structure, useful when comparing focus, figure-ground, or spatial hierarchy. It lives in an optional panel so the main workflow stays focused on denoising-step attention.

3. DAAM-I2I: Image self-attention

Use this when you want to ask: Which image regions are internally related to a selected pixel or region?

Pixel Heatmap: click or type a pixel/range to see related regions.
BBox Heatmap: define a rectangle and average attention over that region.
Contour Heatmap: define an irregular polygon region.
Pixel Diffused Heatmap: start from one pixel and iteratively expand related regions.

4. Text-Guided Detection

This combines DAAM word heatmaps with DAAM-I2I self-attention. It is useful for exploring how a word-level text signal can guide region discovery. It works best when the DAAM and DAAM-I2I prompts refer to similar image content.

5. TITAN

TITAN automatically extracts objects from a prompt and produces object annotations. This is an object discovery workflow, not a replacement for attention inspection.

Glossary

Cross-attention: a text-to-image signal. It estimates how strongly prompt tokens relate to image regions.
Self-attention: an image-to-image signal. It estimates how image regions relate to other image regions.
Denoising step: one stage in the diffusion generation process.
Global attention: an aggregate attention view across prompt tokens.
Token / focus word: a word or word piece from the prompt selected for inspection.
Overlay alpha: heatmap opacity over the selected denoising-step image.
Export evidence: downloadable maps, GIFs, and CSV files for documentation and analysis.

StepX — Attention Explorer

Who is StepX for?

Generate an image, then inspect prompt-word attention

📝 Input

🖼️ Generated Image

🎛️ Attention Controls

🎬 GIF Animation

📦 Export All

🔥 Attention Map

Inspect region-to-region structure

📝 Input

🖼️ Generated Image

Compute heatmap for specific pixels

Compute heatmap for bounding box

Compute heatmap for polygon contour

Compute diffused heatmap from a pixel

Text-Guided Object Detection (TITAN Workflow)

Detect and annotate objects

📝 Input

🎨 Visualization

🖼️ Generated Image

📦 Annotations

StepX Function Guide

1. DAAM: Text-to-image cross-attention

2. Optional Scene Structure: ZoeDepth

3. DAAM-I2I: Image self-attention

4. Text-Guided Detection

5. TITAN

Glossary