StepX — Attention Explorer
Inspectable generation traces for open diffusion models
Who is StepX for?
StepX is for researchers, educators, creative technologists, and designers working with open diffusion models who need to explain what happened inside a generation run. It is not a general-purpose image generator, and it does not claim to explain private black-box models such as ChatGPT image generation.
Recommended path: generate an image in Generate & Inspect, choose one prompt word, move the Step slider, then export a GIF or ZIP if you want evidence for a presentation, paper, design review, or committee demo.
Select a different model only if you want to switch from the default SD 1.5. The model loads automatically when you click Generate.
The open diffusion model whose internal attention signals will be inspected.
Generate an image, then inspect prompt-word attention
📝 Input
🖼️ Generated Image
🎛️ Attention Controls
Choose Global for aggregate attention, or choose a prompt word to inspect its spatial grounding.
🎬 GIF Animation
📦 Export All
🔥 Attention Map
Inspect region-to-region structure
📝 Input
🖼️ Generated Image
Compute heatmap for specific pixels
100, 0-1023, 0,1,2,3.
Compute heatmap for bounding box
Compute heatmap for polygon contour
x1,y1 x2,y2 x3,y3 or x1 y1 x2 y2 x3 y3.
Compute diffused heatmap from a pixel
Text-Guided Object Detection (TITAN Workflow)
Detect and annotate objects
📝 Input
🎨 Visualization
🖼️ Generated Image
📦 Annotations
StepX Function Guide
1. DAAM: Text-to-image cross-attention
Use this when you want to ask: Where does a prompt word appear to influence each denoising-stage image?
- Enter a prompt and generate an image.
- Choose Global to see overall attention, or choose a specific word to inspect token-region grounding.
- Move Step to compare attention over the decoded image for that denoising step.
- Use Overlay Alpha to make the heatmap more or less visible.
- Use GIF Animation to export attention over the denoising image sequence for the selected word.
- Use Export All to download all token/step maps, CSV values, and per-token GIFs.
2. Optional Scene Structure: ZoeDepth
Use this when you want a scene-structure companion to the attention map. The depth map is not an attention map; it is a separate estimate of near/far image structure, useful when comparing focus, figure-ground, or spatial hierarchy. It lives in an optional panel so the main workflow stays focused on denoising-step attention.
3. DAAM-I2I: Image self-attention
Use this when you want to ask: Which image regions are internally related to a selected pixel or region?
- Pixel Heatmap: click or type a pixel/range to see related regions.
- BBox Heatmap: define a rectangle and average attention over that region.
- Contour Heatmap: define an irregular polygon region.
- Pixel Diffused Heatmap: start from one pixel and iteratively expand related regions.
4. Text-Guided Detection
This combines DAAM word heatmaps with DAAM-I2I self-attention. It is useful for exploring how a word-level text signal can guide region discovery. It works best when the DAAM and DAAM-I2I prompts refer to similar image content.
5. TITAN
TITAN automatically extracts objects from a prompt and produces object annotations. This is an object discovery workflow, not a replacement for attention inspection.
Glossary
- Cross-attention: a text-to-image signal. It estimates how strongly prompt tokens relate to image regions.
- Self-attention: an image-to-image signal. It estimates how image regions relate to other image regions.
- Denoising step: one stage in the diffusion generation process.
- Global attention: an aggregate attention view across prompt tokens.
- Token / focus word: a word or word piece from the prompt selected for inspection.
- Overlay alpha: heatmap opacity over the selected denoising-step image.
- Export evidence: downloadable maps, GIFs, and CSV files for documentation and analysis.