Benchmark for SageAttention kernels using real attention shapes logged from ComfyUI models (image / video / audio)

r/StableDiffusion
AI Research

What this is - and what it is not This is not a benchmark of how fast a model generates an image or video. No model weights, no inference pipeline. The benchmark runs on randomly generated tensors that reproduce the exact attention shapes - (batch, heads, seq_len, head_dim, dtype) - that real models use during sampling inside ComfyUI. precisely: it measures only the attention operation itself, one step inside the denoising loop. Everything else - VAE, CLIP, scheduler, ComfyUI overhead - is outside the scope entirely.