[Feature] Support DiffSpot (fine-grained visual change detection on web UIs) by banyinjushi · Pull Request #1568 · open-compass/VLMEvalKit

banyinjushi · 2026-06-04T08:16:35Z

What does this PR do?

Adds DiffSpot, a benchmark for fine-grained visual change detection on real-world web interfaces. Each example is a pair of near-identical webpage screenshots differing by a single programmatic CSS-level mutation; the model must describe what changed. Ground truth comes directly from the mutation, and an operator-aware LLM-as-judge scores open-ended predictions against the structured mutation log.

📄 Paper: https://arxiv.org/abs/2605.29615
🤗 Dataset: https://huggingface.co/datasets/tencent/DiffSpot — 4,400 pairs (3,900 has-diff over 13 CSS operators × 3 difficulty tiers + 500 no-diff controls)
🐙 Code: https://github.com/Tencent/DiffSpot

Changes

New vlmeval/dataset/diffspot.py — DiffSpot(ImageBaseDataset):
- load_data streams the public HF dataset tencent/DiffSpot directly (no extra hosting) and adapts each pair to a two-image (before→after) sample.
- build_prompt builds the canonical spot-the-difference prompt with the two screenshots in before→after order.
- evaluate runs the operator-aware LLM judge (default gpt-oss-120b, reasoning_effort=high) and reports per-tier recall (easy/medium/hard), no-diff specificity, and overall accuracy (TP+TN)/4400, plus per-operator recall.
Registers DiffSpot in vlmeval/dataset/__init__.py.
Judge model and endpoint come from the standard --judge / OPENAI_API_BASE / OPENAI_API_KEY mechanism — no hardcoded credentials.

How to run

# Evaluate any VLM configured in VLMEvalKit; judge via an OpenAI-compatible
# endpoint serving the judge model (set OPENAI_API_KEY / OPENAI_API_BASE):
python run.py --data DiffSpot --model <your_model> --judge gpt-oss-120b

# Fast end-to-end smoke on a small subset:
DIFFSPOT_LIMIT=8 python run.py --data DiffSpot --model <your_model> --judge gpt-oss-120b

Validation

The dataset's prompt construction, judging, and metric computation were validated end-to-end on the full 4,400-pair set with Qwen3.5-VL-397B (judge gpt-oss-120b, reasoning_effort=high). Results reproduce the paper: the difficulty ordering (easy > medium > hard) holds, and no-diff specificity and overall accuracy match within LLM-judge absolute-score variance (the paper reports cross-judge ranking stability, Kendall τ = 1.00).

Checklist

Follows repo style; new dataset registered, supported_datasets() returns DiffSpot
No hardcoded credentials / endpoints (all via env / --judge)
Smoke-tested path via DIFFSPOT_LIMIT

…eb UIs) DiffSpot probes whether VLMs can spot a single fine-grained CSS-level change between two near-identical webpage screenshots (before/after). 4,400 pairs (3,900 has-diff over 13 operators x 3 tiers + 500 no-diff controls), loaded from the public HF dataset tencent/DiffSpot. Scored by an operator-aware LLM-as-judge against the structured mutation log; metrics: per-tier recall, no-diff specificity, and overall accuracy. Paper: arxiv.org/abs/2605.29615

banyinjushi · 2026-06-05T02:40:35Z

Hi @kennymckormick @mzr1996, friendly ping — would you mind triggering CI / taking a look when you have a moment? Happy to address any feedback. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support DiffSpot (fine-grained visual change detection on web UIs)#1568

[Feature] Support DiffSpot (fine-grained visual change detection on web UIs)#1568
banyinjushi wants to merge 1 commit into
open-compass:mainfrom
banyinjushi:diffspot-pr

banyinjushi commented Jun 4, 2026

Uh oh!

banyinjushi commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

banyinjushi commented Jun 4, 2026

What does this PR do?

Changes

How to run

Validation

Checklist

Uh oh!

banyinjushi commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants