close
Skip to content

[Feature] Support DiffSpot (fine-grained visual change detection on web UIs)#1568

Open
banyinjushi wants to merge 1 commit into
open-compass:mainfrom
banyinjushi:diffspot-pr
Open

[Feature] Support DiffSpot (fine-grained visual change detection on web UIs)#1568
banyinjushi wants to merge 1 commit into
open-compass:mainfrom
banyinjushi:diffspot-pr

Conversation

@banyinjushi
Copy link
Copy Markdown

What does this PR do?

Adds DiffSpot, a benchmark for fine-grained visual change detection on real-world web interfaces. Each example is a pair of near-identical webpage screenshots differing by a single programmatic CSS-level mutation; the model must describe what changed. Ground truth comes directly from the mutation, and an operator-aware LLM-as-judge scores open-ended predictions against the structured mutation log.

Changes

  • New vlmeval/dataset/diffspot.pyDiffSpot(ImageBaseDataset):
    • load_data streams the public HF dataset tencent/DiffSpot directly (no extra hosting) and adapts each pair to a two-image (before→after) sample.
    • build_prompt builds the canonical spot-the-difference prompt with the two screenshots in before→after order.
    • evaluate runs the operator-aware LLM judge (default gpt-oss-120b, reasoning_effort=high) and reports per-tier recall (easy/medium/hard), no-diff specificity, and overall accuracy (TP+TN)/4400, plus per-operator recall.
  • Registers DiffSpot in vlmeval/dataset/__init__.py.
  • Judge model and endpoint come from the standard --judge / OPENAI_API_BASE / OPENAI_API_KEY mechanism — no hardcoded credentials.

How to run

# Evaluate any VLM configured in VLMEvalKit; judge via an OpenAI-compatible
# endpoint serving the judge model (set OPENAI_API_KEY / OPENAI_API_BASE):
python run.py --data DiffSpot --model <your_model> --judge gpt-oss-120b

# Fast end-to-end smoke on a small subset:
DIFFSPOT_LIMIT=8 python run.py --data DiffSpot --model <your_model> --judge gpt-oss-120b

Validation

The dataset's prompt construction, judging, and metric computation were validated end-to-end on the full 4,400-pair set with Qwen3.5-VL-397B (judge gpt-oss-120b, reasoning_effort=high). Results reproduce the paper: the difficulty ordering (easy > medium > hard) holds, and no-diff specificity and overall accuracy match within LLM-judge absolute-score variance (the paper reports cross-judge ranking stability, Kendall τ = 1.00).

Checklist

  • Follows repo style; new dataset registered, supported_datasets() returns DiffSpot
  • No hardcoded credentials / endpoints (all via env / --judge)
  • Smoke-tested path via DIFFSPOT_LIMIT

…eb UIs)

DiffSpot probes whether VLMs can spot a single fine-grained CSS-level change
between two near-identical webpage screenshots (before/after). 4,400 pairs
(3,900 has-diff over 13 operators x 3 tiers + 500 no-diff controls), loaded
from the public HF dataset tencent/DiffSpot. Scored by an operator-aware
LLM-as-judge against the structured mutation log; metrics: per-tier recall,
no-diff specificity, and overall accuracy. Paper: arxiv.org/abs/2605.29615
@banyinjushi
Copy link
Copy Markdown
Author

Hi @kennymckormick @mzr1996, friendly ping — would you mind triggering CI / taking a look when you have a moment? Happy to address any feedback. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants