[Feature] Support DiffSpot (fine-grained visual change detection on web UIs)#1568
Open
banyinjushi wants to merge 1 commit into
Open
[Feature] Support DiffSpot (fine-grained visual change detection on web UIs)#1568banyinjushi wants to merge 1 commit into
banyinjushi wants to merge 1 commit into
Conversation
…eb UIs) DiffSpot probes whether VLMs can spot a single fine-grained CSS-level change between two near-identical webpage screenshots (before/after). 4,400 pairs (3,900 has-diff over 13 operators x 3 tiers + 500 no-diff controls), loaded from the public HF dataset tencent/DiffSpot. Scored by an operator-aware LLM-as-judge against the structured mutation log; metrics: per-tier recall, no-diff specificity, and overall accuracy. Paper: arxiv.org/abs/2605.29615
Author
|
Hi @kennymckormick @mzr1996, friendly ping — would you mind triggering CI / taking a look when you have a moment? Happy to address any feedback. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds DiffSpot, a benchmark for fine-grained visual change detection on real-world web interfaces. Each example is a pair of near-identical webpage screenshots differing by a single programmatic CSS-level mutation; the model must describe what changed. Ground truth comes directly from the mutation, and an operator-aware LLM-as-judge scores open-ended predictions against the structured mutation log.
Changes
vlmeval/dataset/diffspot.py—DiffSpot(ImageBaseDataset):load_datastreams the public HF datasettencent/DiffSpotdirectly (no extra hosting) and adapts each pair to a two-image (before→after) sample.build_promptbuilds the canonical spot-the-difference prompt with the two screenshots in before→after order.evaluateruns the operator-aware LLM judge (defaultgpt-oss-120b,reasoning_effort=high) and reports per-tier recall (easy/medium/hard), no-diff specificity, and overall accuracy(TP+TN)/4400, plus per-operator recall.DiffSpotinvlmeval/dataset/__init__.py.--judge/OPENAI_API_BASE/OPENAI_API_KEYmechanism — no hardcoded credentials.How to run
Validation
The dataset's prompt construction, judging, and metric computation were validated end-to-end on the full 4,400-pair set with Qwen3.5-VL-397B (judge
gpt-oss-120b,reasoning_effort=high). Results reproduce the paper: the difficulty ordering (easy > medium > hard) holds, and no-diff specificity and overall accuracy match within LLM-judge absolute-score variance (the paper reports cross-judge ranking stability, Kendall τ = 1.00).Checklist
supported_datasets()returnsDiffSpot--judge)DIFFSPOT_LIMIT