Vision-OPD

Vision-OPD: Learning to See Fine-Grained Details for Multimodal LLMs via On-Policy Self-Distillation

News

[2026.5.18] The paper is released on arXiv.
[2026.5.19] The training code and data is released.
Model release is under company review. Coming soon.

Overview

Vision-OPD is a regional-to-global on-policy self-distillation framework that transfers a model's own privileged regional perception to its full-image policy, enabling fine-grained visual understanding in a single forward pass — without external teachers, ground-truth labels, reward verifiers, or inference-time tool use.

Average scores across fine-grained visual understanding benchmarks. Vision-OPD-4B/9B demonstrate superior performance compared with much larger open-source models (e.g., Qwen3.5-397B) and closed-source models (e.g., GPT-5.4, Gemini-3.1-Pro).

Quick Start

1. Environment Setup

conda create -n vision-opd python=3.11
conda activate vision-opd
pip install -e .
pip install flash-attn --no-build-isolation
pip install flash-linear-attention==0.4.2
pip install flashinfer-python==0.6.6
pip install causal-conv1d==1.6.1
pip install xformers==0.0.32.post1 --no-deps

2. Prepare Training Data

Download and preprocess the Vision-OPD-6K dataset:

python scripts/prepare_data.py --data-dir ./data

This downloads images and metadata from HuggingFace, extracts archives, and converts train.jsonl to the parquet format expected by the training pipeline.

3. Training

Launch Vision-OPD training:

bash scripts/run_vision_opd.sh

Key hyperparameters can be edited at the top of the script. See the script for the full configuration.

4. Merge Checkpoints

After training, merge the FSDP-sharded checkpoint into a standard HuggingFace model:

bash scripts/merge_checkpoint.sh <path_to_checkpoint>

For example:

bash scripts/merge_checkpoint.sh ./checkpoints/Vision-OPD-Qwen3.5-4B/global_step_65/

This merges the FSDP actor shards, saves the model weights, config, tokenizer, and processor into the specified directory. The merged checkpoint can then be loaded directly with transformers or served with vLLM.

5. Deployment

Serve the merged checkpoint with vLLM:

vllm serve <path_to_merged_checkpoint> \
    --gpu-memory-utilization 0.85 \
    --tensor-parallel-size 8 \
    --served-model-name Vision-OPD-4B \
    --trust-remote-code

The server listens on port 8000 by default. You can then query the model via the OpenAI-compatible API at http://localhost:8000/v1/chat/completions.

Project Structure

Vision-OPD/
├── verl/                    # Modified verl framework with self-distillation support
├── scripts/
│   ├── run_vision_opd.sh    # Training launch script
│   ├── merge_checkpoint.sh  # FSDP checkpoint merger
│   └── prepare_data.py      # Data download & preprocessing
├── chat_templates/
│   └── perception_chat_template_qwen35.jinja
├── figures/                 # Paper figures
├── pyproject.toml
└── LICENSE

Citation

If you find Vision-OPD useful for your research, please consider citing:

@misc{yuan2026visionopdlearningfinedetails,
      title={Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation}, 
      author={Qianhao Yuan and Jie Lou and Xing Yu and Hongyu Lin and Le Sun and Xianpei Han and Yaojie Lu},
      year={2026},
      eprint={2605.18740},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.18740}, 
}

License

Apache-2.0 License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-OPD

News

Overview

Quick Start

1. Environment Setup

2. Prepare Training Data

3. Training

4. Merge Checkpoints

5. Deployment

Project Structure

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
chat_templates		chat_templates
figures		figures
scripts		scripts
verl		verl
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Vision-OPD

News

Overview

Quick Start

1. Environment Setup

2. Prepare Training Data

3. Training

4. Merge Checkpoints

5. Deployment

Project Structure

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages