Vision-OPD: Learning to See Fine-Grained Details for Multimodal LLMs via On-Policy Self-Distillation
📃 Paper | 🤗 Training Dataset
- [2026.5.18] The paper is released on arXiv.
- [2026.5.19] The training code and data is released.
- Model release is under company review. Coming soon.
Vision-OPD is a regional-to-global on-policy self-distillation framework that transfers a model's own privileged regional perception to its full-image policy, enabling fine-grained visual understanding in a single forward pass — without external teachers, ground-truth labels, reward verifiers, or inference-time tool use.
Average scores across fine-grained visual understanding benchmarks. Vision-OPD-4B/9B demonstrate superior performance compared with much larger open-source models (e.g., Qwen3.5-397B) and closed-source models (e.g., GPT-5.4, Gemini-3.1-Pro).
conda create -n vision-opd python=3.11
conda activate vision-opd
pip install -e .
pip install flash-attn --no-build-isolation
pip install flash-linear-attention==0.4.2
pip install flashinfer-python==0.6.6
pip install causal-conv1d==1.6.1
pip install xformers==0.0.32.post1 --no-depsDownload and preprocess the Vision-OPD-6K dataset:
python scripts/prepare_data.py --data-dir ./dataThis downloads images and metadata from HuggingFace, extracts archives, and converts train.jsonl to the parquet format expected by the training pipeline.
Launch Vision-OPD training:
bash scripts/run_vision_opd.shKey hyperparameters can be edited at the top of the script. See the script for the full configuration.
After training, merge the FSDP-sharded checkpoint into a standard HuggingFace model:
bash scripts/merge_checkpoint.sh <path_to_checkpoint>For example:
bash scripts/merge_checkpoint.sh ./checkpoints/Vision-OPD-Qwen3.5-4B/global_step_65/This merges the FSDP actor shards, saves the model weights, config, tokenizer, and processor into the specified directory. The merged checkpoint can then be loaded directly with transformers or served with vLLM.
Serve the merged checkpoint with vLLM:
vllm serve <path_to_merged_checkpoint> \
--gpu-memory-utilization 0.85 \
--tensor-parallel-size 8 \
--served-model-name Vision-OPD-4B \
--trust-remote-codeThe server listens on port 8000 by default. You can then query the model via the OpenAI-compatible API at http://localhost:8000/v1/chat/completions.
Vision-OPD/
├── verl/ # Modified verl framework with self-distillation support
├── scripts/
│ ├── run_vision_opd.sh # Training launch script
│ ├── merge_checkpoint.sh # FSDP checkpoint merger
│ └── prepare_data.py # Data download & preprocessing
├── chat_templates/
│ └── perception_chat_template_qwen35.jinja
├── figures/ # Paper figures
├── pyproject.toml
└── LICENSE
If you find Vision-OPD useful for your research, please consider citing:
@misc{yuan2026visionopdlearningfinedetails,
title={Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation},
author={Qianhao Yuan and Jie Lou and Xing Yu and Hongyu Lin and Le Sun and Xianpei Han and Yaojie Lu},
year={2026},
eprint={2605.18740},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.18740},
}Apache-2.0 License
