close
Skip to content

VisionOPD/Vision-OPD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision-OPD

Vision-OPD: Learning to See Fine-Grained Details for Multimodal LLMs via On-Policy Self-Distillation

📃 Paper | 🤗 Training Dataset

News

  • [2026.5.18] The paper is released on arXiv.
  • [2026.5.19] The training code and data is released.
  • Model release is under company review. Coming soon.

Overview

Vision-OPD is a regional-to-global on-policy self-distillation framework that transfers a model's own privileged regional perception to its full-image policy, enabling fine-grained visual understanding in a single forward pass — without external teachers, ground-truth labels, reward verifiers, or inference-time tool use.

Vision-OPD Average Scores

Average scores across fine-grained visual understanding benchmarks. Vision-OPD-4B/9B demonstrate superior performance compared with much larger open-source models (e.g., Qwen3.5-397B) and closed-source models (e.g., GPT-5.4, Gemini-3.1-Pro).

Quick Start

1. Environment Setup

conda create -n vision-opd python=3.11
conda activate vision-opd
pip install -e .
pip install flash-attn --no-build-isolation
pip install flash-linear-attention==0.4.2
pip install flashinfer-python==0.6.6
pip install causal-conv1d==1.6.1
pip install xformers==0.0.32.post1 --no-deps

2. Prepare Training Data

Download and preprocess the Vision-OPD-6K dataset:

python scripts/prepare_data.py --data-dir ./data

This downloads images and metadata from HuggingFace, extracts archives, and converts train.jsonl to the parquet format expected by the training pipeline.

3. Training

Launch Vision-OPD training:

bash scripts/run_vision_opd.sh

Key hyperparameters can be edited at the top of the script. See the script for the full configuration.

4. Merge Checkpoints

After training, merge the FSDP-sharded checkpoint into a standard HuggingFace model:

bash scripts/merge_checkpoint.sh <path_to_checkpoint>

For example:

bash scripts/merge_checkpoint.sh ./checkpoints/Vision-OPD-Qwen3.5-4B/global_step_65/

This merges the FSDP actor shards, saves the model weights, config, tokenizer, and processor into the specified directory. The merged checkpoint can then be loaded directly with transformers or served with vLLM.

5. Deployment

Serve the merged checkpoint with vLLM:

vllm serve <path_to_merged_checkpoint> \
    --gpu-memory-utilization 0.85 \
    --tensor-parallel-size 8 \
    --served-model-name Vision-OPD-4B \
    --trust-remote-code

The server listens on port 8000 by default. You can then query the model via the OpenAI-compatible API at http://localhost:8000/v1/chat/completions.

Project Structure

Vision-OPD/
├── verl/                    # Modified verl framework with self-distillation support
├── scripts/
│   ├── run_vision_opd.sh    # Training launch script
│   ├── merge_checkpoint.sh  # FSDP checkpoint merger
│   └── prepare_data.py      # Data download & preprocessing
├── chat_templates/
│   └── perception_chat_template_qwen35.jinja
├── figures/                 # Paper figures
├── pyproject.toml
└── LICENSE

Citation

If you find Vision-OPD useful for your research, please consider citing:

@misc{yuan2026visionopdlearningfinedetails,
      title={Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation}, 
      author={Qianhao Yuan and Jie Lou and Xing Yu and Hongyu Lin and Le Sun and Xianpei Han and Yaojie Lu},
      year={2026},
      eprint={2605.18740},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.18740}, 
}

License

Apache-2.0 License

About

Vision-OPD is a regional-to-global on-policy self-distillation framework that transfers a model's own privileged crop-conditioned perception to its full-image policy, enabling fine-grained visual understanding in a single forward pass without external teachers, labels, or verifiers.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors