Vista: Verifier-in-the-Loop Agentic RL for Quantum Program Synthesis

This repository contains the code and artifact materials for:

Vista: Verifier-in-the-Loop Agentic RL for Semantic Program Synthesis in Quantum Computing
ACM CAIS 2026 submission #217

Vista trains a language-model policy to generate OpenQASM 3.0 quantum circuits using staged verifier feedback. The verifier checks generated programs through progressively richer semantic stages: syntax/feasibility, behavior alignment, objective-value verification, and utility after downstream optimization. The repository includes the quantum training scripts, verifier/reward code, generation and evaluation scripts, table data, and figure-generation workflow.

The public artifact URL is:

https://github.com/benyucong/rl-quantum

Public Model and Data

Vista RL checkpoint: Benyucong/rl_quantum_4b
SFT initialization checkpoint: Benyucong/sft_quantum_circuit_gen_4B
RL/evaluation dataset: Benyucong/graph-data-quantum-rl

Repository Layout

Path	Purpose
`artifact/`	CAIS artifact README, appendix PDF/TeX, extracted Tables, and reviewer-facing helper scripts.
`artifact/scripts/run_quantum_experiment.sh`	Runs model generation plus evaluation, or evaluates an existing raw generation JSON.
`artifact/scripts/draw_vista_figures.py`	Runs the `vista_draw/` plotting scripts in headless mode and can derive two plot tables from fresh evaluation output.
`artifact/scripts/build_artifact_report.py`	Builds one PDF containing extracted tables and regenerated plots.
`artifact/scripts/show_paper_tables.py`	Prints extracted Tables 1-3 as Markdown.
`examples/train/quantum/`	Vista/quantum GRPO training scripts and Slurm launchers.
`verl_tool/servers/tools/quantum_cpu.py`	Quantum verifier tool used during rollouts.
`verl_tool/servers/tools/utils/quantum_reward_cal.py`	Staged quantum reward calculation.
`verl_tool/workers/reward_manager/quantum.py`	Reward manager for quantum verifier observations.
`quantum-code-generation/code/generation/`	Hugging Face/vLLM scripts for generating OpenQASM samples from checkpoints.
`quantum-code-generation/code/evaluation/`	QASM parsing, simulation, and metric evaluation scripts.
`quantum-code-generation/code/data_generation/`	Quantum graph-optimization data-generation utilities and included problem-instance inputs.
`vista_draw/`	Plot scripts and plot-ready CSV/JSON inputs for Figures.

This repository is based on the VerlTool training framework, but the root README is intentionally focused on the Vista quantum artifact and reviewer workflow.

Supported Environments and Dependency Stacks

Use separate virtual environments for the CPU evaluator, GPU generation, and full training stacks. The dependency differences between the root training environment and quantum-code-generation/ are expected: these components were developed and run in different venvs. Do not install every requirements.txt file into one environment. The evaluator intentionally uses the PennyLane/Qiskit 1.x stack, while the GPU generation/training stacks use PyTorch/vLLM/CUDA packages that require different sympy, numpy, and CUDA library versions.

Supported reviewer-facing Python version:

Python 3.10.x, tested with Python 3.10.16.
Python 3.11.x is not supported for the PennyLane/Qiskit evaluator. Reviewers observed PennyLane==0.40.0 failures with newer autoray and dependency conflicts between torch>=2.6 and pennylane-qiskit==0.40.0.

Supported OS:

Linux x86_64 is the supported artifact platform. The checkpoint-based path was reproduced by reviewers on Ubuntu 22.04 with an NVIDIA P100 GPU and on Debian 12 with an NVIDIA A100 GPU.
Author-tested Linux distributions: Red Hat Enterprise Linux 9.6 (Plow) and SUSE Linux Enterprise Server 15 SP6.
CPU-only table, summary, evaluation, and plotting workflows should work on current Linux distributions with Python 3.10.
Windows and macOS are not tested for this artifact. The CUDA/NVIDIA generation, CUDA-Q/cuQuantum, and training paths require a Linux NVIDIA driver compatible with the CUDA 12.x wheel stack.

Key pinned stacks:

Workflow	Requirements file	GPU requirement	Key versions
Dependency-free table/summary checks	none	CPU-only	Python standard library.
Packaged figure rendering	`artifact/requirements-figures.txt`	CPU-only	`matplotlib>=3.7`, `numpy>=1.24`, `pandas>=2.0`, `pypdf>=4.0`.
Raw-generation metric evaluation	`quantum-code-generation/code/evaluation/requirements.txt`	CPU-only	`numpy==2.0.0`, `PennyLane==0.40.0`, `pennylane-qiskit==0.40.0`, `autoray==0.6.12`, `sympy==1.12.1`, `scipy==1.16.1`, `qiskit==1.2.4`, `qiskit-aer==0.16.1`, `qiskit-algorithms==0.3.1`, `qiskit-qasm3-import==0.5.1`.
Checkpoint generation from the 4B model	`quantum-code-generation/code/generation/requirements.txt`	NVIDIA/CUDA GPU	`torch==2.7.1`, `vllm==0.10.0`, `transformers==4.55.0`, `datasets==4.0.0`, `ray==2.48.0`, CUDA 12.6 wheel packages such as `nvidia-cuda-runtime-cu12==12.6.77` and `nvidia-cudnn-cu12==9.5.1.17`.
Full Vista/RL training and verifier stack	root `requirements.txt`	NVIDIA/CUDA GPU by default	`torch==2.6.0`, `vllm==0.8.5`, `ray==2.47.1`, `transformers==4.55.0`, `qiskit==2.1.1`, `qiskit-aer==0.17.1`, `qiskit-qasm3-import==0.6.0`, `cudaq==0.10.0`, `cuda-quantum-cu12==0.10.0`, `cuquantum-python-cu12==25.6.0`, `cudensitymat-cu12==0.2.0`, `custatevec-cu12==1.9.0.post0`, `cutensornet-cu12==2.8.0`, `cupy-cuda12x==13.5.1`, CUDA 12.4 wheel packages such as `nvidia-cuda-runtime-cu12==12.4.127` and `nvidia-cudnn-cu12==9.1.0.70`.
IBM/hardware plot reproduction from archived logs	`ibm/requirements.txt` or `artifact/requirements-figures.txt` for plots only	CPU-only for archived plots; IBM account/backend only for live QPU runs	`qiskit==1.2.4`, `qiskit-ibm-runtime==0.29.0`, `qiskit-ibm-provider==0.11.0`, `PennyLane==0.40.0`, `autoray==0.6.12`, `sympy==1.12.1`.

Expected cross-venv conflicts:

Keep quantum-code-generation/code/evaluation/requirements.txt in a fresh CPU environment. The apparent sympy conflict is normal if the training and evaluation stacks are mixed: torch>=2.6 expects sympy>=1.13.x, while pennylane-qiskit==0.40.0 requires sympy<1.13.
Keep autoray==0.6.12 with PennyLane==0.40.0; newer autoray releases can remove APIs used by that PennyLane version.
Do not upgrade numpy opportunistically. The supported evaluator pins numpy==2.0.0; the generation stack pins numpy==2.2.6.
The packaged GPU stacks are CUDA/NVIDIA stacks. AMD/ROCm systems require replacing PyTorch, vLLM, CUDA-Q, and cuQuantum packages with ROCm-compatible alternatives; that is outside the supported artifact path.

Quick Artifact Check

These commands do not require a GPU.

git clone https://github.com/benyucong/rl-quantum.git
cd rl-quantum

python3 artifact/scripts/show_paper_tables.py all
python3 artifact/scripts/summarize_eval_outputs.py quantum-code-generation/code/evaluation/out

Expected behavior:

show_paper_tables.py prints the extracted values for Tables 1-3.
summarize_eval_outputs.py prints a compact Markdown summary of included summary_stats_*.json files.

The detailed artifact instructions are in artifact/README.md. The formal compiled artifact appendix is artifact/RL_Quantum_ACM_Journal/CAIS-26-AE/appendix.pdf.

Extracted Tables and Figure Map

The Table values and figure-to-artifact map are stored as CSVs:

Display the extracted Tables as Markdown:

python3 artifact/scripts/show_paper_tables.py table1
python3 artifact/scripts/show_paper_tables.py table2
python3 artifact/scripts/show_paper_tables.py table3

Recompute Metrics From Raw Generations

Evaluation from existing raw generation JSONs is CPU-only, although it can take time depending on sample count. Use Python 3.10 and install only the quantum evaluation dependencies:

python3.10 -m venv .venv-eval
source .venv-eval/bin/activate
python -m pip install -r quantum-code-generation/code/evaluation/requirements.txt

Run the evaluator:

cd quantum-code-generation/code/evaluation
python3 src/evaluate_samples.py \
  ../generation/out/<RAW_GENERATION_JSON> \
  out \
  <MODEL_LABEL>

Expected outputs:

out/summary_stats_<MODEL_LABEL>.json
out/summary_<MODEL_LABEL>_raw_data.csv

For low-friction artifact review, the raw generation JSONs used for Table 1 and Table 2 should be included in the repository. Without those JSONs, reviewers need GPU access to regenerate model outputs. The Table 1 API baselines used DeepSeek-V3, GPT-5, and GPT-4o; rerunning them requires provider API credentials and quota, while re-evaluating archived raw JSONs does not.

Generate New Model Outputs

New Table 1 data requires GPU inference because the 4B model must generate OpenQASM samples. This path requires an NVIDIA/CUDA GPU and a CUDA driver compatible with the CUDA 12.x wheel stack installed by the generation requirements. The helper script uses the public Vista checkpoint by default.

Use a generation-only environment and skip evaluation in that environment:

python3.10 -m venv .venv-gen
source .venv-gen/bin/activate
python -m pip install -r quantum-code-generation/code/generation/requirements.txt

bash artifact/scripts/run_quantum_experiment.sh \
  --model Benyucong/rl_quantum_4b \
  --dataset Benyucong/graph-data-quantum-rl \
  --n-samples 50 \
  --out-dir artifact_runs/reviewer_smoke \
  --skip-eval

The script writes:

artifact_runs/reviewer_smoke/generation/: raw generated-circuit JSON.
artifact_runs/reviewer_smoke/manifest.txt: run metadata.

Deactivate .venv-gen, activate .venv-eval, and run the CPU evaluator from the previous section on the generated JSON. Keeping generation and evaluation in separate environments avoids the known PyTorch/PennyLane sympy conflict.

The default generate_samples.py script produces one completion per selected sample. Pass@10 reproduction requires archived Pass@10 raw generations or a generation script configured to sample ten completions per prompt.

Draw Figures

Install lightweight plotting dependencies:

python3.10 -m venv .venv-fig
source .venv-fig/bin/activate
python -m pip install -r artifact/requirements-figures.txt

One Command for All Packaged Plots

Render the full packaged figure set from vista_draw/ plot-ready data:

python3 artifact/scripts/draw_vista_figures.py \
  --input-dir vista_draw \
  --output-dir artifact_runs/paper_figures \
  --strict
python3 artifact/scripts/build_artifact_report.py \
  --figures-dir artifact_runs/paper_figures \
  --tables-dir artifact/tables \
  --output artifact_runs/paper_figures/all_figures_tables.pdf

The command expects:

vista_draw/dataset/*.csv
vista_draw/dataset/*.json
vista_draw/logs.csv
the plotting scripts in vista_draw/*.py

It writes:

regenerated plot data and figures under artifact_runs/paper_figures/dataset/
training/log-derived figures under artifact_runs/paper_figures/figures/
artifact_runs/paper_figures/plot_status.json
artifact_runs/paper_figures/all_figures_tables.pdf

The verified all-plot run covers these tasks:

box, compilability, relative_entropy, scalability_qubits, scalability_gates_depth, per_primitive, training_dynamics, verifier_efficiency, training_logs, stage_cost, latency_breakdown, helmi_reward_stability, and real_device_tradeoff.

Plots From Checkpoint-Based Evaluation Outputs

Render the objective-gap and relative-entropy plots from a checkpoint-based evaluation directory:

python3 artifact/scripts/draw_vista_figures.py \
  --evaluation-dir artifact_runs/reviewer_smoke/evaluation \
  --only box,relative_entropy \
  --output-dir artifact_runs/reviewer_smoke/figures \
  --strict
python3 artifact/scripts/build_artifact_report.py \
  --figures-dir artifact_runs/reviewer_smoke/figures \
  --tables-dir artifact/tables \
  --output artifact_runs/reviewer_smoke/figures/all_figures_tables.pdf

Checkpoint-based evaluate_samples.py output directly supports the objective-gap box plot and the relative-entropy threshold plot. The scalability, per-primitive, training-dynamics, verifier-efficiency, and hardware plots require their corresponding aggregate CSV/JSON tables or logs in the vista_draw/ layout.

Train Vista

Full training is expensive and is not expected for a quick artifact review. The supported public training stack in this repository is the root CUDA/NVIDIA environment in requirements.txt. The paper-scale configuration fine-tunes a 4B model with GRPO and FSDP on 8 AMD MI250X GPUs or 8 NVIDIA H100 GPUs, but the checked-in package pins CUDA/NVIDIA wheels; AMD/ROCm reuse requires replacing those packages and adapting cluster launch scripts.

Install the root package in the training/verifier environment so python -m verl_tool.servers.serve works from scripts such as start_qasm_server.sh and examples/train/quantum/train_qwen_4B_quantum.sh:

python -m pip install -e .

Training entry points:

export VISTA_REWARD_ABLATION=full
bash examples/train/quantum/train_qwen_4B_quantum.sh

For the Table 2 reward ablation study, set VISTA_REWARD_ABLATION to one of full, no_ev, no_re, no_opt, or validity_only before launching training.

Cluster launchers and variants:

examples/train/quantum/train_hpc.sh
examples/train/quantum/train_hpc_4gpus.sh
examples/train/quantum/train_mn5.sh

Before launching train_qwen_4B_quantum.sh, edit or provide the local data paths configured near the top of the script:

data/rl-qasm/graph-data-quantum-rl-linus/train.parquet
data/rl-qasm/graph-data-quantum-rl-linus/test.parquet

The script also starts a local quantum_cpu verifier service on a random port, uses vLLM rollouts, and logs to W&B by default through trainer.logger=['console','wandb']; change the logger or provide W&B credentials for non-interactive cluster runs.

The staged verifier and reward implementation used by these scripts are in:

verl_tool/servers/tools/quantum_cpu.py
verl_tool/servers/tools/utils/quantum_reward_cal.py
verl_tool/workers/reward_manager/quantum.py

IBM and Figure 10 Data

The hardware/QPU path is optional for artifact evaluation unless explicitly requested. Figure 10 can be regenerated from archived logs without live QPU access:

cd ibm
python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt
./reproduce_plots.sh

Figure 10(a), optimality-gap agreement:

raw data: ibm/out_instance_eval/instance_*.json
script: ibm/src/analyze_instance_eval_results.py
regenerated output: ibm/out_instance_eval/plots/gap_scatter.pdf

Figure 10(b), scheduled execution duration:

raw data: ibm/out_speed/speed_benchmark_*.json
script: ibm/src/plot_execution_time_benchmark.py
regenerated output: ibm/out_speed/plots_time/scheduled_duration_boxplot.pdf

The original ibm_torino backend was retired on April 1, 2026. Live hardware demos should use any currently available IBM Quantum backend and may produce different noisy hardware results; the archived logs are the reproducible path.

Hardware Expectations

Task	GPU needed?	Minimum / recommended resources	Notes
Show extracted Tables	No	Any Python 3.10 environment.	Uses only Python standard library.
Summarize included evaluation JSONs	No	Any Python 3.10 environment.	Uses only Python standard library.
Recompute metrics from raw generation JSONs	No	4 CPU cores and 16 GB RAM for small smoke runs; more cores/RAM reduce runtime for full raw JSONs.	Requires the CPU evaluator environment.
Redraw figures from packaged plot data	No	2 CPU cores and 8 GB RAM.	Requires Matplotlib, NumPy, Pandas, and pypdf.
IBM/hardware figures from archived logs	No	2 CPU cores and 8 GB RAM.	Live QPU reruns require an IBM Quantum account, an active backend, and queue time; archived-log reproduction is CPU-only.
Generate new OpenQASM outputs from `Benyucong/rl_quantum_4b`	Yes, NVIDIA/CUDA	Minimum observed reviewer configuration: 1 NVIDIA P100 with 12 GB VRAM for the 50-sample checkpoint smoke path. Recommended: 1 NVIDIA GPU with at least 24 GB VRAM for fewer OOM retries; 40 GB+ is preferable for larger samples or longer contexts.	AMD/ROCm is not supported by the packaged CUDA dependency stack.
CUDA-Q/cuQuantum demos under `cuda-quantum/`	Yes, NVIDIA/CUDA	NVIDIA GPU plus Singularity/Apptainer or the root CUDA stack.	Optional path; not required for table/figure reproduction.
Full GRPO training	Yes	Paper-scale run used 8 AMD MI250X GPUs or 8 NVIDIA H100 GPUs for roughly two days. The checked-in CUDA environment is NVIDIA-oriented.	Not expected for kick-the-tires or checkpoint-based reproduction.

Cluster-Specific Assumptions

Portable reviewer workflows are the Python entry points shown above. The following scripts are included for transparency but contain local HPC assumptions that must be edited before reuse:

Files	Assumptions to change
`examples/train/quantum/train_hpc.sh`, `examples/train/quantum/train_hpc_4gpus.sh`, `demo.sh`, `convert-verl2huggingface.sh`	Slurm directives, Aalto-style GPU partition names such as `gpu-h200-141g-ellis`, GPU counts, memory limits, and local module names.
`examples/train/quantum/train_mn5.sh`, `examples/train/quantum/start-ray.sh`, `install_deps_offline.sh`	`/gpfs/projects/ehpc95` paths, `ehpc95` Slurm account/queue, Python 3.11.5 module, offline Hugging Face/cache settings, local `venv`, and hard-coded `PYTHONPATH`.
`quantum-code-generation/code/generation/.sh`, `quantum-code-generation/code/evaluation/_slurm.sh`, `quantum-code-generation/code/training/*_distributed.sh`, `quantum-code-generation/code/training/sft_single.sh`	Slurm partitions, mail addresses, `.venv` location, module names, old dataset/model names, and fixed sample counts.
`cuda-quantum/program.sh`	`/scratch/cs/adis/yuc10/...` work directory, Singularity image path, Slurm GPU allocation, and NVIDIA `--nv` container execution.
`pack_deps.sh`, `install_deps_offline.sh`	Maintainer-oriented wheel-cache creation/installation using local module names and directory layout. Prefer the pinned requirements files above for normal artifact evaluation.

When adapting these scripts, first replace absolute paths, Slurm #SBATCH settings, module loads, cache directories, W&B/Hugging Face settings, and GPU counts. The scripts should not be assumed portable as-is.

Artifact Badge Note

Target Available + Functional + Results Reproduced. Functional and Results Reproduced require the raw generation outputs, final evaluation summaries, plot-ready data, and hardware logs described in the artifact README. Available is supported by the Zenodo archive for the submitted snapshot: https://doi.org/10.5281/zenodo.19712131.

License and Upstream Base

This repository retains the upstream framework code and license information from VerlTool where applicable. The Vista-specific artifact code, quantum verifier/reward integration, generation/evaluation wrappers, and table/figure utilities are provided for the CAIS 2026 artifact evaluation workflow.

Name		Name	Last commit message	Last commit date
Latest commit History 494 Commits
artifact		artifact
assets/imgs		assets/imgs
baseline		baseline
benchmarks		benchmarks
cuda-quantum		cuda-quantum
eval_service		eval_service
examples		examples
ibm		ibm
quantum-code-generation		quantum-code-generation
verl @ 38d9a88		verl @ 38d9a88
verl_tool		verl_tool
vista_draw		vista_draw
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
acm-cais26-paper217.pdf		acm-cais26-paper217.pdf
convert-verl2huggingface.sh		convert-verl2huggingface.sh
demo.sh		demo.sh
install_deps_offline.sh		install_deps_offline.sh
install_flash_attn.sh		install_flash_attn.sh
main.py		main.py
pack_deps.sh		pack_deps.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
rm_train.sh		rm_train.sh
start_qasm_server.sh		start_qasm_server.sh
upload_model.py		upload_model.py
verl.patch		verl.patch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vista: Verifier-in-the-Loop Agentic RL for Quantum Program Synthesis

Public Model and Data

Repository Layout

Supported Environments and Dependency Stacks

Quick Artifact Check

Extracted Tables and Figure Map

Recompute Metrics From Raw Generations

Generate New Model Outputs

Draw Figures

One Command for All Packaged Plots

Plots From Checkpoint-Based Evaluation Outputs

Train Vista

IBM and Figure 10 Data

Hardware Expectations

Cluster-Specific Assumptions

Artifact Badge Note

License and Upstream Base

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vista: Verifier-in-the-Loop Agentic RL for Quantum Program Synthesis

Public Model and Data

Repository Layout

Supported Environments and Dependency Stacks

Quick Artifact Check

Extracted Tables and Figure Map

Recompute Metrics From Raw Generations

Generate New Model Outputs

Draw Figures

One Command for All Packaged Plots

Plots From Checkpoint-Based Evaluation Outputs

Train Vista

IBM and Figure 10 Data

Hardware Expectations

Cluster-Specific Assumptions

Artifact Badge Note

License and Upstream Base

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages