This repository contains the code and artifact materials for:
Vista: Verifier-in-the-Loop Agentic RL for Semantic Program Synthesis in Quantum Computing
ACM CAIS 2026 submission #217
Vista trains a language-model policy to generate OpenQASM 3.0 quantum circuits using staged verifier feedback. The verifier checks generated programs through progressively richer semantic stages: syntax/feasibility, behavior alignment, objective-value verification, and utility after downstream optimization. The repository includes the quantum training scripts, verifier/reward code, generation and evaluation scripts, table data, and figure-generation workflow.
The public artifact URL is:
https://github.com/benyucong/rl-quantum
- Vista RL checkpoint: Benyucong/rl_quantum_4b
- SFT initialization checkpoint: Benyucong/sft_quantum_circuit_gen_4B
- RL/evaluation dataset: Benyucong/graph-data-quantum-rl
| Path | Purpose |
|---|---|
artifact/ |
CAIS artifact README, appendix PDF/TeX, extracted Tables, and reviewer-facing helper scripts. |
artifact/scripts/run_quantum_experiment.sh |
Runs model generation plus evaluation, or evaluates an existing raw generation JSON. |
artifact/scripts/draw_vista_figures.py |
Runs the vista_draw/ plotting scripts in headless mode and can derive two plot tables from fresh evaluation output. |
artifact/scripts/build_artifact_report.py |
Builds one PDF containing extracted tables and regenerated plots. |
artifact/scripts/show_paper_tables.py |
Prints extracted Tables 1-3 as Markdown. |
examples/train/quantum/ |
Vista/quantum GRPO training scripts and Slurm launchers. |
verl_tool/servers/tools/quantum_cpu.py |
Quantum verifier tool used during rollouts. |
verl_tool/servers/tools/utils/quantum_reward_cal.py |
Staged quantum reward calculation. |
verl_tool/workers/reward_manager/quantum.py |
Reward manager for quantum verifier observations. |
quantum-code-generation/code/generation/ |
Hugging Face/vLLM scripts for generating OpenQASM samples from checkpoints. |
quantum-code-generation/code/evaluation/ |
QASM parsing, simulation, and metric evaluation scripts. |
quantum-code-generation/code/data_generation/ |
Quantum graph-optimization data-generation utilities and included problem-instance inputs. |
vista_draw/ |
Plot scripts and plot-ready CSV/JSON inputs for Figures. |
This repository is based on the VerlTool training framework, but the root README is intentionally focused on the Vista quantum artifact and reviewer workflow.
Use separate virtual environments for the CPU evaluator, GPU generation, and full training stacks. The dependency differences between the root training environment and quantum-code-generation/ are expected: these components were developed and run in different venvs. Do not install every requirements.txt file into one environment. The evaluator intentionally uses the PennyLane/Qiskit 1.x stack, while the GPU generation/training stacks use PyTorch/vLLM/CUDA packages that require different sympy, numpy, and CUDA library versions.
Supported reviewer-facing Python version:
- Python 3.10.x, tested with Python 3.10.16.
- Python 3.11.x is not supported for the PennyLane/Qiskit evaluator. Reviewers observed
PennyLane==0.40.0failures with newerautorayand dependency conflicts betweentorch>=2.6andpennylane-qiskit==0.40.0.
Supported OS:
- Linux x86_64 is the supported artifact platform. The checkpoint-based path was reproduced by reviewers on Ubuntu 22.04 with an NVIDIA P100 GPU and on Debian 12 with an NVIDIA A100 GPU.
- Author-tested Linux distributions: Red Hat Enterprise Linux 9.6 (Plow) and SUSE Linux Enterprise Server 15 SP6.
- CPU-only table, summary, evaluation, and plotting workflows should work on current Linux distributions with Python 3.10.
- Windows and macOS are not tested for this artifact. The CUDA/NVIDIA generation, CUDA-Q/cuQuantum, and training paths require a Linux NVIDIA driver compatible with the CUDA 12.x wheel stack.
Key pinned stacks:
| Workflow | Requirements file | GPU requirement | Key versions |
|---|---|---|---|
| Dependency-free table/summary checks | none | CPU-only | Python standard library. |
| Packaged figure rendering | artifact/requirements-figures.txt |
CPU-only | matplotlib>=3.7, numpy>=1.24, pandas>=2.0, pypdf>=4.0. |
| Raw-generation metric evaluation | quantum-code-generation/code/evaluation/requirements.txt |
CPU-only | numpy==2.0.0, PennyLane==0.40.0, pennylane-qiskit==0.40.0, autoray==0.6.12, sympy==1.12.1, scipy==1.16.1, qiskit==1.2.4, qiskit-aer==0.16.1, qiskit-algorithms==0.3.1, qiskit-qasm3-import==0.5.1. |
| Checkpoint generation from the 4B model | quantum-code-generation/code/generation/requirements.txt |
NVIDIA/CUDA GPU | torch==2.7.1, vllm==0.10.0, transformers==4.55.0, datasets==4.0.0, ray==2.48.0, CUDA 12.6 wheel packages such as nvidia-cuda-runtime-cu12==12.6.77 and nvidia-cudnn-cu12==9.5.1.17. |
| Full Vista/RL training and verifier stack | root requirements.txt |
NVIDIA/CUDA GPU by default | torch==2.6.0, vllm==0.8.5, ray==2.47.1, transformers==4.55.0, qiskit==2.1.1, qiskit-aer==0.17.1, qiskit-qasm3-import==0.6.0, cudaq==0.10.0, cuda-quantum-cu12==0.10.0, cuquantum-python-cu12==25.6.0, cudensitymat-cu12==0.2.0, custatevec-cu12==1.9.0.post0, cutensornet-cu12==2.8.0, cupy-cuda12x==13.5.1, CUDA 12.4 wheel packages such as nvidia-cuda-runtime-cu12==12.4.127 and nvidia-cudnn-cu12==9.1.0.70. |
| IBM/hardware plot reproduction from archived logs | ibm/requirements.txt or artifact/requirements-figures.txt for plots only |
CPU-only for archived plots; IBM account/backend only for live QPU runs | qiskit==1.2.4, qiskit-ibm-runtime==0.29.0, qiskit-ibm-provider==0.11.0, PennyLane==0.40.0, autoray==0.6.12, sympy==1.12.1. |
Expected cross-venv conflicts:
- Keep
quantum-code-generation/code/evaluation/requirements.txtin a fresh CPU environment. The apparentsympyconflict is normal if the training and evaluation stacks are mixed:torch>=2.6expectssympy>=1.13.x, whilepennylane-qiskit==0.40.0requiressympy<1.13. - Keep
autoray==0.6.12withPennyLane==0.40.0; newerautorayreleases can remove APIs used by that PennyLane version. - Do not upgrade
numpyopportunistically. The supported evaluator pinsnumpy==2.0.0; the generation stack pinsnumpy==2.2.6. - The packaged GPU stacks are CUDA/NVIDIA stacks. AMD/ROCm systems require replacing PyTorch, vLLM, CUDA-Q, and cuQuantum packages with ROCm-compatible alternatives; that is outside the supported artifact path.
These commands do not require a GPU.
git clone https://github.com/benyucong/rl-quantum.git
cd rl-quantum
python3 artifact/scripts/show_paper_tables.py all
python3 artifact/scripts/summarize_eval_outputs.py quantum-code-generation/code/evaluation/outExpected behavior:
show_paper_tables.pyprints the extracted values for Tables 1-3.summarize_eval_outputs.pyprints a compact Markdown summary of includedsummary_stats_*.jsonfiles.
The detailed artifact instructions are in artifact/README.md. The formal compiled artifact appendix is artifact/RL_Quantum_ACM_Journal/CAIS-26-AE/appendix.pdf.
The Table values and figure-to-artifact map are stored as CSVs:
- artifact/tables/table1_passk_comparison.csv
- artifact/tables/table2_reward_ablation.csv
- artifact/tables/table3_training_settings.csv
- artifact/tables/figure_artifact_map.csv
Display the extracted Tables as Markdown:
python3 artifact/scripts/show_paper_tables.py table1
python3 artifact/scripts/show_paper_tables.py table2
python3 artifact/scripts/show_paper_tables.py table3Evaluation from existing raw generation JSONs is CPU-only, although it can take time depending on sample count. Use Python 3.10 and install only the quantum evaluation dependencies:
python3.10 -m venv .venv-eval
source .venv-eval/bin/activate
python -m pip install -r quantum-code-generation/code/evaluation/requirements.txtRun the evaluator:
cd quantum-code-generation/code/evaluation
python3 src/evaluate_samples.py \
../generation/out/<RAW_GENERATION_JSON> \
out \
<MODEL_LABEL>Expected outputs:
out/summary_stats_<MODEL_LABEL>.jsonout/summary_<MODEL_LABEL>_raw_data.csv
For low-friction artifact review, the raw generation JSONs used for Table 1 and Table 2 should be included in the repository. Without those JSONs, reviewers need GPU access to regenerate model outputs. The Table 1 API baselines used DeepSeek-V3, GPT-5, and GPT-4o; rerunning them requires provider API credentials and quota, while re-evaluating archived raw JSONs does not.
New Table 1 data requires GPU inference because the 4B model must generate OpenQASM samples. This path requires an NVIDIA/CUDA GPU and a CUDA driver compatible with the CUDA 12.x wheel stack installed by the generation requirements. The helper script uses the public Vista checkpoint by default.
Use a generation-only environment and skip evaluation in that environment:
python3.10 -m venv .venv-gen
source .venv-gen/bin/activate
python -m pip install -r quantum-code-generation/code/generation/requirements.txt
bash artifact/scripts/run_quantum_experiment.sh \
--model Benyucong/rl_quantum_4b \
--dataset Benyucong/graph-data-quantum-rl \
--n-samples 50 \
--out-dir artifact_runs/reviewer_smoke \
--skip-evalThe script writes:
artifact_runs/reviewer_smoke/generation/: raw generated-circuit JSON.artifact_runs/reviewer_smoke/manifest.txt: run metadata.
Deactivate .venv-gen, activate .venv-eval, and run the CPU evaluator from the previous section on the generated JSON. Keeping generation and evaluation in separate environments avoids the known PyTorch/PennyLane sympy conflict.
The default generate_samples.py script produces one completion per selected sample. Pass@10 reproduction requires archived Pass@10 raw generations or a generation script configured to sample ten completions per prompt.
Install lightweight plotting dependencies:
python3.10 -m venv .venv-fig
source .venv-fig/bin/activate
python -m pip install -r artifact/requirements-figures.txtRender the full packaged figure set from vista_draw/ plot-ready data:
python3 artifact/scripts/draw_vista_figures.py \
--input-dir vista_draw \
--output-dir artifact_runs/paper_figures \
--strict
python3 artifact/scripts/build_artifact_report.py \
--figures-dir artifact_runs/paper_figures \
--tables-dir artifact/tables \
--output artifact_runs/paper_figures/all_figures_tables.pdfThe command expects:
vista_draw/dataset/*.csvvista_draw/dataset/*.jsonvista_draw/logs.csv- the plotting scripts in
vista_draw/*.py
It writes:
- regenerated plot data and figures under
artifact_runs/paper_figures/dataset/ - training/log-derived figures under
artifact_runs/paper_figures/figures/ artifact_runs/paper_figures/plot_status.jsonartifact_runs/paper_figures/all_figures_tables.pdf
The verified all-plot run covers these tasks:
box, compilability, relative_entropy, scalability_qubits, scalability_gates_depth, per_primitive, training_dynamics, verifier_efficiency, training_logs, stage_cost, latency_breakdown, helmi_reward_stability, and real_device_tradeoff.
Render the objective-gap and relative-entropy plots from a checkpoint-based evaluation directory:
python3 artifact/scripts/draw_vista_figures.py \
--evaluation-dir artifact_runs/reviewer_smoke/evaluation \
--only box,relative_entropy \
--output-dir artifact_runs/reviewer_smoke/figures \
--strict
python3 artifact/scripts/build_artifact_report.py \
--figures-dir artifact_runs/reviewer_smoke/figures \
--tables-dir artifact/tables \
--output artifact_runs/reviewer_smoke/figures/all_figures_tables.pdfCheckpoint-based evaluate_samples.py output directly supports the objective-gap box plot and the relative-entropy threshold plot. The scalability, per-primitive, training-dynamics, verifier-efficiency, and hardware plots require their corresponding aggregate CSV/JSON tables or logs in the vista_draw/ layout.
Full training is expensive and is not expected for a quick artifact review. The supported public training stack in this repository is the root CUDA/NVIDIA environment in requirements.txt. The paper-scale configuration fine-tunes a 4B model with GRPO and FSDP on 8 AMD MI250X GPUs or 8 NVIDIA H100 GPUs, but the checked-in package pins CUDA/NVIDIA wheels; AMD/ROCm reuse requires replacing those packages and adapting cluster launch scripts.
Install the root package in the training/verifier environment so python -m verl_tool.servers.serve works from scripts such as start_qasm_server.sh and examples/train/quantum/train_qwen_4B_quantum.sh:
python -m pip install -e .Training entry points:
export VISTA_REWARD_ABLATION=full
bash examples/train/quantum/train_qwen_4B_quantum.shFor the Table 2 reward ablation study, set VISTA_REWARD_ABLATION to one of full, no_ev, no_re, no_opt, or validity_only before launching training.
Cluster launchers and variants:
examples/train/quantum/train_hpc.shexamples/train/quantum/train_hpc_4gpus.shexamples/train/quantum/train_mn5.sh
Before launching train_qwen_4B_quantum.sh, edit or provide the local data paths configured near the top of the script:
data/rl-qasm/graph-data-quantum-rl-linus/train.parquetdata/rl-qasm/graph-data-quantum-rl-linus/test.parquet
The script also starts a local quantum_cpu verifier service on a random port, uses vLLM rollouts, and logs to W&B by default through trainer.logger=['console','wandb']; change the logger or provide W&B credentials for non-interactive cluster runs.
The staged verifier and reward implementation used by these scripts are in:
verl_tool/servers/tools/quantum_cpu.pyverl_tool/servers/tools/utils/quantum_reward_cal.pyverl_tool/workers/reward_manager/quantum.py
The hardware/QPU path is optional for artifact evaluation unless explicitly requested. Figure 10 can be regenerated from archived logs without live QPU access:
cd ibm
python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt
./reproduce_plots.shFigure 10(a), optimality-gap agreement:
- raw data:
ibm/out_instance_eval/instance_*.json - script:
ibm/src/analyze_instance_eval_results.py - regenerated output:
ibm/out_instance_eval/plots/gap_scatter.pdf
Figure 10(b), scheduled execution duration:
- raw data:
ibm/out_speed/speed_benchmark_*.json - script:
ibm/src/plot_execution_time_benchmark.py - regenerated output:
ibm/out_speed/plots_time/scheduled_duration_boxplot.pdf
The original ibm_torino backend was retired on April 1, 2026. Live hardware demos should use any currently available IBM Quantum backend and may produce different noisy hardware results; the archived logs are the reproducible path.
| Task | GPU needed? | Minimum / recommended resources | Notes |
|---|---|---|---|
| Show extracted Tables | No | Any Python 3.10 environment. | Uses only Python standard library. |
| Summarize included evaluation JSONs | No | Any Python 3.10 environment. | Uses only Python standard library. |
| Recompute metrics from raw generation JSONs | No | 4 CPU cores and 16 GB RAM for small smoke runs; more cores/RAM reduce runtime for full raw JSONs. | Requires the CPU evaluator environment. |
| Redraw figures from packaged plot data | No | 2 CPU cores and 8 GB RAM. | Requires Matplotlib, NumPy, Pandas, and pypdf. |
| IBM/hardware figures from archived logs | No | 2 CPU cores and 8 GB RAM. | Live QPU reruns require an IBM Quantum account, an active backend, and queue time; archived-log reproduction is CPU-only. |
Generate new OpenQASM outputs from Benyucong/rl_quantum_4b |
Yes, NVIDIA/CUDA | Minimum observed reviewer configuration: 1 NVIDIA P100 with 12 GB VRAM for the 50-sample checkpoint smoke path. Recommended: 1 NVIDIA GPU with at least 24 GB VRAM for fewer OOM retries; 40 GB+ is preferable for larger samples or longer contexts. | AMD/ROCm is not supported by the packaged CUDA dependency stack. |
CUDA-Q/cuQuantum demos under cuda-quantum/ |
Yes, NVIDIA/CUDA | NVIDIA GPU plus Singularity/Apptainer or the root CUDA stack. | Optional path; not required for table/figure reproduction. |
| Full GRPO training | Yes | Paper-scale run used 8 AMD MI250X GPUs or 8 NVIDIA H100 GPUs for roughly two days. The checked-in CUDA environment is NVIDIA-oriented. | Not expected for kick-the-tires or checkpoint-based reproduction. |
Portable reviewer workflows are the Python entry points shown above. The following scripts are included for transparency but contain local HPC assumptions that must be edited before reuse:
| Files | Assumptions to change |
|---|---|
examples/train/quantum/train_hpc.sh, examples/train/quantum/train_hpc_4gpus.sh, demo.sh, convert-verl2huggingface.sh |
Slurm directives, Aalto-style GPU partition names such as gpu-h200-141g-ellis, GPU counts, memory limits, and local module names. |
examples/train/quantum/train_mn5.sh, examples/train/quantum/start-ray.sh, install_deps_offline.sh |
/gpfs/projects/ehpc95 paths, ehpc95 Slurm account/queue, Python 3.11.5 module, offline Hugging Face/cache settings, local venv, and hard-coded PYTHONPATH. |
quantum-code-generation/code/generation/*.sh, quantum-code-generation/code/evaluation/*_slurm.sh, quantum-code-generation/code/training/*_distributed.sh, quantum-code-generation/code/training/sft_single.sh |
Slurm partitions, mail addresses, .venv location, module names, old dataset/model names, and fixed sample counts. |
cuda-quantum/program.sh |
/scratch/cs/adis/yuc10/... work directory, Singularity image path, Slurm GPU allocation, and NVIDIA --nv container execution. |
pack_deps.sh, install_deps_offline.sh |
Maintainer-oriented wheel-cache creation/installation using local module names and directory layout. Prefer the pinned requirements files above for normal artifact evaluation. |
When adapting these scripts, first replace absolute paths, Slurm #SBATCH settings, module loads, cache directories, W&B/Hugging Face settings, and GPU counts. The scripts should not be assumed portable as-is.
Target Available + Functional + Results Reproduced. Functional and Results Reproduced require the raw generation outputs, final evaluation summaries, plot-ready data, and hardware logs described in the artifact README. Available is supported by the Zenodo archive for the submitted snapshot: https://doi.org/10.5281/zenodo.19712131.
This repository retains the upstream framework code and license information from VerlTool where applicable. The Vista-specific artifact code, quantum verifier/reward integration, generation/evaluation wrappers, and table/figure utilities are provided for the CAIS 2026 artifact evaluation workflow.