Testbed for evaluating GPU simulators with an emphasis on LLM/inference workloads.
- Provide a repeatable way to run and compare simulators on the same workloads/configs.
- Track setup notes, pitfalls, and reproducible scripts for each simulator.
- Keep results and configs versioned.
Initialize submodules:
git submodule update --init --recursiveCreate the Pixi environment (Python + PyPI deps):
pixi installRun commands inside the environment:
pixi run python -c "import torch; print(torch.__version__)"Optional: configure CODEX_HOME and an HTTP proxy for tooling:
./setup-envs.sh --proxy auto- Tutorial:
docs/tutorial/howto/tut-paper-fidelity-static-and-dynamic.md - Paper models sweep quickstart:
specs/003-paper-fidelity-more-models/quickstart.md - Sweep runner script:
scripts/paper_fidelity_sweep.sh - Sweep log runbook:
docs/runbooks/paper_fidelity_sweep.md
Large machine-local assets are managed under models/ and datasets/ using an “external
reference” pattern (docs + bootstrap scripts are committed; the actual data is not).
- Bootstrap everything:
bash models/bootstrap.shbash datasets/bootstrap.sh
- Per-reference bootstraps:
- Qwen3 model:
bash models/qwen3-0.6b/bootstrap.sh - COCO 2017:
bash datasets/coco2017/bootstrap.sh
- Qwen3 model:
Environment variables:
- Models:
GSIM_MODELS_ROOT(legacy:EXTERNAL_REF_ROOT) - Datasets:
GSIM_DATASETS_ROOT(legacy:EXTERNAL_REF_ROOT)
The Vidur (MLSys'24) paper submodule (extern/tracked/vidur) evaluates across these LLMs, and you’ll see these names/IDs show up in configs/notes in this repo:
| Paper name | Vidur/HF model id | Params | Layers | Embedding | Attn heads | Attention type | ModelScope |
|---|---|---|---|---|---|---|---|
| LLaMA2-7B | meta-llama/Llama-2-7b-hf |
7B | 32 | 4096 | 32 | Multi-Head Attention | https://modelscope.cn/models/meta-llama/Llama-2-7b-hf |
| LLaMA2-70B | meta-llama/Llama-2-70b-hf |
70B | 80 | 8192 | 64 | Group-Head Attention | https://modelscope.cn/models/meta-llama/Llama-2-70b-hf |
| InternLM-20B | internlm/internlm-20b |
20B | 60 | 5120 | 40 | Multi-Head Attention | https://modelscope.cn/models/internlm/internlm-20b |
| Qwen-72B | Qwen/Qwen-72B |
72B | 80 | 8192 | 64 | Multi-Head Attention | https://modelscope.cn/models/Qwen/Qwen-72B |
Stats source: Vidur’s config-explorer demo page (extern/tracked/vidur/vidur/config_optimizer/analyzer/dashboard/intro_page.py).
ModelScope SDK supports downloading a full model repo snapshot via snapshot_download (official docs: https://modelscope.cn/docs/%E6%A8%A1%E5%9E%8B%E7%9A%84%E4%B8%8B%E8%BD%BD).
- Install the SDK:
pixi run python -m pip install -U modelscope- (Optional) set an access token (needed for gated/private models):
export MODELSCOPE_API_TOKEN="..."Token page: https://modelscope.cn/my/myaccesstoken
- Download a model into your external storage (pick a
local_dirunder yourGSIM_MODELS_ROOT/ scratch disk):
pixi run python -c "from modelscope.hub.snapshot_download import snapshot_download; print(snapshot_download('Qwen/Qwen-72B', local_dir='PATH/TO/Qwen-72B'))"Examples (one per model):
pixi run python -c "from modelscope.hub.snapshot_download import snapshot_download; print(snapshot_download('meta-llama/Llama-2-7b-hf', local_dir='PATH/TO/Llama-2-7b-hf'))"
pixi run python -c "from modelscope.hub.snapshot_download import snapshot_download; print(snapshot_download('meta-llama/Llama-2-70b-hf', local_dir='PATH/TO/Llama-2-70b-hf'))"
pixi run python -c "from modelscope.hub.snapshot_download import snapshot_download; print(snapshot_download('internlm/internlm-20b', local_dir='PATH/TO/internlm-20b'))"
pixi run python -c "from modelscope.hub.snapshot_download import snapshot_download; print(snapshot_download('Qwen/Qwen-72B', local_dir='PATH/TO/Qwen-72B'))"Notes:
snapshot_download(..., local_dir=...)downloads the repo into that directory;cache_dir=...uses the ModelScope cache layout.- You can filter downloads with
allow_patterns=/ignore_patterns=(e.g., only*.safetensors+ configs). - LLaMA2 checkpoints are typically license-gated; you may need to accept the upstream license terms before downloads succeed.
- 70B/72B checkpoints are very large; plan disk space and download time accordingly.
First simulator to try:
- Vidur (Microsoft): https://github.com/microsoft/vidur
Other candidates:
- RealLM (Bespoke Silicon Group): https://github.com/bespoke-silicon-group/reallm
- LLMCompass (Princeton University): https://github.com/PrincetonUniversity/LLMCompass
- Accel-Sim Framework: https://github.com/accel-sim/accel-sim-framework
A: vidur-sim runs on CPU. It uses a GPU-generated profiling bundle (e.g. A100 kernel timing + comm profiles) and a performance model to simulate/predict GPU execution time for a workload.
A: Yes — you should interpret the comparison as simulated GPU latency prediction vs measured GPU latency. It is meaningful, but it is not “two measurements of the same runtime”, so expect gaps:
- The simulator’s accuracy depends on how well the profiling/model matches your real inference stack (kernels, precision, scheduler/batching, KV-cache behavior, etc.).
- In this repo’s current paper-fidelity workflow, CPU overhead modeling is counted by default (
scenario.vidur.skip_cpu_overhead_modeling=false); disable it only if you intentionally want to exclude CPU-side costs (the finalsummary.mdwill warn). The legacyvidur-simpath still disables CPU overhead modeling by default. - Token-level latencies from
vidur-simare derived from request-level metrics (not measured token-by-token); seesrc/gpu_simulate_test/vidur_ext/sim_runner.py. real-benchreplays requests sequentially; if your workload hasarrival_time_ns=0for many requests, later requests’ttft_nswill include queueing behind earlier ones, which can distort direct TTFT comparisons.- For a “no queueing” baseline with
real-bench, setworkload.arrival.inter_arrival_nslarge enough that each request completes before the next arrives (checkcompletion_time_ns[i] <= arrival_time_ns[i+1]in the real run’srequest_metrics.csv).
This repo’s current Vidur profiling flows do not generate network (collectives) profiles on the current host; they
only copy vendor-provided all_reduce.csv / send_recv.csv by hardware.network_device. For TP/PP>1, you must
manually profile and stage these CSVs into your profiling root.
- Howto:
context/summaries/vidur-kb/howto-profile-vidur-network-collectives.md - Issue:
context/issues/known/issue-vidur-network-profiling-not-generated-on-host.md
src/gpu_simulate_test/: Python package code (src layout).extern/: third-party code (git submodules underextern/tracked/).models/: external model references (symlink-based, not committed).datasets/: external dataset references (symlink-based, not committed).context/: project notes, runbooks, and experiment/repro docs.scripts/: small helper scripts/entrypoints (repo-owned).tests/: test skeleton (unit/,integration/,manual/).tmp/: local scratch space (ignored by git).