gpu-simulate-test

Testbed for evaluating GPU simulators with an emphasis on LLM/inference workloads.

Goals

Provide a repeatable way to run and compare simulators on the same workloads/configs.
Track setup notes, pitfalls, and reproducible scripts for each simulator.
Keep results and configs versioned.

Quickstart

Initialize submodules:

git submodule update --init --recursive

Create the Pixi environment (Python + PyPI deps):

pixi install

Run commands inside the environment:

pixi run python -c "import torch; print(torch.__version__)"

Optional: configure CODEX_HOME and an HTTP proxy for tooling:

./setup-envs.sh --proxy auto

Paper-fidelity (Vidur sim vs Sarathi replay)

Tutorial: docs/tutorial/howto/tut-paper-fidelity-static-and-dynamic.md
Paper models sweep quickstart: specs/003-paper-fidelity-more-models/quickstart.md
Sweep runner script: scripts/paper_fidelity_sweep.sh
Sweep log runbook: docs/runbooks/paper_fidelity_sweep.md

External Assets (models/datasets)

Large machine-local assets are managed under models/ and datasets/ using an “external reference” pattern (docs + bootstrap scripts are committed; the actual data is not).

Bootstrap everything:
- bash models/bootstrap.sh
- bash datasets/bootstrap.sh
Per-reference bootstraps:
- Qwen3 model: bash models/qwen3-0.6b/bootstrap.sh
- COCO 2017: bash datasets/coco2017/bootstrap.sh

Environment variables:

Models: GSIM_MODELS_ROOT (legacy: EXTERNAL_REF_ROOT)
Datasets: GSIM_DATASETS_ROOT (legacy: EXTERNAL_REF_ROOT)

Vidur paper LLMs (reference + stats)

The Vidur (MLSys'24) paper submodule (extern/tracked/vidur) evaluates across these LLMs, and you’ll see these names/IDs show up in configs/notes in this repo:

Paper name	Vidur/HF model id	Params	Layers	Embedding	Attn heads	Attention type	ModelScope
LLaMA2-7B	`meta-llama/Llama-2-7b-hf`	7B	32	4096	32	Multi-Head Attention	https://modelscope.cn/models/meta-llama/Llama-2-7b-hf
LLaMA2-70B	`meta-llama/Llama-2-70b-hf`	70B	80	8192	64	Group-Head Attention	https://modelscope.cn/models/meta-llama/Llama-2-70b-hf
InternLM-20B	`internlm/internlm-20b`	20B	60	5120	40	Multi-Head Attention	https://modelscope.cn/models/internlm/internlm-20b
Qwen-72B	`Qwen/Qwen-72B`	72B	80	8192	64	Multi-Head Attention	https://modelscope.cn/models/Qwen/Qwen-72B

Stats source: Vidur’s config-explorer demo page (extern/tracked/vidur/vidur/config_optimizer/analyzer/dashboard/intro_page.py).

Downloading with ModelScope

ModelScope SDK supports downloading a full model repo snapshot via snapshot_download (official docs: https://modelscope.cn/docs/%E6%A8%A1%E5%9E%8B%E7%9A%84%E4%B8%8B%E8%BD%BD).

Install the SDK:

pixi run python -m pip install -U modelscope

(Optional) set an access token (needed for gated/private models):

export MODELSCOPE_API_TOKEN="..."

Token page: https://modelscope.cn/my/myaccesstoken

Download a model into your external storage (pick a local_dir under your GSIM_MODELS_ROOT / scratch disk):

pixi run python -c "from modelscope.hub.snapshot_download import snapshot_download; print(snapshot_download('Qwen/Qwen-72B', local_dir='PATH/TO/Qwen-72B'))"

Examples (one per model):

pixi run python -c "from modelscope.hub.snapshot_download import snapshot_download; print(snapshot_download('meta-llama/Llama-2-7b-hf', local_dir='PATH/TO/Llama-2-7b-hf'))"
pixi run python -c "from modelscope.hub.snapshot_download import snapshot_download; print(snapshot_download('meta-llama/Llama-2-70b-hf', local_dir='PATH/TO/Llama-2-70b-hf'))"
pixi run python -c "from modelscope.hub.snapshot_download import snapshot_download; print(snapshot_download('internlm/internlm-20b', local_dir='PATH/TO/internlm-20b'))"
pixi run python -c "from modelscope.hub.snapshot_download import snapshot_download; print(snapshot_download('Qwen/Qwen-72B', local_dir='PATH/TO/Qwen-72B'))"

Notes:

snapshot_download(..., local_dir=...) downloads the repo into that directory; cache_dir=... uses the ModelScope cache layout.
You can filter downloads with allow_patterns= / ignore_patterns= (e.g., only *.safetensors + configs).
LLaMA2 checkpoints are typically license-gated; you may need to accept the upstream license terms before downloads succeed.
70B/72B checkpoints are very large; plan disk space and download time accordingly.

Simulators

First simulator to try:

Vidur (Microsoft): https://github.com/microsoft/vidur

Other candidates:

RealLM (Bespoke Silicon Group): https://github.com/bespoke-silicon-group/reallm
LLMCompass (Princeton University): https://github.com/PrincetonUniversity/LLMCompass
Accel-Sim Framework: https://github.com/accel-sim/accel-sim-framework

Q&A (Vidur simulation vs real GPU timing)

Q: Does `vidur-sim` run on CPU or GPU?

A: vidur-sim runs on CPU. It uses a GPU-generated profiling bundle (e.g. A100 kernel timing + comm profiles) and a performance model to simulate/predict GPU execution time for a workload.

Q: Is it meaningful to compare `vidur-sim` results with real GPU inference timing?

A: Yes — you should interpret the comparison as simulated GPU latency prediction vs measured GPU latency. It is meaningful, but it is not “two measurements of the same runtime”, so expect gaps:

The simulator’s accuracy depends on how well the profiling/model matches your real inference stack (kernels, precision, scheduler/batching, KV-cache behavior, etc.).
In this repo’s current paper-fidelity workflow, CPU overhead modeling is counted by default (scenario.vidur.skip_cpu_overhead_modeling=false); disable it only if you intentionally want to exclude CPU-side costs (the final summary.md will warn). The legacy vidur-sim path still disables CPU overhead modeling by default.
Token-level latencies from vidur-sim are derived from request-level metrics (not measured token-by-token); see src/gpu_simulate_test/vidur_ext/sim_runner.py.
real-bench replays requests sequentially; if your workload has arrival_time_ns=0 for many requests, later requests’ ttft_ns will include queueing behind earlier ones, which can distort direct TTFT comparisons.
For a “no queueing” baseline with real-bench, set workload.arrival.inter_arrival_ns large enough that each request completes before the next arrives (check completion_time_ns[i] <= arrival_time_ns[i+1] in the real run’s request_metrics.csv).

Known issue: network profiling on host

This repo’s current Vidur profiling flows do not generate network (collectives) profiles on the current host; they only copy vendor-provided all_reduce.csv / send_recv.csv by hardware.network_device. For TP/PP>1, you must manually profile and stage these CSVs into your profiling root.

Howto: context/summaries/vidur-kb/howto-profile-vidur-network-collectives.md
Issue: context/issues/known/issue-vidur-network-profiling-not-generated-on-host.md

Repo Layout

src/gpu_simulate_test/: Python package code (src layout).
extern/: third-party code (git submodules under extern/tracked/).
models/: external model references (symlink-based, not committed).
datasets/: external dataset references (symlink-based, not committed).
context/: project notes, runbooks, and experiment/repro docs.
scripts/: small helper scripts/entrypoints (repo-owned).
tests/: test skeleton (unit/, integration/, manual/).
tmp/: local scratch space (ignored by git).

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
.github/workflows		.github/workflows
.specify		.specify
.vscode		.vscode
configs		configs
context		context
datasets		datasets
docs		docs
extern		extern
magic-context @ 3013686		magic-context @ 3013686
models		models
results		results
scripts		scripts
specs		specs
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
AGENTS.md		AGENTS.md
GEMINI.md		GEMINI.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pixi.lock		pixi.lock
pyproject.toml		pyproject.toml
setup-envs.sh		setup-envs.sh
setup-proxy.sh		setup-proxy.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpu-simulate-test

Goals

Quickstart

Paper-fidelity (Vidur sim vs Sarathi replay)

External Assets (models/datasets)

Vidur paper LLMs (reference + stats)

Downloading with ModelScope

Simulators

Q&A (Vidur simulation vs real GPU timing)

Q: Does `vidur-sim` run on CPU or GPU?

Q: Is it meaningful to compare `vidur-sim` results with real GPU inference timing?

Known issue: network profiling on host

Repo Layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gpu-simulate-test

Goals

Quickstart

Paper-fidelity (Vidur sim vs Sarathi replay)

External Assets (models/datasets)

Vidur paper LLMs (reference + stats)

Downloading with ModelScope

Simulators

Q&A (Vidur simulation vs real GPU timing)

Q: Does vidur-sim run on CPU or GPU?

Q: Is it meaningful to compare vidur-sim results with real GPU inference timing?

Known issue: network profiling on host

Repo Layout

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Q: Does `vidur-sim` run on CPU or GPU?

Q: Is it meaningful to compare `vidur-sim` results with real GPU inference timing?

Packages