[AnyFlow] FAR: standalone causal-mask builder + torch.compile follow-up#13792
Open
Enderfga wants to merge 6 commits into
Open
[AnyFlow] FAR: standalone causal-mask builder + torch.compile follow-up#13792Enderfga wants to merge 6 commits into
Enderfga wants to merge 6 commits into
Conversation
Follow-up to huggingface#13745. Extracts FAR mask construction to a module-level helper and adds an `attention_mask` forward kwarg so AnyFlowFARTransformer3DModel can be wrapped in `torch.compile(fullgraph=True)`. The pipeline pre-builds the mask during KV-cache prefill so users get end-to-end fullgraph compile. * Public method `AnyFlowFARTransformer3DModel.build_attention_mask(...)` (modes: "train", "cache") plus private module-level helper `_build_anyflow_far_causal_block_mask(...)`. * `_build_freqs` cache lookup/write bypassed under `torch.compiler.is_compiling()` to avoid a Dynamo guard recompile on the second compiled call (applied in bidi source; synced to FAR via `# Copied from`). * `TestAnyFlowFARTransformer3DCompile(TorchCompileTesterMixin)` — recompilation_and_graph_break, repeated_blocks, and group_offloading pass on H200; AOT is `@pytest.mark.skip`'d (torch.export rejects BlockMask as a pytree input). * Base `get_dummy_inputs` omits `attention_mask` so every non-compile test class exercises the in-forward fallback; the compile class overrides to inject a pre-built mask. * Bit-exact: pre-built path vs internal-build fallback max|Δ|=0.0e+00.
6924c8e to
9ba82cd
Compare
…e page * Full author list and NVIDIA → NUS → MIT institution order; TL;DR + abstract + Available Models bullets. * Rewritten pipeline-selection tip describing both pipelines symmetrically. * T2V / I2V / V2V examples now use the canonical 81-frame setup and the demo prompts / conditioning assets shipped under `NVlabs/AnyFlow/assets/evaluation/` (linked via raw.githubusercontent.com). * Drop the inline "Optimizing Memory" and "torch.compile" sections — those notes will live in the NVlabs/AnyFlow repo's own performance guide rather than the diffusers pipeline reference. * Sync zh user guide and the two model-API stubs.
9ba82cd to
eb7c869
Compare
- AnyFlowFARTransformer3DModel.__init__ now accepts chunk_partition via @register_to_config (default (1, 3, 3, 3, 3, 3, 3, 2) for the released 81-frame checkpoints, matching the field on Hub). - AnyFlowFARPipeline.__call__ no longer requires chunk_partition; defaults to self.transformer.config.chunk_partition. Per-call override still supported for V2V / non-default num_frames. - Drop the AnyFlowFARPipeline.default_chunk_partition class attribute. - Update docs (en pipelines/models, zh using-diffusers) and the conversion script to match.
Inside the per-chunk rollout loop, the local variable `timesteps` was reassigned to `self.scheduler.timesteps` after `set_timesteps()`. On the next chunk iteration the same name was passed back into `set_timesteps(timesteps=...)`, where a non-None value enters the *custom-schedule* branch — `apply_shift` re-runs on already-shifted values, double-shifting the schedule for every chunk after the first. Concretely, with `shift=5` and `num_inference_steps=4`: - chunk 0 timesteps: [1000, 937.5, 833.3, 625] (correct) - chunk 1+ timesteps: [1000, 986.8, 961.3, 892.9] (double-shifted) The later steps drift toward `t=1000` instead of toward `t=0`, the flow-map model is conditioned on the wrong source sigma, and the chunk KV cache accumulates errors that show up as artifacts in later video frames. Fix: rebind the cached schedule to a fresh local name (`scheduler_timesteps`) so the outer-scope `timesteps` kwarg (the user-provided custom schedule, when any) stays untouched across chunks. Layer-by-layer verification against the NVlabs reference implementation on H200 (elephant prompt, seed 0, 4 NFE, 81 frames): - chunk 0 inference: bit-exact (0.0 mean diff) - chunk 1 step 0: 0.194 → 0.014 (-93%) - chunk 7 last step: 0.564 → 0.274 (-51%)
Pure rewrap to satisfy `doc-builder style --max_len 119`. Two docstrings introduced in 96077b2 (the `chunk_partition` config arg on the FAR transformer + the matching pipeline kwarg) wrapped a few characters short of the line budget. No semantic change.
…pers, say chunk-wise - Remove author-name attributions from the transformer / pipeline class docstrings and file-header comments; the paper-citation header on the doc page keeps the full author list, the in-code references just point at the [AnyFlow] / [FAR] papers. - Link FAR via its Hugging Face papers page (https://huggingface.co/papers/2503.19325) instead of a raw arxiv.org URL, matching the AnyFlow reference style and the rest of the diffusers docs. - Describe AnyFlow FAR generation as "chunk-wise autoregressive": the pipeline autoregresses over chunks (`chunk_partition`), not single frames.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #13745. Started as @dg845's
torch.compile(fullgraph=True)ask(discussion_r3286032020);
along the way we also (1) migrated
chunk_partitionfrom a pipeline classattribute into the transformer config (so the diffusers code matches the
field already baked into the released checkpoints on the Hub), (2) refreshed
the docs with the full author list and the upstream demo prompts/assets, and
(3) fixed a per-chunk
timestepsshadowing bug in the FAR pipeline rolloutthat was the actual root cause of the precision drift spotted in earlier
FAR generations.
After the four commits in this branch, diffusers code, the pushed Hub
checkpoint configs, and the upstream
NVlabs/AnyFlowreference are allin sync, and the released T2V / I2V / V2V demos in the doc page reproduce
NVlabs-equivalent quality at the same seed.
What's in this branch
1.
torch.compile(fullgraph=True)support (f4c7af8)Direct response to dg845's quoted suggestion:
AnyFlowFARTransformer3DModel.build_attention_mask(...)— new public method returning aBlockMaskfor a given chunk layout. Two modes:"train"(matches_forward_train) and"cache"(matches_forward_cache). The autoregressive_forward_inferencepath attends through the KV cache and doesn't consume a full mask, so it has no mode.attention_mask: Optional[BlockMask] = Nonekwarg onforward(), threaded into_forward_trainand_forward_cache. When provided, the in-forwardcreate_block_mask(_compile=False)call is skipped, making the forward graph-traceable undertorch.compile(fullgraph=True). The optional-with-fallback pattern matches LTX2'sprepare_video_coords(transformer_ltx2.py:1447-1450). Not declared on_forward_inference(per.ai/models.md: don't declare a param you ignore)._build_freqscompile-safe: cache lookup/write is bypassed insidetorch.compiler.is_compiling()so mutatingself._freqs_cachedoesn't trip a Dynamo guard on the second compiled call. Eager behaviour unchanged. (Same edit in the bidi transformer via# Copied fromsync.)AnyFlowFARPipeline.encode_kv_cachepre-builds the mask viatransformer.build_attention_mask(mode="cache", ...)and passes it in, so users can wrappipe.transformerintorch.compile(fullgraph=True)end-to-end.2. Docs refresh (
eb7c869)NVlabs/AnyFlow/assets/evaluation/.3.
chunk_partitionmigrated into transformer config (96077b2)Originally a
default_chunk_partitionclass attribute onAnyFlowFARPipeline. The released Hub checkpoint configs already carried achunk_partition: [1, 3, 3, 3, 3, 3, 3, 2]field, but the diffusers ctor didn't accept it — it was silently dropped. Now:AnyFlowFARTransformer3DModel.__init__(... chunk_partition: Tuple[int, ...] = (1, 3, 3, 3, 3, 3, 3, 2))via@register_to_config.AnyFlowFARPipeline.__call__'schunk_partitionkwarg defaults toself.transformer.config.chunk_partitioninstead of a hard-coded class attribute. Per-call override still supported for V2V / non-defaultnum_frames.4. Bug fix:
timestepsshadowing across chunks (1380957)Inside
AnyFlowFARPipeline's per-chunk rollout, the outer-scopetimestepskwarg (user-supplied custom schedule, normallyNone) was being clobbered:After chunk 0, the local
timestepsheldself.scheduler.timesteps(already-shifted byapply_shift). The next chunk fed this back intoset_timesteps(timesteps=...), which enters the custom-schedule branch and re-applies the shift. Forshift=5,num_inference_steps=4:[1000, 937.5, 833.3, 625][1000, 937.5, 833.3, 625]✓[1000, 937.5, 833.3, 625][1000, 986.8, 961.3, 892.9]✗Chunks 1+ ran with the wrong source timestep, the flow-map model was conditioned on a sigma that didn't match the actual noise level, and KV-cache errors accumulated chunk-over-chunk. End result: visible artifacts in later video frames (elephant trunk fragmentation, color drift in the FAR T2V demo).
Layer-by-layer compare against NVlabs (elephant prompt, seed 0, 4 NFE, 81 frames) before/after the fix:
chunk_0_inference_step_0chunk_0_inference_step_3chunk_1_inference_step_0chunk_7_inference_step_35. doc-builder rewrap (
1867e98)Pure cosmetic: two
chunk_partitiondocstrings introduced in (3) wrapped a few chars short of the 119-char budget.doc-builder style --max_len 119rewrap, no semantic change.Verification (H200, torch 2.11.0+cu128)
Compile tests:
Bit-exact between pre-built-mask path and internal-build fallback:
max|Δ| = 0.000e+00.End-to-end demo regeneration: the T2V / I2V / V2V snippets shown in the new docs page were each re-run with seed 0 on H200 against
nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers; visual quality matches the upstreamNVlabs/AnyFlowdemo output frame-for-frame after the timesteps fix.Code quality gates:
make fix-copies— cleanmake style+make quality— cleandoc-builder style --max_len 119 --check_only— cleanruff check/ruff format --check— cleanHub checkpoint alignment
The four released checkpoints have been updated in-place so their
model_index.json/scheduler/scheduler_config.json/transformer/config.jsonreference the diffusers-merged class names (AnyFlowFARPipeline,AnyFlowFARTransformer3DModel,FlowMapEulerDiscreteScheduler) and carry thechunk_partitionconfig field consumed by this PR:nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusersnvidia/AnyFlow-Wan2.1-T2V-14B-Diffusersnvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusersnvidia/AnyFlow-FAR-Wan2.1-14B-DiffusersWith this PR landed, diffusers code, Hub configs, and the upstream
NVlabs/AnyFlowreference implementation are all in sync. Generation quality fromAnyFlowFARPipeline.from_pretrained("nvidia/AnyFlow-FAR-...")is verified to match the upstream FAR pipeline output at matching seeds.Compatibility
attention_maskdefaults toNone; train / cache paths fall back to internal construction exactly as before, so existing training scripts and out-of-tree users ofAnyFlowFARTransformer3DModel.forward()are unaffected.chunk_partitionis now an optional ctor arg with a default matching the released checkpoints — oldAnyFlowFARTransformer3DModel(...)instantiations without it continue to work.AnyFlowFARPipeline.__call__'schunk_partitionkwarg is unchanged in signature; only the internal default source moved from a class attribute toself.transformer.config.chunk_partition.Test plan
make fix-copiescleanmake style+make qualitycleandoc-builder style --max_len 119 --check_onlycleanTorchCompileTesterMixin— 3 passed, 2 skippedNVlabs/AnyFlow(post-fix: chunk 0 bit-exact, chunk N drift reduced 50–93%)nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers— match upstream outputcc @dg845