close
Skip to content

Reduce FLUX int8 test peak memory with sequential offload#13776

Open
jiqing-feng wants to merge 6 commits into
huggingface:mainfrom
jiqing-feng:test_xpu
Open

Reduce FLUX int8 test peak memory with sequential offload#13776
jiqing-feng wants to merge 6 commits into
huggingface:mainfrom
jiqing-feng:test_xpu

Conversation

@jiqing-feng
Copy link
Copy Markdown
Contributor

@jiqing-feng jiqing-feng commented May 21, 2026

Summary

Update the slow FLUX bitsandbytes int8 tests to use sequential CPU offload instead of model CPU offload.

enable_model_cpu_offload() can move an entire sub-model onto the GPU at once. For black-forest-labs/FLUX.1-dev, this can OOM on <=24 GB cards even when the T5 encoder and transformer are loaded from the pre-quantized int8 test checkpoint. Sequential CPU offload keeps peak memory lower by materializing one layer at a time, which lets the int8 FLUX tests run in more constrained environments.

The LoRA-loading assertion tolerance is also relaxed from 1e-3 to 2e-3 to account for small backend-specific numerical differences observed in the slow int8 path.

Changes

  • Switch SlowBnb8bitFluxTests setup from enable_model_cpu_offload() to enable_sequential_cpu_offload().
  • Document why sequential offload is needed for the FLUX int8 slow tests.
  • Relax the test_lora_loading cosine-distance tolerance to 2e-3.

Validation

Run the affected slow tests:

RUN_SLOW=1 python -m pytest \
  tests/quantization/bnb/test_mixed_int8.py::SlowBnb8bitFluxTests::test_quality \
  tests/quantization/bnb/test_mixed_int8.py::SlowBnb8bitFluxTests::test_lora_loading \
  -x -s

Observed result:

2 passed

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@github-actions github-actions Bot added tests size/S PR with diff < 50 LOC labels May 21, 2026
@jiqing-feng jiqing-feng changed the title Fix OOM on int8 tests Reduce FLUX int8 test peak memory with sequential offload May 21, 2026
@jiqing-feng
Copy link
Copy Markdown
Contributor Author

require change: huggingface/accelerate#4044 merged.

@jiqing-feng
Copy link
Copy Markdown
Contributor Author

Hi @sayakpaul . Would you please review the PR? Thanks!

# enable_model_cpu_offload moves an entire sub-model to GPU at once, which OOMs on
# <=24 GB cards for FLUX.1-dev even with int8 quantization.
# This requires the bitsandbytes fix that preserves Int8Params.SCB across .to() calls.
self.pipeline_8bit.enable_sequential_cpu_offload()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we keep making the same kind of changes i.e., if something fails on your particular environment, it's always better to guard them accordingly rather than doing it in a straightforward way like this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! Updated to guard by device memory instead of unconditionally switching:

_, total_mem = torch.accelerator.get_memory_info(0)
if total_mem <= 25 * (1024**3):
    self.pipeline_8bit.enable_sequential_cpu_offload()
else:
    self.pipeline_8bit.enable_model_cpu_offload()

This keeps the original enable_model_cpu_offload path on large-memory devices and only falls back to sequential offload on ≤24 GB cards. torch.accelerator works across CUDA/XPU/ROCm.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S PR with diff < 50 LOC tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants