fix(compactor): await HeartBeat goroutine to fix flaky TestBlocksCleaner on arm64#7566
Closed
sandy2008 wants to merge 1 commit into
Closed
fix(compactor): await HeartBeat goroutine to fix flaky TestBlocksCleaner on arm64#7566sandy2008 wants to merge 1 commit into
sandy2008 wants to merge 1 commit into
Conversation
BlocksCleaner.cleanUpActiveUsers and cleanDeletedUsers each launched a VisitMarkerManager.HeartBeat goroutine and only signalled it via errChan <- nil before returning, never awaiting the goroutine. When HeartBeat received the nil it still ran a terminal MarkWithStatus( Completed) using the parent ctx followed by DeleteVisitMarker( context.Background()) against the bucket. In tests using a filesystem bucket, those writes raced with t.Cleanup's RemoveAll on the temp dir, producing the arm64-flaky failure: unlinkat .../visit-mark.json: directory not empty Wrap the heartbeat goroutine in a sync.WaitGroup so the per-user callback waits for HeartBeat to fully exit (including its terminal writes) before returning. This drains all bucket I/O before test teardown removes the directory, eliminating the race. Fixes cortexproject#7564 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Member
|
I think this PR duplicates #7386. |
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Wraps the
VisitMarkerManager.HeartBeatgoroutine inBlocksCleaner.cleanUpActiveUsersandBlocksCleaner.cleanDeletedUserswith async.WaitGroupso the per-user cleanup callback waits for the heartbeat goroutine to fully exit before returning, instead of just signalling it viaerrChan <- niland racing on.This is a test-flake fix only; the production behaviour change is that per-user cleanup now blocks until the heartbeat's terminal
MarkWithStatus(Completed)+DeleteVisitMarkerwrites finish, which is what the surrounding code already assumed.Why
TestBlocksCleanerflakes intermittently ontest-no-race (arm64)CI with:Observed in run 26555111751.
Root cause:
cleanUpActiveUsers/cleanDeletedUserslaunchedvisitMarkerManager.HeartBeatin a baregostatement and only signalled it viaerrChan <- nilin adefer, never awaiting it. On receipt of thenil,HeartBeatstill ran a terminalMarkWithStatus(Completed)using the parentctxfollowed byDeleteVisitMarker(context.Background())against the bucket. In the test, the bucket is a filesystem bucket backed by at.TempDir()whoset.Cleanuprunsos.RemoveAll; those late writes raced with the directory teardown, producing the arm64-flakyunlinkat: directory not emptyfailure.The arm64 scheduler exposes this race more often than amd64 does, but the bug is platform-independent — the goroutine leak / late-write window exists on every architecture.
How
pkg/compactor/blocks_cleaner.go(two call sites:cleanUpActiveUsersandcleanDeletedUsers). Replace:with:
The same change is applied symmetrically in both call sites.
syncis already imported in this file so no import diff is required. A CHANGELOG entry is added under master / unreleased referencing this PR.Why not other approaches
HeartBeatto await internally.HeartBeatis shared with other call sites (partition_compaction_planner.go,shuffle_sharding_planner.go) that intentionally do not wait; changing its lifetime contract risks breakage well beyond Flaky test: TestBlocksCleaner (arm64) —unlinkat: directory not empty#7564. The leak is owned by the caller; fix it at the caller.donechannel instead ofsync.WaitGroup. Functionally equivalent (this is what PR Fix flaky TestBlocksCleaner by awaiting HeartBeat goroutine completion #7386 proposed).WaitGroupreads naturally as "wait for this goroutine to exit"; either pattern works.errgroup. Overkill for a single goroutine with no error to propagate (the existingerrChanalready carries the only signal that matters). Would expand the diff without changing the correctness story.Which issue(s) this PR fixes
Fixes #7564
Note: PR #7386 by another contributor proposed a similar
doneChanfix that was marked CHANGES_REQUESTED by @friedrichg for missing the CHANGELOG entry. This PR is functionally equivalent (sync.WaitGroupvsdoneChan) and includes the missing CHANGELOG.Checklist
CHANGELOG.mdupdated under master / unreleasedTestBlocksCleaneris the regression test; runs 30/30 PASS in this worktree on amd64 (the arm64 race window is narrower so amd64 reproduction is unreliable, but the goroutine-leak window is closed regardless of platform)Test plan
gofmt -l pkg/compactor/blocks_cleaner.go— cleangoimports -local github.com/cortexproject/cortex -l pkg/compactor/blocks_cleaner.go— cleango vet -tags "netgo slicelabels" ./pkg/compactor/...— cleango build -tags "netgo slicelabels" ./pkg/compactor/...— cleango test -tags "netgo slicelabels" -timeout 600s -count=30 -run "^TestBlocksCleaner$/concurrency=2,_markers_migration_enabled=false,_tenant_deletion_delay=0s" ./pkg/compactor/...— 30/30 PASSgo test -race -tags "netgo slicelabels" -timeout 600s -run "^TestBlocksCleaner$" ./pkg/compactor/...— PASS🤖 Generated with Claude Code