Refactor: a2a3 swimlane AICore-self-identifying records (level=1 bypasses complete_task) by hw-native-sys-bot · Pull Request #974 · hw-native-sys/simpler

hw-native-sys-bot · 2026-06-02T13:28:51Z

Summary

Refactor a2a3 swimlane so AICORE_TIMING (level=1) skips the AICPU complete_task hot path entirely, and so the level≥2 path no longer carries fields it doesn't actually own. The AICore record is the single source of truth for task identity (task_token_raw) and kernel-execution timing (start_time, end_time); AICPU only contributes the two timestamps it alone can measure (dispatch_time, finish_time) plus the per-core join key (reg_task_id). Buffer rotation is decoupled from complete_task via AICPU's own dispatch-count tracker (no AICore-side signal, no scheduler-loop poll). dep_gen + a host-side core_types_ table cover the remaining identity gaps (func_id, core_type).

a2a3-only. a5's staging-ring model is different; out of scope here.

Commits

Add: dep_gen captures per-subtask kernel_ids — extends DepGenRecord with int32_t kernel_id[3] (steals from _pad0[32]; sizeof and tensors[] offset unchanged). Orchestrator passes {aic, aiv0, aiv1}_kernel_id to dep_gen_aicpu_record_submit; replay emits kernel_ids in deps.json.
Refactor: AICore buffer rotation via dispatch count — AICPU counts per-core dispatches in the dispatch path (scheduler_dispatch.cpp + host_build_graph/aicpu_executor.cpp) and rotates when the count is about to cross PLATFORM_AICORE_BUFFER_SIZE. No AICore-side signal cache line, no scheduler-loop poll. Race safety from the completion-before-dispatch invariant: AICore has FIN'd and dcci'd every record in the old buffer before AICPU's next dispatch. New hook: l2_swimlane_aicpu_on_aicore_dispatch. Drops L2SwimlaneAicoreSignal / poll_aicore_rotation / s_observed_aicore_buf_full_seq[]. total_record_count also bumped here so it stays accurate at all levels (including level=1 where complete_task never runs).
Refactor: AICore record self-identifies + level=1 bypass — L2SwimlaneAicoreTaskRecord carries 64-bit task_token_raw + per-core reg_task_id (still 32B half-cacheline). scheduler_completion + host_build_graph aicpu_executor gate entry into complete_task on level >= AICPU_TIMING. Host join key is reg_task_id (per-core dispatch-unique, the only key that works under SPMD block_num > num_cores and MIX cluster spread).
Add: level=1 host emit path (AICore-only synthesis) — set_core_types() on collector + sim/onboard device_runner wiring. export_swimlane_json synthesizes tasks[] from AICore records when collected_perf_records_ is empty and level == AICORE_TIMING; func_id emitted as -1 and joined from deps.json by swimlane_converter.py at post-process time (same model fanout already uses).
Doc updates — profiling_levels.md + docs/dfx/dep_gen.md.
Refactor: slim AICPU task record (64B → 32B) — L2SwimlaneAicpuTaskRecord drops task_id / func_id / core_type / start_time / end_time / duration (all redundant after the AICore-as-producer split); keeps dispatch_time / finish_time / reg_task_id only. complete_task signature shrinks 8 → 5 args; 5 call sites updated. AICPU writes a half-cacheline per task instead of a full one. Host: join_aicore_records() becomes build_aicore_lookup() (no record mutation); export pulls identity + AICore timing via reg_task_id lookup, core_type via core_types_ table, func_id via deps.json — unifies identity-resolution across levels=1 and ≥2.

Test plan

a2a3sim full dfx regression — pytest tests/st/a2a3/tensormap_and_ringbuffer/dfx --platform a2a3sim 7/7 pass
a2a3sim level=1 JSON inspection — 5 tasks for 5-task vector_example, each with full PTO2 task_id, derived ring_id, correct core_type (aiv), valid timing, func_id=-1 (to be joined from deps.json)
a2a3sim level=4 default — passes; task_id is full PTO2, ring_id decoded, all fields populated
a2a3 onboard — attempted via task-submit --device auto --device-num 2; both attempts hit halMemCtl rc=13 (EACCES) during register init before swimlane code is reached (device contention, npu-smi info shows all devices near 100% from other users). Reviewers should re-run when devices free up.

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7d905fe0-3876-47fa-bd3c-90b7aa916cf4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR refactors L2 swimlane profiling to encode full task tokens instead of low-word task IDs in AICore records, decouple buffer rotation from completion paths via explicit signaling and scheduler-loop polling, extend dependency records with per-subslot kernel IDs, and add host-side core-type mapping with a fallback JSON export path for minimal profiling levels.

Changes

Swimlane Task Identity, Signaling, and Rotation Refactor

Layer / File(s)	Summary
Shared-memory data layout updates `src/a2a3/platform/include/common/dep_gen.h`, `src/a2a3/platform/include/common/l2_swimlane_profiling.h`	`DepGenRecord` gains `kernel_id[3]` array with adjusted padding; `L2SwimlaneAicoreTaskRecord` stores `task_token_raw` (64-bit) replacing `task_id` (32-bit); new `L2SwimlaneAicoreSignal` cacheline struct introduced; `L2SwimlaneAicoreTaskPool` layout reorganized to place `aicore_sig` between `head` and `free_queue`, moving offsets and increasing total size.
AICore task recording with token and buffer-full signal `src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h`	Function signature updated to accept `task_token_raw` (64-bit) instead of `task_id`; record write path now publishes buffer-full rotation signal to dedicated `L2SwimlaneAicoreSignal` cacheline when buffer reaches last slot via `dcci+dsb` instruction sequence.
Dependency record kernel ID capture `src/a2a3/platform/include/aicpu/dep_gen_collector_aicpu.h`, `src/a2a3/platform/shared/aicpu/dep_gen_collector_aicpu.cpp`, `docs/dfx/dep_gen.md`	`dep_gen_aicpu_record_submit` extended with `kernel_ids[3]` parameter (AIC/AIV0/AIV1 subslot identifiers); record path writes kernel IDs to output record or `-1` for inactive subslots; documentation clarifies kernel-ID inclusion for post-processing task-to-kernel mapping.
Buffer rotation polling decoupled from completion `src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h`, `src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp`	New `l2_swimlane_aicpu_poll_aicore_rotation()` function polls per-core `buf_full_seq` advances and triggers rotations independently; per-core observed-sequence cache tracks last seen value; `complete_task` no longer triggers rotation, only increments head counter; flush count calculation becomes level-dependent (formula-based at `AICPU_TIMING`, constant otherwise).
Host-side record joining and JSON export `src/a2a3/platform/include/host/l2_swimlane_collector.h`, `src/a2a3/platform/shared/host/l2_swimlane_collector.cpp`, `src/a2a3/platform/sim/host/device_runner.cpp`	New `set_core_types()` API publishes per-core type table; `join_aicore_records()` join key changed from `reg_task_id` to `task_token_raw` hashmap; export validation extended to treat `collected_aicore_records_` as sufficient data at `AICORE_TIMING` level; fallback export path synthesizes `tasks[]` directly from AICore records when perf records are absent, using `core_types_` for metadata.
Scheduler loop level gating and polling `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`, `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp`, `src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp`, `src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp`, `src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp`	Scheduler dispatch loop adds polling call to `l2_swimlane_aicpu_poll_aicore_rotation()` once per iteration; AICPU completion recording across multiple executor contexts gated on `l2_swimlane_level >= AICPU_TIMING` instead of just enabled flag, making Level 1 bypass `complete_task` entirely; AICore recording calls now pass full `task_token_raw` extracted from dispatch context.
Documentation and runtime call-site updates `src/a2a3/runtime/tensormap_and_ringbuffer/docs/profiling_levels.md`, `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`, `src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp`	Profiling levels documentation updated to reflect Level 1 AICore-only timing with `task_token_raw`, host-side field derivation, and polling-based buffer rotation; weak stub and `submit_task_common` call site updated to capture and forward kernel IDs; `dep_gen_replay` extended to populate and emit `kernel_ids` JSON field.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hw-native-sys/simpler#939: Both PRs modify the L2 swimlane AICore/AICPU profiling structures and buffer-rotation mechanisms; this PR's task-identity and buffer-full signaling builds on that PR's ActiveHead pool unification.
hw-native-sys/simpler#942: This PR's AICore swimlane task-recording signature and buffer-rotation refactoring build on that PR's earlier swimlane "rotation→active head" refactor that also modified l2_swimlane_aicore_record_task.

Poem

🐰 Task tokens grow to sixty-four bits wide,
Signal cachelines guide the rotation tide,
Scheduler polls while AICore writes far,
Kernel IDs map each subslot by char—
Host-side synthesis brings clarity's light,
One level cleaner, profiling done right! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly identifies the main refactoring work: decoupling AICore buffer rotation from complete_task and enabling level=1 to bypass AICPU processing entirely via self-identifying AICore records.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description clearly describes the refactoring of a2a3 swimlane, explaining the level=1 AICORE_TIMING bypass, AICore self-identification, buffer rotation decoupling, and related changes to the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request optimizes and decouples the L2 swimlane profiling mechanism, particularly for AICORE_TIMING (level 1). It introduces a direct buffer-rotation signaling mechanism (L2SwimlaneAicoreSignal) from AICore to AICPU, allowing level 1 profiling to bypass the expensive AICPU complete_task hot path. It also updates schemas to carry the full 64-bit task_token_raw so that host-side post-processing can resolve task-to-kernel mappings. The reviewer identified two critical issues: a static array (s_observed_aicore_buf_full_seq) is not reset during initialization, which can cause incorrect buffer rotations on subsequent runs, and base_time_cycles remains uninitialized at level 1, leading to timestamp underflow and broken timeline alignments in the exported JSON.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp (1)
745-781: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Add AICore rotation polling to host_build_graph at AICORE_TIMING (level=1)

At L2SwimlaneLevel::AICPU_TIMING gating blocks in src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp (~745-781, ~853-872, ~899-915), l2_swimlane_aicpu_complete_task is skipped when l2_swimlane_level < AICPU_TIMING (so level=1 bypasses it). However, AICore-side buffer rotation is decoupled from l2_swimlane_aicpu_complete_task and is driven by l2_swimlane_aicpu_poll_aicore_rotation (reads aicore_sig.buf_full_seq and calls aicore_rotate; aicore_rotate is only called from that poll). That poll is called from the tensormap_and_ringbuffer scheduler (scheduler_dispatch.cpp), but there is no call to it anywhere in src/a2a3/runtime/host_build_graph/, so rotations won’t happen during the run—only the end-of-run l2_swimlane_aicpu_flush enqueues the current AICore buffer (capacity PLATFORM_AICORE_BUFFER_SIZE = 1024).
Add l2_swimlane_aicpu_poll_aicore_rotation(...) to the host_build_graph executor’s main loop so level=1 still rotates buffers as AICore fills them.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp` around lines 745
- 781, The host_build_graph AICPU executor loop is not calling
l2_swimlane_aicpu_poll_aicore_rotation, so AICore buffer rotations never occur
at L2SwimlaneLevel::AICPU_TIMING (level=1); add a call to
l2_swimlane_aicpu_poll_aicore_rotation(...) inside the executor main loop (near
the existing l2_swimlane_aicpu_complete_task calls in the AICPU handling blocks)
guarded by l2_swimlane_level >= L2SwimlaneLevel::AICPU_TIMING so rotations are
polled each iteration (use the same core_id/thread_idx/dispatch timestamp
context as the surrounding code and ensure dispatch_timestamps_ semantics are
preserved).

🧹 Nitpick comments (1)

src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h (1)

138-144: 💤 Low value

Consider replacing magic number with sizeof(L2SwimlaneActiveHead).

The + 64 offset is correct per the static_assert ABI lock, but using sizeof(L2SwimlaneActiveHead) would make the relationship explicit and self-documenting if the struct ever changes (the static_assert would catch any mismatch).

♻️ Optional refactor

         __gm__ L2SwimlaneAicoreSignal *sig =
-            reinterpret_cast<__gm__ L2SwimlaneAicoreSignal *>(reinterpret_cast<__gm__ uint8_t *>(head) + 64);
+            reinterpret_cast<__gm__ L2SwimlaneAicoreSignal *>(reinterpret_cast<__gm__ uint8_t *>(head) + sizeof(L2SwimlaneActiveHead));

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h` around lines
138 - 144, The code uses a magic offset "+ 64" to compute the
L2SwimlaneAicoreSignal pointer from head; replace that hard-coded 64 with
sizeof(L2SwimlaneActiveHead) to document the ABI relationship and keep the
static_assert meaningful—update the expression that builds sig (the
reinterpret_cast sequence referencing head) to add sizeof(L2SwimlaneActiveHead)
instead of 64, ensure L2SwimlaneActiveHead is visible/forward-declared in this
header so sizeof is available, and leave the existing static_assert and
assignment to sig->buf_full_seq (and subsequent dcci/dsb calls) unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/platform/shared/host/l2_swimlane_collector.cpp`:
- Around line 900-935: The base_time_cycles fallback causes underflow because
collected AICore records aren't considered when l2_swimlane_level_ ==
L2SwimlaneLevel::AICORE_TIMING; update the base_time_cycles computation to also
scan collected_aicore_records_ and take the minimum start_time (same min logic
used for tagged_records/phase records) when in that mode. Specifically, after
the phase-record base_time calculation (the block that currently computes
base_time_cycles from tagged_records and phase records), add a loop over
collected_aicore_records_ (iterating core_id 0..num_aicore_-1 and each r in
collected_aicore_records_[core_id]) that skips r.start_time==0 and sets
base_time_cycles = std::min(base_time_cycles, r.start_time); this ensures
cycles_to_us(r.start_time - base_time_cycles) cannot underflow in the
AICORE_TIMING fallback.

---

Outside diff comments:
In `@src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp`:
- Around line 745-781: The host_build_graph AICPU executor loop is not calling
l2_swimlane_aicpu_poll_aicore_rotation, so AICore buffer rotations never occur
at L2SwimlaneLevel::AICPU_TIMING (level=1); add a call to
l2_swimlane_aicpu_poll_aicore_rotation(...) inside the executor main loop (near
the existing l2_swimlane_aicpu_complete_task calls in the AICPU handling blocks)
guarded by l2_swimlane_level >= L2SwimlaneLevel::AICPU_TIMING so rotations are
polled each iteration (use the same core_id/thread_idx/dispatch timestamp
context as the surrounding code and ensure dispatch_timestamps_ semantics are
preserved).

---

Nitpick comments:
In `@src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h`:
- Around line 138-144: The code uses a magic offset "+ 64" to compute the
L2SwimlaneAicoreSignal pointer from head; replace that hard-coded 64 with
sizeof(L2SwimlaneActiveHead) to document the ABI relationship and keep the
static_assert meaningful—update the expression that builds sig (the
reinterpret_cast sequence referencing head) to add sizeof(L2SwimlaneActiveHead)
instead of 64, ensure L2SwimlaneActiveHead is visible/forward-declared in this
header so sizeof is available, and leave the existing static_assert and
assignment to sig->buf_full_seq (and subsequent dcci/dsb calls) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5f4fb728-77e3-424a-b130-7e41019a25f3

📥 Commits

Reviewing files that changed from the base of the PR and between d61dee4 and 8a19d8a.

📒 Files selected for processing (19)

docs/dfx/dep_gen.md
src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h
src/a2a3/platform/include/aicpu/dep_gen_collector_aicpu.h
src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h
src/a2a3/platform/include/common/dep_gen.h
src/a2a3/platform/include/common/l2_swimlane_profiling.h
src/a2a3/platform/include/host/l2_swimlane_collector.h
src/a2a3/platform/shared/aicpu/dep_gen_collector_aicpu.cpp
src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
src/a2a3/platform/sim/host/device_runner.cpp
src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp
src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/docs/profiling_levels.md
src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp

Threads the orchestrator-side kernel_id triple into each DepGenRecord so the host post-processor can resolve (task_id -> kernel) offline. This is the identity-side groundwork for the upcoming swimlane refactor that moves AICore self-identification out of the AICPU complete_task hot path -- the AICore record will carry only timing + task_token_raw, and the viewer will join func_id from deps.json the same way fanout already does. Schema change is in-place: 12B kernel_id[3] steals from _pad0[32], so sizeof(DepGenRecord) stays 2624B and tensors[] stays cache-line aligned at offset 576 (static_asserts unchanged). Replay emits kernel_ids in the JSON tasks[] entries; INVALID_KERNEL_ID (-1) marks inactive subslots. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Decouple AICore buffer rotation from the AICPU complete_task hot path. Previously every complete_task bumped ac_state->head.total_record_count and, every PLATFORM_AICORE_BUFFER_SIZE-th completion, called aicore_rotate. This trigger required complete_task to run on every FIN, blocking the upcoming AICORE_TIMING (level=1) bypass. New mechanism: AICPU drives rotation from its own per-core dispatch count. The dispatch path (scheduler_dispatch in tensormap_and_ringbuffer; aicpu_executor in host_build_graph) calls a new l2_swimlane_aicpu_on_aicore_dispatch hook right before write_reg(DATA_MAIN_BASE). The hook maintains the per-core dispatch count and rotates when it is about to cross a BUFFER_SIZE boundary -- strictly before the upcoming dispatch register write. AICPU has full dispatch visibility on its own, so no AICore-side signal cache line is needed. Race safety: the runtime's completion-before-dispatch invariant (AICore per core is single-threaded, and AICPU does not dispatch task K+1 until K FIN'd) guarantees that at rotation time AICore has already finished writing -- and dcci'd out -- every record in the old buffer. AICPU can safely enqueue it to the ready queue. Side effects vs the prior self-signal design: - L2SwimlaneAicoreSignal struct removed; L2SwimlaneAicoreTaskPool shrinks from 256B back to 192B (head 64 + free_queue 128); static_asserts updated. Host allocator uses sizeof() so the new size propagates automatically. - AICore record_task no longer writes a buf_full_seq at slot N-1; one fewer dcci OUT + dsb per BUFFER_SIZE tasks on the AICore hot path. - AICPU complete_task no longer bumps the AICore-pool total_record_count; the dispatch hook bumps it instead. This makes the counter accurate at all levels (level=1 included, where complete_task is bypassed). The flush-path high-water-mark formula drops its level=1 fallback branch as a result. - Scheduler main loop no longer polls every iteration -- the once-per- loop l2_swimlane_aicpu_poll_aicore_rotation call is gone. Trigger cost moves to a single per-dispatch check, paid only on cores that actually dispatched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…es complete_task Two coupled changes: 1) AICore record schema (L2SwimlaneAicoreTaskRecord) replaces the 32-bit `task_id` (low bits of the per-core reg dispatch token) with a 64-bit `task_token_raw` carrying the full PTO2 encoding (ring_id << 32 | local_id) for tensormap_and_ringbuffer, or the zero-extended task index for host_build_graph. Record stays 32B half-cache-line. The AICore helper reads `task_token` straight from `exec_payload->local_context.async_ctx.task_token.raw` -- already in AICore cache from the just-completed task, so the new identity comes at no extra GM load. Layout invariants (sizeof == 32) hold. 2) scheduler_completion + host_build_graph aicpu_executor: gate the entry into `l2_swimlane_aicpu_complete_task` on `l2_swimlane_level >= AICPU_TIMING`. At AICORE_TIMING (level=1) the AICore record alone carries identity; AICPU adds nothing useful and the per-completion hot path (counter inc, ring lookup, identity store, wmb) is elided entirely. Phase 3's signal-based AICore rotation makes this safe -- rotations no longer piggy-back on complete_task. Host collector: join key moves from per-core `reg_task_id` (low 32) to the full PTO2 `task_id`. AICore record's `task_token_raw` matches the AICPU record's `task_id`; both are 64-bit, both are the canonical identity. The kDirectIndexCap vector fast path goes away -- PTO2 task ids are sparse high values, so unordered_map is the right structure. Known follow-up (next commit in this PR): at level=1 there are no AICPU records, so the host emit loop currently produces zero tasks[] entries. The Phase 4 commit adds an AICore-records-only path that synthesizes JSON entries from {task_token_raw, start, end} + dep_gen kernel_ids[] + host-side core_id -> core_type derivation, making level=1 useful end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Make AICORE_TIMING (level=1) useful end-to-end by synthesizing JSON tasks[] entries from the AICore record stream alone -- no AICPU records exist at this level because complete_task is bypassed by Phase 1+2. Three pieces: 1. set_core_types(types, n) on L2SwimlaneCollector + sim device_runner call. Sim populates from runtime.workers[i].core_type. Onboard not wired in this commit (its workers[].core_type is established via AICore handshake which races with init_l2_swimlane); the existing level>=2 path on onboard still works because AICPU records carry core_type directly. Onboard level=1 wiring is a follow-up. 2. AICPU flush path falls back to enqueueing the full buffer (count=PLATFORM_AICORE_BUFFER_SIZE) at level=1 instead of the total_record_count-based live count, since complete_task -- where total_record_count is bumped -- is bypassed at level=1. The host copy_aicore_buffer already skips trailing slots with start_time==0, so over-stating count costs only a scan pass, not spurious records. 3. export_swimlane_json: when collected_perf_records_ is empty AND level == AICORE_TIMING, synthesize one task[] entry per AICore record from {task_token_raw, start, end} + the published core_types_ table for core_type. func_id is emitted as -1; swimlane_converter.py joins it from deps.json by task_id (same pattern fanout already uses). Verified: at level=1 the JSON has 5 tasks for the 5-task vector_example, each with the full PTO2 task_id, derived ring_id, correct core_type ("aiv"), and valid timing. level>=2 paths unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ctor - profiling_levels.md: level 1 row now says "AICore timing only, complete_task bypassed"; new paragraph on how host derives func_id (deps.json kernel_ids[]) and core_type (L2SwimlaneCollector::set_core_types) at level 1; describes the buf_full_seq + scheduler poll rotation mechanism. - dep_gen.md: capture call site row notes the new a2a3-only kernel_id[3] field and its role as the swimlane identity bridge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After the AICore-as-producer split, identity (task_token_raw) and AICore-side timing (start/end) live in L2SwimlaneAicoreTaskRecord; the AICPU side only adds AICPU-only timestamps. The L2SwimlaneAicpuTaskRecord schema still carried forward task_id / func_id / core_type / start_time / end_time / duration from the pre-refactor "AICPU is the producer" era, none of which the AICPU side actually owns anymore. Trim it down to the minimum viable AICPU payload: struct L2SwimlaneAicpuTaskRecord { uint64_t dispatch_time; // 8B uint64_t finish_time; // 8B uint32_t reg_task_id; // 4B (host join key vs AICore stream) } __attribute__((aligned(32))); // sizeof == 32B (compiler trailing pad) Hot-path: - 64B per record -> 32B per record. AICPU buffer holds 2x records in the same shared-memory footprint (PLATFORM_PROF_BUFFER_SIZE unchanged). - complete_task writes 3 fields instead of 9 -- drops a 24B chunk of zero-stores (start/end/duration that the pre-refactor design pre-zeroed for host-fill) plus the redundant task_id (8B) / func_id (4B) / core_type (4B) stores. One half-cache-line commit per task. - complete_task signature shrinks from 8 to 5 args; callers in scheduler_completion.cpp and host_build_graph/aicpu_executor.cpp drop task->task_id.raw, task->kernel_id[slot], and h->core_type loads at every call site. Host post-processing changes: - join_aicore_records() (record patcher) becomes build_aicore_lookup() -- builds per-core reg_task_id -> {task_token_raw, start_time, end_time} maps and returns them. No mutation of collected_perf_records_. - export_swimlane_json() walks AICPU records, joins each by reg_task_id against its core's lookup, and emits task entries using AICore-side identity + timing, AICPU-side dispatch/finish, core_type from the set_core_types() static table, and func_id = -1 (resolved post-process by swimlane_converter.py from deps.json's kernel_ids[]). - Identity-resolution path is now unified across all levels: level=1's emit loop and level>=2's emit loop use the same core_types_ table + deps.json mechanism. Drops one "level=1 fallback" code branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously dispatch_time and finish_time were sampled at convenient code locations rather than at the actual AICPU register I/O boundaries, charging AICPU-internal cost to the "dispatch chain" and "FIN delivery" spans. dispatch_time was sampled: - tensormap_and_ringbuffer: inside dispatch_subtask_to_core, BEFORE on_aicore_dispatch (which on every BUFFER_SIZE-th call does a full rotation: free_queue pop + ready_queue push + head update + wmb) and BEFORE wmb + write_reg - host_build_graph: at the top of try_dispatch_task, BEFORE LOG_INFO_V0 (a V0-level log, always on once that logger is enabled), BEFORE on_aicore_dispatch, BEFORE wmb + write_reg Net effect: (dispatch_time → start_time) included a couple hundred ns of AICPU prep that has nothing to do with AICore latency, and a µs-scale spike every 1024 dispatches when rotation happened. Misleading for diagnosing dispatch-chain bottlenecks. finish_time was sampled inside the level-gated complete_task path -- AFTER decide_slot_transition, AFTER LOG_INFO_V0, AFTER fanin processing preamble. Charged AICPU completion-processing cost to (end_time → finish_time). This commit moves both timestamps to the actual I/O boundaries: dispatch_time = get_sys_cnt_aicpu() right after wmb, immediately before write_reg(DATA_MAIN_BASE). finish_time = get_sys_cnt_aicpu() right after rmb() following the register read that observed FIN, before LOG_INFO_V0 and any transition processing. Additional cleanup in host_build_graph driven by the dispatch_time move: - The post-completion `dispatch_timestamps_[core_id] = get_sys_cnt_aicpu()` pre-emptive writes at four sites are now dead -- try_dispatch_task sets the timestamp correctly at the next dispatch. Removed. - The "Update timestamp if didn't dispatch" pre-emptive write at two sites is dead for the same reason. Removed. - The `bool dispatched` variable in two case-1/case-3 blocks was only consumed by the removed pre-emptive write. Dropped (the try_dispatch_task calls themselves are preserved). Sched/orch phase records and the existing per-task on_aicore_dispatch rotation hook are untouched -- this is purely a sampling-position fix for the existing dispatch_time / finish_time fields. No schema change, no new fields, no record-size change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp Outdated

Comment thread src/a2a3/platform/shared/host/l2_swimlane_collector.cpp Outdated

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/a2a3/platform/shared/host/l2_swimlane_collector.cpp

hw-native-sys-bot force-pushed the swimlane-aicore-self-id branch 5 times, most recently from 4264f6a to f086334 Compare June 3, 2026 06:14

hw-native-sys-bot force-pushed the swimlane-aicore-self-id branch from f086334 to f3078e4 Compare June 3, 2026 08:02

ChaoWao and others added 4 commits June 3, 2026 17:10

hw-native-sys-bot force-pushed the swimlane-aicore-self-id branch from f3078e4 to a589570 Compare June 3, 2026 09:12

ChaoWao previously approved these changes Jun 3, 2026

View reviewed changes

ChaoWao dismissed their stale review via bc2166a June 3, 2026 11:25

ChaoWao approved these changes Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: a2a3 swimlane AICore-self-identifying records (level=1 bypasses complete_task)#974

Refactor: a2a3 swimlane AICore-self-identifying records (level=1 bypasses complete_task)#974
hw-native-sys-bot wants to merge 7 commits into
hw-native-sys:mainfrom
hw-native-sys-bot:swimlane-aicore-self-id

hw-native-sys-bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hw-native-sys-bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

Test plan

Related

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hw-native-sys-bot commented Jun 2, 2026 •

edited

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading