Finding deadlocks in CuTe kernels with SPIN

Using SPIN model checker to statically find or prove the absence of deadlocks in CuTe DSL kernels on NVIDIA B200, and presenting a proof-of-concept github.com/cheshire/cute2promela lowering from CuTe to SPIN.

Synchronization bugs in GPU kernels are hard to debug. When a barrier deadlocks, the hardware yields no stack trace, and no error code until the benchmark times out. Hence each iteration of the debug loop starts to potentially cost tens of minutes.

As we’ve worked on FlashInfer MLSYS Challenge (our solution took 1st place in the mixture-of-experts track), we had to iterate on a persistent fused mixture-of-experts kernel for DeepSeek-V3, written in CUTLASS’s CuTe DSL for an NVIDIA B200 and stitched together from FF1, SwiGLU, and FF2 stages across clusters of CTAs. The pipeline is coordinated via mbarriers in shared memory, peer-CTA mbarriers reached via mapa-translated DSMEM pointers, and GMEM atomic counters across clusters. A bug in any of those can potentially result in a deadlock, which is hard to debug, especially when GPUs are only available via Modal.

As my background is in formal verification, I wanted to try to instead encode the synchronization model in Promela DSL and check them with the SPIN model checker. This would not only shorten the iteration cycle, but also deterministically either demonstrate a counterexample to the desired property (e.g. a deadlock) or prove that no such interleaving exists.

Short Primer: Blackwell Synchronization Primitives

Let’s start with an overview of Blackwell synchronization primitives we would have to model.

An mbarrier is a 64-bit hardware object that lives in shared memory. Its bits hold a current arrival count, a pre-declared expected count, and a single-bit phase. You allocate it like any SMEM variable, then have one thread initialize it with the expected count; from that point on, the hardware treats it as a small state machine with two transitions:

Arrive. Any thread can call mbarrier_arrive on the barrier. The hardware atomically increments the arrival count. If the new count equals expected, the barrier completes: the count resets, the phase bit flips. Otherwise the call returns and the thread continues.
Wait. A thread that calls mbarrier_wait(phase=P) stalls until the barrier’s phase differs from P. Since the phase only changes on completion, this means “wait until the barrier completes one more time after I last looked”.

The single-bit phase is what makes mbarriers reusable. The first time you arrive-and-wait, you wait on phase 0; after completion, the barrier is at phase 1 and ready for the next round; you wait on phase 1; and so on. A persistent kernel that does N iterations of arrive-then-wait has to flip its expected-phase tracking each iteration. Get the flipping wrong and an iteration deadlocks.

mbarrier in CuTe DSL

In CuTe DSL, every mbarrier op is one of a small handful of calls:

Initialization. Barriers are allocated in static SMEM and initialized with their expected arrival count. One elected thread per cluster does this, and a fence makes the initialization visible before anyone arrives.

with cute.arch.elect_one():
    cute.arch.mbarrier_init(gate_mbar, expected=256)  # 128 local + 128 peer
    cute.arch.mbarrier_init_fence()

Local arrive. A thread drops one count on a barrier in its own CTA’s SMEM.

cute.arch.mbarrier_arrive(gate_mbar)

TMA arrive-and-expect-tx. A variant used by TMA producers: the barrier accumulates bytes rather than arrivals, and completes when the expected byte count has landed. Typically followed by an async copy that names the barrier and a wait on the same barrier.

cute.arch.mbarrier_arrive_and_expect_tx(w2_mbar, w2_copy_bytes)
cute.copy(tma_atom_w2, ..., tma_bar_ptr=w2_mbar)
cute.arch.mbarrier_wait(w2_mbar, tma_phase_w2)
tma_phase_w2 = tma_phase_w2 ^ cutlass.Int32(1)

Phase wait. A consumer blocks until the barrier’s phase differs from the value it last saw.

cute.arch.mbarrier_wait(gate_mbar, sgate_phase)
sgate_phase = sgate_phase ^ cutlass.Int32(1)   # toggle for next iter

Cross-CTA arrive. The B200 cluster lets two CTAs share each other’s shared memory through a DSMEM window. To arrive on a peer CTA’s mbarrier, the local SMEM pointer is translated through mapa.shared::cluster to the peer’s view of the same allocation, and the arrive lands on that translated address.

def mbarrier_arrive_peer(mbar_ptr, peer_cta_rank):
    remote = cute.arch.mapa_shared_cluster(mbar_ptr, peer_cta_rank)
    cute.arch.mbarrier_arrive(remote, space="cluster")

From the peer’s perspective, that arrive lands in their count the same way a local arrive would. The hardware handles the address translation and the count update; no software protocol is involved.

Cross-cluster: GMEM atomic counter. Clusters can’t share SMEM, only GMEM. When a producer in one cluster needs to hand off to a consumer in another, the DeepGEMM idiom is an int32 GMEM counter: the producer does atomic_add(counter, 1, sem="release", scope="gpu") when it finishes, and the consumer spins on atomic_load(counter, sem="acquire", scope="gpu") until the count reaches expected. sem="acquire"/"release" is the C11 ordering pair: stores before the release are visible after the matching acquire. scope="gpu" is the part that’s easy to get wrong — on Blackwell, ordering and scope are orthogonal, and the scope decides how far the visibility actually propagates. cta, cluster, gpu, and sys form a hierarchy; pick one too small and the consumer reads stale GMEM even though the producer’s atomic completed. Case 2 below works through the bug class this opens up.

Fences. A handful of variants order things that don’t order themselves. The one we would use is fence_view_async_shared: it makes synchronous SMEM stores visible to subsequent async consumers (in our case, the mbarrier wait that gates a downstream SMEM read). Other call sites use fence_proxy("async.shared", space="cta"), the inverse direction (async producers, sync consumers). Fences are cheap at runtime; forgetting them produces the same intermittently-stale reads that scope errors do, and the bug is invisible to SPIN unless the visibility constraint is encoded in the model.

A motivating kernel

For a running example we’ll use a two-CTA Jacobi smoother on a 256-cell array of float32. Each CTA owns 128 cells in its own SMEM and repeatedly applies the 3-point average

    new[i] = (old[i-1] + old[i] + old[i+1]) / 3

across the whole array. Interior cells only touch the owning CTA’s SMEM. The cells at the boundary — cell 127 of CTA 0 and cell 0 of CTA 1 — need one value from the peer CTA, which is exchanged through cluster DSMEM once per iteration. After N rounds the array relaxes toward the linear interpolation between its two endpoints; concretely, the kernel does something visible. What we care about for the rest of the post is its synchronization skeleton, which is the same as the cross-CTA exchange in our MoE kernel.

def mbarrier_arrive_peer(mbar_ptr, peer_cta_rank):
    """Arrive on a peer CTA's mbarrier via mapa.shared::cluster."""
    remote = cute.arch.mapa_shared_cluster(mbar_ptr, peer_cta_rank)
    cute.arch.mbarrier_arrive(remote, space="cluster")
 
 
@cute.jit
def jacobi_smoother(arr_ptr, N_ITERS: cutlass.Constexpr):
    HALF = 128
    my_cta = cute.arch.cluster_rank_in_cluster()         # 0 or 1
    peer   = 1 - my_cta
    tid    = cute.arch.thread_idx_x()
 
    cur  = smem.alloc_array(float32, shape=HALF)         # this CTA's cells
    nxt  = smem.alloc_array(float32, shape=HALF)         # double buffer
    halo = smem.alloc_array(float32, shape=1)            # peer's edge cell
    gate = smem.alloc_mbarrier()
 
    # One elected thread per cluster initializes the mbarrier; fence
    # so the init is visible before any arrive.
    with cute.arch.elect_one():
        cute.arch.mbarrier_init(gate, expected=256)      # 128 local + 128 peer
        cute.arch.mbarrier_init_fence()
 
    cur[tid] = arr_ptr[my_cta * HALF + tid]              # initial load from GMEM
    phase = cutlass.Int32(0)                             # per-thread parity track
 
    for it in cutlass.range_constexpr(N_ITERS):
        # 1. One thread per CTA writes my edge cell into the peer's halo slot
        #    over cluster DSMEM.
        if tid == 0:
            my_edge = cur[HALF - 1] if my_cta == 0 else cur[0]
            peer_halo = cute.arch.mapa_shared_cluster(halo, peer)
            peer_halo[0] = my_edge
 
        # 2. Cluster sync: every thread in the cluster arrives locally and on
        #    the peer, then waits for the barrier to complete (256 arrivals).
        cute.arch.fence_view_async_shared()
        cute.arch.mbarrier_arrive(gate)
        mbarrier_arrive_peer(gate, peer)
        cute.arch.mbarrier_wait(gate, phase)
        phase = phase ^ cutlass.Int32(1)
 
        # 3. Apply the 3-point stencil. halo[0] supplies the missing neighbor
        #    at the CTA boundary; the global endpoints clamp to themselves.
        if my_cta == 0:
            left  = cur[tid]  if tid == 0          else cur[tid - 1]
            right = halo[0]   if tid == HALF - 1   else cur[tid + 1]
        else:
            left  = halo[0]   if tid == 0          else cur[tid - 1]
            right = cur[tid]  if tid == HALF - 1   else cur[tid + 1]
        nxt[tid] = (left + cur[tid] + right) / 3.0
 
        cur, nxt = nxt, cur                              # swap for next round
 
    arr_ptr[my_cta * HALF + tid] = cur[tid]              # write back to GMEM

Each iteration brings the barrier’s arrival count to 256 (every one of the 128 threads in each CTA arrives once locally and once on its peer), at which point the hardware resets the count and flips the phase bit. A thread that called mbarrier_wait with the previous phase value unblocks, XORs its own tracked phase for the next round, and the next wait fires when the barrier’s phase flips again. That single-bit phase, threaded through a per-thread XOR each iteration, is the only thing keeping consecutive rounds distinguishable.

Whether this is correct is not obvious from reading. The phase tracking is one register; the XOR is one line; the expected count was set once at init. If we replace the phase argument on the mbarrier_wait line with a literal cutlass.Int32(0) (the bug we had in our MoE kernel!) the code would work at small iteration counts because the phase only matters from iteration 2 onward, but would deadlock in larger ones:

$ modal run repro.py --mode correct --iters 30
=== mode=correct iters=30 ===
GPU: NVIDIA B200
Launching grid=(2,) num_ctas=2 threads/CTA=128 EXPECTED=256
COMPLETED in 0.507s. done_ctr=2 (expect 2)
PASS mode=correct iters=30
ROUND-TRIP: 5.3s wall (incl. Modal launch overhead)

$ modal run repro.py --mode buggy --iters 1
=== mode=buggy iters=1 ===
COMPLETED in 0.538s. done_ctr=2 (expect 2)
PASS mode=buggy iters=1

$ modal run repro.py --mode buggy --iters 5
=== mode=buggy iters=5 ===
Launching grid=(2,) num_ctas=2 threads/CTA=128 EXPECTED=256
(GPU never returns)
...
Task's current input hit its timeout of 300s
modal.exception.FunctionTimeoutError

The buggy variant completes at iters=1 because the phase only matters from the second iteration on. At iters ≥ 2 the GPU deadlocks.

The model-checking side: SPIN and Promela

SPIN is a model checker for concurrent systems, and Promela (Process Meta-Language) is the input DSL. You describe a finite set of communicating processes and the properties they should obey, and SPIN compiles a C verifier that exhaustively enumerates the reachable state space. If a property fails, it gives back a concrete interleaving as a counterexample. If the search completes, no interleaving within the model violates the property.

Processes and the `init` block

A proctype is a process template, and the run construct spawns an instance. Everything starts from the init block, which spawns the rest:

proctype Thread(int cta; int tid) {
  /* body */
}
 
init {
  int c, t;
  atomic {
    for (c : 0 .. NCTAS - 1) {
      for (t : 0 .. THREADS_PER_CTA - 1) {
        run Thread(c, t);
      }
    }
  }
}

This launches eight Thread processes. They share global state and interleave at the level of individual Promela statements. SPIN would explore every legal interleaving.

`atomic { ... }`

An atomic block executes without interleaving. A single mbarrier arrive is one PTX instruction, so wrapping its count/phase update in atomic is exact:

inline do_arrive(cta) {
  atomic {
    mbar_count[cta] = mbar_count[cta] + 1;
    if
    :: mbar_count[cta] == EXPECTED_ARRIVALS ->
        mbar_count[cta] = 0;
        mbar_phase[cta] = 1 - mbar_phase[cta];
    :: else -> skip
    fi
  }
}

`inline`

An inline is a textual macro — like a C #define, but multi-statement and arity-checked. We use it for primitives like do_arrive and do_wait so the proctype body reads like the kernel.

`if`/`fi` with guards

Promela’s if is Dijkstra-style: a list of guarded statements, any one of which may fire if its guard is true. If multiple guards are open, SPIN picks any of them non-deterministically, and explores both. The :: else branch fires only if no other guard is possible.

`do`/`od`

Same as if but loops, with break to exit:

do
:: iter < TILES_PER_CTA ->
     /* one iteration */
     iter = iter + 1;
:: else -> break
od

Guards as waits

Inside an atomic, a false guard blocks the proctype until it becomes true. Promela has no separate “wait” statement; you just write the condition:

inline do_wait(cta, expected_phase) {
  atomic { mbar_phase[cta] != expected_phase -> skip; }
}

A thread that hits do_wait sits there until mbar_phase[cta] differs from expected_phase. Some other thread’s arrive flips the phase, the guard opens, and the proctype proceeds.

LTL: the property language

SPIN takes safety and liveness properties as Linear Temporal Logic formulas. The LTL subset we use is:

[] P “always P”. Safety: P holds in every reachable state. Use for invariants like “no data race”.
<> P “eventually P”. Liveness: P holds in some future state of every infinite execution. Use for “the protocol eventually completes”.
[]<> P “always eventually”, i.e. infinitely often. Use for fairness-style claims.
<>[] P “eventually permanently”. Use for “eventually the system stabilizes”.

Our standard deadlock-freedom claim:

ltl all_done {
  <> (iters_completed[0] == THREADS_PER_CTA
   && iters_completed[1] == THREADS_PER_CTA)
}

Internally, SPIN compiles an LTL property into a never claim, a Büchi automaton (an automaton over infinite words, accepting by visiting an accepting state infinitely often) that accepts exactly the executions that violate the property. If SPIN finds an accepting run of the never claim, that run is the counterexample. When the violation is a safety invariant [] P, SPIN compiles it into the never-claim !(!P) and prints assertion violated !(!P) on a hit. There’s no actual assert() in the source, the doubly-negated form is the literal text of the never-claim whose witness SPIN found.

Running the verifier

The compile-and-run sequence is always three commands:

$ spin -a model.pml      # generate pan.c (the C verifier)
$ cc -O2 -o pan pan.c    # build it
$ ./pan -a               # run; -a enables acceptance-cycle / liveness search

A handful of flags matter. -a is what you almost always want: safety, assertions, and LTL liveness (including the acceptance cycles that show up as deadlocks). -l checks non-progress cycles instead and needs cc -DNP at build time.

Lowering the kernel to Promela

Once you know what to keep and what to drop, the translation from the running example to Promela is fairly mechanical. The per-thread loop in the kernel maps directly to a proctype Thread in Promela:

CuTe DSL kernel (per thread)

# 256 threads per cluster, N_ITERS rounds
phase = 0
for tile in range(N_ITERS):
  # ... stash ...
  cute.arch.fence_view_async_shared()
  cute.arch.mbarrier_arrive(gate_mbar)
  mbarrier_arrive_peer(gate_mbar, peer)
  cute.arch.mbarrier_wait(gate_mbar, phase)
  phase ^= 1

Promela proctype

proctype Thread(int cta; int tid) {
  int my_phase = 0;
  int iter = 0;
  int peer = 1 - cta;
  do
  :: iter < ITERS ->
       arrive(cta);   /* local */
       arrive(peer);  /* peer */
       wait(cta, my_phase);
       my_phase = 1 - my_phase;
       iter = iter + 1;
  :: else -> break
  od;
  iters_completed[cta] =
    iters_completed[cta] + 1;
}

The substitutions, primitive by primitive:

Hardware mbarrier (64-bit word in SMEM packing arrival count, expected count, and phase) → two Promela ints, mbar_count[NCTAS] and mbar_phase[NCTAS], plus a constant EXPECTED_ARRIVALS. The hardware atomicity is preserved by wrapping every update in atomic { }.
mbarrier_arrive(ptr) → the arrive(cta) inline shown earlier. Increment the count atomically; if it hits expected, reset to 0 and flip phase.
mbarrier_wait(ptr, P) → atomic { mbar_phase[cta] != P -> skip; }. A guarded atomic that blocks the proctype until the phase differs.
Peer arrive via mapa_shared_cluster + mbarrier_txn(ARRIVE) → the same arrive primitive, on the peer’s array slot. The model captures the protocol, not the address translation; mapa is a deterministic function that picks which CTA’s mbarrier gets incremented, so we encode it directly as arrive(peer).
128 threads per CTA → 4 proctypes per CTA. For this protocol the correctness doesn’t change with width; shrinking the count keeps the state space tractable.
~30 tile iterations → ITERS = 3. Three is the smallest count that exercises phase parity (iter 0 waits on phase 0, iter 1 on phase 1, iter 2 on phase 0 again). A model with fewer iterations would pass over the phase-flip-back-to-zero bug without seeing it.
Persistent loop → do :: iter < ITERS -> ...; iter++ :: else -> break od.

Note that the modelling drops most of the code, as only synchronization primitives and calculation affecting control flow influence liveness.

$ spin -a kernel_model.pml && cc -O2 -o pan pan.c && ./pan -a
ltl all_done: <> (((iters_completed[0]==4)) && ((iters_completed[1]==4)))

State-vector 248 byte, depth reached 387, errors: 0
   830628 states, stored (1.66126e+06 visited)
  3735761 states, matched
  5397017 transitions (= visited+matched)

pan: elapsed time 2.07 seconds
pan: rate 802539.13 states/second

~830k states, ~5.4M transitions, exhaustively verified that the correct phase tracking never deadlocks at this size.

What SPIN can check

Safety: [] P. P never becomes false. Use for invariants like “no data race” or “the consumer never reads before the producer published”. SPIN proves these by depth-first search; on success the full reachable state space has been visited. On failure SPIN gives one offending state and the path to it.
Liveness: <> P. P eventually holds on every infinite execution. Use for “the protocol completes”. Failure looks different from safety: SPIN finds an acceptance cycle, an infinite loop in the state graph that never satisfies P, which would manifest as the deadlock.
Fairness: []<> P, with -f. P happens infinitely often, assuming the scheduler is weakly fair to every proctype. Worth reaching for when a counterexample involves a thread that’s simply never scheduled, fairness lets you distinguish a real protocol break from one that only fails under an adversarial scheduler.

How fast does this blow up?

State-space size is exponential in the protocol dimensions you choose. We swept the correct Promela model of the running example across iteration counts (TILES_PER_CTA, i.e. N_ITERS) and thread counts (THREADS_PER_CTA); the numbers below come from one laptop run, with the bottom row added from the standalone three-iteration baseline:

TILES   THREADS/CTA   states stored   transitions   wall
  1         2             1,137          5,476       <0.01s
  1         3            22,082        123,937        0.04s
  1         4           416,831      2,564,578        0.95s
  1         5         7,701,536     50,393,599         23s
  2         2             4,424         21,907        0.01s
  2         3           196,051      1,117,014        0.38s
  2         4         9,500,414     58,705,617         26s
  3         4        18,583,997    114,846,660         52s   (baseline)

Adding a thread per CTA multiplies states by roughly 20×. The (3, 4) row is the full verification baseline used in the case studies and is included for context; the rows above it are a clean sweep at smaller sizes. The protocol’s shape (phase parity, count-to-expected, peer arrive via mapa) is exercised the same way at 2×4×3 as at 2×128×30 — the bug class doesn’t need 128 threads to manifest. Width-invariance holds for the phase-flip family; for racier bug classes (the no-race variant from earlier) shrinking the thread count can hide the violation. When in doubt, scale up until the verifier struggles, then back off.

Case 1: the hardcoded-phase deadlock

In the running example, the variant we shipped first had the wait line written like this:

buggy — what we shipped

# wait phase hardcoded
cute.arch.mbarrier_wait(
    gate_mbar,
    cutlass.Int32(0),  # ← never toggles
)

correct — what we wanted

# wait phase tracked per iter
cute.arch.mbarrier_wait(
    gate_mbar,
    phase,
)
phase = phase ^ cutlass.Int32(1)

At small iteration counts the kernel only does one round per CTA, so the phase never matters and the bug works by luck. At larger counts (roughly 30 tile iterations per cluster in production) the mbarrier’s phase has flipped twice by the second iteration, is back to 0, and the hardcoded wait(0) blocks forever.

In the corresponding Promela models, the buggy and correct variants differ in exactly one argument: do_wait(cta, 0) (buggy) vs do_wait(cta, my_phase) with my_phase ^= 1 after (correct). The lowering section above shows the full proctype; what we care about here is what the verifier reports.

What SPIN says

On the buggy model:

$ spin -a mbar_protocol_bug.pml && cc -O2 -o pan pan.c && ./pan -a
ltl all_done: <> (((iters_completed[0]==4)) && ((iters_completed[1]==4)))
pan:1: acceptance cycle (at depth 305)
pan: wrote mbar_protocol_bug.pml.trail

(Spin Version 6.5.2 -- 6 December 2019)
        + Partial Order Reduction

Full statespace search for:
        never claim         + (all_done)
        assertion violations + (if within scope of claim)
        acceptance   cycles + (fairness disabled)

State-vector 244 byte, depth reached 306, errors: 1
       99 states, stored
       99 transitions (= stored+matched)

pan: elapsed time 0 seconds

99 states, sub-second wall clock, one acceptance cycle that violates the liveness claim. The accompanying .trail file replays the exact interleaving.

Replaying the trail with spin -t -p mbar_protocol_bug.pml ends in the state that explains the deadlock:

$ spin -t -p mbar_protocol_bug.pml  (tail)
303:    proc  1 (Thread:1) line 26  [mbar_count[peer] = (mbar_count[peer]+1)]
304:    proc  1 (Thread:1) line 31  [else]
305:    proc  1 (Thread:1) line 31  [(1)]
  <<<<<START OF CYCLE>>>>>
spin: trail ends after 307 steps
#processes: 6
                mbar_count[0]   = 1
                mbar_count[1]   = 1
                mbar_phase[0]   = 0
                mbar_phase[1]   = 0
                iters_completed[0] = 0
                iters_completed[1] = 3
307:    proc  5 (Thread:1) line 36 (state 27)   // stuck in do_wait(cta, 0)
307:    proc  4 (Thread:1) line 36 (state 27)
307:    proc  3 (Thread:1) line 36 (state 27)
307:    proc  2 (Thread:1) line 36 (state 27)
307:    proc  1 (Thread:1) line 36 (state 27)

Three of CTA 1's four threads got through all iterations (lucky phase parity); CTA 0's threads are all wedged in do_wait(cta, 0) with mbar_phase[0] == 0. Same end state as the GPU hang, except here it's ~300 lines of trail to scroll through instead of silence. (The :1 in Thread:1 is SPIN's instance counter for the proctype template, not the cta argument; the actual cta value lives in the mbar_phase array index above.)

On the fixed model:

$ spin -a mbar_protocol.pml && cc -O2 -o pan pan.c && ./pan -a
ltl all_done: <> (((iters_completed[0]==4)) && ((iters_completed[1]==4)))

Full statespace search for:
        never claim         + (all_done)
        acceptance   cycles + (fairness disabled)

State-vector 272 byte, depth reached 547, errors: 0
 18583997 states, stored (3.7168e+07 visited)
 77678662 states, matched
1.1484666e+08 transitions (= visited+matched)
hash conflicts:  23239621 (resolved)

Stats on memory usage (in Megabytes):
 5594.538       total actual memory usage

pan: elapsed time 52.4 seconds
pan: rate 708771.82 states/second

~18M states stored, ~115M transitions, 52 seconds at 709k states/sec, no liveness violation. Deadlock-free at the modeled scale; the "forgot to toggle phase" bug class would have been caught on the first ./pan run.

Case 2: the missing-acquire race

Case 1 was within a single cluster of two CTAs, which share SMEM. A different protocol shows up when producer and consumer live in different clusters and coordinate through a GMEM atomic counter instead of an mbarrier.

To make this concrete, suppose we extend the Jacobi smoother to a small two-stage pipeline. A first set of producer clusters each smooths a separate array (one cluster per array) and writes the result to GMEM. A second consumer cluster, running afterward on the same GPU, reads the smoothed arrays and computes some summary (say, the sum of all of them). The producer and consumer never overlap in time logically, but they do execute concurrently from the hardware’s point of view: the consumer must observe each producer’s writes before reading them.

The DeepGEMM idiom for this is one int32 counter per produced array. When a producer finishes its array, it does an atomic release-add of 1 to its counter. The consumer spins on an acquire-load of the same counter until the count reaches the expected value, then reads the array. In CuTe DSL the consumer side is a one-line acquire-load loop:

@cute.jit
def spin_until(counter_ptr, expected: cutlass.Constexpr):
    """Block until the GMEM counter at counter_ptr reaches `expected`.
 
    Producer side does
        cute.arch.atomic_add(counter_ptr, 1, sem="release", scope="gpu")
    so the consumer needs acquire+gpu semantics to observe cross-cluster writes.
    """
    while cute.arch.load(
        counter_ptr, cutlass.Int32,
        sem="acquire", scope="gpu",
    ) < cutlass.Int32(expected):
        pass

The failure mode here is a silent race if the consumer reads without waiting. In Promela this is a safety property: [] (safety_violated == 0), with safety_violated set when a consumer observes an array before the producer published it.

buggy — consumer reads without waiting

/* Producer: increment the per-array
 * counter when a tile is published. */
inline produce_tile(array_id) {
  atomic {
    counter[array_id] = counter[array_id] + 1;
  }
}
 
/* Consumer — BUG: read the counter
 * without spinning. If the array isn't
 * fully published yet, flag a violation. */
for (a : 0 .. N_ARRAYS - 1) {
  if
  :: counter[a] < TILES_PER_ARRAY ->
       safety_violated = 1
  :: counter[a] >= TILES_PER_ARRAY ->
       skip
  fi;
}

correct — consumer spins on the counter

/* Producer: same as the buggy side. */
inline produce_tile(array_id) {
  atomic {
    counter[array_id] = counter[array_id] + 1;
  }
}
 
/* Consumer: spin until the counter
 * reaches the expected tile count. */
inline wait_for_array(array_id) {
  atomic {
    counter[array_id] >= TILES_PER_ARRAY
        -> skip;
  }
}
 
for (a : 0 .. N_ARRAYS - 1) {
  wait_for_array(a);
}

The buggy version models "what if the consumer just reads the counter without spinning?". The correct one models the GPU's acquire-load spin as an atomic Promela guard on the per-array counter. The bug is consumer-side; the producer is the same in both files.

SPIN’s response, both directions:

$ ./pan -a   (buggy — consumer skips the wait)
ltl no_race: [] ((safety_violated==0))
pan:1: assertion violated  !( !((safety_violated==0))) (at depth 34)
pan: wrote pipeline_protocol_bug.pml.trail

State-vector 112 byte, depth reached 34, errors: 1
       12 states, stored
pan: elapsed time 0 seconds

The assertion violated !(!P) line is the print-quirk noted in the LTL primer: the verifier compiled the [] (safety_violated == 0) claim into a never-claim and found a witness for its negation. No actual assert() involved.

$ ./pan -a   (correct — consumer spins)
ltl all_consumed: <> ((consumed==2))

Full statespace search for:
        never claim         + (all_consumed)
        assertion violations + (if within scope of claim)
        acceptance   cycles + (fairness disabled)

State-vector 104 byte, depth reached 92, errors: 0
     1114 states, stored (2228 visited)
     2057 states, matched
     4285 transitions (= visited+matched)

pan: elapsed time 0 seconds

12 states to flag the race; 1114 states to verify the fix. Both finish in well under a second of wall clock.

Towards automatic CuTe to Promela lowering

Every model shown so far was hand-written (well, Claude-written) from CuTe DSL source. A natural follow-up question is whether a tool could read the kernel and emit Promela. The translation rules for individual primitives are mechanical, the main challenge is picking the subset of the kernel language to translate. We have implemented a simple proof-of-concept conversion library at github.com/cheshire/cute2promela repository.

The mechanical part: a pattern library

The CuTe DSL surface for synchronization is small. Roughly ten to fifteen call-site patterns cover everything in our codebase, and each maps to a fixed Promela fragment:

cute.arch.mbarrier_init(ptr, expected=N) → declare EXPECTED_ARRIVALS = N, initialize mbar_count[slot] = 0, mbar_phase[slot] = 0.
cute.arch.mbarrier_arrive(ptr) → do_arrive(slot_of(ptr)).
cute.arch.mapa_shared_cluster(ptr, peer) followed by mbarrier_arrive(remote, space="cluster") → do_arrive(peer_slot), with peer_slot derived from the same allocation as ptr.
cute.arch.mbarrier_wait(ptr, P) → do_wait(slot_of(ptr), P); toggle P per iteration as in the source.
cute.arch.atomic_add(ctr, 1, sem="release", scope="gpu") → atomic { counter[slot] += 1 }.
cute.arch.load(ctr, sem="acquire", scope="gpu") in a spin loop with bound N → atomic { counter[slot] >= N -> skip }.

Using dataflow to pick translated subset

A standard backward slice from the sync call sites picks out exactly the lines a Promela model needs. The recipe is:

Mark every statement whose op is a sync primitive (mbarrier_*, release/acquire atomic_*, fences) or a cluster-topology call (mapa_shared_cluster, cluster_rank_in_cluster, thread_idx_x, elect_one) as a slice seed.
Close transitively backward through data dependences: a statement is kept if any of the values it defines is read by an already-kept statement.
Everything else is datapath. Drop it.

To see what this produces on a kernel we’ve already discussed, we ran the slice on a hand-built IR for the Jacobi smoother from earlier. The IR has 29 statements; the slicer keeps 16 and drops 13. The complete output:

$ python scripts/slice_jacobi_to_promela.py
input statements : 29
slice kept       : 16
slice dropped    : 13

 id  kept  op
 --  ----  ----------------------------------------------------------------
  0   yes  my_cta = cluster_rank_in_cluster()
  1   yes  peer = 1 - my_cta
  2   yes  tid = thread_idx_x()
  3    -   cur = smem.alloc_array(...)
  4    -   nxt = smem.alloc_array(...)
  5   yes  halo = smem.alloc_array(...)
  6   yes  gate = smem.alloc_mbarrier()
  7   yes  elected = elect_one()
  8   yes  mbarrier_init(gate, expected=256)
  9   yes  mbarrier_init_fence()
 10    -   cur[tid] = arr_ptr[my_cta*HALF + tid]
 11   yes  phase = Int32(0)
 12   yes  is_tid0 = (tid == 0)
 13    -   my_edge = cur[HALF-1] if my_cta==0 else cur[0]
 14   yes  peer_halo = mapa_shared_cluster(halo, peer)
 15    -   peer_halo[0] = my_edge
 16   yes  fence_view_async_shared()
 17   yes  mbarrier_arrive(gate)
 18   yes  mbarrier_arrive_peer(gate, peer)
 19   yes  mbarrier_wait(gate, phase)
 20   yes  phase = phase ^ Int32(1)
 21    -   left  = ... cur[tid-1] / halo[0] / cur[tid] ...
 22    -   right = ... cur[tid+1] / halo[0] / cur[tid] ...
 23    -   sum1 = left + cur[tid]
 24    -   sum2 = sum1 + right
 25    -   avg = sum2 / 3.0
 26    -   nxt[tid] = avg
 27    -   cur, nxt = nxt, cur
 28    -   arr_ptr[my_cta*HALF + tid] = cur[tid]

The slice keeps the sync calls, the phase variable, the topology calls (cluster_rank_in_cluster, thread_idx_x, mapa_shared_cluster), the mbarrier and halo allocations they point at, and the is_tid0 guard the cross-CTA write hangs off. It drops the entire stencil computation, the two double-buffered tile allocations (cur/nxt), the GMEM load and store, and the per-iteration edge stash. The kept set is identical to what we wrote by hand in the lowering section above, modulo the data-vs-control variable that one of us would have inlined.

The slice doesn’t need any sync-specific reasoning, it is plain reverse reachability on a use-def graph, seeded from a list of names.

Two consequences worth naming:

Loop bounds tied to tensor dimensions don’t make it into the slice unless they gate sync. The persistent loop counter survives because mbarrier_wait sits inside it; the inner stencil bounds don’t because they only gate the arithmetic. The lowering still has to pick a small concrete value for the surviving iteration count (3, in our case), but the slice tells it exactly which loops require that choice.
Conditional sync is handled by the same mechanism. A warp-specialized region whose body contains an mbarrier_arrive drags the warp-index test into the slice through the control dependence; a warp-specialized region whose body is pure compute is dropped wholesale. The slicer doesn’t need to know what warp specialization means, just that an if warp_idx == 2 guard defines a control value used by a kept statement.

A working proof of concept of the conversion pipeline is at github.com/cheshire/cute2promela: an ast-based slicer and lowerer that takes the Jacobi smoother source from earlier as input, emits a Promela model, and round-trips through SPIN, and gets errors: 0 on the correct variant, an acceptance cycle on the hardcoded-phase variant. It supports a single protocol shape and isn’t a general-purpose CuTe analyzer; the “What this doesn’t do” section of the repo’s README is part of the design rather than a TODO.

Modeling rules

Model the protocol shape, not the dimensions. 2 CTAs × 4 threads × 3 iters exercises the same interleavings as 2 CTAs × 128 threads × 30 iters. The bug class doesn’t depend on width; making the model wider trades verification time for nothing.
Persistent kernels need at least three iterations. The hardcoded-phase bug only appears at iter 2. Models with fewer iterations pass without anything having been checked.
Write the buggy twin alongside the correct model. Each *.pml file gets a *_bug.pml sibling that reintroduces the failure the model was built to rule out. It’s a regression test for the model itself: if SPIN stops finding the bug in the buggy file, the model has drifted and isn’t checking what we think it is.

Why not `compute-sanitizer --tool synccheck`?

compute-sanitizer is a runtime tool: it instruments the kernel at launch and watches the execution that actually happens. It catches some real classes of warp-divergent barrier misuse, and when we’ve had a hang reproduce under it the diagnostic has been useful. It is not, however, a substitute for static checking: it requires hardware (and on many cloud and managed-GPU setups, additional permissions or a driver mode the platform doesn’t expose by default), only sees the one interleaving that the scheduler happened to take on that run, and finding no error under one launch is not evidence that other interleavings are safe.

References

Holzmann, G. J. (2003). The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley. — Canonical reference for SPIN and Promela; covers the language, the verifier internals, and the LTL fragment in detail.
Holzmann, G. J. (1997). “The Model Checker SPIN”. IEEE Transactions on Software Engineering 23(5): 279–295. PDF — The original journal paper; freely available, much shorter than the book.
NVIDIA. Parallel Thread Execution ISA, Parallel Synchronization and Communication Instructions. — Authoritative reference for mbarrier.init/arrive/wait/try_wait.parity semantics, including the count + expected + phase state machine we model.
NVIDIA. CUDA C++ Programming Guide, Thread Block Clusters. — Cluster semantics, DSMEM, mapa.shared::cluster, and the cluster-scoped variants of mbarrier ops.
DeepSeek-AI. DeepGEMM. — Source of the GMEM atomic-counter pattern used in Case 2 and discussed in the cross-cluster section.
NVIDIA. Compute Sanitizer User Manual. — Documentation for compute-sanitizer --tool synccheck and the other runtime tools discussed in the “Why not compute-sanitizer?” section.

Conclusion

We have shown how SPIN/Promela can be used for proving liveness properties, and presented a proof-of-concept tool for lowering from CuTe DSL to Promela automatically.

George's Blog

Explorer

Finding deadlocks in CuTe kernels with SPIN

Short Primer: Blackwell Synchronization Primitives

mbarrier in CuTe DSL

A motivating kernel

The model-checking side: SPIN and Promela

Processes and the `init` block

`atomic { ... }`

`inline`

`if`/`fi` with guards

`do`/`od`

Guards as waits

LTL: the property language

Running the verifier

Lowering the kernel to Promela

What SPIN can check

How fast does this blow up?

Case 1: the hardcoded-phase deadlock

What SPIN says

Case 2: the missing-acquire race

Towards automatic CuTe to Promela lowering

The mechanical part: a pattern library

Using dataflow to pick translated subset

Modeling rules

Why not `compute-sanitizer --tool synccheck`?

References

Conclusion

Table of Contents

Backlinks

George's Blog

Explorer

Finding deadlocks in CuTe kernels with SPIN

Short Primer: Blackwell Synchronization Primitives

mbarrier in CuTe DSL

A motivating kernel

The model-checking side: SPIN and Promela

Processes and the init block

atomic { ... }

inline

if/fi with guards

do/od

Guards as waits

LTL: the property language

Running the verifier

Lowering the kernel to Promela

What SPIN can check

How fast does this blow up?

Case 1: the hardcoded-phase deadlock

What SPIN says

Case 2: the missing-acquire race

Towards automatic CuTe to Promela lowering

The mechanical part: a pattern library

Using dataflow to pick translated subset

Modeling rules

Why not compute-sanitizer --tool synccheck?

References

Conclusion

Table of Contents

Backlinks

Processes and the `init` block

`atomic { ... }`

`inline`

`if`/`fi` with guards

`do`/`od`

Why not `compute-sanitizer --tool synccheck`?