Dispatch, all the way up: from vtables to effect handlers

Every call in a program is a small act of resolution. You name an operation, and something, somewhere, decides which code actually runs and how to reach it. Most of the time you don’t notice the decision because the compiler already made it. But the moment you allow the answer to depend on something the compiler can’t see, you’ve opened a door, and behind that door is a whole ladder of machinery built to answer the same question under progressively weaker assumptions.

The question never changes: given an operation, which code runs, and how do we get there? What changes is how late we’re willing to answer, and how much context we have to capture to answer it. Those are really two axes welded together. The first runs the length of the whole ladder; the second stays at zero for most of the climb and then, near the top, becomes the entire subject. I want to climb it in one go, because the rungs are usually taught as unrelated topics, and they aren’t. The whole thing collapses to five words: a fixed address, a table slot, a cached guess, a stack walk, a captured continuation.

A fixed address

The cheapest dispatch is the one that isn’t. When the target is known statically, the compiler emits a direct call to a fixed address, or it inlines the body and emits nothing at all. Intra-module targets are pinned at compile time; cross-module ones get fixed at link or load time, which is what the PLT and GOT are quietly doing on your behalf. There’s no decision left at runtime, so the branch predictor is never wrong and the call is as cheap as control flow gets. Everything above this rung is paying, in cycles or in code, for the flexibility of not knowing yet. Context captured: none.

A table slot

Relax the assumption that we know the receiver’s type, and we get virtual dispatch. Which method body runs now depends on the object’s runtime type, so each object carries a pointer to a vtable, and a virtual call is three steps:

; obj in rdi; call obj->method() which is vtable slot 2 (offset 16)
mov  rax, [rdi]        ; load vtable pointer (1st dependent load)
mov  rax, [rax + 16]   ; load method pointer from slot (2nd dependent load)
call rax               ; indirect branch to the resolved body

Two dependent loads and an indirect branch. Under single inheritance the slot index is a fixed compile-time offset, so the only real cost is that indirect branch, which leans on the branch-target predictor; a megamorphic site that sees many targets predicts poorly and stalls. Still no context captured at the call site, just one more level of pointer chasing than a fixed address.

The tidy fixed-offset story frays once you have more than one base. In the Itanium C++ ABI, an object with multiple or virtual bases carries several vtable pointers, one per polymorphic base subobject, and a call through a non-primary base may route through a thunk that fixes up this before jumping to the real body:

; thunk for Derived::f reached through a secondary base subobject
sub  rdi, 16           ; adjust `this` from base subobject to Derived*
jmp  Derived_f         ; tail-jump to the real method body

With non-virtual inheritance that adjustment is a hardcoded constant; with virtual inheritance the offset itself is read at runtime from vcall-offset slots. Interfaces force their own structures: Java uses itables, and Go uses itabs, a per (concrete type, interface type) table holding the two types and an ordered array of method pointers. A Go non-empty interface value is two words, the itab pointer and the data pointer, and the itab is built and cached at runtime, not known up front. The table slot is rarely just a slot.

A cached guess

If the type at a given call site tends to be the same one over and over, we can stop paying the table walk every time and instead memoize the answer right at the call site. That’s inline caching, from Deutsch and Schiffman’s 1984 Smalltalk-80 implementation, later generalized to polymorphic inline caches by Hölzle, Chambers, and Ungar in Self. Modern engines lean on it heavily: V8, SpiderMonkey, and HotSpot all cache by the receiver’s hidden class or shape (V8 calls them Maps, SpiderMonkey Shapes, JavaScriptCore Structures).

// load x.foo, cached for shape S at offset 24
if (x.shape == S)            // guard (one compare/branch)
    result = load(x, 24);    // cached fast path: direct field load
else
    result = ic_miss(x, "foo");  // patch the cache; maybe go polymorphic/megamorphic

Here is the first rung that captures anything: a shape guard. A site walks through states. Uninitialized, then monomorphic (one shape, and behind the guard it’s nearly a direct access), then polymorphic (a small set of shapes checked in sequence), then megamorphic once it’s seen too many; in V8 that’s more than four, at which point it gives up and falls back to a shared megamorphic cache (a global hash table keyed by map and name), with a full dictionary lookup only when the object itself is in dictionary mode. The deeper payoff is speculation: a monomorphic site lets the JIT inline the target behind a shape guard, with deoptimization as the escape hatch when the guard fails. The cache is no longer just an optimization of dispatch; it’s a guess the optimizer can build on.

A stack walk

The first three rungs all dispatched on the type of a receiver, under progressively later binding. Here the ladder turns a corner. Pattern matching steps sideways: it dispatches on a value’s discriminant instead. A match over a tagged union compiles either to a jump table when the tags are dense (one indexed indirect branch, O(1)) or to a balanced decision tree of comparisons when they’re sparse or structured. Maranget’s good decision trees (ML 2008) never test the same subterm twice and share common subtrees as a DAG, trading the risk of code-size blowup for speed; backtracking automata go the other way, smaller code at the cost of occasionally re-testing.

That lateral step reframes the whole question. We’re no longer asking how late we bind, but what we dispatch on. Once you ask that, type and data are just two answers, and the more interesting one is control.

Exceptions are the first rung where dispatch becomes a search over the call stack. A throw walks the stack outward until it finds the nearest dynamically enclosing handler whose type matches. Two implementation families dominate. Table-driven “zero-cost” unwinding (the Itanium two-phase scheme) consults per-function unwind tables and an LSDA through a language personality routine: a search phase locates the handler without unwinding any frames or running cleanups, then a cleanup phase unwinds frames and runs destructors down to it. It costs nothing on the non-throwing path and pays only on throw. The older setjmp/longjmp approach inverts that, paying on every entry to a protected region to register handler state.

The limitation is the point, because it’s exactly what the next rung removes. An exception only goes up. Handling it discards every frame between the throw and the handler; the interrupted computation is gone and cannot be resumed.

A captured continuation

Now keep the search and drop the discard. That one change is algebraic effect handlers. perform op does the same nearest-enclosing-handler search a throw does, but instead of throwing away the frames between the perform and the handler, the runtime reifies them as a delimited continuation k, the rest of the computation up to the handler, and hands k to the handler. The handler decides what runs, and whether and when control returns. This is the rung where “how much context to capture” finally switches on: it was effectively zero all the way up, and from here it is the whole story.

throw  : walk stack to nearest matching handler, DISCARD frames in between (upward only)
perform: walk stack to nearest matching handler, REIFY frames in between as continuation k,
         hand k to the handler, which may resume 0..n times

I think of perform as emitting an event and the handler as the listener that chooses the response, with one capability a normal event callback lacks: the handler can feed a value back into the suspended computation and let it keep going. That “the handler gets the resumption” is the entire difference between an exception and an effect, and it’s what turns restricted upward dispatch into general dynamic dispatch over control flow.

open Effect
open Effect.Deep

type _ Effect.t += Xchg : int -> int t

let comp () = perform (Xchg 0) + perform (Xchg 1)

let () =
  let r =
    match comp () with
    | v -> v
    | effect (Xchg n), k -> continue k (n + 1)
  in
  assert (r = 3)   (* Xchg 0 -> resume 1; Xchg 1 -> resume 2; 1 + 2 = 3 *)

The ladder has topped out here. The two sections that follow aren’t a sixth and seventh rung; they drill down into the final one, asking what it costs and where the handler actually lives.

How many times may the future run?

Here is the center of gravity, and it’s a cost question, not a semantic one: how many times may a continuation be resumed? That number sets the price.

One-shot continuations resume at most once. Because the captured stack slice never has to survive a second use, the runtime never has to copy it. This is the OCaml 5 design: continuations are one-shot by default, the restriction enforced by a dynamic check that raises if you resume twice. The program stack is a linked list of heap-allocated segments called fibers; a handler allocates a fresh fiber, and capturing a continuation copies nothing, because the fiber chunk is already on the heap. Resuming with continue or discontinue simply re-links that chunk onto the running stack, which gives genuinely cheap context switches entirely in userland. One-shot is enough for almost all concurrency: schedulers, async/await, generators, exceptions-plus-resume.

Multi-shot is where it gets expensive, and for a precise structural reason. If k may be invoked again, the captured fiber chunk must survive each use, so before every additional resumption you have to clone that chunk. Released OCaml 5 only ships one-shot continuations; multi-shot requires an external library, multicont, which clones the captured slice on demand:

(* one-shot continuations only resume once; multicont clones the chunk per resume *)
| effect Choose, k ->
    let r = Multicont.Deep.promote k in       (* make a multi-shot resumption *)
    let a = Multicont.Deep.resume r true in   (* explore one branch (clones, resumes) *)
    let b = Multicont.Deep.resume r false in  (* explore the other on a fresh clone *)
    a @ b

Multicont.Deep.resume clones the underlying continuation each time it runs, so the resumption survives repeated use. That clone is the whole cost of multi-shot, and multi-shot is exactly what powers backtracking search, nondeterminism, and probabilistic programming. The expressiveness and the expense arrive together: re-running the future means keeping a copy of it.

Where the handler actually lives

Underneath, handlers compile a few ways. A CPS or monadic translation turns perform into a call that passes the current continuation and turns handlers into ordinary functions; selective or segmented CPS converts only the effectful regions to keep the overhead off the rest. Native runtimes instead use segmented stacks or stack copying to capture the slice between perform and handler directly, no whole-program rewrite, which is what OCaml 5 fibers do. Both rest on delimited control:

handler   ~  reset / prompt   (installs a delimiter)
perform   ~  shift  / control (capture continuation up to nearest matching prompt)
nearest enclosing handler  ==  nearest enclosing prompt  ==  the dispatch decision

Multi-prompt delimited control matches multiple handlers, with the nearest matching prompt winning, which is the handler search again wearing different notation. Koka takes a fourth route, evidence passing (Xie and Leijen, ICFP 2021): handlers are threaded through the program as runtime evidence, so perform finds its handler via an evidence vector instead of walking the stack, monadically translated into plain lambda calculus and compiled to C with Perceus reference counting, needing no GC or special runtime at all. Notably, that scheme still supports multi-shot resumptions through yield bubbling, so one-shot is not a universal restriction; it’s a deliberate cost choice. The dispatch is still nearest-enclosing; it’s just precomputed and carried along rather than searched for.

The ladder in one sentence

Read the rungs back the way you climbed them and they’re all the same sentence with one more clause. A fixed address knows the answer. A table slot looks it up by type. A cached guess remembers it and bets on it. A decision tree branches on a value’s tag, turning the corner from type to data. A stack walk searches for it across frames. A captured continuation searches, then hands the searcher the rest of the program. Effect handlers are dynamic dispatch carried all the way to control flow, and the one place the runtime’s bill is itemized is the boundary between one-shot and multi-shot, where the cost is set by exactly one thing: how many times you might re-run the future.

If you’re reaching for effects in a new runtime, that’s the design knob worth holding onto. Default to one-shot and make the copy explicit, the way OCaml does. The moment a continuation can run twice, you’ve left the world of re-linking a single heap-allocated fiber and entered the world of cloning it before every extra resumption, and it’s worth knowing which side of that line your abstraction lives on before the profiler tells you.