I had a working order matching engine. 2 million orders. Sub-microsecond latency. It worked. But it was built wrong. Here’s how I found out, and what I did about it.


The Engine That Worked (But Shouldn’t Have)

A few weeks ago, I built a high-frequency trading matching engine in C++. The architecture was textbook:

  • Bids: std::map<price, std::deque<Order*>> (descending)
  • Asks: std::map<price, std::deque<Order*>> (ascending)
  • Lookups: std::unordered_map<id, Order*>

It had all the right features - price-time priority, partial fills, cancel/modify, custom memory pool. I benchmarked it at ~800K orders/sec. I posted it on LinkedIn. I was proud.

Then I started reading about how real exchanges work. And I realized my entire data structure layer was wrong.


The Problem: std::map Is a Cache Killer

std::map in C++ is a red-black tree. Each node is a separate heap allocation, scattered randomly in memory. Every time you traverse the tree - which happens on every insert and every match - you’re chasing pointers to random memory locations.

On modern CPUs, a cache miss costs ~100 nanoseconds. A cache hit costs ~1 nanosecond. That’s a 100x difference.

My “elegant” O(log n) tree was actually slower than a brute-force array scan, because every tree node was a cache miss.

std::map traversal:
  Node A (0x7f2a00) -> cache MISS
    -> Node B (0x7f5c80) -> cache MISS
      -> Node C (0x7f1240) -> cache MISS
        -> Found it! But we paid 300ns in cache misses.
 
Flat array lookup:
  levels[price - min_price] -> cache HIT, done. 1ns.

This is the thing nobody tells you in algorithms class: O(1) with bad cache behavior can be slower than O(log n) with good cache behavior. But O(1) with good cache behavior? Nothing beats it.


The Second Problem: O(n) Cancel

My cancel operation used std::deque:

// To cancel order #42, I had to scan the entire deque at that price level
dq.erase(std::remove(dq.begin(), dq.end(), order), dq.end());

That’s O(n) in the number of orders at that price level. In a real exchange, cancel is one of the most frequent operations. This was unacceptable.


The Third Problem: Heap Allocation on Every Match

std::vector<Trade> match() {
    std::vector<Trade> trades;  // <-- heap allocation EVERY CALL
    // ... matching logic
    return trades;
}

Every time the engine matched orders, it allocated a new std::vector on the heap. Over 2 million orders, that’s millions of malloc/free calls on the hot path. This is what causes p999 latency spikes - the allocator occasionally needs to request memory from the OS, and that’s catastrophically slow.


The Rebuild: v3.0

I rewrote the entire data structure layer. Here’s what changed:

Change 1: Flat Array Instead of Tree

Instead of a red-black tree, I use two flat arrays - one for bids, one for asks - indexed directly by price tick:

std::vector<PriceLevel> bid_levels_;  // bid_levels_[price - min_tick]
std::vector<PriceLevel> ask_levels_;  // ask_levels_[price - min_tick]

Insert is now O(1) - just index into the array. No tree traversal, no pointer chasing, no cache misses.

I track best_bid and best_ask as integers, updated lazily. BBO query is O(1).

Change 2: Intrusive Doubly-Linked Lists

I replaced std::deque with an intrusive doubly-linked list. Each Order struct carries its own prev and next pointers:

struct Order {
    uint64_t  id;
    int64_t   price_ticks;
    uint32_t  quantity;
    Side      side;
    OrderType type;
    Order*    prev = nullptr;  // Intrusive list
    Order*    next = nullptr;  // Intrusive list
};

Now cancel is O(1) - given a pointer to the order, just unlink it from the list. No search needed.

// O(1) cancel: just pointer swaps
void remove(Order* o) {
    if (o->prev) o->prev->next = o->next;
    else         head = o->next;
    if (o->next) o->next->prev = o->prev;
    else         tail = o->prev;
}

Change 3: Zero-Allocation Matching

Instead of returning a std::vector<Trade>, the match function takes a callback template:

template <typename Callback>
size_t submitOrder(Order* order, Callback&& on_trade) {
    // ... matching logic
    on_trade(Trade{id, buy_id, sell_id, price, qty});
    // No heap allocation. The compiler can even inline the callback.
}

Usage:

book.submitOrder(order, [](const Trade& t) {
    // Process trade - zero allocation
});

The compiler inlines the lambda, so this compiles down to a direct function call. Zero overhead. Zero allocation.


The Results

I benchmarked all three architectures on the exact same workload - 2 million orders, same RNG seed, same machine:

ArchitectureThroughputp50p99p999
A: Tree + Deque (v2)823K ops/s800 ns3,000 ns27,400 ns
B: Array + Deque (intermediate)1.27M ops/s500 ns2,200 ns10,800 ns
C: Array + Intrusive (v3)2.72M ops/s300 ns900 ns3,100 ns

Architecture C is 3.3x faster than Architecture A.

But look at the p999 column: from 27.4 microseconds down to 3.1 microseconds. That’s an 8.8x improvement in worst-case latency. In HFT, tail latency is everything - you’re only as fast as your slowest order.

What Caused the Speedup?

The speedup breaks down into two independent improvements:

  1. Tree → Flat Array (A → B): 1.55x - this is pure cache locality. Same algorithmic complexity for insert, but the flat array keeps data contiguous in memory.

  2. Deque → Intrusive List (B → C): 2.13x - this is O(n) cancel → O(1) cancel, plus eliminating std::deque’s internal heap allocations, plus zero-alloc matching via callback template.


What I Learned

  1. Cache locality > algorithmic complexity. A flat array with O(1) access that’s cache-friendly will demolish an O(log n) tree with scattered nodes in real-world performance.

  2. Intrusive data structures exist for a reason. They’re ugly compared to STL containers, but when you need O(1) removal and zero allocation, nothing else works.

  3. Templates are not just for generics. Using a callback template for matching eliminated heap allocation entirely. The compiler inlines it. This is the kind of C++ zero-cost abstraction that people talk about but rarely demonstrate.

  4. Measure everything. I would never have found these bottlenecks without benchmarking. “It works” and “it’s fast” are very different statements.


What’s Next

I’m currently writing a research paper comparing these three LOB architectures with hardware-level cache analysis (L1/L2/L3 miss rates using CPU performance counters). I’ll share it when it’s published.

The full source code is open: GitHub - Simple-HFT-Engine


If you found this useful, follow me for more deep dives into C++ systems programming, low-latency architecture, and high-performance computing.