world

SIMD Within a Register: How I Doubled Hash Table Lookup Performance

admin

Jul 28, 2025 - 08:30

0 0

SIMD Within a Register: How I Doubled Hash Table Lookup Performance

2025-07-15 • by Aleksey Maltsev • 2,664 views

SIMD Within a Register: How I Doubled Hash Table Lookup Performance

It started with a simple thought: four bytes in a hash table bucket look just like an integer. Luckily, this one idea led to a deep dive into bit-twiddling and a 2x performance boost.

← Back to all posts

While working on a Cuckoo Filter implementation in C#, I created an array-like structure for the underlying hash table. I chose an 8-bit fingerprint: it aligns nicely on a byte boundary and still keeps the false-positive rate around 3 %.

The layout looked straightforward—just a byte array where the start of each bucket is calculated as bucketIdx * bucketSize. The size of each bucket is 4 slots, which is a solid choice for Cuckoo Filter. Bucket03A00B7F2Bucket14C9100DEBucketnAA005FC8Table:...

But those four bytes in a bucket reminded me of something. They feel like … an integer!

I wasn’t chasing ultra-low latency—after all, this is C#—but I couldn’t resist experimenting. Could I speed up lookups in my Cuckoo Filter by replacing the 4-byte bucket with a plain old 32-bit integer?

Time to find out. 💪

#Contents

#A Flexible and Simple Implementation

Here’s the naïve storage I began with. To keep the rest of the code clean, all table logic lives in its own class, whose heart is a pre-allocated byte array:

private readonly byte[] _table;

Each bucket has 4 slots, so there was no need for a second dimension; mapping a key to its bucket is obvious:

var offset = bucketIdx * 4;

Checking whether a particular fingerprint (a single byte) sits in a bucket is just as obvious:

public bool Contains(byte fingerprint, uint bucketIdx)
{
    var s = bucketIdx * 4;
    for (var i = s; i < s + BucketSize; i++)
    {
        if (_table[i] == fingerprint) return true;
    }

    return false;
}

(Please ignore the missing bounds checks—they’re not the point today.)

I’m no bit-twiddling wizard—I live in a cosy, GC-collected world—but those 4 bytes kept itching. A bucket lines up perfectly with a 32-bit uint; that opens the door to a future lock-free compare-and-swap. So I decided to play.

#Migrating to a `uint` Table

Switching the backing array is trivial because the hash table is still one-dimensional:

- private readonly byte[] _table;
+ private readonly uint[] _table;

Now let’s refactor the lookup:

public bool Contains(byte fingerprint, uint bucketIdx)
{
-   var s = bucketIdx * 4;
+   var bucket = _table[bucketIdx];
-   for (var i = s; i < s + BucketSize; i++)
+   for (var i = 0; i < BucketSize; i++)
    {
-       if (_table[i] == fingerprint) return true;
+       var shift = i * 8;
+       var fp = (byte)(bucket >> shift);
+       if (fp == fingerprint) return true;
    }

    return false;
}

The bucket offset disappears because each uint is a bucket. But I’ve lost the luxury of direct indexing—unless I convert the integer to bytes first:

  Span<byte> bucketBytes = stackalloc byte[sizeof(uint)];
  BitConverter.TryWriteBytes(bucketBytes, _table[bucketIdx]);

  for (var i = 0; i < BucketSize; i++)
  {
      var fp = bucketBytes[i];
      if (fp == fingerprint) return true;
  }

I ran a quick benchmark on both uint-based lookups, and the results were revealing. The shifting version gave a nice speed boost, about 35% faster than the original byte-array loop. I’ve also tested the unrolled loop, but it showed no significant improvement because the compiler has already unrolled it.

Method	Operations	Mean	Ratio
ByteTable_PositiveLookups	128	239.9 ns	1.00
IntTable_PositiveLookups	128	166.9 ns	0.70
ByteTable_NegativeLookups	128	321.9 ns	1.34
IntTable_NegativeLookups	128	203.9 ns	0.85
ByteTable_PositiveLookups	1024	1,847.0 ns	1.00
IntTable_PositiveLookups	1024	1,300.9 ns	0.70
ByteTable_NegativeLookups	1024	2,523.7 ns	1.37
IntTable_NegativeLookups	1024	1,597.2 ns	0.86
ByteTable_PositiveLookups	1048576	1,907,503.2 ns	1.00
IntTable_PositiveLookups	1048576	1,762,474.3 ns	0.92
ByteTable_NegativeLookups	1048576	2,575,301.5 ns	1.35
IntTable_NegativeLookups	1048576	1,637,653.6 ns	0.86

The BitConverter approach, however, was a step backward. It was even slower than the original, likely due to the additional Span overhead. I’m not about to introduce complexity for negative gain, so the BitConverter version was a non-starter.

Method	Operations	Mean	Ratio
ByteTable_PositiveLookups	128	239.7 ns	1.00
IntTable_PositiveLookups	128	366.8 ns	1.53
ByteTable_NegativeLookups	128	322.6 ns	1.35
IntTable_NegativeLookups	128	429.2 ns	1.79
ByteTable_PositiveLookups	1024	1,850.0 ns	1.00
IntTable_PositiveLookups	1024	2,877.2 ns	1.56
ByteTable_NegativeLookups	1024	2,512.9 ns	1.36
IntTable_NegativeLookups	1024	3,396.8 ns	1.84
ByteTable_PositiveLookups	1048576	1,909,607.9 ns	1.00
IntTable_PositiveLookups	1048576	3,352,454.1 ns	1.76
ByteTable_NegativeLookups	1048576	2,566,696.5 ns	1.34
IntTable_NegativeLookups	1048576	3,536,050.6 ns	1.85

So what’s next?

Even the shifting version is quite performant, can we do better? Maybe just eliminate the loop entirely?

#Finding a Byte with Masking

Long ago I bookmarked Sean Anderson’s great Bit Twiddling Hacks. One gem there—Determine if a word has a zero byte—is exactly what I need. The C# version is nearly identical:

private static bool HasZero(uint v)
{
    return ((v - 0x01010101U) & ~v & 0x80808080U) != 0;
}

Admittedly opaque, so let’s unpack it.

The core of the trick is (v - 0x01010101U) & ~v. This expression has a special property:

For any non-zero byte b, the most significant bit of (b - 1) & ~b will always be 0.
For a zero-byte b = 0x00, the expression becomes (0x00 - 1) & ~0x00, which is 0xFF & 0xFF = 0xFF. The most significant bit is 1.

So, this operation creates a “marker” bit (it sets the most significant bit to 1) in any byte position that was originally 0x00.

Let’s apply it to our v, e.g., 0x4462002E:

First, we subtract 0x01010101U.

01000100011000100000000000101110 (0x4462002E) - 00000001000000010000000100000001 (0x01010101) 01000011011000001111111100101101 (0x4360FF2D)

Note

Note that this is a single 32-bit subtraction, so borrows can cross byte boundaries. The 0x00 byte borrows from 0x62, resulting in 0x...60FF....

Second, we apply a bitwise NOT to the original value:

\neg 01000100011000100000000000101110 (0x4462002E) 10111011100111011111111111010001 (0xBB9DFFD1)

Now, & them together:

01000011011000001111111100101101 (0x4360FF2D) & 10111011100111011111111111010001 (0xBB9DFFD1) 00000011000000001111111100000001 (0x0300FF01)

Notice that the third byte from the left is 0xFF. Its most significant bit is 1, indicating that this was our zero-byte position.

The final & 0x80808080U is a mask that removes all other bits, leaving only the most significant bit of each byte.

00000011000000001111111100000001 (0x0300FF01) & 10000000100000001000000010000000 (0x80808080) 00000000000000001000000000000000 (0x00008000)

The method returns the result of 0x00008000 != 0. Since 0x8000 is not zero, the expression is true, and the method correctly reports that the zero byte was found.

Great, now I can detect a zero byte without branches. All that remains is to turn the byte I’m searching for into zero.

#XOR to the Rescue

Let’s assume our integer, aka bucket, is 0x12345678 and I’m looking for byte 0x56. Without shifts, this seems tough. Luckily, all we need to do is transform the 0x56 byte to 0x00.

Note

What happens to the other bytes is irrelevant, because our HasZero trick only cares if any byte is zero.

The XOR (^) operator has a useful property: A ^ B = 0 if and only if A == B. So, if we XOR the entire bucket with a repeating mask of our target byte, only the matching byte will become zero.

00010010001101000101011001111000 (0x12345678) \oplus 01010110010101100101011001010110 (0x56565656) 01000100011000100000000000101110 (0x4462002E)

So the remaining part of this puzzle is to make a repetitive mask, which is pretty arithmetic: I should multiply target byte by 0x01010101U: mask = 0x56×0x01010101 = 0x56565656\text{mask} \;=\; \text{0x56} \times \text{0x01010101} \;=\; \text{0x56565656} mask=0x56×0x01010101=0x56565656

#What About Existing Zeros?

A careful reader might ask: what happens if the bucket already has a zero byte in it? Does that mess up the logic?

Let’s say our bucket is 0xAA00CCDD and we’re searching for a non-zero fingerprint like 0xBB. The XOR operation transforms the original zero into a non-zero value, so HasZero correctly returns false.

10101010000000001100110011011101 (0xAA00CCDD) \oplus 10111011101110111011101110111011 (0xBBBBBBBB) 00010001101110110111011101100110 (0x11BB7766)

Now, what if we search for 0x00 itself (an empty slot in my case)? The mask is 0x00000000, so the XOR leaves the bucket unchanged. HasZero is then applied to the result, which correctly finds the pre-existing zero and returns true.

10101010000000001100110011011101 (0xAA00CCDD) \oplus 00000000000000000000000000000000 (0x00000000) 10101010000000001100110011011101 (0xAA00CCDD)

So no, nothing breaks, the algorithm is still robust: HasZero only gives a positive result if a zero byte exists after the XOR, which only happens if our target fingerprint was in the bucket to begin with.

#Putting It All Together

Here’s the final, branch-free lookup:

public bool Contains(byte fingerprint, uint bucketIdx)
{
    uint bucket = _table[bucketIdx];
    uint mask = fingerprint * 0x01010101U;
    uint xored = bucket ^ mask;
    return ((xored - 0x01010101U) & ~xored & 0x80808080U) != 0;
}

We XOR to zero-out matching bytes, then use the bit-twiddling trick to see if any byte is zero.

The benchmarks confirmed this bit-twiddling exercise was well worth the effort. Positive lookups became over 60% faster, while negative lookups became more than twice as fast compared to the original byte-array implementation. It’s a significant leap over the shifting version, too. While readability certainly took a hit, the raw performance gain is a trade-off I’m ok with.

Method	Operations	Mean	Ratio
ByteTable_PositiveLookups	128	245.2 ns	1.00
IntTable_PositiveLookups	128	147.7 ns	0.60
ByteTable_NegativeLookups	128	324.1 ns	1.32
IntTable_NegativeLookups	128	147.1 ns	0.60
ByteTable_PositiveLookups	1024	1,845.6 ns	1.00
IntTable_PositiveLookups	1024	1,139.4 ns	0.62
ByteTable_NegativeLookups	1024	2,561.2 ns	1.39
IntTable_NegativeLookups	1024	1,136.3 ns	0.62
ByteTable_PositiveLookups	1048576	1,908,031.9 ns	1.00
IntTable_PositiveLookups	1048576	1,170,627.6 ns	0.61
ByteTable_NegativeLookups	1048576	2,574,882.8 ns	1.35
IntTable_NegativeLookups	1048576	1,172,802.0 ns	0.61

#Final Thoughts

I’m still not a huge fan of stuffing dense bit tricks into production C#—they can be hard to read and even harder to maintain if something goes wrong. But I think this little adventure has been worth it: the lookup path is now twice as fast, and the codebase is still compact enough to keep the trick well-commented. I hope these notes save someone else a detour—or at the very least that you enjoyed this little optimization trip with me.

Want to read more?

Like 0

Dislike 0

Love 0

Funny 0

Angry 0

Sad 0

Wow 0

admin

SIMD Within a Register: How I Doubled Hash Table Lookup Performance

SIMD Within a Register: How I Doubled Hash Table Lookup Performance

#Contents

#A Flexible and Simple Implementation

#Migrating to a uint Table

#Finding a Byte with Masking

#XOR to the Rescue

#What About Existing Zeros?

#Putting It All Together

#Final Thoughts

Want to read more?

What's Your Reaction?

Related Posts

Popular Posts

Follow Us

Recommended Posts

Popular Tags

#Migrating to a `uint` Table