SIMD Within a Register: How I Doubled Hash Table Lookup Performance


SIMD Within a Register: How I Doubled Hash Table Lookup Performance
It started with a simple thought: four bytes in a hash table bucket look just like an integer. Luckily, this one idea led to a deep dive into bit-twiddling and a 2x performance boost.
While working on a Cuckoo Filter implementation in C#, I created an array-like structure for the underlying hash table. I chose an 8-bit fingerprint: it aligns nicely on a byte boundary and still keeps the false-positive rate around 3 %.
The layout looked straightforward—just a byte array where the start of each bucket is calculated as bucketIdx * bucketSize
. The size of each bucket is 4 slots, which is a solid choice for Cuckoo Filter.
But those four bytes in a bucket reminded me of something. They feel like … an integer!
I wasn’t chasing ultra-low latency—after all, this is C#—but I couldn’t resist experimenting. Could I speed up lookups in my Cuckoo Filter by replacing the 4-byte bucket with a plain old 32-bit integer?
Time to find out. 💪
#Contents
#A Flexible and Simple Implementation
Here’s the naïve storage I began with. To keep the rest of the code clean, all table logic lives in its own class, whose heart is a pre-allocated byte array:
private readonly byte[] _table;
Each bucket has 4 slots, so there was no need for a second dimension; mapping a key to its bucket is obvious:
var offset = bucketIdx * 4;
Checking whether a particular fingerprint (a single byte) sits in a bucket is just as obvious:
public bool Contains(byte fingerprint, uint bucketIdx)
{
var s = bucketIdx * 4;
for (var i = s; i < s + BucketSize; i++)
{
if (_table[i] == fingerprint) return true;
}
return false;
}
(Please ignore the missing bounds checks—they’re not the point today.)
I’m no bit-twiddling wizard—I live in a cosy, GC-collected world—but those 4 bytes kept itching. A bucket lines up perfectly with a 32-bit uint
; that opens the door to a future lock-free compare-and-swap. So I decided to play.
#Migrating to a uint
Table
Switching the backing array is trivial because the hash table is still one-dimensional:
- private readonly byte[] _table;
+ private readonly uint[] _table;
Now let’s refactor the lookup:
public bool Contains(byte fingerprint, uint bucketIdx)
{
- var s = bucketIdx * 4;
+ var bucket = _table[bucketIdx];
- for (var i = s; i < s + BucketSize; i++)
+ for (var i = 0; i < BucketSize; i++)
{
- if (_table[i] == fingerprint) return true;
+ var shift = i * 8;
+ var fp = (byte)(bucket >> shift);
+ if (fp == fingerprint) return true;
}
return false;
}
The bucket offset disappears because each uint
is a bucket. But I’ve lost the luxury of direct indexing—unless I convert the integer to bytes first:
Span<byte> bucketBytes = stackalloc byte[sizeof(uint)];
BitConverter.TryWriteBytes(bucketBytes, _table[bucketIdx]);
for (var i = 0; i < BucketSize; i++)
{
var fp = bucketBytes[i];
if (fp == fingerprint) return true;
}
I ran a quick benchmark on both uint
-based lookups, and the results were revealing. The shifting version gave a nice speed boost, about 35% faster than the original byte-array loop. I’ve also tested the unrolled loop, but it showed no significant improvement because the compiler has already unrolled it.
Method | Operations | Mean | Ratio |
---|---|---|---|
ByteTable_PositiveLookups | 128 | 239.9 ns | 1.00 |
IntTable_PositiveLookups | 128 | 166.9 ns | 0.70 |
ByteTable_NegativeLookups | 128 | 321.9 ns | 1.34 |
IntTable_NegativeLookups | 128 | 203.9 ns | 0.85 |
ByteTable_PositiveLookups | 1024 | 1,847.0 ns | 1.00 |
IntTable_PositiveLookups | 1024 | 1,300.9 ns | 0.70 |
ByteTable_NegativeLookups | 1024 | 2,523.7 ns | 1.37 |
IntTable_NegativeLookups | 1024 | 1,597.2 ns | 0.86 |
ByteTable_PositiveLookups | 1048576 | 1,907,503.2 ns | 1.00 |
IntTable_PositiveLookups | 1048576 | 1,762,474.3 ns | 0.92 |
ByteTable_NegativeLookups | 1048576 | 2,575,301.5 ns | 1.35 |
IntTable_NegativeLookups | 1048576 | 1,637,653.6 ns | 0.86 |
The BitConverter
approach, however, was a step backward. It was even slower than the original, likely due to the additional Span
overhead. I’m not about to introduce complexity for negative gain, so the BitConverter
version was a non-starter.
Method | Operations | Mean | Ratio |
---|---|---|---|
ByteTable_PositiveLookups | 128 | 239.7 ns | 1.00 |
IntTable_PositiveLookups | 128 | 366.8 ns | 1.53 |
ByteTable_NegativeLookups | 128 | 322.6 ns | 1.35 |
IntTable_NegativeLookups | 128 | 429.2 ns | 1.79 |
ByteTable_PositiveLookups | 1024 | 1,850.0 ns | 1.00 |
IntTable_PositiveLookups | 1024 | 2,877.2 ns | 1.56 |
ByteTable_NegativeLookups | 1024 | 2,512.9 ns | 1.36 |
IntTable_NegativeLookups | 1024 | 3,396.8 ns | 1.84 |
ByteTable_PositiveLookups | 1048576 | 1,909,607.9 ns | 1.00 |
IntTable_PositiveLookups | 1048576 | 3,352,454.1 ns | 1.76 |
ByteTable_NegativeLookups | 1048576 | 2,566,696.5 ns | 1.34 |
IntTable_NegativeLookups | 1048576 | 3,536,050.6 ns | 1.85 |
Even the shifting version is quite performant, can we do better? Maybe just eliminate the loop entirely?
#Finding a Byte with Masking
Long ago I bookmarked Sean Anderson’s great Bit Twiddling Hacks. One gem there—Determine if a word has a zero byte—is exactly what I need. The C# version is nearly identical:
private static bool HasZero(uint v)
{
return ((v - 0x01010101U) & ~v & 0x80808080U) != 0;
}
Admittedly opaque, so let’s unpack it.
The core of the trick is (v - 0x01010101U) & ~v
. This expression has a special property:
- For any non-zero byte b, the most significant bit of
(b - 1) & ~b
will always be0
. - For a zero-byte b = 0x00, the expression becomes
(0x00 - 1) & ~0x00
, which is0xFF & 0xFF = 0xFF
. The most significant bit is1
.
So, this operation creates a “marker” bit (it sets the most significant bit to 1
) in any byte position that was originally 0x00
.
Let’s apply it to our v
, e.g., 0x4462002E
:
First, we subtract 0x01010101U
.
Note that this is a single 32-bit subtraction, so borrows can cross byte boundaries. The 0x00
byte borrows from 0x62
, resulting in 0x...60FF...
.
Second, we apply a bitwise NOT to the original value:
Now, &
them together:
Notice that the third byte from the left is 0xFF
. Its most significant bit is 1
, indicating that this was our zero-byte position.
The final & 0x80808080U
is a mask that removes all other bits, leaving only the most significant bit of each byte.
The method returns the result of 0x00008000 != 0
. Since 0x8000
is not zero, the expression is true
, and the method correctly reports that the zero byte was found.
Great, now I can detect a zero byte without branches. All that remains is to turn the byte I’m searching for into zero.
#XOR to the Rescue
Let’s assume our integer, aka bucket, is 0x12345678
and I’m looking for byte 0x56
. Without shifts, this seems tough. Luckily, all we need to do is transform the 0x56
byte to 0x00
.
What happens to the other bytes is irrelevant, because our HasZero
trick only cares if any byte is zero.
The XOR (^
) operator has a useful property: A ^ B = 0
if and only if A == B
. So, if we XOR the entire bucket with a repeating mask of our target byte, only the matching byte will become zero.
So the remaining part of this puzzle is to make a repetitive mask, which is pretty arithmetic: I should multiply target byte by 0x01010101U
:
#What About Existing Zeros?
A careful reader might ask: what happens if the bucket already has a zero byte in it? Does that mess up the logic?
Let’s say our bucket is 0xAA00CCDD
and we’re searching for a non-zero fingerprint like 0xBB
. The XOR operation transforms the original zero into a non-zero value, so HasZero
correctly returns false
.
Now, what if we search for 0x00
itself (an empty slot in my case)? The mask is 0x00000000
, so the XOR leaves the bucket unchanged. HasZero
is then applied to the result, which correctly finds the pre-existing zero and returns true
.
So no, nothing breaks, the algorithm is still robust: HasZero
only gives a positive result if a zero byte exists after the XOR, which only happens if our target fingerprint was in the bucket to begin with.
#Putting It All Together
Here’s the final, branch-free lookup:
public bool Contains(byte fingerprint, uint bucketIdx)
{
uint bucket = _table[bucketIdx];
uint mask = fingerprint * 0x01010101U;
uint xored = bucket ^ mask;
return ((xored - 0x01010101U) & ~xored & 0x80808080U) != 0;
}
We XOR to zero-out matching bytes, then use the bit-twiddling trick to see if any byte is zero.
The benchmarks confirmed this bit-twiddling exercise was well worth the effort. Positive lookups became over 60% faster, while negative lookups became more than twice as fast compared to the original byte-array implementation. It’s a significant leap over the shifting version, too. While readability certainly took a hit, the raw performance gain is a trade-off I’m ok with.
Method | Operations | Mean | Ratio |
---|---|---|---|
ByteTable_PositiveLookups | 128 | 245.2 ns | 1.00 |
IntTable_PositiveLookups | 128 | 147.7 ns | 0.60 |
ByteTable_NegativeLookups | 128 | 324.1 ns | 1.32 |
IntTable_NegativeLookups | 128 | 147.1 ns | 0.60 |
ByteTable_PositiveLookups | 1024 | 1,845.6 ns | 1.00 |
IntTable_PositiveLookups | 1024 | 1,139.4 ns | 0.62 |
ByteTable_NegativeLookups | 1024 | 2,561.2 ns | 1.39 |
IntTable_NegativeLookups | 1024 | 1,136.3 ns | 0.62 |
ByteTable_PositiveLookups | 1048576 | 1,908,031.9 ns | 1.00 |
IntTable_PositiveLookups | 1048576 | 1,170,627.6 ns | 0.61 |
ByteTable_NegativeLookups | 1048576 | 2,574,882.8 ns | 1.35 |
IntTable_NegativeLookups | 1048576 | 1,172,802.0 ns | 0.61 |
#Final Thoughts
I’m still not a huge fan of stuffing dense bit tricks into production C#—they can be hard to read and even harder to maintain if something goes wrong. But I think this little adventure has been worth it: the lookup path is now twice as fast, and the codebase is still compact enough to keep the trick well-commented. I hope these notes save someone else a detour—or at the very least that you enjoyed this little optimization trip with me.
Want to read more?
What's Your Reaction?






