Strategies for Fast Lexers
In this blog post I’ll explain strategies I used to make the purple garden lexer really fast.
purple-garden is an s-expr based language I am currently developing for myself. Its my attempt at building a language I like, with a battery included approach, while designing it with performance in mind.
This doesn’t mean all approaches are feasible for your use case, architecture and design. I tried to bring receipts for my performance claims, so watch out for these blocks at the end of chapters:
Info - Benchmark
Introduction to Lexing
A lexer (often also referred to as a tokeniser) is the easiest part of any compilation and language pipeline. The idea is to convert a list of characters into a list of tokens in which each token conveys some meaning. This list of tokens can then be used by the parser to generate an abstract syntax tree (AST), which the compiler consumes, converting it to bytecode, which the vm executes.
A compilation pipeline
As an overview:
┌───────────────────┐
│ │
│ Lexer │ <- we are here
│ │
└─────────┬─────────┘
│
│ Tokens <- and here
│
┌─────────▼─────────┐
│ │
│ Parser │
│ │
└─────────┬─────────┘
│
│ AST
│
┌─────────▼─────────┐
│ │
│ Compiler │
│ │
└─────────┬─────────┘
│
│ Bytecode
│
┌─────────▼─────────┐
│ │
│ Virtual Machine │
│ │
└───────────────────┘
For a list of characters, lets say (@std.fmt.println "my pi is: " 3.1415)
:
Input to the lexer:TEXT
1(@std.fmt.println "my pi is: " 3.1415)
As characters:TEXT
1( 2@ 3s 4t 5d 6. 7f 8m 9t 10// .... 11)
As pseudo tokens:TEXT
1( 2@std 3. 4fmt 5. 6println 7"my pi is: " 83.1415 9)
The above is just an example and I’ll go into detail below:
Defining purple garden’s Tokens
A token is not only a set characters it can be mapped to, but it also holds:
- A token type, to easily distinguish between tokens
- Positional information:
- start point
- end or length
- line number
Purple garden keeps it as minimal as possible and just has a String and a token type:C
1typedef enum {
2 // (
3 T_DELIMITOR_LEFT = 1,
4 // +
5 T_PLUS = 2,
6 // -
7 T_MINUS = 3,
8 // *
9 T_ASTERISKS = 4,
10 // /
11 T_SLASH = 5,
12 // =
13 T_EQUAL = 6,
14 // )
15 T_DELIMITOR_RIGHT,
16 // [
17 T_BRAKET_LEFT,
18 // ]
19 T_BRAKET_RIGHT,
20 // anything between ""
21 T_STRING,
22 // true
23 T_TRUE,
24 // false
25 T_FALSE,
26 // floating point numbers
27 T_DOUBLE,
28 // whole numbers
29 T_INTEGER,
30 // builtins in the format @
31 T_BUILTIN,
32 // any identifier
33 T_IDENT,
34 // end marker
35 T_EOF,
36} TokenType;
37
38typedef struct {
39 TokenType type;
40 // stores all values for T_STRING,T_IDENT,T_INTEGER and T_DOUBLE
41 Str string;
42} Token;
Architecting a minimal Lexer
As explained in A compilation pipeline, the lexer iterates over the characters in the input and attempts to match found characters to sets of characters, like keywords, numbers and symbols. For this we will need to keep some state, even if though its not much:C
1typedef struct {
2 Str input; // <- String view over the input
3 size_t pos; // <- current position in the input
4} Lexer;
Str
as an abstraction is introduced in Operating on zero copy, zero alloc string windows.
A naive approach would be:C
1#define SINGLE_TOK(t) ((Token){.type = t})
2
3Lexer Lexer_new(Str str) {
4 return {
5 .input=str,
6 .pos=0
7 };
8}
9
10Token Lexer_next(Lexer *l) {
11 if (l->input.len == 0 || l->pos >= l->input.len) {
12 return SINGLE_TOK(T_EOF)
13 }
14
15 Token t;
16 switch(Str_get(l->input, l->pos)) {
17 case '+':
18 t = SINGLE_TOK(T_PLUS);
19 break;
20 case '-':
21 t = SINGLE_TOK(T_MINUS);
22 break;
23 case '*':
24 t = SINGLE_TOK(T_ASTERISKS);
25 break;
26 case '/':
27 t = SINGLE_TOK(T_SLASH);
28 break;
29 // ... thinks like strings, quoted symbols, numbers, comments, whitespace :^)
30 default:
31 t = SINGLE_TOK(T_EOF);
32 break;
33 }
34 l->pos++;
35 return t;
36}
And one could consume the api as follows (even with a little type->string lookup for debugging):C
1const char* TOKEN_TYPE_MAP[] = {
2 [T_PLUS] = "T_PLUS",
3 [T_MINUS] = "T_MINUS",
4 [T_ASTERISKS] = "T_ASTERISKS",
5 [T_SLASH] = "T_SLASH",
6 [T_EOF] = "T_EOF"
7};
8
9Str s = STRING("+-*/");
10Lexer l = Lexer_new(s);
11while(l.pos < l.input.len) {
12 Token t = Lexer_next(&l);
13 puts(TOKEN_TYPE_MAP[t.type]);
14}
Tip - Designated initializers
Designated initializers ([] =
) are allowed by ISO C99, while
designated initializers with ranges ([ .. ] =
) are a gnu
extension, see 6.2.11 Designated Initializers
.Making it fast
Lets get into some optimisations and strategies for creating a clean, minimal and especially fast lexer.
Computed gotos, or: Threaded Lexing
Threaded Lexing
is a reference to Threaded code, see
wikipedia. The idea is to
replace the switch statement with a jump to a block inside of the lexer. The
easiest way I could think of implementing this was to:
Define a jump table mapping lexeme start characters to their respective blocksC
1static void *jump_table[256] = { 2 // first bind all possible bytes to the unkown label, so there are no out 3 // of bound reads 4 [0 ... 255] = &&unknown, 5 6 // replace all known bytes with correct jump labels 7 ['('] = &&delimitor_left, 8 [')'] = &&delimitor_right, 9 ['['] = &&braket_left, 10 [']'] = &&braket_right, 11 ['@'] = &&builtin, 12 ['+'] = &&plus, 13 ['-'] = &&minus, 14 ['/'] = &&slash, 15 ['*'] = &&asterisks, 16 ['='] = &&equal, 17 [' '] = &&whitespace, 18 ['\t'] = &&whitespace, 19 ['\n'] = &&whitespace, 20 [';'] = &&comment, 21 ['.'] = &&number, 22 ['0' ... '9'] = &&number, 23 ['a' ... 'z'] = &&ident, 24 ['A' ... 'Z'] = &&ident, 25 ['\''] = &"ed, 26 ['_'] = &&ident, 27 ['"'] = &&string, 28 29 // handle the edgecase of being at the end of the input 30 [0] = &&end, 31};
Create a macro for jumping to the labelC
1#define JUMP_TARGET goto *jump_table[(int8_t)l->input.p[l->pos]]
At the start of the lexer, call the macroC
1JUMP_TARGET
Putting it all together lets us build the following threaded lexer:C
1size_t Lexer_all(Lexer *l, Allocator *a, Token **out) {
2 ASSERT(out != NULL, "Failed to allocate token list");
3
4 // empty input
5 if (l->input.len == 0) {
6 return 1;
7 }
8
9 static void *jump_table[256] = {
10 [0 ... 255] = &&unknown,
11 ['('] = &&delimitor_left,
12 [')'] = &&delimitor_right,
13 ['['] = &&braket_left,
14 [']'] = &&braket_right,
15 ['@'] = &&builtin,
16 ['+'] = &&plus,
17 ['-'] = &&minus,
18 ['/'] = &&slash,
19 ['*'] = &&asterisks,
20 ['='] = &&equal,
21 [' '] = &&whitespace,
22 ['\t'] = &&whitespace,
23 ['\n'] = &&whitespace,
24 [';'] = &&comment,
25 ['.'] = &&number,
26 ['0' ... '9'] = &&number,
27 ['a' ... 'z'] = &&ident,
28 ['A' ... 'Z'] = &&ident,
29 ['\''] = &"ed,
30 ['_'] = &&ident,
31 ['"'] = &&string,
32 [0] = &&end,
33 };
34
35#define JUMP_TARGET goto *jump_table[(int8_t)l->input.p[l->pos]]
36
37 JUMP_TARGET;
38
39delimitor_left:
40 JUMP_TARGET;
41
42delimitor_right:
43 JUMP_TARGET;
44
45braket_left:
46 JUMP_TARGET;
47
48braket_right:
49 JUMP_TARGET;
50
51builtin:
52 JUMP_TARGET;
53
54plus:
55 JUMP_TARGET;
56
57minus:
58 JUMP_TARGET;
59
60slash:
61 JUMP_TARGET;
62
63equal:
64 JUMP_TARGET;
65
66asterisks:
67 JUMP_TARGET;
68
69number:
70 JUMP_TARGET;
71
72ident:
73 JUMP_TARGET;
74
75quoted:
76 JUMP_TARGET;
77
78string:
79 JUMP_TARGET;
80
81comment:
82 JUMP_TARGET;
83
84whitespace:
85 JUMP_TARGET;
86
87unknown:
88 return 1;
89
90end:
91 return 0;
92}
Replacing the switch with an array index and a jump. The latter is significantly faster than the former due to:
- Better code density
- Less cache misses, better branch prediction
Warning
There are two downsides to this approach:
- It is not supported by MSVC toolchain (clang, gcc only)
- It makes reading and debugging the lexer far more complicated
Abstracting allocations via the Allocator interface
I want to allow the user of purple-garden to choose between allocation strategies like my garbage collectors and a bump allocator. For this I need an interface to marry several backing implementations into a singular api:C
1#ifndef MEM_H
2#define MEM_H
3
4#include
5
6#ifdef DEBUG
7#if DEBUG
8#define VERBOSE_ALLOCATOR 1
9#endif
10#else
11#define VERBOSE_ALLOCATOR 0
12#endif
13
14// 1MB
15#define GC_MIN_HEAP 1024 * 1024
16
17typedef struct {
18 size_t current;
19 size_t allocated;
20} Stats;
21
22// CALL is used to emulate method calls by calling a METHOD on SELF with
23// SELF->ctx and __VA_ARGS__, this is useful for interface interaction, such as
24// Allocator, which reduces alloc_bump.request(alloc_bump.ctx, 64); to
25// CALL(alloc_bump, request, 64), removing the need for passing the context in
26// manually
27#if VERBOSE_ALLOCATOR
28#include
29#define CALL(SELF, METHOD, ...) \
30 (fprintf(stderr, "[ALLOCATOR] %s@%s::%d: %s->%s(%s)\n", __FILE__, __func__, \
31 __LINE__, #SELF, #METHOD, #__VA_ARGS__), \
32 (SELF)->METHOD((SELF)->ctx, ##__VA_ARGS__))
33#else
34#define CALL(SELF, METHOD, ...) (SELF)->METHOD((SELF)->ctx, ##__VA_ARGS__)
35#endif
36
37// Allocator defines an interface abstracting different allocators, so the
38// runtime of the virtual machine does not need to know about implementation
39// details, can be used like this:
40//
41//
42// #define ALLOC_HEAP_SIZE = 1024
43// Allocator alloc_bump = bump_init(ALLOC_HEAP_SIZE);
44//
45// size_t some_block_size = 16;
46// void *some_block = alloc_bump.request(alloc_bump.ctx, some_block_size);
47//
48// alloc_bump.destroy(alloc_bump.ctx);
49//
50typedef struct {
51 // Allocator::ctx refers to an internal allocator state and owned memory
52 // areas, for instance, a bump allocator would attach its meta data (current
53 // position, cap, etc) here
54 void *ctx;
55
56 // Allocator::stats is expected to return the current statistics of the
57 // underlying allocator
58 Stats (*stats)(void *ctx);
59 // Allocator::request returns a handle to a block of memory of size `size`
60 void *(*request)(void *ctx, size_t size);
61 // Allocator::destroy cleans state up and deallocates any owned memory areas
62 void (*destroy)(void *ctx);
63} Allocator;
64
65Allocator *bump_init(size_t size);
66Allocator *xcgc_init(size_t size, void *vm);
67
68#endif
For instance: this allocator is then passed to the virtual machine. The vm uses
Allocator->request(Allocator->ctx, sizeof(Value))
to allocate a new runtime
value.
I included a CALL
macro for cleaning this duplicated context passing up:C
1Allocator *a = bump_init(1024);
2
3// before:
4void *block = a->request(a->ctx, 512);
5a->destroy(a->ctx)
6
7// after:
8void *block = CALL(a, request, 512);
9CALL(a, destroy)
This macro also enables verbose allocation insights for debugging purposes:TEXT
1[ALLOCATOR] vm.c@freelist_preallocate::59: fl->alloc->request(sizeof(Frame))
2[ALLOCATOR] main.c@main::245: vm.alloc->stats()
3[ALLOCATOR] main.c@main::255: pipeline_allocator->destroy()
4[ALLOCATOR] vm.c@Vm_destroy::283: vm->alloc->destroy()
The implementation of said interface is straightforward:C
1// Bump allocator header
2typedef struct {
3 // points to the start of the allocated block from which Allocator::request
4 // will hand out aligned chunks
5 void *block;
6 // the size of said allocated block
7 size_t len;
8 // the current amount of bytes in use
9 size_t pos;
10} BumpCtx;
11
12void *bump_request(void *ctx, size_t size) {
13 BumpCtx *b_ctx = (BumpCtx *)ctx;
14 size_t align = sizeof(void *);
15 b_ctx->pos = (b_ctx->pos + align - 1) & ~(align - 1);
16 ASSERT(b_ctx->pos + size <= b_ctx->len, "OOM :( with %zu", b_ctx->len);
17 void *block_entry = (char *)b_ctx->block + b_ctx->pos;
18 b_ctx->pos += size;
19 return block_entry;
20}
21
22void bump_destroy(void *ctx) {
23 ASSERT(ctx != NULL, "bump_destroy on already destroyed allocator");
24 BumpCtx *b_ctx = (BumpCtx *)ctx;
25 // madvise(2):
26 // The application no longer requires the pages in the range
27 // specified by addr and len. The kernel can thus free these
28 // pages, but the freeing could be delayed until memory pres‐
29 // sure occurs.
30 madvise(b_ctx->block, b_ctx->len, MADV_FREE);
31 int res = munmap(b_ctx->block, b_ctx->len);
32 ASSERT(res == 0, "munmap failed");
33 free(ctx);
34}
35
36Stats bump_stats(void *ctx) {
37 BumpCtx *b_ctx = (BumpCtx *)ctx;
38 return (Stats){.allocated = b_ctx->len, .current = b_ctx->pos};
39}
40
41Allocator *bump_init(size_t size) {
42 long page_size = sysconf(_SC_PAGESIZE);
43 size = (size + page_size - 1) & ~(page_size - 1);
44
45 void *b = mmap(NULL, size, PROT_READ | PROT_WRITE,
46 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
47 ASSERT(b != MAP_FAILED, "failed to mmap allocator buffer");
48
49 BumpCtx *ctx = malloc(sizeof(BumpCtx));
50 ASSERT(ctx != NULL, "failed to bump allocator context");
51 ctx->len = size;
52 ctx->pos = 0;
53 ctx->block = b;
54
55 Allocator *a = malloc(sizeof(Allocator));
56 ASSERT(ctx != NULL, "failed to alloc bump allocator");
57 a->ctx = (void *)ctx;
58 a->destroy = bump_destroy;
59 a->request = bump_request;
60 a->stats = bump_stats;
61
62 return a;
63}
Info - Benchmarks
This reduced the total runtime by like 24ms or made it 1.58x faster, see
4251b6b
.
The separate parsing stage was due to me experimenting with attaching the parser to the compiler, but this made the whole thing a bit slower than just finishing parsing before compilingTEXT
1mem+main+cc: replace dynamic alloc with bump allocator (~1.6x faster)
2- preallocated 4MB for bytecode and 256KB for the global pool to improve memory management.
3- cc now takes an allocator to omit inline memory allocations for global
4 pool entries and bytecode creation
5- with previous changes shaving off 24ms
6
7Old perf (before b59737a)
8
9 [ 0.0070ms] main::Args_parse: Parsed arguments
10 [ 0.0080ms] io::IO_read_file_to_string: mmaped input
11 [ 41.2460ms] parser::Parser_run: Transformed source to AST
12 [ 20.7330ms] cc::cc: Flattened AST to byte code
13 [ 0.9640ms] mem::Allocator::destroy: Deallocated AST memory space
14 [ 2.1270ms] vm::Vm_run: Walked and executed byte code
15 [ 0.5560ms] vm::Vm_destroy: Deallocated global pool and bytecode list
16
17New:
18
19 [ 0.0050ms] main::Args_parse: Parsed arguments
20 [ 0.0120ms] io::IO_read_file_to_string: mmaped input
21 [ 37.8600ms] cc::cc: Flattened AST to byte code
22 [ 2.3540ms] vm::Vm_run: Walked and executed byte code
23 [ 0.7380ms] mem::Allocator::destroy: Deallocated AST memory space
24
25 $ hyperfine "./purple_garden examples/bench.garden" "../purple_garden_old/purple_garden examples/bench.garden"
26 Benchmark 1: ./purple_garden examples/bench.garden
27 Time (mean ± σ): 42.2 ms ± 0.7 ms [User: 28.0 ms, System: 13.8 ms]
28 Range (min … max): 41.3 ms … 45.8 ms 70 runs
29
30 Benchmark 2: ../purple_garden_old/purple_garden examples/bench.garden
31 Time (mean ± σ): 66.6 ms ± 1.1 ms [User: 45.9 ms, System: 20.2 ms]
32 Range (min … max): 64.8 ms … 69.8 ms 43 runs
33
34 Summary
35 ./purple_garden examples/bench.garden ran
36 1.58 ± 0.04 times faster than ../purple_garden_old/purple_garden examples/bench.garden
Before I also made the parser use the block allocator, see
0324051
:TEXT
1parser: allocate with bump alloc (2x faster parse, 26x faster clean)
2I replaced the inline allocation logic for the growing of Node.children with
3the bump allocator from the allocator abstraction in mem::Allocator
4(mem::bump_*). This results in 2x faster parsing and 27x faster AST cleanup:
5
6- [ 89.4830ms/41.7840ms=02.1417x faster] parser::Parser_run: Transformed
7 source to AST
8- [ 22.3900ms/00.8440ms=26.5284x faster] parser::Node_destroy: Deallocated
9 AST Nodes renamed to mem::Allocator::destroy
10
11Old:
12
13 $ make bench PG=examples/bench.garden
14 [ 0.0050ms] main::Args_parse: Parsed arguments
15 [ 0.0110ms] io::IO_read_file_to_string: mmaped input
16 [ 89.4830ms] parser::Parser_run: Transformed source to AST
17 [ 18.8010ms] cc::cc: Flattened AST to byte code
18 [ 22.3900ms] parser::Node_destroy: Deallocated AST Nodes
19 [ 2.3200ms] vm::Vm_run: Walked and executed byte code
20 [ 0.4670ms] vm::Vm_destroy: Deallocated global pool and bytecode list
21
22New:
23
24 $ make bench PG=examples/bench.garden
25 [ 0.0050ms] main::Args_parse: Parsed arguments
26 [ 0.0100ms] io::IO_read_file_to_string: mmaped input
27 [ 41.7840ms] parser::Parser_run: Transformed source to AST
28 [ 21.2160ms] cc::cc: Flattened AST to byte code
29 [ 0.8440ms] mem::Allocator::destroy: Deallocated AST memory space
30 [ 2.2510ms] vm::Vm_run: Walked and executed byte code
31 [ 0.7590ms] vm::Vm_destroy: Deallocated global pool and bytecode list
Operating on zero copy, zero alloc string windows
Our lexer doesn’t require mutable strings, C does not come with a string abstraction out of the box and I wanted to attach metadata to strings - so I came up with a small string abstraction:C
1// str is a simple stack allocated wrapper around C char arrays, providing
2// constant time length access and zero allocation+copy interactions for all
3// methods except Str_to
4typedef struct {
5 // store the pointer to the underlying char
6 const uint8_t *p;
7 // hash of the input, do not expect it to be filled, has to be computed via
8 // Str_hash or inline in the lexer
9 uint64_t hash;
10 // length of the input without a zero terminator
11 size_t len;
12} Str;
C1#define STRING(str) ((Str){.len = sizeof(str) - 1, .p = (const uint8_t *)str})
2#define STRING_EMPTY ((Str){.len = 0, .p = NULL})
Creating a Str
from a c style const char*
can be done by passing it into
the STRING
macro, gcc can evaluate all operations inside of it at compile
time. Since the view doesnt own its underlying data, its cheap to copy and
create slices by just pointing new views to the underlying buffer.
I use this struct throughout the whole codebase, but specifically inside of the lexer to create views over the original input that point to the contents of some tokens:C
1// example inside of handling strings in the lexer:
2
3// dummy values for start and hash
4size_t start = 3;
5size_t hash = 0xAFFEDEAD;
6
7Token *t = &(Token){};
8t->type = T_STRING;
9t->string = (Str){
10 .p = l->input.p + start,
11 .len = l->pos - start,
12 .hash = hash,
13};
Due to the window nature of this struct I had to reimplement some things
myself, such as slicing, concatination, equality checking, hashing and
converting to int64
and double
:C
1char Str_get(const Str *str, size_t index);
2Str Str_from(const char *s);
3Str Str_slice(const Str *str, size_t start, size_t end);
4Str Str_concat(const Str *a, const Str *b, Allocator *alloc);
5bool Str_eq(const Str *a, const Str *b);
6void Str_debug(const Str *str);
7size_t Str_hash(const Str *str);
8int64_t Str_to_int64_t(const Str *str);
9double Str_to_double(const Str *str);
Lets quickly go over the interesting ones, specifically slicing and converting to other data types. Slicing is super easy, since we just move the start and the end to the new slice start and end.C
1Str Str_slice(const Str *str, size_t start, size_t end) {
2 ASSERT(end >= start, "Str_slice: Invalid slice range: end must be >= start");
3 ASSERT(end <= str->len, "Str_slice: Slice range exceeds string length");
4
5 return (Str){
6 .p = str->p + start,
7 .len = end - start,
8 };
9}
Converting to int64
is also fairly uncomplicated, since the lexer is expected
to make sure all characters are in the integer set. The algorithm consists of
converting the character representation of an integer component into its
literal value by subtracting '0'
from it. The resulting value is added to the
product of the previous iteration and 10, since we are working in the decimal
system.C
1int64_t Str_to_int64_t(const Str *str) {
2 int64_t r = 0;
3 ASSERT(str->len > 0, "Cant convert empty string into int64_t");
4
5 for (size_t i = 0; i < str->len; i++) {
6 int digit = str->p[i] - '0';
7 ASSERT(r < (INT64_MAX - digit) / 10,
8 "int64_t number space overflow: `%.*s`", (int)str->len, str->p)
9 r = r * 10 + digit;
10 }
11
12 return r;
13}
Doubles are represented differently, specifcally by their mantiassa and
exponent, requiring a slightly more sophisticated conversion algorithm. In the
same vain as Str_to_int64_t
, validating is already done by the lexer to the
extend of only allowing any of .1234567890
.C
1double Str_to_double(const Str *str) {
2 ASSERT(str->len > 0, "Can't convert empty string into double");
3
4 const char *p = (const char *)str->p;
5 size_t len = str->len;
6
7 uint64_t mantissa = 0;
8 int exponent = 0;
9 bool seen_dot = false;
10 bool has_digits = false;
11
12 // we dont check that all chars are numbers here, since the lexer already does
13 // that
14 for (size_t i = 0; i < len; i++) {
15 char c = p[i];
16
17 if (c == '.') {
18 seen_dot = true;
19 continue;
20 }
21
22 has_digits = true;
23 short digit = c - '0';
24 ASSERT(mantissa <= (UINT64_MAX - digit) / 10, "Mantissa overflow");
25 mantissa = mantissa * 10 + digit;
26 if (seen_dot) {
27 exponent -= 1;
28 }
29 }
30
31 // if there were no digits after the '.'
32 ASSERT(has_digits, "Can't parse `%.*s` into a double", (int)len, p);
33
34 double result = (double)mantissa;
35 // skip exponent computation for .0, since these are just the
36 // mantissa
37 if (exponent != 0) {
38 result *= pow(10.0, exponent);
39 }
40
41 return result;
42}
Info - Benchmarks
1.4x Speedup by using the above abstraction in a non allocating way, see
b19088a
:TEXT
1common: make String methods 0 alloc (~1.4x faster)
2- String_slice no longer requires a malloc, now just returns a window into its
3 argument
4- String_to no longer returns just the underlying pointer (String.p) but
5 allocates a new char*
6- new String_debug method to print the window
7- lexer::num no longer allocates and frees for its string window, instead uses
8 a stack buffer
9- parser::Node_destroy no longer calls Token_destroy
10- Token_destroy no longer needed, since the lexer no longer allocates
11
12Main improvements in runtime while parsing (less allocations and frees for
13ident, string and double handling) and in cleanup (no more deallocations for
14tokens)
15
16With 250k loc of "hello world":
17
18- Parsing went from 20.9ms to 13.8ms => 1.51x faster
19- Cleanup went from 3.6ms to 0.83ms => 4.34x faster
20
21Old:
22
23 [BENCH] (T-0.0060ms): parsed arguments
24 [BENCH] (T-0.0420ms): read file to String
25 [BENCH] (T-20.8780ms): parsed input
26 [BENCH] (T-6.8080ms): compiled input
27 [BENCH] (bc=500002|globals=250001)
28 [BENCH] (T-0.3440ms): ran vm
29 [BENCH] (T-3.5960ms): destroyed Nodes, vm and input
30
31New:
32
33 [BENCH] (T-0.0060ms): parsed arguments
34 [BENCH] (T-0.0410ms): read file to String
35 [BENCH] (T-13.8280ms): parsed input
36 [BENCH] (T-7.9410ms): compiled input
37 [BENCH] (bc=500002|globals=250001)
38 [BENCH] (T-0.3490ms): ran vm
39 [BENCH] (T-0.8280ms): destroyed Nodes, vm and input
Hashing everything
I want to distinguish atoms (strings, numbers, idents) from other atoms for interning purposes and faster comparisons in the pipeline. This can be done via hashes, especially since we already “visit” each member of said atoms while converting them to tokens. Hashing them is therefore just a matter of computations while walking the atoms underlying bytes.
For instance strings: before this, the lexer just advanced until it hit the closing delimitor or EOF:C
1size_t Lexer_all(Lexer *l, Allocator *a, Token **out) {
2
3 // ...
4
5string: {
6 // skip "
7 l->pos++;
8 size_t start = l->pos;
9 for (char cc = cur(l); cc > 0 && cc != '"'; l->pos++, cc = cur(l)) {}
10
11 if (UNLIKELY(cur(l) != '"')) {
12 Str slice = Str_slice(&l->input, l->pos, l->input.len);
13 fprintf(stderr, "lex: Unterminated string near: '%.*s'", (int)slice.len,
14 slice.p);
15 out[count++] = INTERN_EOF;
16 } else {
17 Token *t = CALL(a, request, sizeof(Token));
18 t->type = T_STRING;
19 t->string = (Str){
20 .p = l->input.p + start,
21 .len = l->pos - start,
22 };
23 out[count++] = t;
24 // skip "
25 l->pos++;
26 }
27 JUMP_TARGET;
28}
29
30 // ...
31
32}
Adding hashing to this is fairly easy:DIFF
1diff --git a/lexer.c b/lexer.c
2index 316a494..2280a6b 100644
3--- a/lexer.c
4+++ b/lexer.c
5@@ -286,10 +286,7 @@ string: {
6 // skip "
7 l->pos++;
8 size_t start = l->pos;
9+ size_t hash = FNV_OFFSET_BASIS;
10 for (char cc = cur(l); cc > 0 && cc != '"'; l->pos++, cc = cur(l)) {
11+ hash ^= cc;
12+ hash *= FNV_PRIME;
13 }
14
15 if (UNLIKELY(cur(l) != '"')) {
16@@ -303,7 +300,6 @@ string: {
17 t->string = (Str){
18 .p = l->input.p + start,
19 .len = l->pos - start,
20+ .hash = hash,
21 };
22 out[count++] = t;
23 // skip "
Numbers, identifiers and builtins are all also being hashed, I omitted this here for clearity and since we will revisit this topic again in this article.
Interning Tokens
Most of the tokens we will encounter are constant, we know:
- their size
- their type
We can use this knowledge, statically allocate these and use their pointers to reduce memory pressure:C
1// we can "intern" these, since all of them are the same, regardless of position
2Token *INTERN_DELIMITOR_LEFT = &SINGLE_TOK(T_DELIMITOR_LEFT);
3Token *INTERN_DELIMITOR_RIGHT = &SINGLE_TOK(T_DELIMITOR_RIGHT);
4Token *INTERN_BRAKET_LEFT = &SINGLE_TOK(T_BRAKET_LEFT);
5Token *INTERN_BRAKET_RIGHT = &SINGLE_TOK(T_BRAKET_RIGHT);
6Token *INTERN_MINUS = &SINGLE_TOK(T_MINUS);
7Token *INTERN_PLUS = &SINGLE_TOK(T_PLUS);
8Token *INTERN_ASTERISKS = &SINGLE_TOK(T_ASTERISKS);
9Token *INTERN_SLASH = &SINGLE_TOK(T_SLASH);
10Token *INTERN_FALSE = &SINGLE_TOK(T_FALSE);
11Token *INTERN_TRUE = &SINGLE_TOK(T_TRUE);
12Token *INTERN_EQUAL = &SINGLE_TOK(T_EQUAL);
13Token *INTERN_EOF = &SINGLE_TOK(T_EOF);
14
15// size_t Lexer_all(Lexer *l, Allocator *a, Token **out)
SINGLE_TOK
is just:C
1#define SINGLE_TOK(t) ((Token){.type = t})
Prehashing keywords for comparisons
As introduced in the previous chapters, all identifers are hashed, thus we can also hash the known keywords at startup and make comparing them very fast.
Lets take a look at true
and false
, both are known keywords and we will
need to compare found identifers to them.C
1ident: {
2 size_t start = l->pos;
3 size_t hash = FNV_OFFSET_BASIS;
4 for (char cc = cur(l); cc > 0 && is_alphanum(cc); l->pos++, cc = cur(l)) {
5 hash ^= cc;
6 hash *= FNV_PRIME;
7 }
8
9 size_t len = l->pos - start;
10 Token *t;
11
12 // comparing to the keywords is now just a number comparison
13 if (hash == true_hash) {
14 t = INTERN_TRUE;
15 } else if (hash == false_hash) {
16 t = INTERN_FALSE;
17 } else {
18 t = CALL(a, request, sizeof(Token));
19 t->type = T_IDENT;
20 t->string = (Str){
21 .p = l->input.p + start,
22 .len = len,
23 .hash = hash,
24 };
25 }
26 out[count++] = t;
27 JUMP_TARGET;
28}
Both true_hash
and false_hash
are computed at startup of Lexer_all
:C
1size_t Lexer_all(Lexer *l, Allocator *a, Token **out) {
2 ASSERT(out != NULL, "Failed to allocate token list");
3
4 // empty input
5 if (l->input.len == 0) {
6 out[0] = INTERN_EOF;
7 return 1;
8 }
9
10 size_t true_hash = Str_hash(&STRING("true"));
11 size_t false_hash = Str_hash(&STRING("false"));
12
13 // [...]
14}
Tip - is_alphanum performance deep dive
I know this function shouldn’t be called
is_alphanum
because it allows[A-Za-z0-9_-]
A naive check of is_alphanum
can be:C
1bool is_alphanum(char cc) {
2 return (cc >= 'a' && cc <= 'z') || (cc >= 'A' && cc <= 'Z')
3 || (cc >= '0' && cc <= '9') || cc == '_' || cc == '-'
4}
We know we can omit the uppercase check by converting the character to its lowercase representation, so lets fold the character, since ASCII upper and lowercase characters only differ by a single bit:C
1bool is_alphanum(char cc) {
2 uint8_t lower = cc | 0x20;
3 bool is_alpha = (lower >= 'a' && lower <= 'z');
4 bool is_digit = (cc >= '0' && cc <= '9');
5 return is_alpha || is_digit || cc == '_' || cc == '-';
6}
In benchmarks I was able to measure inline
and parameter type uint8_t
have a
reproducible impact of reducing the runtime by 1-5% for identifier heavy
inputs, so I marked the function as “private” static inline
:C
1__attribute__((always_inline)) inline static bool is_alphanum(uint8_t cc) {
2 uint8_t lower = cc | 0x20;
3 bool is_alpha = (lower >= 'a' && lower <= 'z');
4 bool is_digit = (cc >= '0' && cc <= '9');
5 return is_alpha || is_digit || cc == '_' || cc == '-';
6}
There are some other ways that could be more efficient, but I haven’t benchmarked these:
statically allocated lookup table like:C
1static const bool is_alphanum_lookup[128] = { 2 ['0' ... '9'] = true, 3 ['A' ... 'Z'] = true, 4 ['a' ... 'z'] = true, 5 ['_'] = true, 6 ['-'] = true, 7}; 8__attribute__((always_inline)) inline static bool is_alphanum(uint8_t cc) { 9 return cc < 128 && is_alphanum_lookup[cc]; 10}
weird bit sets:
CI don’t fully understand this one and it sucks to read, so no thanks
1static const uint64_t table1 = 0x03ff000000000000 2static const uint64_t table2 = 0x07fffffe07fffffe 3 4 __attribute__((always_inline)) inline static bool is_alphanum(uint8_t cc) { 5 if (cc >= 128) return false; 6 return cc < 64 ? (table1 >> cc) & 1 : (table2 >> (cc - 64)) & 1; 7}
On demand double and int64_t parsing
Let’s revisit hashing numbers: as introduced before, all atoms are hashed, therefore I am able to use this hash for and while interning. This way the compiler converts all duplicated integers and doubles into their numerical representation only once (in the compiler).
The lexer therefore only needs to store the window of the atoms input. While
verifying all characters in this window are valid arguments for Str_to_double
and Str_to_int64_t
. In a later chapter I’ll show the compiler doing on demand
parsing with the token we created here in the lexer.C
1number: {
2 size_t start = l->pos;
3 size_t i = start;
4 bool is_double = false;
5 size_t hash = FNV_OFFSET_BASIS;
6 for (; i < l->input.len; i++) {
7 char cc = l->input.p[i];
8 hash ^= cc;
9 hash *= FNV_PRIME;
10 if (cc >= '0' && cc <= '9')
11 continue;
12 if (cc == '.') {
13 ASSERT(!is_double, "Two dots in double");
14 is_double = true;
15 continue;
16 }
17 break;
18 }
19
20 l->pos = i;
21 Token *n = CALL(a, request, sizeof(Token));
22 n->string = (Str){
23 .p = l->input.p + start,
24 .len = i - start,
25 .hash = hash,
26 };
27 if (is_double) {
28 n->type = T_DOUBLE;
29 } else {
30 n->type = T_INTEGER;
31 }
32
33 out[count++] = n;
34 JUMP_TARGET;
35}
Tip - Distinguishing between integers and doubles
At first, purple garden’s runtime represented all numbers as doubles. After benchmarking, I found out that integer math is a lot faster, so it makes a lot more sense to store all non floating point numbers as integers. For general advice, always benchmark and read the following:
Extra: memory mapping the input
Normaly one would consume a file in C by
- open file descriptor:
fopen
- fread the buffer into a malloced space
- zero terminate
- close file:
fclose
1// https://stackoverflow.com/questions/14002954/c-how-to-read-an-entire-file-into-a-buffer
2#include
3#include
4
5int main(void) {
6 FILE *f = fopen("textfile.txt", "rb");
7 fseek(f, 0, SEEK_END);
8 long fsize = ftell(f);
9 fseek(f, 0, SEEK_SET);
10
11 char *string = malloc(fsize + 1);
12 fread(string, fsize, 1, f);
13 fclose(f);
14
15 string[fsize] = 0;
16 free(string);
17 return EXIT_SUCCESS;
18}
However, you can also do the whole thing a lot faster by instructing the kernel
to dump the whole file into our virtual memory via
mmap
(Just not
walking a file two times is already faster).
After opening the file (opening a file with O_RDONLY
and mapping it with
PROT_READ
can be faster than making it mutable), we need it’s type (we dont’t
want to open or dump a directory) and it’s size (the api wants a mapping block
size). fstat
helps
us with filling a struct with meta data containing exactly the info we need:C
1// taken from https://www.commandlinux.com/man-page/man2/fstat.2.html
2struct stat {
3 dev_t st_dev; /* ID of device containing file */
4 ino_t st_ino; /* inode number */
5 mode_t st_mode; /* protection */
6 nlink_t st_nlink; /* number of hard links */
7 uid_t st_uid; /* user ID of owner */
8 gid_t st_gid; /* group ID of owner */
9 dev_t st_rdev; /* device ID (if special file) */
10 off_t st_size; /* total size, in bytes */
11 blksize_t st_blksize; /* blocksize for file system I/O */
12 blkcnt_t st_blocks; /* number of 512B blocks allocated */
13 time_t st_atime; /* time of last access */
14 time_t st_mtime; /* time of last modification */
15 time_t st_ctime; /* time of last status change */
16};
I use
S_ISREG
:
to check if the handled is a regular file. After mmapping with the size stored
in stat.st_size
I do a bookkeeping, see the combined snippet below:C
1#include
2#include
3#include
4#include
5
6#include "common.h"
7#include "io.h"
8
9Str IO_read_file_to_string(char *path) {
10 ASSERT(path != NULL, "path was NULL");
11
12 int fd = open(path, O_RDONLY);
13 ASSERT(fd != -1, "failed to open input file");
14
15 struct stat s;
16 fstat(fd, &s);
17 ASSERT(S_ISREG(s.st_mode), "path is not a file");
18
19 long length = s.st_size;
20 if (length < 0) {
21 close(fd);
22 ASSERT(length > 0, "input is empty")
23 }
24
25 char *buffer = 0;
26 if (length != 0) {
27 buffer = mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0);
28 }
29
30 ASSERT(close(fd) == 0, "failed to close file");
31 ASSERT(buffer != MAP_FAILED, "failed to mmap input")
32 return (Str){.len = length, .p = (const uint8_t *)buffer};
33}
Info - Benchmarks
In my benchmark this made the stage before even starting lexing 6x-35x
faster, see
a2cae88
:
Please excuse the ugly debug info, I have reworked this till then. Also, lexing and parsing was a single stage back then.TEXT
1io: use mmap to make IO_read_file_to_string 6x-35x faster
2For 5k lines with 4 atoms each (20k atoms), the initial string read was
3reduced from 0.4390ms to 0.0730ms (6x faster):
4
5Old:
6 [BENCH] (T-0.1620ms): parsed arguments
7 [BENCH] (T-0.4390ms): read file to String
8 [BENCH] (T-10.2020ms): parsed input
9 [BENCH] (T-1.2820ms): compiled input
10 [BENCH] (bc=40008|globals=20004)
11 [BENCH] (T-0.1620ms): ran vm
12 [BENCH] (T-0.6190ms): destroyed Nodes, vm and input
13
14New:
15 [BENCH] (T-0.1510ms): parsed arguments
16 [BENCH] (T-0.0730ms): read file to String
17 [BENCH] (T-10.1350ms): parsed input
18 [BENCH] (T-1.3210ms): compiled input
19 [BENCH] (bc=40008|globals=20004)
20 [BENCH] (T-0.1710ms): ran vm
21 [BENCH] (T-0.6460ms): destroyed Nodes, vm and input
22
23For larger files, such as 250k lines with 4 atoms each (1mio atoms), the
24initial string read was reduced from 3.472ms to 0.0980ms (35x faster):
25
26Old:
27 [BENCH] (T-0.1430ms): parsed arguments
28 [BENCH] (T-3.4720ms): read file to String
29 [BENCH] (T-434.8770ms): parsed input
30 [BENCH] (T-30.7538ms): compiled input
31 [BENCH] (bc=2040408|globals=1020204)
32 [BENCH] (T-7.5610ms): ran vm
33 [BENCH] (T-37.2170ms): destroyed Nodes, vm and input
34
35New:
36 [BENCH] (T-0.1490ms): parsed arguments
37 [BENCH] (T-0.0980ms): read file to String
38 [BENCH] (T-437.4770ms): parsed input
39 [BENCH] (T-30.8820ms): compiled input
40 [BENCH] (bc=2040408|globals=1020204)
41 [BENCH] (T-7.4540ms): ran vm
42 [BENCH] (T-36.9500ms): destroyed Nodes, vm and input
Consuming numeric tokens in the compiler
As introduced in On demand double and int64_t parsing, the lexer does not perform string to numerical conversions, but rather stores a hash and a window of said string. The compiler converts any tokens with this hash only once and refers any duplicates to the global pool index of this number.
The compiler itself will probably be the topic of a future blog article, but I kept it simple at this time.
token_to_value
is called for all unique (not encountered before and thus not interned) atoms:C
1// token_to_value converts tokens, such as strings, idents and numbers to
2// runtime values
3inline static Value *token_to_value(Token *t, Allocator *a) {
4 Value *v = CALL(a, request, sizeof(Value));
5 switch (t->type) {
6 case T_STRING:
7 case T_IDENT:
8 v->type = V_STR;
9 v->string = t->string;
10 break;
11 case T_INTEGER:
12 v->type = V_INT;
13 v->integer = Str_to_int64_t(&t->string);
14 break;
15 case T_DOUBLE:
16 v->type = V_DOUBLE;
17 v->floating = Str_to_double(&t->string);
18 break;
19 default:
20 ASSERT(0, "Unsupported value for token_to_value");
21 break;
22 }
23 return v;
24}
Note the missing cases for T_FALSE
and T_TRUE
? They are omitted, because
there are hard coded entries 0
and 1
in the global pool (@None
is the
same, its bound to index 2
).
Info - Benchmarks
This resulted in crazy 15ms/64% faster total runtime results for number and
duplicate heavy test inputs, see
a55a190
.TEXT
1lexer+cc: move Str_to_(double|int64_t) parsing from lexer to cc
2This change omits all integer and number parsing from the pipeline but
3the first occurence of each unique integer or number by storing a hash
4of the string representation of said values. At the interning stage in
5the compiler only the first occurence of any hash of a double or integer
6is parsed via Str_to_int64_t or Str_to_double, which reduces the
7theoretically workload for any duplicated number of integers and doubles
8from N to 1.
9
10For a double and integer heavy benchmark (250k loc with 250k duplicated
11doubles and integers) results in:
12
13 - 15ms faster
14 - 64% faster
15 - ~2.8x faster
16
17Prev commit:
18 ./build/bench +V examples/bench.garden
19 [ 0.0000ms] main::Args_parse: Parsed arguments
20 [ 0.0120ms] io::IO_read_file_to_string: mmaped input of size=2500090B
21 [ 0.0050ms] mem::init: Allocated memory block of size=153092096B
22 [ 23.8300ms] lexer::Lexer_all: lexed tokens count=1000033
23 [ 12.5190ms] parser::Parser_next created AST with node_count=250003
24 [ 9.2090ms] cc::cc: Flattened AST to byte code/global pool length=1500048/4
25 [ 36.3060ms] vm::Vm_run: executed byte code
26 [ 0.3730ms] mem::Allocator::destroy: Deallocated memory space
27 [ 0.0410ms] vm::Vm_destroy: teared vm down
28 [ 0.0000ms] munmap: unmapped input
29
30New:
31 ./build/bench +V examples/bench.garden
32 [ 0.0000ms] main::Args_parse: Parsed arguments
33 [ 0.0170ms] io::IO_read_file_to_string: mmaped input of size=2500090B
34 [ 0.0060ms] mem::init: Allocated memory block of size=153092096B
35 [ 8.5270ms] lexer::Lexer_all: lexed tokens count=1000033
36 [ 12.2070ms] parser::Parser_next created AST with node_count=250003
37 [ 9.4020ms] cc::cc: Flattened AST to byte code/global pool length=1500048/4
38 [ 36.9900ms] vm::Vm_run: executed byte code
39 [ 0.3960ms] mem::Allocator::destroy: Deallocated memory space
40 [ 0.0480ms] vm::Vm_destroy: teared vm down
41 [ 0.0010ms] munmap: unmapped input
Benchmark
So I created, what i would consider a fairly heavy lexer benchmark:SCHEME
1(@Some (@Some (@Some (@None))))
2true false true false
33.1415 22222222222 .12345
4"string me this, string me that"
5'quoted-strings-is-a-must-do
6(@let unquoted-strings-are-just-idents (@None))
7unquoted-strings-are-just-idents
8(@None) (+) (-) (*) (/) (=)
9;; COMMENT COMMENT COMMENT
10;; COMMENT COMMENT COMMENT
11;; COMMENT COMMENT COMMENT with whitespace for 3 lines
12
13
14
15;; whitespace end
And I typed VggyG66666p
to fill 1mio lines (1000005
).
On a Laptop
Terminal$ inxi -CMD
System:
Host: ************* Kernel: 6.11.0-28-generic arch: x86_64 bits: 64
Desktop: i3 v: 4.23 Distro: Ubuntu 24.04.2 LTS (Noble Numbat)
Machine:
Type: Laptop System: LENOVO product: 21F8002TGE v: ThinkPad T14s Gen 4
serial:
Mobo: LENOVO model: 21F8002TGE v: SDK0T76530 WIN
serial: UEFI: LENOVO v: R2EET41W (1.22 )
date: 09/22/2024
CPU:
Info: 8-core model: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics bits: 64
type: MT MCP cache: L2: 8 MiB
Speed (MHz): avg: 883 min/max: 400/5132 cores: 1: 1388 2: 400 3: 1396
4: 400 5: 400 6: 400 7: 1374 8: 400 9: 1331 10: 400 11: 1357 12: 400
13: 1357 14: 1346 15: 1393 16: 400
With the above components.Terminal
$ make bench
./build/bench +V examples/bench.garden
[ 0.0000ms] main::Args_parse: Parsed arguments
[ 0.0150ms] io::IO_read_file_to_string: mmaped input of size=25466794B
[ 0.0060ms] mem::init: Allocated memory block of size=929537981B
[ 43.9190ms] lexer::Lexer_all: lexed tokens count=3133350
[ 48.8460ms] parser::Parser_next created AST with node_count=1200006
[ 18.2070ms] cc::cc: Flattened AST to byte code/global pool length=2666680/8
[ 8.9970ms] vm::Vm_run: executed byte code
[ 26.7470ms] mem::Allocator::destroy: Deallocated memory space
[ 1.0180ms] vm::Vm_destroy: teared vm down
[ 0.0000ms] munmap: unmapped input
I can confidently say I do a million lines or 25,466,794 Bytes in 44ms. Let’s do some math:$$ \begin{align} 44 \textrm{ms} &\triangleq 25,466,749 \mathrm{B} \\ 1 \textrm{ms} &\triangleq 578,789.75 \mathrm{B} \\ 1000 \textrm{ms} &\triangleq 578,789,750 \mathrm{B} \\ &= \underline{578,79 \mathrm{MB}/\textrm{s}} \end{align} $$
In token:$$ \begin{align} 44 \textrm{ms} &\triangleq 3,133,350 \mathrm{T} \\ 1 \textrm{ms} &\triangleq 71212.5 \mathrm{T} \\ 1000 \textrm{ms} &\triangleq 71,212,500 \mathrm{T} \\ &= \underline{71,212,500 \mathrm{T}/\textrm{s}} \end{align} $$
That’s pretty fast, but SIMD can probably do a lot better at this point. However, I haven’t started that experiment yet.
On a Tower
Terminal$ inxi -CMD
System:
Host: comfyputer Kernel: 6.15.4-arch2-1 arch: x86_64
bits: 64
Desktop: i3 v: 4.24 Distro: Arch Linux
Machine:
Type: Desktop Mobo: ASUSTeK model: PRIME B450-PLUS
v: Rev X.0x serial:
UEFI: American Megatrends v: 2008 date: 12/06/2019
CPU:
Info: 8-core model: AMD Ryzen 7 3700X bits: 64
type: MT MCP cache: L2: 4 MiB
Speed (MHz): avg: 4052 min/max: 2200/4979 cores:
1: 4052 2: 4052 3: 4052 4: 4052 5: 4052 6: 4052
7: 4052 8: 4052 9: 4052 10: 4052 11: 4052 12: 4052
13: 4052 14: 4052 15: 4052 16: 4052
So we are around 14ms faster on my tower.TEXT
1./build/bench +V examples/bench.garden
2[ 0.0000ms] main::Args_parse: Parsed arguments
3[ 0.0070ms] io::IO_read_file_to_string: mmaped input of size=25400127B
4[ 0.0030ms] mem::init: Allocated memory block of size=927104635B
5[ 30.9930ms] lexer::Lexer_all: lexed tokens count=3133350
6[ 22.6340ms] parser::Parser_next created AST with node_count=1200006
7[ 10.1480ms] cc::cc: Flattened AST to byte code/global pool length=2666680/8
8[ 7.4800ms] vm::Vm_run: executed byte code
9[ 0.7520ms] mem::Allocator::destroy: Deallocated memory space
10[ 0.0620ms] vm::Vm_destroy: teared vm down
11[ 0.0000ms] munmap: unmapped input
The same math as above, just with 30ms instead of 44ms:$$ \begin{align} 30 \textrm{ms} &\triangleq 25,466,749 \mathrm{B} \\ 1 \textrm{ms} &\triangleq 848,891.633333 \mathrm{B} \\ 1000 \textrm{ms} &\triangleq 848,891,633.333 \mathrm{B} \\ &= \underline{848.89 \mathrm{MB}/\textrm{s}} \end{align} $$
In token:$$ \begin{align} 30 \textrm{ms} &\triangleq 3,133,350 \mathrm{T} \\ 1 \textrm{ms} &\triangleq 104,445 \mathrm{T} \\ 1000 \textrm{ms} &\triangleq 104,445,000 \mathrm{T} \\ &= \underline{104,445,000 \mathrm{T}/\textrm{s}} \end{align} $$
Benchmark contexts
For a C input of 7.5 mio loc, which is of course more complex to tokenize then my language, see Some Strategies For Fast Lexical Analysis when Parsing Programming Languages. The following numbers are available and I added the performance the purple-garden lexer has for 7.5mio lines lexer heavy benchmark inputs.
lexer | performance |
---|---|
flex (default) | 13.56 s |
stb_lex (w/symbol hashing) | 4.84 s |
stb_lex | 4.23 s |
flex -F (fast) | 3.07 s |
flex -f (full) | 2.92 s |
handcoded | 2.45 s |
handcoded mmap | 2.14 s |
wc | 1.73 s |
purple-garden (laptop) | 0.308s |
purple-garden (tower) | 0.150s |
What next
A summary what I implemented in this article:
- Jump table for direct threading
- 0 copy and window based tokens
- interned and stateless tokens
- bump allocator for unique tokens
- inline hashing for atoms that need it (strings, idents, numeric)
- fast paths for true and false
While 580-848 MB/s is already pretty fast, I want to go further, some things I have planned:
- use the absurd bit set based
is_alphanum
checks - use SIMD for comments and whitespace
- use SIMD as a preprocessing step to find markers for tokens 16 bytes at a time
- replace FNV-1a with a faster hashing algorithm, something like xxHash
- prefetch some amount of bytes to reduce L1 & L2 latency
- mmap larger inputs with
MAP_HUGETLB
- maybe align mmap to 64 byte boundaries for SIMD
Putting it all together
C 1#include "lexer.h"
2#include "common.h"
3#include "mem.h"
4#include "strings.h"
5#include
6#include
7#include
8#include
9
10#define SINGLE_TOK(t) ((Token){.type = t})
11
12Str TOKEN_TYPE_MAP[] = {[T_DELIMITOR_LEFT] = STRING("T_DELIMITOR_LEFT"),
13 [T_DELIMITOR_RIGHT] = STRING("T_DELIMITOR_RIGHT"),
14 [T_BRAKET_LEFT] = STRING("T_BRAKET_LEFT"),
15 [T_BRAKET_RIGHT] = STRING("T_BRAKET_RIGHT"),
16 [T_STRING] = STRING("T_STRING"),
17 [T_TRUE] = STRING("T_TRUE"),
18 [T_FALSE] = STRING("T_FALSE"),
19 [T_DOUBLE] = STRING("T_DOUBLE"),
20 [T_INTEGER] = STRING("T_INTEGER"),
21 [T_BUILTIN] = STRING("T_BUILTIN"),
22 [T_IDENT] = STRING("T_IDENT"),
23 [T_PLUS] = STRING("T_PLUS"),
24 [T_MINUS] = STRING("T_MINUS"),
25 [T_ASTERISKS] = STRING("T_ASTERISKS"),
26 [T_SLASH] = STRING("T_SLASH"),
27 [T_EQUAL] = STRING("T_EQUAL"),
28 [T_EOF] = STRING("T_EOF")};
29
30Lexer Lexer_new(Str input) {
31 return (Lexer){
32 .input = input,
33 .pos = 0,
34 };
35}
36
37#define cur(L) (L->input.p[L->pos])
38
39__attribute__((always_inline)) inline static bool is_alphanum(uint8_t cc) {
40 uint8_t lower = cc | 0x20;
41 bool is_alpha = (lower >= 'a' && lower <= 'z');
42 bool is_digit = (cc >= '0' && cc <= '9');
43 return is_alpha || is_digit || cc == '_' || cc == '-';
44}
45
46// we can "intern" these, since all of them are the same, regardless of position
47Token *INTERN_DELIMITOR_LEFT = &SINGLE_TOK(T_DELIMITOR_LEFT);
48Token *INTERN_DELIMITOR_RIGHT = &SINGLE_TOK(T_DELIMITOR_RIGHT);
49Token *INTERN_BRAKET_LEFT = &SINGLE_TOK(T_BRAKET_LEFT);
50Token *INTERN_BRAKET_RIGHT = &SINGLE_TOK(T_BRAKET_RIGHT);
51Token *INTERN_MINUS = &SINGLE_TOK(T_MINUS);
52Token *INTERN_PLUS = &SINGLE_TOK(T_PLUS);
53Token *INTERN_ASTERISKS = &SINGLE_TOK(T_ASTERISKS);
54Token *INTERN_SLASH = &SINGLE_TOK(T_SLASH);
55Token *INTERN_FALSE = &SINGLE_TOK(T_FALSE);
56Token *INTERN_TRUE = &SINGLE_TOK(T_TRUE);
57Token *INTERN_EQUAL = &SINGLE_TOK(T_EQUAL);
58Token *INTERN_EOF = &SINGLE_TOK(T_EOF);
59
60size_t Lexer_all(Lexer *l, Allocator *a, Token **out) {
61 ASSERT(out != NULL, "Failed to allocate token list");
62
63 // empty input
64 if (l->input.len == 0) {
65 out[0] = INTERN_EOF;
66 return 1;
67 }
68
69 size_t true_hash = Str_hash(&STRING("true"));
70 size_t false_hash = Str_hash(&STRING("false"));
71
72 size_t count = 0;
73 static void *jump_table[256] = {
74 [0 ... 255] = &&unknown,
75 [' '] = &&whitespace,
76 ['\t'] = &&whitespace,
77 ['\n'] = &&whitespace,
78 [';'] = &&comment,
79 ['('] = &&delimitor_left,
80 [')'] = &&delimitor_right,
81 ['@'] = &&builtin,
82 ['.'] = &&number,
83 ['0' ... '9'] = &&number,
84 ['a' ... 'z'] = &&ident,
85 ['A' ... 'Z'] = &&ident,
86 ['_'] = &&ident,
87 ['\''] = &"ed,
88 ['"'] = &&string,
89 ['+'] = &&plus,
90 ['-'] = &&minus,
91 ['/'] = &&slash,
92 ['*'] = &&asterisks,
93 ['='] = &&equal,
94 ['['] = &&braket_left,
95 [']'] = &&braket_right,
96 [0] = &&end,
97 };
98
99#define JUMP_TARGET goto *jump_table[(int32_t)l->input.p[l->pos]]
100
101 JUMP_TARGET;
102
103delimitor_left:
104 out[count++] = INTERN_DELIMITOR_LEFT;
105 l->pos++;
106 JUMP_TARGET;
107
108delimitor_right:
109 out[count++] = INTERN_DELIMITOR_RIGHT;
110 l->pos++;
111 JUMP_TARGET;
112
113braket_left:
114 out[count++] = INTERN_BRAKET_LEFT;
115 l->pos++;
116 JUMP_TARGET;
117
118braket_right:
119 out[count++] = INTERN_BRAKET_RIGHT;
120 l->pos++;
121 JUMP_TARGET;
122
123builtin: {
124 l->pos++;
125 // not an ident after @, this is shit
126 if (!is_alphanum(cur(l))) {
127 out[count++] = INTERN_EOF;
128 }
129 size_t start = l->pos;
130 size_t hash = FNV_OFFSET_BASIS;
131 for (char cc = cur(l); cc > 0 && is_alphanum(cc); l->pos++, cc = cur(l)) {
132 hash ^= cc;
133 hash *= FNV_PRIME;
134 }
135
136 size_t len = l->pos - start;
137 Str s = (Str){
138 .p = l->input.p + start,
139 .len = len,
140 .hash = hash,
141 };
142 Token *b = CALL(a, request, sizeof(Token));
143 b->string = s;
144 b->type = T_BUILTIN;
145 out[count++] = b;
146 JUMP_TARGET;
147}
148
149plus:
150 out[count++] = INTERN_PLUS;
151 l->pos++;
152 JUMP_TARGET;
153
154minus:
155 out[count++] = INTERN_MINUS;
156 l->pos++;
157 JUMP_TARGET;
158
159slash:
160 out[count++] = INTERN_SLASH;
161 l->pos++;
162 JUMP_TARGET;
163
164equal:
165 out[count++] = INTERN_EQUAL;
166 l->pos++;
167 JUMP_TARGET;
168
169asterisks:
170 out[count++] = INTERN_ASTERISKS;
171 l->pos++;
172 JUMP_TARGET;
173
174number: {
175 size_t start = l->pos;
176 size_t i = start;
177 bool is_double = false;
178 size_t hash = FNV_OFFSET_BASIS;
179 for (; i < l->input.len; i++) {
180 char cc = l->input.p[i];
181 hash ^= cc;
182 hash *= FNV_PRIME;
183 if (cc >= '0' && cc <= '9')
184 continue;
185 if (cc == '.') {
186 ASSERT(!is_double, "Two dots in double");
187 is_double = true;
188 continue;
189 }
190 break;
191 }
192
193 l->pos = i;
194 Token *n = CALL(a, request, sizeof(Token));
195 n->string = (Str){
196 .p = l->input.p + start,
197 .len = i - start,
198 .hash = hash,
199 };
200 if (is_double) {
201 n->type = T_DOUBLE;
202 } else {
203 n->type = T_INTEGER;
204 }
205
206 out[count++] = n;
207 JUMP_TARGET;
208}
209
210ident: {
211 size_t start = l->pos;
212 size_t hash = FNV_OFFSET_BASIS;
213 for (char cc = cur(l); cc > 0 && is_alphanum(cc); l->pos++, cc = cur(l)) {
214 hash ^= cc;
215 hash *= FNV_PRIME;
216 }
217
218 size_t len = l->pos - start;
219 Token *t;
220 if (hash == true_hash) {
221 t = INTERN_TRUE;
222 } else if (hash == false_hash) {
223 t = INTERN_FALSE;
224 } else {
225 t = CALL(a, request, sizeof(Token));
226 t->type = T_IDENT;
227 t->string = (Str){
228 .p = l->input.p + start,
229 .len = len,
230 .hash = hash,
231 };
232 }
233 out[count++] = t;
234 JUMP_TARGET;
235}
236
237// same as string but only with leading '
238quoted: {
239 // skip '
240 l->pos++;
241 size_t start = l->pos;
242 size_t hash = FNV_OFFSET_BASIS;
243 for (char cc = cur(l); cc > 0 && is_alphanum(cc); l->pos++, cc = cur(l)) {
244 hash ^= cc;
245 hash *= FNV_PRIME;
246 }
247
248 size_t len = l->pos - start;
249 Token *t;
250 t = CALL(a, request, sizeof(Token));
251 t->type = T_STRING;
252 t->string = (Str){
253 .p = l->input.p + start,
254 .len = len,
255 .hash = hash,
256 };
257 out[count++] = t;
258 JUMP_TARGET;
259}
260
261string: {
262 // skip "
263 l->pos++;
264 size_t start = l->pos;
265 size_t hash = FNV_OFFSET_BASIS;
266 for (char cc = cur(l); cc > 0 && cc != '"'; l->pos++, cc = cur(l)) {
267 hash ^= cc;
268 hash *= FNV_PRIME;
269 }
270
271 if (UNLIKELY(cur(l) != '"')) {
272 Str slice = Str_slice(&l->input, l->pos, l->input.len);
273 fprintf(stderr, "lex: Unterminated string near: '%.*s'", (int)slice.len,
274 slice.p);
275 out[count++] = INTERN_EOF;
276 } else {
277 Token *t = CALL(a, request, sizeof(Token));
278 t->type = T_STRING;
279 t->string = (Str){
280 .p = l->input.p + start,
281 .len = l->pos - start,
282 .hash = hash,
283 };
284 out[count++] = t;
285 // skip "
286 l->pos++;
287 }
288 JUMP_TARGET;
289}
290
291comment:
292 for (char cc = cur(l); cc > 0 && cc != '\n'; l->pos++, cc = cur(l)) {
293 }
294 JUMP_TARGET;
295
296whitespace:
297 l->pos++;
298 JUMP_TARGET;
299
300unknown: {
301 uint8_t c = cur(l);
302 ASSERT(0, "Unexpected byte '%c' (0x%X) in input", c, c)
303}
304
305end:
306 out[count++] = INTERN_EOF;
307 return count;
308}
309
310#undef SINGLE_TOK
Thank you for reading this far. If you have any suggestions or feedback, feel free to send me an email at [email protected] or:
What's Your Reaction?






