I’ve recently found about gperf a GNU tool said to be a “a perfect hash function generator”

Well what do we use it for? Anytime you need to lookup a string from a fixed dataset. It generates a “perfect” hash function that promises to do that lookup using at most one string comparison.

A good use of it is in lexers (or call it a scanner or tokenizer if you want) where we need to distinguish keywords from identifiers, this leads us to scanning an identifier and before calling it an identifier we do comparisons to see if it could actually be a reserved/builtin keyword of the language.

The Problem

I’m going to use clox from Part 2 of the Crafting Interpreters Book as a base to work on.

The book parses keywords like this:

static TokenType checkKeyword(int start, int length,
    const char* rest, TokenType type) {
  if (scanner.current - scanner.start == start + length &&
      memcmp(scanner.start + start, rest, length) == 0) {
    return type;
  }

  return TOKEN_IDENTIFIER;
}

static TokenType identifierType() {
  switch (scanner.start[0]) {
    case 'a': return checkKeyword(1, 2, "nd", TOKEN_AND);
    case 'c': return checkKeyword(1, 4, "lass", TOKEN_CLASS);
    case 'e': return checkKeyword(1, 3, "lse", TOKEN_ELSE);
    case 'f':
      if (scanner.current - scanner.start > 1) {
        switch (scanner.start[1]) {
          case 'a': return checkKeyword(2, 3, "lse", TOKEN_FALSE);
          case 'o': return checkKeyword(2, 1, "r", TOKEN_FOR);
          case 'u': return checkKeyword(2, 1, "n", TOKEN_FUN);
        }
      }
      break;
    case 'i': return checkKeyword(1, 1, "f", TOKEN_IF);
    case 'n': return checkKeyword(1, 2, "il", TOKEN_NIL);
    case 'o': return checkKeyword(1, 1, "r", TOKEN_OR);
    case 'p': return checkKeyword(1, 4, "rint", TOKEN_PRINT);
    case 'r': return checkKeyword(1, 5, "eturn", TOKEN_RETURN);
    case 's': return checkKeyword(1, 4, "uper", TOKEN_SUPER);
    case 't':
      if (scanner.current - scanner.start > 1) {
        switch (scanner.start[1]) {
          case 'h': return checkKeyword(2, 2, "is", TOKEN_THIS);
          case 'r': return checkKeyword(2, 2, "ue", TOKEN_TRUE);
        }
      }
      break;
    case 'v': return checkKeyword(1, 2, "ar", TOKEN_VAR);
    case 'w': return checkKeyword(1, 4, "hile", TOKEN_WHILE);
  }

  return TOKEN_IDENTIFIER;
}

static Token identifier() {
  while (isAlpha(peek()) || isDigit(peek())) advance();
  return makeToken(identifierType());
}

Basically after parsing an identifier it starts peeking at it’s letters, e.g if it starts with “c” compare the rest of it if it could be a “class”, sometimes the letters conflict and more peeking is needed, e.g if we see a “t” this could either be “true” or “this” so start looking at its second letter and if it’s an “h” compare the rest if it’s a “this” or if it’s a “r” compare the rest if it could be a “true”

This check happens on every identifier, while the string comparisons aren’t too bad since we peek characters and avoid comparing against everything there is still a lot of branching going on.

And it gets annoying real fast, as you add more keywords, you will find yourself sometimes nesting switches within each other alongside the if statement.

I got tired of this and at first accepted to sacrifice the benefits of this approach for more comfort in extending the keywords set, so I defined a table:

typedef Keyword {
  const char* name;
  size_t length;
  TokenType token;
}

static Keyword keywords[] = {
  { "this",  4, TOKEN_THIS     },
  { "true",  4, TOKEN_TRUE     },
  { "class", 5, TOKEN_CLASS    },
  { NULL,    0, (TokenType)(0) }, // sentinel
};

And then I wrote a loop to go over the table and if the keyword’s length is the same as our identifier compare them to see if it’s the keyword. This doesn’t lookahead at the characters and so more comparisons might happen in this case if I write “time” it will be compared twice against both “this” and “true” where as previously it would check the second letter and see that it’s “i”, no string comparison is done. There is still some branch overhead from the previous one though.

I accepted this as my solution to make it easy to extend the keyword set and figured I won’t have to worry about scanner performance too much.

That is until, I stumbled across a file while reading the ruby source code and it peaked my interest so I looked further into gperf

The Solution

gperf is amazing, it was easy to get started, not much to learn, the manual was short, I could get up to speed with it and it works great.

So let’s fix the above with gperf

Here’s a file called keywords

struct keyword { const char* name; TokenType token; }
%%
and,    TOKEN_AND
class,  TOKEN_CLASS
else,   TOKEN_ELSE
false,  TOKEN_FALSE
for,    TOKEN_FOR
fun,    TOKEN_FUN
if,     TOKEN_IF
nil,    TOKEN_NIL
or,     TOKEN_OR
print,  TOKEN_PRINT
return, TOKEN_RETURN
super,  TOKEN_SUPER
this,   TOKEN_THIS
true,   TOKEN_TRUE
var,    TOKEN_VAR
while,  TOKEN_WHILE
%%

It pretty much feels like the table approach I’ve done above except I don’t have to specify the token lengths, it’s not needed, so it’s also really easy to extend, but we don’t suffer from the performance problem above since gperf will create a “perfect hash function” for it and do it with one lookup.

Here’s the command for generating the code:

$ gperf -c -C -t keywords -N checkKeyword > lex.def

This will generate some C code into lex.def which will take care of the keyword lookup.

Here’s some flags explanations to help you get up to speed:

  • -C Make table readonly, we never need to change the resulting keywords, just look them up, so this makes them const
  • -c Use strncmp for comparison instead of strcmp This is required for us because scanner.start points to the source code and is not a NULL-terminated string. The default works fine if you have NULL-terminated strings.
  • -t Generate struct definition, this will define the struct keyword which we used in the scanner.
  • -N changes the name of the lookup function. This is optional and the default is called in_word_set

Note

You can name the output file whatever you want, I just picked lex.def because that’s what mruby did.

Now in the scanner I simply #included the lex.def file. And removed the checkKeyword function since we’ll use our newly generated one. Replace identifierType with this now:

static TokenType identifierType() {
  int length = scanner.current - scanner.start;

  const struct keyword *kw = checkKeyword(scanner.start, length);

  if (kw != NULL) return kw->token;
  return TOKEN_IDENTIFIER;
}

Simple right? We just call the generated function to lookup the keyword for us, if it exists we return the associated token for the keyword, otherwise it’s an identifier.

I noticed a decent performance boost using gperf and it’s still easy as cake to extend the keyword set.

You can push the generated code to your git repository so others don’t have to install gperf to build your code, unless they want to tweak the keyword set. The generated code is also small and the output is not restricted to any license so you are free to do what you want.

With that said, this a new tool I learned today and added to my arsenal, I’d love to use it more often when writing compilers. I just had to share it to you because I want more people to know about this amazing tool. Hope you found it useful :)