Porting Re2/j to C#

For these last few weeks, I’ve been trying to grapple with the problem of p/invoke–the nasty but must-use feature in C#. While one could write these declarations out by hand, some libraries are too large, and change too often, so people use p/invoke generators, like SWIG. However, the is no generator that is easy to use or generates 100% correct C# declarations. So, as every software developer does, so I go to re-invent the wheel.

Part of my solution involves using regular expressions–which requires a recognizer. Rather than write my own regular expression engine–as I don’t want to spend all day writing one–I decided to look for one.

If you try to Google for “regular expression engine source code”, you will find a great many implementations. But, hardly any I would consider an easy read, in the evening beside your dog at the fireplace. Some say the implementation by Pike and Kernighan is a nice and clean implementation, but that code really isn’t a regular expression engine. It’s a glob pattern matcher. And, if you try to extend it to include regular expression operators like groups (parenthesized patterns), sets (denoted by square brackets), and Boolean Or operators, you end up with a big, fat, pile of crap.

Surprisingly, virtually all regular expression engines are implemented as NFA’s with backtracking. As I really didn’t want to go through Thompson’s Construction, and re-implement something I did years ago, I decided to try a nice implementation of Thompson’s original NFA algorithm by Cox. The original code is in C++, then ported to Java a few years ago. Starting with that, I ported it to C# over the last two days. Does it work? Yes!

How does it work?

RE2 works by constructing an NFA, then, as explained by Wikipedia, keeps a set data structure of all the possible states that the recognizer is in. (Surprisingly, Wiki does not describe the backtracking method, used by most RE recognizers!)

Let’s take Cox’s example (pattern = “abab|abbb”, text = “abbb”). Debugging the implementation, we see the NFA is implemented by the class Machine. Prog encodes the states and transitions of the NFA. The inst field of the class encodes the transitions organized by state, which is identified by a non-negative integer. The main loop of the recognizer is for-loop. Starting in state 1, the recognizer steps to {2} on ‘a’. Then, on ‘b’ the recognizer steps to {7, 3, 5}. Then, on ‘b’, the recognizer steps to {6}. Then, on ‘b’, the recognizer steps to {8}, which is the match state.

The Alternative

Note: The simplest implementation I’ve seen is just a recursive descent, which you can find here. I think I even had it on a quiz way back in compiler construction many years ago.


Posted in Tip