With a bit of hacking for the last month or two, I can finally see that I am making progress on Piggy, a new kind of p/invoke generator. Some might say “Why in the world are you wasting time writing a p/invoke generator? Aren’t there tools already that do this?” Yes, while there are generators, they aren’t that good! They offer obscure switches to change how something is converted to C#. You don’t know when or how they apply.

Piggy is different. You program the p/invoke generator for the exact AST matches you want to apply. There is no fishing around about what it does. The context is all right there. But, in order to write tree-matching patterns, you have to have a clear understanding of the abstract syntax tree for the input.

The representation of an AST in Piggy

Piggy reads C++ code using Clang. After the parse, Clang passes back an AST. Unfortunately, Clang doesn’t have a serializer for the AST, so I wrote one. The syntax of the serialized AST is a nested tree expression. It may be antiquated, as many people nowadays prefer to use tabs to denote levels in trees and code, e.g., JSON and Python, and even the output AST from Clang (clang -Xclang -ast-dump test.h). They argue that indentation mandates structure that is optional in free-format languages, so it is “better”. IMHO, this is nonsense! I find I like to use single lines for some tree patterns. And, I still like to write if-statements with a simple then on the same line.

With Piggy. use the option -ast to output the AST for the input source. A node in the AST is represented with the following prototype: (N A1 A2 ... ( C1 ...) ( C2 ...) ...). The type of the node is N. All attributes of the node, like Name or SrcLoc, follow the node type. Attributes of the node all follow the same syntax: name = value, and always precede any children of the node. Children of the node are further parenthesized expressions. For example, the following is a partial tree from the AST of the Clang header file CXErrorCode.h. The tree contains an enum declaration with comments. The location of the code is shown with SrcRange attributes.

Template for pattern matching

Patterns are themselves nested parentheses expressions. They can match any node in the tree in a DFS in-order traverse and multiple times. However, once a pattern matches a particular node, no other patterns match for that node or any nodes contained in that sub-tree. For further discussion, please refer to the EnumDecl example shown below.

Patterns generally specify a type of node to match, e.g., EnumDecl. However, there are times when you want to match any type. You can either leave it blank or use an asterisk. In any case, this sets up a context for matching any children of the node. Any sub-nodes in the pattern cannot match without the parent matching. Further, any children that do not match causes the entire pattern to not match.

A nested parentheses expression that begins with (* and ends with *) is for matching a subtree at any depth in the tree with the given context. With the enums pattern, (* EnumDecl ... *) matches an EnumDecl node at any depth within the selected node with SrcRange.

A nested parentheses expression which begins with (% and ends with %) is a grouping. It works just like that in regular expressions. In this example, it’s used in combination with the OR-operator (|) and the STAR-operator (*). Operators + and * are interpreted just as with regular expressions: + denotes one or more occurrences of a nested parentheses expression; * denotes zero or more occurrences of a nested parentheses expression. Intermingled with a nested parentheses expression attributes and children is template code, which begins with { and ends with }. It indicates the output generated for an AST node match and must be valid C# code. More on this later.

The expression ( SrcRange=".*\\clang-c\\.*" ) matches all nodes with any type, with SrcRange containing “clang-c” in the file name. Values in a pattern for an attribute can be any valid C# Regex.

Template matching engine

The grammar for a nested parenthesized AST expression results in one tree; the grammar for template pattern matching results in another tree. The engine that matches template with the AST is a recursive predictive parser. Matching proceeds in a DFS in-order traversal. When a pattern matches, the engine sets the variable Intercept<IParseTree, IParseTree> matches mapping an AST match with a node in the pattern. The engine also computes the variable Dictionary<IParseTree, IParseTree> parent as it is needed in the code output blocks, and the Antlr parser does not record this property.

Code output

In the template, three variables are passed to the code for the purpose of side effects: vars, result, and tree.

The variable vars is a C# Dictionary<String, object>. You can set values in one code section and pass it onto another code section in the template. Be warned, a get of the key must have a value, otherwise, the code will throw an exception.

The variable result is a C# StringBuffer. Use this to append output to the translation. For example, in the EnumDecl pattern, we output enum followed by the name of the type, then an open parenthesis and newline.

The variable tree is a “dom-like” object for referring to the AST matched. The type of the variable is Tree, which you can find here. Method Peek() returns a Tree relative to the current tree node. So, tree.Peek(0) denotes the current tree; tree.Peek(1) denotes the parent of the current node; etc. Attr() gets the value of the named attribute. If there is no attribute with the name, it returns the empty string.

Currently, there is a problem with code blocks that contain open and closing curly brackets. Unfortunately, the Antlr parser I wrote doesn’t correctly ignore curly brackets in strings. So, this is why you see the Unicode values used in the EnumDecl template.

Output engine

Output for the tool is generated using a non-recursive implementation for a DFS in-order tree walker of the AST. When an AST node is found that maps to a matched pattern node, output nodes of the pattern are interleaved with the children of the AST node. As the nodes are visited, output is generated for the code blocks. The implementation is here.

Output

Output from the tool is the accumulated output from the StringBuffer generated by the code blocks that match the pattern.

In this example, the output is

What are the problems I discovered in writing Piggy?

By no means is Piggy finished, let alone available for general use. There are still a number of problems to resolve.

When does an AST node “match” a pattern? And, when does matching stop?
How do patterns with * and + operators work?
Should patterns allow references to “Parent”?
I’m having problems with curly brackets in strings in code blocks. How do I write an Antlr grammar that balances curly brackets, yet ignores one contained in C# string literals?

Stay tuned for further developments.