This little piggy is not at the market yet
With a bit of hacking for the last month or two, I can finally see that I am making progress on Piggy, a new kind of p/invoke generator. Some might say “Why in the world are you wasting time writing a p/invoke generator? Aren’t there tools already that do this?” Yes, while there are generators, they aren’t that good! They offer obscure switches to change how something is converted to C#. You don’t know when or how they apply.
Piggy is different. You program the p/invoke generator for the exact AST matches you want to apply. There is no fishing around about what it does. The context is all right there. But, in order to write tree-matching patterns, you have to have a clear understanding of the abstract syntax tree for the input.
The representation of an AST in Piggy
Piggy reads C++ code using Clang. After the parse, Clang passes back an AST. Unfortunately, Clang doesn’t have a serializer for the AST, so I wrote one. The syntax of the serialized AST is a nested tree expression. It may be antiquated, as many people nowadays prefer to use tabs to denote levels in trees and code, e.g., JSON and Python, and even the output AST from Clang (clang -Xclang -ast-dump test.h). They argue that indentation mandates structure that is optional in free-format languages, so it is “better”. IMHO, this is nonsense! I find I like to use single lines for some tree patterns. And, I still like to write if-statements with a simple then on the same line.
With Piggy. use the option -ast
to output the AST for the input source. A node
in the AST is represented with the following prototype: (N A1 A2 ... ( C1 ...) ( C2 ...) ...)
.
The type of the node is N
. All attributes of the node, like Name or SrcLoc,
follow the node type. Attributes of the node all follow the same
syntax: name = value
, and always precede any children of the node.
Children of the node are further parenthesized expressions. For example,
the following is a partial tree from the AST of the Clang header
file CXErrorCode.h. The tree contains an enum declaration with comments. The location
of the code is shown with SrcRange attributes.
Template for pattern matching
Patterns are themselves nested parentheses expressions. They can match any node in the tree in a DFS in-order traverse and multiple times. However, once a pattern matches a particular node, no other patterns match for that node or any nodes contained in that sub-tree. For further discussion, please refer to the EnumDecl example shown below.
Patterns generally specify a type of node to match, e.g., EnumDecl
. However,
there are times when you want to match any type. You can either leave it blank
or use an asterisk. In any case, this sets up a context for matching any
children of the node. Any sub-nodes in the pattern cannot match without
the parent matching. Further, any children that do not match causes the
entire pattern to not match.
A nested parentheses expression that begins with (*
and ends with *)
is for
matching a subtree at any depth in the tree with the given context. With the
enums pattern, (* EnumDecl ... *)
matches an EnumDecl
node at any depth
within the selected node with SrcRange.
A nested parentheses expression which begins with (%
and ends with %)
is a
grouping. It works just like that in regular expressions. In this example,
it’s used in combination with the OR-operator (|
) and the STAR-operator
(*
). Operators +
and *
are interpreted just as with regular expressions:
+
denotes one or more occurrences of a nested parentheses expression;
*
denotes zero or more occurrences of a nested parentheses expression.
Intermingled with a nested parentheses expression attributes and children
is template code, which begins with {
and ends with }
. It indicates the
output generated for an AST node match and must be valid C# code.
More on this later.
The expression ( SrcRange=".*\\clang-c\\.*" )
matches all nodes with any type,
with SrcRange containing “clang-c” in the file name. Values in a pattern for an
attribute can be any valid C# Regex.
Template matching engine
The grammar for a nested parenthesized AST expression results in one tree;
the grammar for template pattern matching results in another tree.
The engine that matches template with the AST is a recursive predictive
parser. Matching proceeds in a DFS in-order traversal. When a pattern matches,
the engine sets the variable Intercept<IParseTree, IParseTree>
matches
mapping an AST match with a node in the pattern. The engine also computes
the variable Dictionary<IParseTree, IParseTree>
parent as it is needed
in the code output blocks, and the Antlr parser does not record this property.
Code output
In the template, three variables are passed to the code for the purpose of side effects: vars, result, and tree.
The variable vars is a C# Dictionary<String, object>
. You can set values
in one code section and pass it onto another code section in the template.
Be warned, a get of the key must have a value, otherwise, the code will throw
an exception.
The variable result is a C# StringBuffer
. Use this to append output to the
translation. For example, in the EnumDecl
pattern, we output enum followed
by the name of the type, then an open parenthesis and newline.
The variable tree is a “dom-like” object for referring to the AST matched.
The type of the variable is Tree, which you can find here. Method Peek()
returns a Tree relative to the current tree node. So, tree.Peek(0)
denotes
the current tree; tree.Peek(1)
denotes the parent of the current node; etc.
Attr() gets the value of the named attribute. If there is no attribute with
the name, it returns the empty string.
Currently, there is a problem with code blocks that contain open and closing
curly brackets. Unfortunately, the Antlr parser I wrote doesn’t correctly
ignore curly brackets in strings. So, this is why you see the Unicode
values used in the EnumDecl
template.
Output engine
Output for the tool is generated using a non-recursive implementation for a DFS in-order tree walker of the AST. When an AST node is found that maps to a matched pattern node, output nodes of the pattern are interleaved with the children of the AST node. As the nodes are visited, output is generated for the code blocks. The implementation is here.
Output
Output from the tool is the accumulated output from the StringBuffer generated by the code blocks that match the pattern.
In this example, the output is
What are the problems I discovered in writing Piggy?
By no means is Piggy finished, let alone available for general use. There are still a number of problems to resolve.
- When does an AST node “match” a pattern? And, when does matching stop?
- How do patterns with
*
and+
operators work? - Should patterns allow references to “Parent”?
- I’m having problems with curly brackets in strings in code blocks. How do I write an Antlr grammar that balances curly brackets, yet ignores one contained in C# string literals?
Stay tuned for further developments.