This little piggy is not at the market yet

With a bit of hacking for the last month or two, and I can finally see that I am making progress on Piggy, a new kind of p/invoke generator. Some might say “Why in the world are you wasting time writing a p/invoke generator? Aren’t there tools already that do this?” Well, yeah, there are other generators, but they all…how should I say…suck! I need a p/invoke generator for Campy, a compiler and runtime for C# for GPUs, which I am still working on, but had to place on the back burner to work on this. Campy uses LLVM and CUDA. Because these libraries are large and constantly changing, I have to have an automated way of handling new releases.

I first have to note that I’m really disappointed in the many developers out there who have written or contributed to the several existing p/invoke generators (SWIG, CppSharp, ClangSharp, P/Invoke Interop, xInterop, and others). Many of these developers are compiler writers, but they should know better what this problem is about. A p/invoke generator is a source-to-source translator that can be distilled into an L-attribute evaluator of the Abstract Syntax Tree (AST) of the original source. You don’t use command-line switches to change the behavior of the translation when you see a function isn’t being converted correctly. What is needed is a change in tree-matching output patterns! It is especially disappointing because I’ve been basically unemployed for many years, yet I have forgotten none of the theory and practice of compiler construction.

In order to write tree-matching output patterns for the L-attribute evaluator. users have to see the abstract syntax tree for some input. As far as I can tell, Piggy is the only p/invoke generator that shows–and operates on–an AST for the input you want to generate p/invoke decls. With the AST in hand, it’s relatively easy to start writing the translation template patterns. There is no need to fish around a manual to see how to special case the conversion of a method that uses a pointer, when one can simply write a special case pattern to output what one needs for that function.

The representation of an AST in Piggy

Piggy reads C++ code using Clang. After the parse, Clang passes back an AST. Unfortunately, Clang doesn’t have a serializer for the AST, so I wrote one. The syntax of the serialized AST is a nested tree expression. It may be antiquated, as many people nowadays prefer to use tabs to denote levels in trees and code, e.g., JSON and Python, and even the output AST from Clang (clang -Xclang -ast-dump test.h). They argue that indentation mandates structure that is optional in free-format languages, so it is “better”. IMHO, this is nonsense! I find I like to use single lines for some tree patterns. And, I still like to write if-statements with a simple then on the same line.

With Piggy. use the option ast to output the AST for the input source. A node in the AST is represented with the following prototype: (N A1 A2 …. ( C1 …) ( C2 …) ….). The type of the node is N. All attributes of the node, like Name or SrcLoc, follow the node type. Attributes of the node all follow the same syntax: name = value, and always precede any children of the node. Children of the node are further parenthesized expressions. For example, the following is a partial tree from the AST of the Clang header file CXErrorCode.h. The tree contains an enum declaration with comments. The location of the code is shown with SrcRange attributes.

  ( LinkageSpecDecl Pointer="0x2241c0a2360" SrcRange="C:\temp\include\clang-c\CXErrorCode.h:20:1, line:61:1"  SrcLoc="line:20:8" Linkage="C"
    ( EnumDecl Pointer="0x2241c0a23e0" SrcRange="line:29:1, line:58:1"  SrcLoc="line:29:6" Name="CXErrorCode"
      ( FullComment Pointer="0x2241c174de0" SrcRange="line:24:3, line:27:76"
        ( ParagraphComment Pointer="0x2241c174cc0" SrcRange="line:24:3, col:45"
          ( TextComment Pointer="0x2241c174c98" SrcRange="col:3, col:45" Text=" Error codes returned by libclang routines."
        ) )
        ( ParagraphComment Pointer="0x2241c174db0" SrcRange="line:26:3, line:27:76"
          ( TextComment Pointer="0x2241c174ce0" SrcRange="line:26:3, col:9" Text=" Zero ("
          ( InlineCommandComment Pointer="0x2241c174d30" SrcRange="col:10, col:11" Name="c" AdditionalInline="RenderMonospaced" Arg="[0] CXError_Success)"
          ( TextComment Pointer="0x2241c174d50" SrcRange="col:29, col:78" Text=" is the only error code indicating success.  Other"
          ( TextComment Pointer="0x2241c174d70" SrcRange="line:27:3, col:76" Text=" error codes, including not yet assigned non-zero values, indicate errors."
      ) ) )
      ( EnumConstantDecl Pointer="0x2241c0a24f8" SrcRange="line:33:3, col:21"  SrcLoc="col:3" Name="CXError_Success" Type="enum CXErrorCode"
        ( IntegerLiteral Pointer="0x2241c0a24d0" SrcRange="col:21" Type="int" Value="0"
        ( FullComment Pointer="0x2241c174e90" SrcRange="line:31:5, col:14"
          ( ParagraphComment Pointer="0x2241c174e68" SrcRange="col:5, col:14"
            ( TextComment Pointer="0x2241c174e40" SrcRange="col:5, col:14" Text=" No error."
      ) ) ) )
      ( EnumConstantDecl Pointer="0x2241c0a25a0" SrcRange="line:41:3, col:21"  SrcLoc="col:3" Name="CXError_Failure" Type="enum CXErrorCode"
        ( IntegerLiteral Pointer="0x2241c0a2578" SrcRange="col:21" Type="int" Value="1"
        ( FullComment Pointer="0x2241c174fb8" SrcRange="line:36:5, line:39:23"
          ( ParagraphComment Pointer="0x2241c174f18" SrcRange="line:36:5, col:60"
            ( TextComment Pointer="0x2241c174ef0" SrcRange="col:5, col:60" Text=" A generic error code, no further details are available."
          ) )
          ( ParagraphComment Pointer="0x2241c174f88" SrcRange="line:38:5, line:39:23"
            ( TextComment Pointer="0x2241c174f38" SrcRange="line:38:5, col:73" Text=" Errors of this kind can get their own specific error codes in future"
            ( TextComment Pointer="0x2241c174f58" SrcRange="line:39:5, col:23" Text=" libclang versions."
      ) ) ) )
      ( EnumConstantDecl Pointer="0x2241c0a2648" SrcRange="line:46:3, col:21"  SrcLoc="col:3" Name="CXError_Crashed" Type="enum CXErrorCode"
        ( IntegerLiteral Pointer="0x2241c0a2620" SrcRange="col:21" Type="int" Value="2"
        ( FullComment Pointer="0x2241c175068" SrcRange="line:44:5, col:63"
          ( ParagraphComment Pointer="0x2241c175040" SrcRange="col:5, col:63"
            ( TextComment Pointer="0x2241c175018" SrcRange="col:5, col:63" Text=" libclang crashed while performing the requested operation."
      ) ) ) )
      ( EnumConstantDecl Pointer="0x2241c0a26f0" SrcRange="line:52:3, col:30"  SrcLoc="col:3" Name="CXError_InvalidArguments" Type="enum CXErrorCode"
        ( IntegerLiteral Pointer="0x2241c0a26c8" SrcRange="col:30" Type="int" Value="3"
        ( FullComment Pointer="0x2241c175140" SrcRange="line:49:5, line:50:14"
          ( ParagraphComment Pointer="0x2241c175118" SrcRange="line:49:5, line:50:14"
            ( TextComment Pointer="0x2241c1750c8" SrcRange="line:49:5, col:66" Text=" The function detected that the arguments violate the function"
            ( TextComment Pointer="0x2241c1750e8" SrcRange="line:50:5, col:14" Text=" contract."
      ) ) ) )
      ( EnumConstantDecl Pointer="0x2241c0a2798" SrcRange="line:57:3, col:26"  SrcLoc="col:3" Name="CXError_ASTReadError" Type="enum CXErrorCode"
        ( IntegerLiteral Pointer="0x2241c0a2770" SrcRange="col:26" Type="int" Value="4"
        ( FullComment Pointer="0x2241c1751f0" SrcRange="line:55:5, col:47"
          ( ParagraphComment Pointer="0x2241c1751c8" SrcRange="col:5, col:47"
            ( TextComment Pointer="0x2241c1751a0" SrcRange="col:5, col:47" Text=" An AST deserialization error has occurred."
  ) ) ) ) ) )

Template for pattern matching

Patterns are themselves nested parentheses expressions. They can match any node in the tree in a DFS in-order traverse and multiple times. However, once a pattern matches a particular node, no other patterns match for that node or any nodes contained in that sub-tree. For further discussion, please refer to the EnumDecl example shown below.

Patterns generally specify a type of node to match, e.g., EnumDecl. However, there are times when you want to match any type. You can either leave it blank or use an asterisk. In any case, this sets up a context for matching any children of the node. Any sub-nodes in the pattern cannot match without the parent matching. Further, any children that do not match causes the entire pattern to not match.

A nested parentheses expression that begins with (* and ends with *) is for matching a subtree at any depth in the tree with the given context. With the enums pattern, (* EnumDecl …. *) matches an EnumDecl node at any depth within the selected node with SrcRange.

A nested parentheses expression which begins with (% and ends with %) is a grouping. It works just like that in regular expressions. In this example, it’s used in combination with the OR-operator (|) and the STAR-operator (*). Operators + and * are interpreted just as with regular expressions: + denotes one or more occurrences of a nested parentheses expression; * denotes zero or more occurrences of a nested parentheses expression. Intermingled with a nested parentheses expression attributes and children is template code, which begins with { and ends with }. It indicates the output generated for an AST node match and must be valid C# code. More on this later.

   ( SrcRange=".*\\clang-c\\.*"
      (* EnumDecl
             vars["first"] = true;
             result.Append("enum " + tree.Peek(0).Attr("Name") + " " + "\u007B" + Environment.NewLine);
            ( EnumConstantDecl
               ( IntegerLiteral
                     if ((bool)vars["first"])
                        vars["first"] = false;
                        result.Append(", ");
                     var tt = tree.Peek(1);
                     var na = tt.Attr("Name");
                     var t2 = tree.Peek(0);
                     var va = t2.Attr("Value");
                     result.Append(tree.Peek(1).Attr("Name") + " = " + tree.Peek(0).Attr("Value") + Environment.NewLine);
            ( EnumConstantDecl
                  if ((bool)vars["first"])
                     vars["first"] = false;
                     result.Append(", ");
                  result.Append(tree.Peek(0).Attr("Name") + Environment.NewLine);
            result.Append("\u007D");  // Closing curly.

The expression ( SrcRange=”.*\\clang-c\\.*” ) matches all nodes with any type, with SrcRange containing “clang-c” in the file name. Values in a pattern for an attribute can be any valid C# Regex.

Template matching engine

The grammar for a nested parenthesized AST expression results in one tree; the grammar for template pattern matching results in another tree. The engine that matches template with the AST is a recursive predictive parser. Matching proceeds in a DFS in-order traversal. When a pattern matches, the engine sets the variable Intercept<IParseTree, IParseTree> matches mapping an AST match with a node in the pattern. The engine also computes the variable Dictionary<IParseTree, IParseTree> parent as it is needed in the code output blocks, and the Antlr parser does not record this property.

Code output

In the template, three variables are passed to the code for the purpose of side effects: vars, result, and tree.

The variable vars is a C# Dictionary<String, object>. You can set values in one code section and pass it onto another code section in the template. Be warned, a get of the key must have a value, otherwise, the code will throw an exception.

The variable result is a C# StringBuffer. Use this to append output to the translation. For example, in the EnumDecl pattern, we output enum followed by the name of the type, then an open parenthesis and newline.

The variable tree is a “dom-like” object for referring to the AST matched. The type of the variable is Tree, which you can find here. Method Peek() returns a Tree relative to the current tree node. So, tree.Peek(0) denotes the current tree; tree.Peek(1) denotes the parent of the current node; etc. Attr() gets the value of the named attribute. If there is no attribute with the name, it returns the empty string.

Currently, there is a problem with code blocks that contain open and closing curly brackets. Unfortunately, the Antlr parser I wrote doesn’t correctly ignore curly brackets in strings. So, this is why you see the Unicode values used in the EnumDecl template.

Output engine

Output for the tool is generated using a non-recursive implementation for a DFS in-order tree walker of the AST. When an AST node is found that maps to a matched pattern node, output nodes of the pattern are interleaved with the children of the AST node. As the nodes are visited, output is generated for the code blocks. The implementation is here.


Output from the tool is the accumulated output from the StringBuffer generated by the code blocks that match the pattern.

In this example, the output is

enum CXErrorCode {
CXError_Success = 0
, CXError_Failure = 1
, CXError_Crashed = 2
, CXError_InvalidArguments = 3
, CXError_ASTReadError = 4

What are the problems I discovered in writing Piggy?

By no means is Piggy finished, let alone available for general use. There are still a number of problems to resolve.

  • When does an AST node “match” a pattern? And, when does matching stop?
  • How do patterns with “*’ and “+” operators work?
  • Should patterns allow references to “Parent”?
  • I’m having problems with curly brackets in strings in code blocks. How do I write an Antlr grammar that balances curly brackets, yet ignores one contained in C# string literals?

Stay tuned for further developments.

–Ken Domino


Posted in Tip