Tree transformations via XPath and S-expressions

I’ve finally have the right tools to now implement transforms over an Antlr parse tree.

The first part of a transform is identifying what nodes in the tree that are going to be replaced. It turns out that the best tool to do that is an XPath engine, which I’ve rewritten in C# from Java over the last month.

The other part of a transform–and far simpler to implement–is a way to express a tree that is created and spliced into an existing parse tree. Here, the work I did in Piggy for S-expressions comes in handy. For example, the expression “( ruleAltList ( labeledAlt ( alternative { child } )))” identifies the Antlr parse tree to create, and splices in a “child” node into a created “alternative” node. Note, this is a slightly different usage for “S-expressions” that you may be used to, in which a node is an unnamed pair, but conveys the same purpose.

With XPath and S-expressions, I can now rewrite all the transforms that I hardcoded in C# for Antlrvsix. The code implementing both parts is here, but I will be forking this code and placing it under in the Antlrvsix source until I see a need to place this code in a separate Nuget package.

At this moment, I’m not exactly sure what language and control structures to add on top of XPath and S-expressions. For now, these two tools should suffice along with C# to glue the pieces together in order to modify an Antlr parse tree. The other consideration is having intermediate results of an XPath expression. For example, I may want to get all ruleAltList’s but continue down the tree for a particular child. I’ve fixed a bug in the Eclipse XPath library to allow an intermediate result to be used as context for another XPath expression. But I might consider extending XPath to bind the results of an intermediate result into a C# variable.

One other note–Why am I not looking at term rewriting systems? I have. The problem is that they are not practical for two reasons: (1) integrating it with Antlr parse trees would not be easy; (2) most do not express manipulations directly on a tree. XSLT is one example. Here, the language isn’t specifying tree rewrites, but the construction of an entirely new tree. I also looked at TXL. Here, the language isn’t about trees, but term rewrites in the target language. I would need to convert the Antlr grammar into TXL grammar syntax. In all these systems, I would need to fit in Antlr parse trees into the framework. Again, all I want is to manipulate trees.

What of Piggy? Unfortunately, Piggy is not a tree editing library. While it recognizes tree nodes, the problem is that it then executes user code that performs an output on a DFS traversal of the tree. Again, all I want is to manipulate an existing tree, then do something later with that tree.

–Ken

Posted in Tip

Converting Bison precedence and associativity rules to Antlr

As noted in the Bison guide:

  • The associativity of an operator op determines how repeated uses of the operator nest: whether ‘x op y op z’ is parsed by grouping x with y first or by grouping y with z first. %left specifies left-associativity (grouping x with y first) and %right specifies right-associativity (grouping y with z first). %nonassoc specifies no associativity, which means that ‘x op y op z’ is considered a syntax error.%precedence gives only precedence to the symbols, and defines no associativity at all. Use this to define precedence only, and leave any potential conflict due to associativity enabled.
  • The precedence of an operator determines how it nests with other operators. All the tokens declared in a single precedence declaration have equal precedence and nest together according to their associativity. When two tokens declared in different precedence declarations associate, the one declared later has the higher precedence and is grouped first.

Associativity

By default, operators in Antlr are left-associative. So, %left associative operators do not require any additional declarations in Antlr.

%left op

=>

e : e op e

An operator declared as %right associative will need to be tagged in Antlr with <assoc=right>.

%right op

=>

e : <assoc=right> e op e

Antlr does not have a %nonassoc declaration. To disallow a non-associative operator in Antlr, a semantic predicate must be used.

%nonassoc op

=>

e : e op e { Input.LA(1) != op }?

Precedence

In Antlr, to make op1 higher precedence than op2, one will need to define the alternative containing op1 before the alternative for op2. In other words, the order of the alternatives of the operators in Antlr is the same as the order of appearance in the Bison grammar.

%left op1
%left op2

=>

e : e op1 e | e op2 e

What are precedence and associativity bound to?

As noted here: “A precedence and associativity is associated with each grammar rule. It is the precedence and associativity of the final token or literal in the body of the rule. If the %prec construction is used, it overrides this default value. Some grammar rules may have no precedence and associativity associated with them.”

Even though precedence and associativity are specified with an operator, the precedence and associativity are associated with a rule as well. Note, this makes sense from an implementation reference: when you see a shift/shift or shift/reduce conflict, you are deciding what to place in the parsing table for the action based on the rule and the lookahead that you see. Crucially, the %prec, %right, %left, %nonassoc clauses can appear within a rule, which overrides any clauses specified elsewhere.

–Ken

Posted in Tip

Finding direct left recursion in an Antlr grammar via XPath

“//parserRuleSpec[RULE_REF/text() = ruleBlock/ruleAltList/labeledAlt/alternative/[name()=’element’][1]/atom/ruleref/[1]/text()]”

This XPath expression is the first important rule I wrote for Antlr grammars, which finds all rules that have direct left recursion. Finding direct left recursion is an important step for removing indirect left recursion.

My efforts for getting XPath working with Antlr are starting to finally pay off.

–Ken

Posted in Tip

XPath 3.1 engine with Antlr parse trees — Update

It’s taken a few weeks, but the daily grind has resulted in a new grammar in Antlr for XPath version 3.1, and a translation of the Eclipse engine to search the Antlr parse trees (code). This will come in handy to partially replace the hardwired code in C# in Antlrvsix to perform grammar refactorings. There is still much work to be done:

  • Polishing the XPath engine to work well.
  • Create an expression language to tell the transformation system what nodes found in the parse tree (using the XPath engine) how to rewrite the tree nodes.
  • Rewrite the refactorings in Antlrvsix in this new language and engine.

–Ken, June 24, 2020

Posted in Tip

XPath and Piggy

I’ve made another release of Antlrvsix, version 7.3, last week and I’m almost ready to release 7.4 with a few more fixes. The changes I’ve been making since last year in August have greatly enhanced what Antlrvsix can do over version 1. But, there is more work to be done. I’m now working on updating and integrating Piggy into the extension.

Part of the problem with Piggy was that I lacked a clear vision of what I was trying to solve. I now know what I’m looking to solve because I have worked on the analysis and transformations for Antlr. And, I can see there are actually two levels to this system: (a) a high-level language, that works with snippets in the language that is going to be modified, and (b) a low-level language, that works with tree patterns and routines to manipulate those trees. This was what the authors of Coccinelle did, and it is the right idea because it’s hard writing tree patterns in a language like XPath, which is just a language that specifies a collection of tree nodes.

I will first start writing an interface for XPath because all the transforms use a basic tree node find and replace. I will first try porting Eclipse XPath written in Java to C# and see where that takes me. The port will be here.

–Ken

Posted in Tip

The next step

What is the next step in Antlrvsix to help improve a grammar? To answer that, I decided to go back and take a look at where this all began, with the comparison of the Java grammars in Antlr.

Java9 is derived from version 9 of the Java spec. Parr’s Java is derived from an unknown, earlier version of the spec, but it is much faster in parsing that the other grammar.

How can one compare these grammars? A line-by-line diff of the files could be done, but it is a terrible way to compare the grammars because they are very different: one is a partitioned grammar, the other a combined grammar; the formatting is different; the rules in each grammar are ordered differently; and, the rules differ considerably between the grammars. In order to compare the grammars, a much smarter diff is needed.

What I’m planning is to do a rule-by-rule comparison, working with rules in a DFS order from the start rule. I’ll need to know how the rules differ and why. Then, using Antlrvsix, apply a transformation to make the rules the same, and continue the diff with the next rule in the DFS ordering.

So, I first applied a DFS reorder of the grammars using the Antlrvsix extension, then a reformat (it turns out Codebuff has some bugs). Starting with the start rules, compilationUnit, the difference has already started: Parr decided to unfold ordinaryCompilation and remove the rule.

I could edit the grammar manually for the rule, but I decided to save time and implement the fold transformation right now. And, in order to make this reproducible, I want Antlrvsix to take a sequence of transformations from a file and apply them.

–Ken

Posted in Tip

Version 7 of Antlrvsix released

I’ve made a release of Antlrvsix yesterday which includes recursion removal for Antlr grammars. There are two types of recursion that I can work with: direct (aka immediate) and indirect. The algorithm that I use for direct and indirect recursion removal are the ones that Aho et al. (2006) describe. In addition, the extension provides a transform to rewrite direct recursion (left or right) into Kleene operator form, which helps clean up the grammar considerably. These transformations help fix grammar problems that you may run across. But, to use these transformations, you have to select the rule that contains or participates in the recursion.

In addition, the extension now provides an explicit mapping for colorizing the grammar. The problem with LSP and VS2019 is that neither supports languages like Antlr. So, language features like nonterminals and terminals are mapped to client-specific data types that have display properties. This mapping can be modified in the options file for Antlrvsix. In addition, I now tag almost all of the grammar.

What’s next up for Antlrvsix? The main problem I wanted to solve is the performance of the Java9 parser, which performs terribly for certain input. The issue seems to involve left factoring, but it’s not clear. I’ll now be working on this for the next month. I was hoping to add Piggy into the extension, but that will have to wait.

–Ken

Posted in Tip

Forgetting Flex for now

After a bit of fiddling around with what I thought would be a lexical analyzer for Flex, I now realize that the lexer isn’t defined by scan.l, but a combination of one generated from flex.skl, a file that is processed by m4, and the scanner generated by scan.l. Originally, flex.skl included some switches for C or C++, but it’s now m4 dependent. I’m not sure why #ifdefs wouldn’t have sufficed; it would have been far more portable. Further, after carefully inspecting the parser grammar, I now realize it is missing a ton of things from an input file because the lexer yanks so much information out of the input stream and never passes it onto the parser. For example, most of the options in a Flex file are simply swallowed by the scanner, only the ones with ‘=’ passed to the parser. (Bullshit.) Code blocks within %{ … %} or %top{ … } are swallowed up by the lexer, and never mentioned in the parser grammar. So, in order to really write an import of Flex, the lexer and scanner would have to be written from scratch. (Originally, I was toying with doing just that, after successfully converting Bison’s parse.y to Antlr, but chose to start with the Flex sources because I thought the grammars were complete. They just were complete garbage from the standpoint of an import.) Given the number of problems, I’m going to pass on Flex import for now. I will probably revisit Flex import but at a later date. For now, I’m ODed on a fruitless translation.

Fortunately, I don’t really need to implement any of this for Antlrvsix right now. I’m going to continue with the rest of the “to do’s” I have slated.

Posted in Tip

Flex vs Bison import

I’ve written a basic Bison import for Antlrvsix, and now I’m in the guts of writing a Flex import. While writing the Bison import, I soon realized a problem that I could not deal with immediately. Many people write Bison code that is very undisciplined. Often the code contains driver code (calc.y from the official Gnu repository), The static semantics is often wrapped up with the tree construction. While this is ok, this means the grammar contains target-language code, which is not a great way to write grammars. If you really want an abstract syntax tree instead of a parse tree for semantic checks, you should discipline the semantic checks outside of tree construction. For now, I do not copy code blocks to the generated Antlr grammar.

For Flex, the problem is that the embedded code exclusively performs semantic checks at the lexer phase. So this code must be included in a generated Antlr lexer grammar. The good news is that it’s not too difficult to convert the patterns in Flex into Antlr lexer rules and modes. You can then convert the code blocks in a Flex rule into your target language. Whether this works or not, I will find out–I’m translating line-by-line the Flex lexical grammar into an Antlr lexer grammar, and so far, it looks good.

Aside, I’ve been using Antlrvsix quite a bit myself to write the Flex grammars. I added a refactoring to sort lexer modes alphabetically since the original Flex “scan.l” code defined the start states in a haphazard order. And, since I’ve switch to the “Dark mode” theme in Visual Studio 2019, I added to the options color selections for grammar symbols so I can see the symbol text more clearly.

–Ken

Posted in Tip

Status of transforms for Antlr

I’ve been adding transform/refactoring operations to Antlrvsix for the last two weeks. So far, I have the following: (1) replace parser string literals; (2) remove useless parser rules; (3) move parser start rules to top; (4) reorder parser rules alphabetically; (5) reorder parser rules vis BFS; (6) reorder parser rules DFS; (7) split combined grammars; (8) merge split grammars. I’ll soon start the really important transforms (rewrite rules to remove excessive factoring, left recursion, etc). I will also add in a Flex/Bison import tool to convert these grammars to Antlr. I’m making a release of Antlrvsix v5.5 with these changes, but much more is coming.

–Ken

Posted in Tip