Over the last few months, I’ve been updating several of the operations in Trash to keep text, tokens, intervals/indices, and parse trees consisten and synchronized after edits. This is necessary because editing operations should be transactional, without having to reparse. In fact, many operations would make the text unparsible with the same grammar, so recomputing a new tree from the altered input would likely be out of the question.

Suppose we take the arithmetic grammar and the input a = (((-12 + 1) + 7) + 7) / u. The parse tree would be:

Each leaf node in the parse tree is a TerminalNodeImpl, which contains a pointer to the token, which would have a constant position in the token stream. For example, the a in the token stream would be at position 0, the = would be at position 1, and the first ( would be at position 2. Each token has a start and stop index in the character stream buffer. So, the a in the char stream would be at position 0, the = would be at position 2, and the ( would be at position 4 in the character stream. Under this scheme, when the token stream is altered, all the indices in both the character and token streams need to be updated to reflect the new positions. Associate with all parse nodes is an Interval, in which the start and stop values are the minimum and maximum token indices for the frontier of tokens.

The problem is that if we want to perform multiple edits to the parse tree, e.g., (1) replace + 7 with / 33, then (2) remove useless parentheses in sub-expressions, then for each edit we would need to change the text of the CharStream, the Tokens of the TokenStream rewritten, and the TerminalNodeImpl for the useless parentheses removed. For each change, the parse trees must be maintained consistent with the altered text in order to replace the correct characters all the way down into the CharStream, and in order to support a consistent rewrite over several operations.

Therefore, this representation is very slow–not to mention very tedious–to update the positional data of the tokens after each alteration. If one only uses a token stream rewriter, searching the intermediate alterations for changes would be difficult (e.g., when simplifying expressions through partial evaluation).

The parse tree representation in Trash was rewritten to avoid this problem. A parse tree now represents all token and char stream data on a parse tree node. Further, after parsing, the Antlr parse tree is converted to a representation that is now sub-classed from classes in the DOM object model used in the XPath implementation. This saves on the conversion overhead between tools, which was done previously.

A Big Performance Increase

Here is an example that scrapes the Python3 grammar from a grammar spec, and performs a few manipulations to bring it closer to Antlr4 format.

date
trparse ../examples/python-11.gram \
	| trxgrep ' //rule_' \
	| trdelete ' //attribute' \
	| trdelete ' //action' \
	| trdelete ' //attribute_name' \
	| trdelete ' //more_alts[preceding-sibling::indent/preceding-sibling::newline/preceding-sibling::COLON]/VBAR' \
	| trtext > new-python-11.gram
date

Although it is anecdotal evidence, the run time using the old Antlr4 parse tree representation is 1m 6s. With the new parse tree representation, it is now 5s, of which 2.6s for the initial parse alone.

Conclusions

Antlr has a long history, and the design has evolved with changes in requirements. The current representation for a Token is from Antlr3, where we find positional information for a token to the char stream.

Antlr’s representation of a parse tree places constraints on what can and cannot be done efficiently. The tree representation containing positional information into buffers is not a good representation for tree rewriting.

All this may be obvious to you. But, it was not for me. And, one should not judge that Antlr–or any parser generator–is poorly implemented, yet another reason why we should all write parsers from scratch by hand. One can point out hundreds of poorly implemented data structures in all software.

–Ken