Rewriting the pattern matching engine — part 2

Patterns in Piggy are regular expressions of parenthesized expressions that represent trees. The conversion of the regular expressions to a finite state automaton is easy (via Thompson’s Construction), but the resulting automaton for the pattern may not work for an important reason: patterns that specify attributes and children of an AST node have an implicit “…” between each attribute or child pattern. While it’s possible to introduce additional states to recognize “…” between each attribute or child node in the pattern, the resulting automaton is ambiguous. For every input AST node, any attribute or child AST node can mismatch a “sub-pattern”, but can still match the complete pattern.

To illustrate this issue, consider the following AST for an EnumDecl node, and the pattern “( EnumDecl ( EnumConstantDecl ) )”. The EnumDecl contains a FullComment child node and attributes for SrcLoc, Name, and Type.

( EnumDecl SrcLoc="..\src\environ.h:286:1"
( FullComment
( ParagraphComment
( TextComment Text=" Simple search state variables "
) ) )
( EnumConstantDecl Name="L_NOT_FOUND" Type="int"
( IntegerLiteral Type="int" Value="0"
) )
( EnumConstantDecl Name="L_FOUND" Type="int"
( IntegerLiteral Type="int" Value="1"
) ) )

This pattern should match this EnumDecl AST node, but it can’t if we use an automaton that looks for literally “(” “EnumDecl” “(” “EnumConstantDecl” “)” “)” in the token input stream. In fact, the automaton would fault on the input token “SrcLoc” because it would be expecting a “(” token. Even after removing the “SrcLoc=….” the automaton would still fail because it would be looking for “EnumConstantDecl” after recognizing “(“, finding instead “FullComment”.

In order to perform a lookahead and skip any attribute or child node not specified in the pattern, I extended the automaton with “subpatterns”, which are separate finite state automatons of the pattern used to match attributes or child nodes in a pattern.

So, for “( EnumDecl ( EnumConstantDecl ) )”, an NFA is first constructed for the regular expression pattern. In this automaton, edges can be an input symbol, or a subpattern.

A DFA-like automaton is constructed for the NFA using the usual epsilon-closure algorithm.

When parsing the input, the sub-automaton is used to recognize an attribute or child node. If it does not match, the “Any” sub-automaton is used to recognized the input and the top-level automaton recognition restarted with the next input attribute or child node.

Posted in Tip

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.