Parsing with Augmented Transition Networks and computing lookahead

The next step in the development of my LSP server for Antlr involves support for code completion. There is an API written by Mike Lischke that computes the lookahead sets from the parser tables and input, but it turns out to have a bug in some situations. So, I’ll now describe the algorithms I wrote to compute the lookahead set for Antlr parsers. I’ll start first with a few definitions, then go over the three algorithms that do the computation. An implementation of this is available in C# in the Antlrvsix extension for code completion and another is provided by default in the Antlr Net Core templates I wrote for syntax error messages.

What is the lookahead set, and what is available currently to compute it?

The lookahead set is the set of token types that could be inserted at an index in the text that is being parsed. If you know this set, you can compute code completions in a text editor for the grammar or output meaningful error messages in a parse error.

Let’s take a simple Antlr grammar, the well-known Expr example.

grammar Expr;

expression: assignment | simpleExpression;

    : (VAR | LET) ID EQUAL simpleExpression

    : simpleExpression (PLUS | MINUS) simpleExpression
    | simpleExpression (MULTIPLY | DIVIDE) simpleExpression
    | variableRef
    | functionRef

    : ID


// Insert here...

VAR: [vV] [aA] [rR];
LET: [lL] [eE] [tT];

PLUS: '+';
MINUS: '-';
DIVIDE: '/';
EQUAL: '=';
OPEN_PAR: '(';
ID: [a-zA-Z] [a-zA-Z0-9_]*;
WS: [ \n\r\t] -> channel(HIDDEN);

Now consider what could be added at the line “// Insert here…”. We could add:

  • a new parser rule (in git), which starts with a RULE_REF;
  • a new lexer rule (in git), which starts with a TOKEN_REF;
  • a new parser rule with a leading DOC_COMMENT (in git);
  • a catch clause (in git), which starts with the keyword ‘catch’;
  • a finally clause (in git), which starts with the keyword ‘finally’;

There are, of course, other possibilities. If you try to place a “.” at the point of insertion, the Antlr Tool reports the following error message:

error(50): C:\Users\kenne\Documents\expr\E6.g4:26:0: syntax error: ‘.’ came as a complete surprise to me

This is fine, but it doesn’t give you an idea of what you should be adding at the insertion point. If you write a parser using the Antlr grammars itself and use the default error reporter in the Antlr runtime, you get these error messages:

line 26:0 mismatched input ‘.’ expecting {, ‘mode’}

This is a little weird because there is no symbol before the comma in the set described in the error message. Beyond the fact that you cannot use “modes” in a unified parser/lexer grammar, ‘mode’ is possible in a lexer grammar (lexer and parser grammars). To its credit, the documentation for the default error reporter in the Antlr runtime does note that this routine does not compute error messages correctly.

However, Lischke’s Code Completion Core should report the range of possibilities I describe above. But it instead reports that it expects input of ‘-2’ (epsilon), ‘catch’, and ‘finally’, which is only partly true.

I now describe a way to compute the lookahead.

Augmented Transition Network

An Augmented Transition Network (ATN) is defined as follows. Given predicated grammar G = (N, T, P, S, Π, M),

  • N is the set of nonterminals (rule names)
  • T is the set of terminals (tokens)
  • P is the set of productions
  • S ∈ N is the start symbol
  • Π is a set of side-effect-free semantic predicates
  • M is a set of actions (mutators)

the corresponding ATN MG = (Q, Σ, ∆, E, F) has elements:

  • Q is the set of states
  • Σ is the transition alphabet N ∪ T ∪ Π ∪ M
  • ∆ is the transition relation mapping Q × (Σ ∪ ε) → Q, denoted by the triple <qfrom, σ, qto>
  • E = {pA | A ∈ N} is the set of submachine entry states
  • F = {p’A | A ∈ N} is the set of submachine final states

An edge is a triple <u, σ , v> ∈ ∆. For an input string t0 t1 t2 … tn, a path is a sequence of edges [ <s0, σ0, s1> < s1, σ1, s2> < s2, σ2, s3> <s3, σ3, s4 > … ] where σ0 σ1 σ2 … = t0 t1 t2 … tn, and where the final edge of the path is not an epsilon transition. The input does not have to be a complete string for the grammar, so it is possible for the last state in the path to be a non-final state.

Parsing in an ATN

The ATN parse is computed using the parse() function described below. It requires the input string, the current position to consider in the input, and the edge that matched the last input. For a lack of information, I implemented a basic backtracking parser. The function calls itself recursively for each transition that it considers. The result of the parse() function is a set of paths that match the input from this point in the input and state, or null if there is no match. Each path in the returned set is an acceptable path through the ATN that matches the input string from the current position in the input. The parser consumes the entire input string, and if the input is unacceptable, null is returned.

// The parse of an input is a set of paths, where each path is a parse that consumes
// the entire input. More than one parse is possible. If no parse is valid, empty set is returned.
// Inputs:
//   input: sequence of tokens input, indexible via []-operator, zero based.
//   cursor: int, an index into token input sequence.
//   d: edge or null. Note, d = <r, σ, s> or null. If edge, the triple is 
//      the edge goes from state r to state s via σ.
//   V: set of <int, state> pair. It is used to note whether we have visited
//      a state with a given index in the input token sequence.
// Output: a set of sequence of edge
parse(input: seq of edge, cursor: int, d: edge <r, σ, s> or null, V: set of <int, state> pairs)
	Let state = s
	if d = null
		state ← entry state of rule being recognized // initial condition.
	if cursor >= input.Size
        return { d }
    Let i ← input[cursor]
	if <cursor, state> ∈ V
		return {}
	V ← V + <cursor, state>
	M’ ← {}
    for all edges e = <state, w, t> // i.e., all edges from state "state".
		if w = ε
			M ← parse(input, cursor, e, V)
		else if w = input[cursor]
			M ← parse(input, cursor+1, <state, input[cursor], t>, V)
		for each m in M
			m ← d ∙ m
			M’ ← M’ + m
	If M’ = {}
		return {}
	return M’

Closure of an ATN

After computing an ATN parse, parses are represented as a set of paths. Each path terminates in a final state. At this point, we are interested in what valid token could occur. The closure() function computes the set of states that are reachable via an ε-transition in the ATN. If the final state is a stop-state of the ATN, it’s possible that the parse could continue from a “follow” state of a rule transition in the path.

// The closure set of a state is the set of states that can be reached via epsilon transitions.
// For grammar symbol transitions, the closure includes submachine states. If the submachine
// closure includes the stop state of the submachine, the closure includes the closure of the
// follow state of the transition.
// Input: state p.
// Output: set of states V.
    let V ← {} 	-- V is a set of states
    let S ← {}	-- S is a stack of states
    Push(S, p)
    while S ≠ {}
        let s ← Pop(S)
        if s ∈ V continue	-- If state s was visited before, continue.
        V ← V + s
        For all edges <s, σ, t> from state s
	        If σ ∈ N		-- If transition is via grammar rule …
		        Let W ← closure(pσ)	-- Compute closure of submachine.
		        Let X ← {}
                For all w ∈ W
                    If w = p’σ	-- If state w is submachine σ stop state …
					    Let W2 ← closure(t)
					    X ← X + W2
		        For all w ∈ W + X
                    V ← V + w
	        else if σ = ε		-- If transition δ is via epsilon …
			    Push(S, t)
    return V

Computation of lookahead sets

Finally, we now define the function LASet() to compute the lookahead set for a particular closure of a path. The idea here is to find all non-ε-transitions in the ATN from the closure() of the last state of a path. If the closure includes the final state of the submachine, then go to the “calling atn submachine”, and compute the closure of the follow-state of the rule transition. The reason why to use the closure of the calling submachine is because the called submachine can return from a final state and continue the parse from the follow state of a rule transition. For an Antlr grammar, after the semicolon, a catch or finally clause could be parsed. But, because the catch and finally rules could be empty, the rule could end in the final state for the rule submachine.

	Let reverse ← Reverse(path)
	Let result_states ← { }
	Let current_state = First(reverse).to
		Let cl ← closure(current_state)
		Let do_continue ← false
		Let atn = current_state.atn
		Let start_state = atn.start_state
		Let stop_state = atn.stop_state
		For each c in cl
			result_states ← result_states + c
			If c == stop_state
				do_continue ← true
		if ! do_continue
			reverse ← Rest(reverse)
			if reverse.Count = 0
			Let last ← First(reverse)
			If last.from = start_state
			    reverse ← Rest(reverse) // Remove edge to submachine
			    current_state = First(reverse).follow // continue with follow state of rule transition
	Let result ← select from result_states all non-ε transition symbols
	Return result
Posted in Tip