Scraping the C++ grammar from the PDF Spec

This post is a draft. –Ken

This post describes some of my experiences in scraping the C++ grammar directly from the specification for C++ in PDF format.

“Where is the Spec?”

The spec can be found in the GitHub repo https://github.com/cplusplus/draft.git. The version I am working with is N4296.

Hand-scraping the text

If you are up for a challenge and have enough time, you can open the Spec in Adobe, and hand-transcribe Appendix A. But, most people can’t afford much time in doing so.

Instead, most will open the PDF in some program, select the text of Appendix A, copy and paste the text to a text file. Unfortunately, the problem is the quality of the OCRed text, and the way subscripted text is converted on a copy and paste.

For example, Adobe Reader is probably the best off-the-shelf tool. Another is the viewer in Google Chrome. A third is Microsoft Word. For a programming API to read PDF files, iText is available for C# and Java. But, Adobe has its own API. However, requiring a person to navigate open the PDF, navigate to Annex A, select, and copy then paste is time consuming and opens up some ambiguity as to which program one is using.

The quality of the text received in a copy/paste depends on the contents of the PDF. For example, the iText library returns in most of the time “identi?ier” instead of “identifier”. And, it often returns run-ins like ‘opt)‘, ‘opt=‘, ‘opt:‘.

Adobe Acrobat

new-expression:
:: optnew new-placementopt new-type-id new-initializeropt
:: optnew new-placementopt ( type-id ) new-initializeropt

Google Chrome

new-expression:
::optnew new-placementopt new-type-id new-initializeropt
::optnew new-placementopt( type-id ) new-initializeropt

iText

new-expression:
:: optnew new-placement opt new-type-id new-initializer opt
:: optnew new-placement opt ( type-id ) new-initializer opt

If you don’t specify which program you use, you can’t easily automate the correction process. If you then default to what programmers do normally, you correct the grammar by hand–which then means you can only do one or two versions of the spec. Some of the corrections are easily programmable, like optnew really should be opt new. Or should it? What if the spec actually contained an optnew non-terminal?

In order to minimize handling different run-ins, you have to choose one tool, and program against that. After hand-checking of the scraped grammar, you can program against the specific run-ins and correct them, counting the number of corrections you expect.

A different problem in copy/paste, which seems specific for iText, are spurious newlines. Let’s take the text result from iText copy/paste of the type-parameter rule in Annex A:

type-parameter:
type-parameter-key … opt identifier opt
type-parameter-key identifier opt= type-id
template < template-parameter-list > type-parameter-key …
opt
identifier
opt
template < template-parameter-list > type-parameter-key identifier opt= id-expression

The problem is that it should instead be

type-parameter:
type-parameter-key … opt identifier opt
type-parameter-key identifier opt = type-id
template < template-parameter-list > type-parameter-key … opt identifier opt
template < template-parameter-list > type-parameter-key identifier opt = id-expression

It is difficult to decide automatically what is correct or not. The scraped grammar must be tested and corrected, by hand or by program that recognizes this specific text.

Problems with an unmodified scraped grammar

After correcting all the problems with run-ins and spurious newlines in the text, one has to convert rules of the grammar to a legal EBNF.

Some rules in the Spec are in prose:

q-char:
any member of the source character set except new-line and "

For raw-string, the end delimiter must be handled with a semantic predicate.

r-char:
any member of the source character set, except
a right parenthesis ) followed by the initial d-char-sequence
(which may be empty) followed by a double quote ".

(The current Cpp grammar does not have the semantic predicate.)

Annex A of the Spec does not define the boundary between the lexer and parser very well.

literal:
integer-literal
character-literal
floating-literal
string-literal
boolean-literal
pointer-literal
user-defined-literal

Is literal itself a token from the lexer, or is literal a parser non-terminal? We know from the Antlr4 grammars-v4 cpp grammar where to draw the line, but it hasn’t been done for the preprocessor rules at all.

Optional left recursive rules are not handled by Antlr4

Some rules have to be modified because they are illegal in Antlr4. While left recursion is handled in parser rules in Antlr4, it is flagged as error easily with a few alterations such as parentheses or the ?-operator.

noptr_abstract_declarator : noptr_abstract_declarator ? parameters_and_qualifiers | noptr_abstract_declarator ? '[' constant_expression ? ']' attribute_specifier_seq ? | '(' ptr_abstract_declarator ')' ;

Instead, this must be written as

noptr_abstract_declarator : noptr_abstract_declarator parameters_and_qualifiers | noptr_abstract_declarator '[' constant_expression ? ']' attribute_specifier_seq ? | '(' ptr_abstract_declarator ')' | parameters_and_qualifiers | '[' constant_expression ? ']' attribute_specifier_seq ? ;

A similar problem occurs for attribute_specifier_seq.

Left recursion in lexer rules is not handled by Antlr4

But, after rewriting a rule to be lexer-specific, the rule must be rewritten because Antlr4 does not handle left-recursion in a lexer rule.

error(119): c_plus_plus_spec_draft.g4::: The following sets of rules are mutually left-recursive [Hexadecimal_escape_sequence] and [C_char_sequence] and [Digit_sequence]

Hexadecimal_escape_sequence : '\x' Hexadecimal_digit | Hexadecimal_escape_sequence Hexadecimal_digit ;

C_char_sequence : C_char | C_char_sequence C_char ;

Digit_sequence : Digit | Digit_sequence '\'' ? Digit ;

Antlr does not parse rules renamed to lexer names if RHS contains any parser names

Consider a: b ‘C’; b : ‘B’;

If I want to rename a to A, the resulting grammar will not compile. Renaming is order-dependent. The problem here is that the Antlr grammar differentiates between lexer and parser symbols at a lexer level, and that lexer symbols cannot contain parser symbols even if erroneous. This prevents editing the grammar using tools like Trash that use the official Antl4 grammar: RULE_REF symbols cannot be used on the RHS of of lexer rule since it won’t parse!

Reordering rule alts

This change affected how something is parsed.

Non-greedy decl-specifier-seq

See this change. declSpecifierSeq: declSpecifier+? attributeSpecifierSeq?;

Wrong parse tree after grouping refactoring

The statement rule was scraped from the specification and resulted in a rule with alts in one order, which happened to work. But, this rule was refactored with a grouping, and resulted in a misparse.

C++ and Preprocessor rules are shared

The start rule for the C++ parser is translation_unit, but for the preprocessor it is preprocessing_file. There are rules that are shared between the two start rules, but for Antlr, the lexer must be different. I can compensate by placing the rules for preprocessing in a separate mode, but most lexer rules have to be duplicated between the two modes.

Because they are shared, the lexer rules for the preprocessor need to return types that are identical to the C++ lexer.

C++ grammar is ambigous for ‘#’ non_directive

The grammar has an alt of group_part for ‘#’ non_directive. This alt slows the parser significantly. I have commented it out, but I could fix this rule with maybe an ‘~’-operator.

header_name is either a string literal with double quotes or angle brackets. It turns out that c++config.h in the GNU compiler has a constant_expression that passes an angle bracket to a function. The header_name needs to accept either as a String_literal.

Preprocessing

C++ is different from other languages in many ways. For C++, a preprocessor is part of the Spec. The Cpp grammar in the Antlr4 grammars repo handled preprocessing directives as though comments. But, some code in some of the GNU C++ headers require a preprocessor, otherwise many files fail to parse. The most basic, widely-used sequence that is not specified in the grammar is the backslash-newline sequence, which is specifically mentioned in Section 2.2, Paragraph 2 on how it is to be handled.

String concatenation is another, clearly defined by Spec, preprocessing issue. The grammar should not be changed to deal with this issue because it has introduced many issues in the past.

In the Cpp14.g4, several changes were made to handle this directly by the grammar, but it only added problems. First, primary-expression was changed:

primaryexpression : literal (literal)* | This | '(' expression ')' | idexpression | lambdaexpression ;

Later, it was changed again:

primaryExpression: literal+ | This | LeftParen expression RightParen | idExpression | lambdaExpression;

But, both “hacks” are wrong because it deviates from the spec And, it allows if (1 1) ... as valid input. The original hack was introduced here, and noted in this issue, and change here without explanation, with little review!

Some rules are incomplete. It says in the Spec that and can be used as an alternative to &&. But, logical-and-expression does not contain the alternative. (The Cpp14.g4 grammar was modified here to include the and alternative, but most of the other alternative operators in Table 6 of the Spec were not added.)

Adhoc, scantly-reviewed changes to Cpp14.g4

Over the years, there have been many changes to the grammar. These changes have been allowed for the most part with little review. And, consequently, problems have crept into the grammar.

‘/=’ removed from operators rule

When the grammar was split, the ‘/=’ operator was dropped from Spec rule operators (aka theOperator rule in the Antlr4 grammar). This was done here. You cannot split an Antlr4 grammar reliably by hand: you should use tools.

It was then dropped altogether here.

Duplicate GreaterAssign in theOperator rule added

In this change, GreaterAssign was added twice to the rule. It was subsequently removed here.

Regression of elaborated_type_specifier

In this change, elaborated_type_specifier was changed so that we can no longer parse “friend union” declarations.

union a {
    int b;
    float c;
};

class b
{
    friend union a;
    int d;
};

C++ Preprocessor

The preprocessor is complicated. It requires a bit of support code in the form of an Antlr visitor. The visitor interprets the parse tree from the preprocessor_file rule and converts it to the input for the main parser. The grammar uses C++ expressions, which can result in very large parse trees even for the simple example ‘#if 1‘, and requires a lot of methods in the visitor to be implemented.

The current implementation for the C++ grammar is here.

Changes to Cpp14.g4 over the years

Date Change How it fails
4 Jul 2015 Initial revision.  
1 Sep 2015 Fix ‘-‘ operator in ‘unaryoperators’ for Issue#212. This commit changed unaryoperator : ‘|’ | ‘*’ | ‘&’ | ‘+’ | ‘!’ | ‘~’ ; to unaryoperator : ‘|’ | ‘*’ | ‘&’ | ‘+’ | ‘!’ | ‘~’ | ‘-‘ ;, which added the ‘-‘ operator. While the commit fixes the ‘-‘ operator, the unary-operator rule still contains ‘|’. which is an error, from the initial revision, and it remains more or less to this day!
3 Oct 2015 Removed Java @header. This change removed a bogus @header from the grammar. @header makes the grammar target specific.  
3 Oct 2015 Fixed the pom.xml <packageName> attribute. Package names should be empty in order to be target-independent.  
29 Oct 2015 Issue#230. This issue is about whether the grammar handles preprocessing directives. It does not. In order to parse C++ files with preprocessor directives, the suggested solution is “cc -E t.c will get the actual C++ input :)”. The suggested solution uses the C preprocessor, not the C++ preprocessor. They two are not the same. See this explanation.
16 Dec 2015 Change ‘purespecifier’ rule; Issue#249. In the Spec, we have pure-specifier -> = 0. In the initial revision, this was implemented as  purespecifier : Assign val = Integerliteral {if($val.text.compareTo(“0”)!=0) throw new InputMismatchException(this);} because the string literal ‘0’ conflicted with Integerliteral. This commit modifies the rule to purespecifier : Assign val = Octalliteral {if($val.text.compareTo(“0”)!=0) throw new InputMismatchException(this);} in order to parse static member initializations. The commit breaks parsing pure virtual function declarations because Octalliteral never matches ‘0’: Octalliteral occurs after Integerliteral. The change has remained even though it is faulty.
24 Jun 2016 CPP14.g4 only compatible with Java ANTLR #415. The issue has never been fixed.
24 Jun 2016 “Issue with other Antlr targets and pure-specifier”; part of PR416, reported in #415. The commit only changes the comments in the grammar. The change does not fix target-independence.
14 Sep 2016 “Add lexer support for multiline macros in C++” with this hack. This commit adds the lexer rule MultiLineMacro : ‘#’ (.*? ‘\\’ [\r\n]+)+ ~[\r\n]+ -> skip;. Preprocessing is not really implemented by the grammar. But, in addition, multi-line macros are not ISO C++ 2014.
12 Nov 2016 CPP14.g4 bugfix – relational ops in template arguments. This change reorders alts from templateargument : constantexpression | typeid | idexpression; to templateargument : typeid | constantexpression | idexpression ;. in order to parse “void TestFunc(vector<ClassA>args, vector<ClassB> args2)” correctly.  
9 Dec 2016 “bugfix – multiline macro parse consumes all remaining input”. This commit changes MultiLineMacro : ‘#’ (.*? ‘\\’ [\r\n]+)+ ~[\r\n]+ -> skip; Directive : ‘#’ ~[\r\n]* -> skip; changed to MultiLineMacro : ‘#’ (~[\n]*? ‘\\’ ‘\r’? ‘\n’)+ ~[\n]+ -> channel(HIDDEN) ; Directive : ‘#’ ~[\n]* -> channel(HIDDEN) ;. On a Mac, the end of line is just a ‘\r’, so this change breaks parsing for Mac files.
23 Jun 2017 Fix Rule Names Conflicts. This commit is a symbol name change. E.g., typeid conflicts with one of the targets.  
26 Aug 2018 Fix duplicated Typeid() function declarations problem. This commit is a symbol name change, Typeid => typeidofthetypeid. Most name changes now just append an underscore ‘_’ to the end of the name. But, the Go target has been fixed, so the rename is no longer required.
26 Dec 2018 CPP grammar does not parse. This commit is a symbol name change for conflicts in the C++ target. Huge changes in the grammar were for mostly reformatting! Those changes should have gone into a separate commit. It’s hard to verify whether additional changes beyond renaming were made.
26 Dec 2018 partial fix for CPP14.g4 Stringliteral concatenation. This commit changes literal : … | Stringliteral | … to literal : … | Stringliteral+ | …. This change was probably ok, but it was later backed out without any explanation as part of a larger/unrelated commit. But, the Spec states clearly that string concatenation is performed by the preprocessor. This is the wrong place for implementing string concatenation.
Posted in Tip