Scraping the grammar for C++ from the ISO/IEC 14882 Specification

Draft. I am still working on the extraction tool. -Ken

Grammars are specifications of the formal syntax for a language. They are used by developers of programming languages to implement tools, e.g., compilers, interpreters, syntax-directed editors, documentation systems, source code query systems, and refactoring tools. Developers of program language tools rather use a grammar than an existing parser in order to check the correctness of the language. Using an existing parser forces the developer to adapt to the implementation language and API, often leaving him in a weaker position to solve problems after parsing because it does not plug-and-play well with other tools.

Some popular languages nowadays do not even use formal grammars to help specify the language, e.g., Julia or V. Fortunately, C++ is not one of these.

An official edition of the specification for C++ is ISO/IEC 14822, and is available at iso.org. Over the years, the C++ Standards Committee has released a new version of the specification every three years: C++14 (Fourth Edition), C++17 (Fifth Edition), and C++20 (Sixth Edition). Drafts of the Spec can also be found in the “Drafts GitHub repository”. While the drafts are public and free, the official editions are not, and the drafts differ from the official editions.

The main problem with the ISO C++ Specifications is extracting a working grammar from the document. This is what this note will discuss.

Extracting the formal grammar from the Specification

Scraping is the extraction of a machine-readable formal grammar from a specification. Scraping involves much more than just a copy/paste operation because it involves the identification, extraction, and correction of the grammar from a source document, and subject to all the usual problems:

typos;
bad OCR text;
missing tags in the meta for the document to identify the rules for the formal grammar;
different syntax for the formal grammar;
incorporation of natural language descriptions within the rules.

In addition, information on the extraction must be documented:

When was the spec fetched?
Where was the spec fetched?
What version is the spec?
What method was used for the extraction?
Who did the extraction?
What modifications were applied to the grammar?

Scraping involves the following steps:

Collect the specification.
Identify the boundaries of the grammar rules.
Extract the text of the formal grammar using a repeatable, automatic method.
Correct issues due to OCR problems, including run-ins, wrong font combinations.
- An example of a run-in is in the Grammar Zoo with alias-declaration/attribute-specifier-seq and opt.
- Correct rules in OCRed text.
Correct errors in the specification itself, such as:
- Removing duplicate rules.
- Adding rules that are missing.
- Correcting misspellings.
Convert the grammar into a common formal grammar syntax.
- Transform natural language descriptions into formal descriptions.
Transform the formal grammar to a particular parser generator.
Identify and implement the lexer/parser boundary.
Test the grammar.
Optimize the grammar for the parser generator.

A critical aspect of the extraction process is to implement an automated method of extraction and correction. The objective of using an automated tool is to boost the reliability and correctness of the extracted grammar. The C++ ISO Specifications are similar to one another, and an automated system should take into account these similarities.

Problems in the ISO C++ Specifications

Natural language descriptions in rules

q-char: any member of the source character set except new-line and “

For raw-string, the end delimiter must be handled with a semantic predicate.

r-char: any member of the source character set, except a right parenthesis ) followed by the initial d-char-sequence (which may be empty) followed by a double quote “.

Use of left recursion

noptr_abstract_declarator : noptr_abstract_declarator ? parameters_and_qualifiers | noptr_abstract_declarator ? ‘[’ constant_expression ? ‘]’ attribute_specifier_seq ? | ‘(‘ ptr_abstract_declarator ‘)’ ;

attribute_specifier_seq : attribute_specifier_seq ? attribute_specifier ;

Left recursion is handled by some parser generators, but it all really depends. While Antlr4 handles immediate left recursion in parser rules, it does not handle left recursion in lexer rules.

Restrictions in refactorings

How a system refactors the rules in a scraped grammar may be order dependent. Antlr4 does not parse rules renamed to lexer names if RHS contains any parser names. So, if given rules a: b 'C'; b : 'B';, one first renames a to A, the resulting grammar will not compile by the Antlr4 tool, even if it is followed by renaming b to B. Renaming is order-dependent.

Reordering rule alts

Order of the alternatives in a rule in the Specification may have additional meaning in a parser generator. Therefore, reordering the alternatives may be necessary. For example, this reordering was necessary for Antlr4.

Greedy vs. Non-greedy rules

This change was required for parsing declSpecifierSeq.

declSpecifierSeq: declSpecifier+? attributeSpecifierSeq?;

Parser-dependent lexing

The C++ Specification shares the C++ and Preprocessor rules. Unfortunately, the lexer is context-dependent on which parse one is performing.

C++ grammar is ambigous for ‘#’ non_directive

The grammar has an alt of group_part for ‘#’ non_directive.

header_name is either a string literal with double quotes or angle brackets. It turns out that c++config.h in the GNU compiler has a constant_expression that passes an angle bracket to a function. The header_name needs to accept either as a String_literal.

Grammars for C++

Ed Willink’s C++ grammar

This Yacc grammar and the accompanying lex grammar are mentioned specifically by the C++ Standards Committee. See this discussion from the “FAQ section”. The grammars are dated from 1999 and do not appear to have been updated recently.

Grammar Zoo C++

The “Grammar Zoo” is a repository of grammars collected by Vadim Zaytsev for a research project. The collected grammars for C++ include the ISO C++ Drafts and grammars for specific parser generators. The official editions of the Specification are not included. Unfortunately, the quality of the Draft grammars is very low because it is uncorrected, manually-obtained, copy/paste text.

The Antlr4 C++ grammar

This grammar for Antlr4 is for the ISO C++14 Standard. It does not contain a preprocessor, nor extensions to parse GNU, Clang, or MS C++. In fact, it was the motivation for this research. Over the years, the free-and-open-source grammar has been modified. Some of these modifications were correct, others were not. The following table describes some of the changes and the impact of those changes.

Date	Change	Impact
4 Jul 2015	Initial revision
1 Sep 2015	Fix ‘-‘ operator in ‘unaryoperators’ for Issue#212. This commit changed unaryoperator : ‘\|’ \| ‘∗’ \| ‘&’ \| ‘+’ \| ‘!’ \| ‘~’ ; to unaryoperator : ‘\|’ \| ‘∗’ \| ‘&’ \| ‘+’ \| ‘!’ \| ‘~’ \| ‘-‘ ;, which added the ‘-‘ operator.	While the commit fixes the ‘-‘ operator, the unary-operator rule still contains ‘\|’, which is an error, from the initial revision, and it remains more or less to this day!
3 Oct 2015	Removed Java @header. This change removed a bogus @header from the grammar. @header makes the grammar target specific.
3 Oct 2015	Fixed the pom.xml `<packageName>` attribute. Package names should be empty in order to be target-independent.
29 Oct 2015	Issue#230. This issue is about whether the grammar handles preprocessing directives. It does, but only on a very basic level. In order to parse C++ files with preprocessor directives, the suggested solution is “cc -E t.c” will get the actual C++ input u)”.	The suggested solution uses the C preprocessor, not the C++ preprocessor. They two are not the same. See this explanation.
16 Dec 2015	Change ‘purespecifier’ rule; Issue #249. In the Spec, we have `pure-specifier -> = 0`. In the initial revision, this was implemented as purespecifier : Assign val = Integerliteral {if($val.text.compareTo(“0”)!=0) throw new InputMismatchException(this);} because the string literal ‘0’ conflicted with Integerliteral. This commit modifies the rule to purespecifier : Assign val = Octalliteral {if($val.text.compareTo(“0”)!=0) throw new InputMismatchException(this);} in order to parse static member initializations.	The commit breaks parsing pure virtual function declarations because Octalliteral never matches ‘0’: Octalliteral occurs after Integerliteral. The change has remained even though it is faulty.
24 Jun 2016	CPP14.g4 only compatible with Java ANTLR #415.	The issue has never been fixed.
24 Jun 2016	“Issue with other Antlr targets and pure-specifier”; part of PR416, reported in #415. The commit only changes the comments in the grammar.	The change does not fix target-independence.
14 Sep 2016	“Add lexer support for multiline macros in C++” with this hack. This commit adds the lexer rule MultiLineMacro : ‘#’ (.*? ‘\’ [\r\n]+)+ ~[\r\n]+ -> skip;.	Preprocessing is not really implemented by the grammar. But, in addition, multi-line macros are not ISO C++ 2014.
12 Nov 2016	CPP14.g4 bugfix - relational ops in template arguments. This change reorders alts from templateargument : constantexpression \| typeid \| idexpression; to templateargument : typeid \| constantexpression \| idexpression ;. in order to parse `"void TestFunc(vector<ClassA>args, vector<ClassB> args2)"` correctly.
9 Dec 2016	“bugfix - multiline macro parse consumes all remaining input”. This commit changes MultiLineMacro : ‘#’ (.? ‘\’ [\r\n]+)+ ~[\r\n]+ -> skip; Directive : ‘#’ ~[\r\n] -> skip; changed to MultiLineMacro : ‘#’ (~[\n]? ‘\’ ‘\r’? ‘\n’)+ ~[\n]+ -> channel(HIDDEN) ; Directive : ‘#’ ~[\n] -> channel(HIDDEN) ;.	On a Mac, the end of line is just a ‘\r’, so this change breaks parsing for Mac files.
23 Jun 2017	Fix Rule Names Conflicts. This commit is a symbol name change. E.g., typeid conflicts with one of the targets.
26 Aug 2018	Fix duplicated Typeid() function declarations problem. This commit is a symbol name change, Typeid => typeidofthetypeid.	Most name changes now just append an underscore ‘_’ to the end of the name. But, the Go target has been fixed, so the rename is no longer required.
26 Dec 2018	CPP grammar does not parse. This commit is a symbol name change for conflicts in the C++ target.	Huge changes in the grammar were for mostly reformatting! Those changes should have gone into a separate commit. It’s hard to verify whether additional changes beyond renaming were made.
26 Dec 2018	Partial fix for CPP14.g4 Stringliteral concatenation. This commit changes literal : … \| Stringliteral \| … to literal : … \| Stringliteral+ \| …	This change was probably ok, but it was later backed out without any explanation as part of a larger/unrelated commit. But, the Spec states clearly that string concatenation is performed by the preprocessor. This is the wrong place for implementing string concatenation.

Tree Sitter C++ grammar

The Tree Sitter C++ grammar is a grammar implemented in JavaScript using functions, which makes it resemble a parse tree for a formal grammar in any other parser generator. This is converted into a JSON format grammar, which would be the best format for a converter to another EBNF syntax.

There is no attribution of the origins of the grammar, but this commit indicates it originated from a grammar for C99 in “Grammar Zoo”.

The readme.md for the repository gives two links to the syntax for C++ here and here. But, they have no relation to the grammar in this repository.

There is no formal refactorings employed, and no real discussion of what was done.

There are a total of four tests, two examples and two tests, in the repository. There are several other tests here. They check the parse tree produced for a small number of C++ snippets.

Lezer C++ grammar

The Lexer C++ grammar is based on the Tree Sitter C++ grammar.

While there is attribution of the origins of the grammar from the Tree Sitter grammar, there is no dicussion of the refactorings employed.

The test suite provided in the repository is similar to that for Tree Sitter.