Converting XText grammars into Antlr4

Recently, a developer asked a question about using an Antlr grammar generated from an XText grammar. The grammar, which was added to a Git version control repo, was generated from an XText file elsewhere in the Git repo. Besides the fact that the generated grammar was completely out of sync from the XText source, it seemed like it should compile using the Antlr3 tool, and generate a parser that should compile, but it didn’t. The Antlr3 grammar contained action code for tree construction using XText libraries.

But, could the Antlr3 grammar be converted to Antlr4 without the action code? Of course, with Trash, the answer is yes.

Christian Dietrich, the lead developer for Eclipse XText, suggested that one just generate a debug version of the grammar. This note explains how to generate an Antlr3 grammar from an XText file using Eclipse.

Generating an Antlr grammar from XText

According to Wikipedia,

Xtext is an open-source software framework for developing programming languages and domain-specific languages (DSLs). Unlike standard parser generators, Xtext generates not only a parser, but also a class model for the abstract syntax tree, as well as providing a fully featured, customizable Eclipse-based IDE.

Unfortunately, XText is tightly bound tp the Eclipse IDE, and requires the environment in order to do just about anything. If you don’t want to install yet another IDE, and want an absolute turn-key method to create an Antlr4 grammar, you’re SOL. If there is a way to generate an Antlr grammar without the IDE, I couldn’t figure it out.

To get started, you need to download the IDE, and install “Eclipse IDE for Java and DSL Developers”. Afterward, you can then create a sample XText, and generate an Antlr3 grammar from it. The “15 Minute Tutorial” is helpful, but you can do this even faster by accepting the defaults.

Create a fresh Workspace. For me, this was a stumbling block because I have no idea what the terminology is. A “workspace” is a directory where you are going to place all your code. To create a free Workspace, start Eclipse. This form will pop open where you can type in the path to a directory that you want to be created.

Afterwards, this window appears.

You can now do a File → New → Project…, where this windows opens, and you can then create a new XText project:

Creating an XText project involves inputing information for that project, including the name of the domain-specific language. You can just accept all the defaults.

After clicking on Finish, it takes you right back to the previous window, where you are thinking what just happened? Yes, it looks the same, but now close off the “Welcome” sub-window, and you have full view of the program and grammar.

Next, you will need to alter the properties for the project so that a trimmed-down version of the Antlr3 grammar is produced. In the Package Explorer, navigate to “org.xtext.example.mydsl” -> GenerateMyDsl.mwe2 file. In the file, add a few lines: “parserGenerator = { debugGrammar = true }” within the “language = StandardLanguage” section.

Finally, you are now in position to generate the Antlr3 grammar for the XText grammar. Right-click anywhere in the grammar, and select from the pop-up menu Run As → Generate Xtext Artifacts. After the code has been generated, you can fish around for the grammar using Bash and find.

The generated Antlr grammar is “DebugInternalMyDsl.g”.

It’s not exactly hard to convert an XText grammar to Antlr3, but for sure it’s not easy. And, although I haven’t checked, I suspect that naked XText grammars (the grammar plus any included grammars) can’t be converted in absence of the Eclipse project information. And, it is dog slow to start up an instance of Eclipse–20 seconds or so on my 8 core Ryzen system.

Generating an Antlr4 grammar from XText

Trash currently doesn’t convert XText grammars directly to Antlr4, but it can convert the Antlr3-generated grammar from Eclipse into Antlr4. But, the main problem with the Antlr3 conversion is that it doesn’t take care of the extra parentheses that XText defines for some basic rules like that for RULE_STRING and RULE_SL_COMMENT:

RULE_STRING : ('"' ('\' .|~(('\'|'"')))* '"'|'\'' ('\' .|~(('\'|'\'')))* '\'');
RULE_SL_COMMENT : '//' ~(('\n'|'\r'))* ('\r'? '\n')? {skip();};

Trash does have a command to remove useless parentheses, but it also has a bug in being a little too greedy in doing so. Consequently, after you generate an Antlr3 “debug” grammar, edit the grammar to remove the extra parentheses. Then, using Trash, you can convert the file to Antlr4:

trparse -t antlr3 InternalDebugFoobar.g | trconvert | trsponge

Conversion of XText grammars directly to Antlr4 should be possible, but it is currently not implemented. But, it should not be too hard to do so. A XText grammar would require changes in syntax to fix XText scoping, also known as cross-referencing. Instead of using a terminal or non-terminal in place on a right-hand side element for a rule, XText uses a reference to another symbol’s type.

Model: packages+=Pack*;
Pack: 'package' name=ID '{' (defs+=Def | calls+=Call)* '}';
Def: 'def' name=ID;
Call: 'call' ref=[Def]; // same as "ref=[Def|ID]"
// From https://goto40.github.io/self-dsl/xtext_scoping/

When Eclipse converts the grammar to Antlr3, it substitutes the reference “ref=[Def]” with the “name” attribute value “ID”.

I find the syntax that XText uses for scoping confusing. Worse, it embeds static semantics constructs in the grammar as opposed to placing the static semantics in a separate specification–a separate file. I will have more to say on this later.

Posted in Tip

Parser generators on the Web

In my implementation of Trash on the Web, I collected a list of some online parser generators. There aren’t many, and most have a restricted set of features. For example, REx is probably the best in conversion and parser generation, but it doesn’t have basic grammar analysis. Many of the sites do not offer EBNF input, and none of the sites tell the user what the input syntax is.

REx

  • Link
  • Computes conversions to REx EBNF for ABNF, Antlr3, Antlr4, Bison, GOLD, Instaparse, JavaCC, Jison, PEG.js, XText
  • Computes parsers for REx grammar input, which is a variant of W3C EBNF. LL(k), backtracking.
  • 18 grammars in REx EBNF for various languages.

Grammophone

  • Link
  • Previous version at University of Calgary Link
  • No online help pages.
  • Computes:
    • Sanity checks.
    • LL(1), LR(0), SLR(1), LALR(1), LR(1) automaton and table
  • Syntax:
    • LHS symbol can be upper or lower case alphabetic.
    • LHS/RHS separator is ‘->’.
    • Rule separator is ‘.’.
    • Rules can span multiple lines.
    • Does not accept BNF or EBNF.

Princeton Online LL(1) demo

  • Link
  • Based on SJMachines SourceForge online LL(1) parser generator.
  • Accepts basic CFG input. Does not accept grammar syntax >=BNF (‘|’ is considered a terminal).
  • Accepts basic CFG input. Does not accept BNF, EBNF. ‘|’ is considered a terminal.
  • Syntax:
    • LHS symbol can be upper or lower case alphabetic.
    • LHS/RHS separator is ‘::=’.
    • Newline is rule separator.
    • Entire rule must be on one line (no splitting of the rule across lines).
  • Computes FIRST, FOLLOW, Nullable property, LL(1) table.Computes FIRST, FOLLOW, Nullable property, LL(1) table.

SJMachines SourceForge online LL(1)

  • Link
  • Accepts basic CFG input. Does not accept BNF, EBNF. ‘|’ is considered a terminal.
  • Syntax:
    • LHS symbol can be upper or lower case alphabetic.
    • LHS/RHS separator is ‘->’.
    • Newline is rule separator.
    • Entire rule must be on one line (no splitting of the rule across lines).
  • Computes FIRST, FOLLOW, Nullable property, LL(1) table.

Online PEG.js

  • Link
  • Computes PEG parser for JavaScript only.
  • Inputs PEG.js.

Planetcalc online LL(1)

  • Link
  • Accepts custom EBNF: comma-separated symbols; semi-colon terminated rules; grouping; double-quote terminals.
  • Computes parser table (JSON), cycles, detects left recursion but does not generate a parser.

Kocman’s LL(k) parser generator

  • Link
  • Simplified YACC input grammar, and k.
  • Outputs LL(k) tables, with export to CSV.

HackingOff LL(1) parser generator

  • Link
  • Accepts BNF sytax. It does not take EBNF as it claims, since it does not accept Kleene closure (either post-fix plus, or curly brackets).
  • Computes FIRST, FOLLOW, LL(1) tables.

Trash on the web

  • Link
  • Currently I’m developing Trash on the Web. But, I plan to have conversion, table and parser generation, analysis, automaton output and display, LR(k), ALL(*), FIRST, FOLLOW, and transformations.

Posted in Tip

Writing a universal programming language extension for VSCode

There was an interesting question in Stackoverflow today: Is it possible to write an extension for VSCode that can accept any grammar and support semantic highlighting, defs/refs, and other language features?

I’ve always toyed with the idea of having a “universal language server” with the server dynamically reading a grammar then parsing the input.

So, I decided to write such an extension. It took me about four hours to get something working, but I would assume some could write something faster and cleaner.

Is such an extension possible? The answer is yes, but writing one depends on several factors: Have you written a VSCode extension before? Are you starting from an existing implementation? Are you a language tool expert? If your answer “no” to either of these questions, it could take a bit of time to write.

I started from the AntlrVSIX extension for VSCode that I already wrote a year or two ago. The server is in C# and client in Typescript.

The server code originally required a month or two to write because there weren’t any good examples to model. In addition, a bit of infrastructure need to be written because the server-side APIs for LSP did not implement version 3.16 of Language Server Protocol, which includes “semantic highlighting”.

For the grammar specification, one has to put a stake in the ground and decide what parser generator to use. There are many generators, but I used Antlr4 because that is what I am familiar with. And, while Antlr4 can target C#, the parser it outputs requires a driver, and the code compiled. Therefore, I assumed this would be done in a step prior to running VSCode. “Users” can write a new grammar, generate a parser, build, then tell the extension where the parser is located.

I’ve noted that other parser implementations exist. One “fast” implementation in JavaScript is Chevrotain. Unfortunately, the rules are specified in JavaScript, so while it would be easy to convert an EBNF-based grammar like Antlr or BNFC to Chevrotain, scraping the code to extract the grammar from Chevrotain syntax would be difficult. That’s because one can modify the rules on the fly.

The output of a parse is a parse tree. But a grammar in itself is insufficient to classify tokens in some file.

In addition to the grammar, you will need to declare classes of symbols. Here, I used XPath expressions, which is an expression to match a particular parse tree node. The reason for XPath expression is because we want the context of a token to be used for classification. For example, this grammar-based editor selects tokens from the lexer for a classification. But, this ignores the fact that in many languages, tokens have different meaning because they occur in different contexts in the parse. If you ignore the parse tree context that the token occurs within, you may as well just use Textmate.

Note, a Textmate language spec is an alternative solution, but Textmate grammars are difficult to implement. A Textmate pattern is either a single regular expression, or a “dual-level” regular expression pattern. As far as I know, there are no tools to convert a context-free grammar with class patterns into a Textmate specification. My gut feeling is that an XPath expression into a “dual-level” pattern, but not always.

Putting it all together: Java as an example

Let’s start with the Java grammar. To get this extension to work, you will need to clone the grammars-v4 repo, and build a parser for the Java grammar.

git clone https://github.com/antlr/grammars-v4
cd grammars-v4/java/java
trgen
dotnet build
cd ../..

Next, we need to set up the three ~/.grammar-* files. In ~/.grammar-location:

c:/users/foobar/documents/grammars-v4/java/java

In ~/.grammar-classes:

class
property
variable
method
keyword
string

Next, we need to clone the extension code, build it, then run VSCode.

In ~/.grammar-classifiers:

//classDeclaration/IDENTIFIER
//fieldDeclaration/variableDeclarators/variableDeclarator/variableDeclaratorId/IDENTIFIER
//variableDeclarator/variableDeclaratorId/IDENTIFIER
//methodDeclaration/IDENTIFIER
//(ABSTRACT | ASSERT | BOOLEAN | BREAK | BYTE | CASE | CHAR | CLASS | CONST | CONTINUE | DEFAULT | DO | DOUBLE | ELSE | ENUM | EXTENDS | FINAL | FINALLY | FLOAT | FOR | IF | GOTO | IMPLEMENTS | IMPORT | INSTANCEOF | INT | INTERFACE | LONG | NATIVE | NEW | PACKAGE | PRIVATE | PROTECTED | PUBLIC | SHORT | STATIC | STRICTFP | SUPER | SWITCH | SYNCHRONIZED | THIS | THROW | THROWS | TRANSIENT | TRY | VOID | VOLATILE | WHILE)
//(DECIMAL_LITERAL | HEX_LITERAL | OCT_LITERAL | BINARY_LITERAL | HEX_FLOAT_LITERAL | BOOL_LITERAL | CHAR_LITERAL | STRING_LITERAL | NULL_LITERAL)

Finally, we need to clone the extension code, build it, then run VSCode.

git clone https://github.com/kaby76/uni-vscode.git
cd uni-vscode
dotnet build
cd VsCode
bash install.sh
code .

Semantic highlighting

I tried it out for Java, and have five classes of symbols defined. Here’s how the editor displays a Java source file with the standard Java extension.

This is how the editor displays the same Java source file with the “universal language server”.

Hover, Select, Defs/Refs

This implementation does support mouse hover pop-ups. The information displayed is just the classes that the symbol belongs to.

Select (aka “TextDocument/DocumentHighlight”) only selects one symbol because there is no symbol table implemented. Similarly, the Defs/Refs of a symbol only fetch the one symbol selected.

Implementation problems

  • Tagging is slow: O(Number of tree nodes * number of Classifications). This is because each classifier is run separately with the XPath engine.
  • Defs/Refs/Select only handle one symbol.
  • The location of the grammar, classes, and classifiers should be in one file.
  • The server can only handle one grammar per VSCode session.

–Ken, July 9, 2021

Posted in Tip

Testing of Antlr language targets

A few weeks of work, and I finally have a driver generator that can test any grammar with any target.

alias make='mingw32-make.exe'
for i in CSharp Java JavaScript Dart Python3
do
    echo $i
    rm -rf Generated
    dotnet-antlr -t $i
    pushd Generated
    ls
    mingw32-make.exe
    mingw32-make.exe run RUNARGS="-input 1+2 -tree"
    mingw32-make.exe clean
    ls
    popd
done
Posted in Tip

Forking a Git repo more than once

Github does not support forking a repo more than once. If you try to do that, it’ll just redirect to your original fork. That could be a problem if you have an outstanding pull request on a branch you want to modify for an unrelated change. Rather than muck around with the forked repo, create a new for this way.

  • git clone https://github.com/antlr/antlr4.git antlr4-pr2
  • In Github, create a new repo antlr4-pr2.
  • git remote rename origin upstream
  • git remote add origin https://github.com/yourgitid/antlr4-pr2.git
  • git remote -v
  • git push -v origin master

The main problem with this approach is that Github imposes a lot of restrictions on what you can do. Github “pull requests” cannot be performed from repositories that haven’t been explicitly “forked” through the UI. It might be best to follow the normal workflow: fork the repo, clone it, “git branch foobar”, “git checkout foobar”, then make your changes and checkin and publish from the Github Desktop app.

Posted in Tip

System Service Exception BSOD

I’ve been seeing BSOD, and did a dism and sfc. Looks like there were some problems…

C:\WINDOWS\system32>dism /online /cleanup-image /restorehealth

Deployment Image Servicing and Management tool
Version: 10.0.19041.572

Image Version: 10.0.19042.685

[==========================100.0%==========================] The restore operation completed successfully.
The operation completed successfully.

C:\WINDOWS\system32>sfc /scannow

Beginning system scan. This process will take some time.

Beginning verification phase of system scan.
Verification 100% complete.

Windows Resource Protection found corrupt files and successfully repaired them.
For online repairs, details are included in the CBS log file located at
windir\Logs\CBS\CBS.log. For example C:\Windows\Logs\CBS\CBS.log. For offline
repairs, details are included in the log file provided by the /OFFLOGFILE flag.

C:\WINDOWS\system32>

Posted in Tip

Release 8.3 of Antlrvsix

After two months of work, I’m a few days away from making release 8.3 for Antlrvsix. Most of the changes actually pertain to Trash, the command-line interpreter shell contained within Antlrvsix, but there are a few other important changes to the extension itself.

Release 8.3 features an all-new input/output pipe implementation between commands in Trash. It uses a JSON serialization of parse trees between commands. The JSON serialization also contains the parser, lexer, token stream, file contents, and file location. Basically, it includes everything that one would need for a command to manipulate a parse tree. The purpose of the JSON file is so that each command can now be implemented “out-of-process”, meaning that Trash can now be replaced by Bash and each command implemented as an independent program in any language of choice. Nobody wants yet another monolithic program to implement programming language tools. This release works towards an open system. With this release, all the commands and Trash still are implemented in one program, but it will be switched over in a month or two.

I’ve also included a few new goodies:

  • Added a grammar diff program.
  • Added an ISO 14977 parser.
  • Added AGL (Automatic Graph Layout), XML, JSON output.
  • Added Cat, Echo, Wc commands.
  • Adding BNFC’s LBNF and other associated grammars. NB: This work is not yet complete, and only works for the simplest of grammars.

For ISO 14977, it was something that I decided to implement a while ago. But I didn’t know what I was getting into, and really should have read what D. Wheeler wrote about the spec. While it is now almost done, I learned along the way that the spec has several problems. One error is that the symbol meta identifier cannot contain spaces (meta identifier = letter, (letter | decimal digit);), yet throughout the spec–and meta identifier itself–meta identifier should allow spaces! And, as Wheeler pointed out, there are many other problems. Yes, grammars in Iso 14977 are very verbose…”a sea of commas”. But, it does have some interesting features, and so worth adding a parser for it.

The “diff” program I implemented with this release is interesting. I used the Zhang-Shasha tree-edit distance algorithm, extending it to record the actual tree edits that correspond to the minimum tree-edit distance. This algorithm is, unfortunately, for an ordered tree, so it works best for small differences. I will be trying to implement other algorithms in the next month or two. There is certainly a lot that could be done here. One important type of difference is to include no only simple single-node inserts and deletes, but more complex operations like fold and unfold.

In addition, with this release, I’m disabling semantic highlighting for VS2019. This is because it’s buggy and slow, and despite my warning people, they complain about it being buggy and slow! Use the extension for VSCode. It’s really very good. In the next release, I will try to fix Antlrvsix for VS2019, but you never know: Microsoft needs to implement semantic highlighting in its LSP client for VS2019.

–Ken

Posted in Tip

Antlrvsix and Trash release v8.2

Released yesterday is version 8.2 of Antlrvsix and Trash. This release enhances the Trash shell further, making it look more and more like a full-fledged analogy to the Bash shell. However, Trash uses the lingua-franca of parse trees and XPath. In this release, one can now pipe output–parse trees–between commands. The find command has been renamed to xgrep to further the analogy with grep. Various output commands have been added, such as st and text. File globbing for ls, cd, parse and other commands has been rewritten to be more like what you would see in Bash. I’ve added a run command to generate, build, and run parsers.

All in all, I think Trash is finally becoming the tool I’ve envisioned for language development.

Posted in Tip

Updates to Antlrvsix

Many months ago, I had VSCode working with Antlrvsix. Back then, rather than release what I had, I decided to put it off because I was concentrating my effort on getting the server capabilities expanded. But, the main problem why I didn’t release a VSCode extension was that I could not support “semantic highlighting” in my server in a standardized way because the API I was using did not support it. Since then, the server capabilities have been enhanced. But, more importantly, I changed the API to get semantic highlighting in Antlrvsix to work with VSCode. I have now released Antlrvsix for VSCode to the Microsoft Marketplace for VSCode.

To get semantic highlighting working with VSCode, I decided to write a drop-in replacement for Microsoft’s Microsoft.VisualStudio.LanguageServer.Protocol API. While Microsoft does make a release of the API every three months or so, there are many features missing that have been in the LSP spec for years. Semantic highlighting is a crucial new addition, but I have no confidence that Microsoft will ever implement it based on the changes I’ve seen over the last year. This drop-in replacement is the current version with additions for semantic highlighting.

On the grammar transforms, I have a script for the “Trash” command-line tool of Antlrvsix to optimize the Java9 grammar partially. The transforms for expressions aren’t yet there, but so far, the optimized grammar parser works several times faster.

Posted in Tip

AntlrTreeEditing library added

I’ve pulled all the Antlr parse tree editing routines into its own library yesterday. This library fills a gap between what is offered in the Antlr runtime (a psuedo XPath library, AddChild and Parent accessors for ParserRuleContext), and a full-blown transformation system like ASF+SDF.

This library contains:

  • a beefed-up XPath version 2 library;
  • a tree construction routine from an s-expression notation and an Antlr parser and lexer.
  • Antlr parse tree node replace and delete;
  • an exhanced parse tree node type that supports an observer pattern to keep data structures in sync when you modify a parse tree.

Right now it’s just in C#, but I plan at some point to translate it to Java because it is very useful.

–Ken

Posted in Tip