Writing a universal programming language extension for VSCode

There was an interesting question in Stackoverflow today: Is it possible to write an extension for VSCode that can accept any grammar and support semantic highlighting, defs/refs, and other language features?

I’ve always toyed with the idea of having a “universal language server” with the server dynamically reading a grammar then parsing the input.

So, I decided to write such an extension. It took me about four hours to get something working, but I would assume some could write something faster and cleaner.

Is such an extension possible? The answer is yes, but writing one depends on several factors: Have you written a VSCode extension before? Are you starting from an existing implementation? Are you a language tool expert? If your answer “no” to either of these questions, it could take a bit of time to write.

I started from the AntlrVSIX extension for VSCode that I already wrote a year or two ago. The server is in C# and client in Typescript.

The server code originally required a month or two to write because there weren’t any good examples to model. In addition, a bit of infrastructure need to be written because the server-side APIs for LSP did not implement version 3.16 of Language Server Protocol, which includes “semantic highlighting”.

For the grammar specification, one has to put a stake in the ground and decide what parser generator to use. There are many generators, but I used Antlr4 because that is what I am familiar with. And, while Antlr4 can target C#, the parser it outputs requires a driver, and the code compiled. Therefore, I assumed this would be done in a step prior to running VSCode. “Users” can write a new grammar, generate a parser, build, then tell the extension where the parser is located.

I’ve noted that other parser implementations exist. One “fast” implementation in JavaScript is Chevrotain. Unfortunately, the rules are specified in JavaScript, so while it would be easy to convert an EBNF-based grammar like Antlr or BNFC to Chevrotain, scraping the code to extract the grammar from Chevrotain syntax would be difficult. That’s because one can modify the rules on the fly.

The output of a parse is a parse tree. But a grammar in itself is insufficient to classify tokens in some file.

In addition to the grammar, you will need to declare classes of symbols. Here, I used XPath expressions, which is an expression to match a particular parse tree node. The reason for XPath expression is because we want the context of a token to be used for classification. For example, this grammar-based editor selects tokens from the lexer for a classification. But, this ignores the fact that in many languages, tokens have different meaning because they occur in different contexts in the parse. If you ignore the parse tree context that the token occurs within, you may as well just use Textmate.

Note, a Textmate language spec is an alternative solution, but Textmate grammars are difficult to implement. A Textmate pattern is either a single regular expression, or a “dual-level” regular expression pattern. As far as I know, there are no tools to convert a context-free grammar with class patterns into a Textmate specification. My gut feeling is that an XPath expression into a “dual-level” pattern, but not always.

Putting it all together: Java as an example

Let’s start with the Java grammar. To get this extension to work, you will need to clone the grammars-v4 repo, and build a parser for the Java grammar.

git clone https://github.com/antlr/grammars-v4
cd grammars-v4/java/java
trgen
dotnet build
cd ../..

Next, we need to set up the three ~/.grammar-* files. In ~/.grammar-location:

c:/users/foobar/documents/grammars-v4/java/java

In ~/.grammar-classes:

class
property
variable
method
keyword
string

Next, we need to clone the extension code, build it, then run VSCode.

In ~/.grammar-classifiers:

//classDeclaration/IDENTIFIER
//fieldDeclaration/variableDeclarators/variableDeclarator/variableDeclaratorId/IDENTIFIER
//variableDeclarator/variableDeclaratorId/IDENTIFIER
//methodDeclaration/IDENTIFIER
//(ABSTRACT | ASSERT | BOOLEAN | BREAK | BYTE | CASE | CHAR | CLASS | CONST | CONTINUE | DEFAULT | DO | DOUBLE | ELSE | ENUM | EXTENDS | FINAL | FINALLY | FLOAT | FOR | IF | GOTO | IMPLEMENTS | IMPORT | INSTANCEOF | INT | INTERFACE | LONG | NATIVE | NEW | PACKAGE | PRIVATE | PROTECTED | PUBLIC | SHORT | STATIC | STRICTFP | SUPER | SWITCH | SYNCHRONIZED | THIS | THROW | THROWS | TRANSIENT | TRY | VOID | VOLATILE | WHILE)
//(DECIMAL_LITERAL | HEX_LITERAL | OCT_LITERAL | BINARY_LITERAL | HEX_FLOAT_LITERAL | BOOL_LITERAL | CHAR_LITERAL | STRING_LITERAL | NULL_LITERAL)

Finally, we need to clone the extension code, build it, then run VSCode.

git clone https://github.com/kaby76/uni-vscode.git
cd uni-vscode
dotnet build
cd VsCode
bash install.sh
code .

Semantic highlighting

I tried it out for Java, and have five classes of symbols defined. Here’s how the editor displays a Java source file with the standard Java extension.

This is how the editor displays the same Java source file with the “universal language server”.

Hover, Select, Defs/Refs

This implementation does support mouse hover pop-ups. The information displayed is just the classes that the symbol belongs to.

Select (aka “TextDocument/DocumentHighlight”) only selects one symbol because there is no symbol table implemented. Similarly, the Defs/Refs of a symbol only fetch the one symbol selected.

Implementation problems

  • Tagging is slow: O(Number of tree nodes * number of Classifications). This is because each classifier is run separately with the XPath engine.
  • Defs/Refs/Select only handle one symbol.
  • The location of the grammar, classes, and classifiers should be in one file.
  • The server can only handle one grammar per VSCode session.

–Ken, July 9, 2021

Posted in Tip