At this point, I am working on the code for refactoring Antlr grammars in Antlrvsix. While I plan to use Piggy to implement these refactorings, I will first implement them in C# code so I can get a clear picture of what to do with Piggy, make any changes, then rewrite the refactorings using Piggy templates.
Here is a preliminary list of some transformations I plan to write:
- Convert string literals in parser with lexer token types.
- Remove useless productions.
- Move start symbol to the top of parser grammar.
- Order parser rules (a) alphabetically, (b) DFS traversal, or (c) BFS traversal.
- Separate/combine lexer and parser grammars.
- Convert parser rules to fragment lexer rules and vice versa.
- Move lexer fragment rules to top/bottom.
- Order modes in alphabetic order.
- Remove left recursion.
The first three transformations have already been added to Antlrvsix. Here are some clips on how they work.
Additionally, there are a number of papers to check out.
- Halupka, Ivan, and Ján Kollár. “Catalog of grammar refactoring patterns.” Central European Journal of Computer Science 4.4 (2014): 231-241.
- Kollár, Ján, et al. “pLERO: Language for grammar refactoring patterns.” 2013 Federated Conference on Computer Science and Information Systems. IEEE, 2013.
- Porubän, Jaroslav, and Milan Nosáľ. “Practical experience with task-driven case studies.” 2014 IEEE 12th IEEE International Conference on Emerging eLearning Technologies and Applications (ICETA). IEEE, 2014.
- Kraft, Nicholas A., Edward B. Duffy, and Brian A. Malloy. “Grammar recovery from parse trees and metrics-guided grammar refactoring.” IEEE Transactions on Software Engineering 35.6 (2009): 780-794.
- Sarbo, Janos J. “Grammar transformations for optimizing backtrack parsers.” Computer Languages 20.2 (1994): 89-100.
- Lior Zur-Lotan and Avi Hayoun, Lecture notes “Transforming grammars to LL(1)”.
- Soisalon-Soininen, Eljas, and Esko Ukkonen. “A method for transforming grammars into LL (k) form.” Acta Informatica 12.4 (1979): 339-369.
- Smith, James. “Eliminating Left Recursion without the Epsilon.” arXiv preprint arXiv:1908.10888 (2019).
- Johnson, Mark, and Brian Roark. “Compact non-left-recursive grammars using the selective left-corner transform and factoring.” arXiv preprint cs/0008021 (2000).
- Moore, Robert C. “Removing left recursion from context-free grammars.” Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference. Association for Computational Linguistics, 2000.
This is just a note to myself on what procedures I take to clean up source code. This occurs because of the basic development cycle in trying to get something to work with complicated APIs, which goes something like this:
- I search the web to find something that supposedly works, i.e., it was written by someone of authority, e.g. Microsoft, and was checked into Github.
- I copy and paste snippets into my code and try it out.
- It usually doesn’t work because the APIs were changed or gone obsolete, or I’m trying to add in something that isn’t compatible with other code. If it doesn’t work, I go back and try something else.
- After a bunch of iterations of adding in useless code, I find something that works and check it in.
- I now have to clean up the code, full of garbage from snippets.
To clean the code, I employ several tools: Resharper, CodeMaid, and Visual Studio IDE. I do not know what Visual Studio Code does, nor do I care because it is such a poor development environment. I mentioned some of the problems in a previous post.
- The first step is to reorder all the garbage so that classes have elements of fields, properties, and methods ordered, and each element is ordered alphabetically. To reorganize, select the document, then right-click and select “CodeMaid->Reorganize Active Document”. Note, if you have preprocessor directives, CodeMaid will not observe them in the reorganization. You should push any pragmas into the project .csproj file.
- Use VS 2019 “Analyze => Code Cleanup”. This feature was just added, and contains numerous fix-ups: “Apply object/collection initialization preferences”; “Apply expression/block body preferences”; “Apply ‘this.’ qualification preferences”; “Sort usings”; “Remove unnecessary usings”; “Apply implicit/explicit type preferences”; “Remove unused variables”; “Remove unnecessary casts”; “Apply inline ‘out’ variable preferences”; “Add accessibility modifiers”; “Sort accessibility modifiers”; “Make private fields read-only when possible”; “Apply language/framework type preferences”; “Add/remove braces for single-line control statements”. You might want to check-in your code prior to these changes.
- Remove manually any .cs file that is unused.
- Test using “Find all references” and remove manually any unused fields, properties, methods, and types. Note, take care to not remove any unused elements or classes that APIs are expecting.
- Set tabs to 4 spaces, and replace them with spaces.
- I personally do not care for most code document comments and usually strip all comments from code. They clutter the entire meaning of the code. The code should speak for itself. If you can’t understand your code, you should not program.
- Strip all #regions. It is a stupid concept and a waste of a compiler directive. It is better to use a proper development environment that does this for you.
I’ve started to write an Antlr grammar for Bison, with the goal of automatically converting the grammars to Antlr, or another parser generator for that matter. As it turns out, the “central dogma” of parsing (i.e., “you cannot use an LR grammar in an LL parser, and vice versa”) is untrue with the unlimited symbol lookahead parsers that are available nowadays. The major issue will be handling static semantics. Many Bison grammars embed tree construction into the grammar, as well as performing static semantic checks. All this needs to be ripped out and done in a clean manner.
I have a grammar for Bison that works pretty well on eleven different grammars of varying complexity. I am now looking into a Github scraper that searches for and collects all public Bison/Yacc grammars so I can place them in a test suite.
Update Feb 24 2020: I have a Github crawler that is now downloading 32K Yacc grammars. I plan to test each of them with this Bison parser. Here is the script (more or less without chunking issues resolved).
# Libraries #
from BytesIO import BytesIO
from io import BytesIO
# Constants #
username = sys.argv
password = sys.argv
extension = sys.argv
print("username " + username + " password " + password + " extension " + extension)
URL = "https://api.github.com/search/code?q=" #The basic URL to use the GitHub API
QUERY = "parser+extension:" + extension
PARAMETERS = "&per_page=100" #Additional parameters for the query (by default 100 items per page)
DELAY_BETWEEN_QUERYS = 10 #The time to wait between different queries to GitHub (to avoid be banned)
OUTPUT_FOLDER = "./" #Folder where files will be stored
OUTPUT_FILE = "./data.csv" #Path to the CSV file generated as output
# Functions #
''' Given a URL it returns its body '''
buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.USERPWD, '%s:%s' %(username, password))
body = buffer.getvalue()
print("body " + str(body))
# MAIN #
count = 0
csvfile = open(OUTPUT_FILE, 'w')
writer = csv.writer(csvfile, delimiter=',')
items = data["items"]
for entry in items:
name = entry["name"]
repo = entry["repository"]
url = entry["url"]
more = simplejson.loads(getUrl(url))
download_url = more["download_url"]
writer.writerow([str(count), str(name), str(download_url)])
wget.download(download_url, out=OUTPUT_FOLDER + str(count) + "/" + name)
count = count + 1
#Run queries to get information in json format and download ZIP file for each repository
url = URL + QUERY + PARAMETERS
dataRead = simplejson.loads(getUrl(url))
numberOfPages = int(math.ceil(dataRead.get('total_count')/100.0))
#Results are in different pages
for currentPage in range(1, numberOfPages+1):
print("Processing page " + str(currentPage) + " of " + str(numberOfPages) + " ...")
url = URL + QUERY + PARAMETERS + "&page=" + str(currentPage)
dataRead = simplejson.loads(getUrl(url))
print("DONE! " + str(countOfRepositories) + " repositories have been processed.")
Having now implemented an LSP server and two different clients for Antlr, I can say I am more familiar with the LSP protocol. From these implementations, what can I say? LSP works well for single client/single server instances, but I am wondering how I can use LSP in a situation that modifies a workspace, reads and alters non-Antlr source code from the Antlr LSP server.
What I want is to add a “go to visitor method” of a grammar symbol feature to my extension. What I mean by this is that I want to find a C#/C++ method that corresponds to the non-terminal symbol (see this documentation for an explanation of a visitor/listener in the Antlr runtime). To do this, the Antlr LSP server needs to parse the grammar to get the name of the grammar symbol the cursor is currently on. But, it needs to parse C# (or C++, etc) as well. Currently, what happens is it assumes the source has been written to disk and the Antlr server uses the Microsoft.CodeAnalysis APIs directly. The user opens a Solution or Project in VS2019, opens a window on an Antlr grammar source file, positions the cursor on a non-terminal, then asks for “go to visitor method”. The Antlr server is called with a custom message “go to visitor method” because the knowledge of Antlr grammars is in the server. The server parses every C# (but could be in C++ or another target language) to find the visitor method corresponding to the grammar non-terminal. If it exists, the server reports the location back to the client. If it doesn’t exist, the server now may create a source file, or modify a source file. It may need to add a class, or a method to correspond to the grammar non-terminal. Note that the Antlr server has no knowledge of whether the client currently has an LSP server opened on the C# source. Nor does it know if the copy is currently being modified but not saved out. It doesn’t modify the workspace and report it back to the client, but it should.
The problem with this scenario is that there is no shared LSP server for the C# code–it’s owned by the editor and/or the LSP server that parses the C# source code. However, I need to parse the source code in order to provide this navigational (and possible source code generation) feature. Is there any coordination of the services between servers?
This is a question I recently asked on a Gitter conversation about LSP and colorization. It’s worth copying it here because there are a number of problems and solutions.
After several months of work, I’ve finally released version 5 of the Visual Studio 2019 extension for Antlr, AntlrVSIX. This version is a re-architecture of the code using a Language Server Protocol implementation. The extension is slimmer because it now only focuses on support for Antlr grammars and removes some of the features that LSP does not support (e.g., “go to listener/visitor” and colorized tagging). However, I now have added a template for creating C++ Antlr programs (in addition to the familiar C# Antlr template). The templates use an updated version of the msbuild rules and tool in Antlr4BuildTasks that works for csproj and vcxproj files (C# and C++ projects, respectively). There is a Net Core template in Antlr4BuildTasks.Templates as well. –Ken
The next step in the development of my LSP server for Antlr involves support for code completion. There is an API written by Mike Lischke that computes the lookahead sets from the parser tables and input, but it turns out to have a bug in some situations. So, I’ll now describe the algorithms I wrote to compute the lookahead set for Antlr parsers. I’ll start first with a few definitions, then go over the three algorithms that do the computation. An implementation of this is available in C# in the Antlrvsix extension for code completion and another is provided by default in the Antlr Net Core templates I wrote for syntax error messages.
For those interested in creating an Antlr4 program using C#, I wrote a dotnet package and uploaded it to Nuget. There is similar functionality in the VS2019 extension AntlrVSIX, but I am starting to move towards a Language Server Protocol client/server implementation for Antlr. This package capitalizes on the work I did with Antlr4BuildTasks supporting MSBUILD builds using the Java Antlr tool v4.7.2.
Continuing with my work in making Antlrvsix a Language Server Protocol server implementation, I created another extension for Visual Studio 2019 that uses the client API Microsoft.VisualStudio.LanguageServer.Client, this time with the Clangd LSP server. The extension source code is here. But, there is a hitch…
Continuing with my work in making Antlrvsix a Language Server Protocol server implementation, I created an extension for Visual Studio 2019 that uses the client API Microsoft.VisualStudio.LanguageServer.Client with the Eclipse Java Language Server Protocol implementation. This extension follows the steps outlined in an old article on creating LSP clients in Visual Studio. The extension source code is here. The code has been updated so that the only requirement is that you have the Java runtime downloaded and installed, and JAVA_HOME set to the top-level directory of the Java runtime. The code will prompt you for the path and warn you that it isn’t set properly.
Even as noted in the old MS documentation page, many of the client features are enabled, e.g., go to def, find all refs, reformat, hover, and typing completions. What is missing is building and debugging. But it is very usable.