Version 7 of Antlrvsix released

I’ve made a release of Antlrvsix yesterday which includes recursion removal for Antlr grammars. There are two types of recursion that I can work with: direct (aka immediate) and indirect. The algorithm that I use for direct and indirect recursion removal are the ones that Aho et al. (2006) describe. In addition, the extension provides a transform to rewrite direct recursion (left or right) into Kleene operator form, which helps clean up the grammar considerably. These transformations help fix grammar problems that you may run across. But, to use these transformations, you have to select the rule that contains or participates in the recursion.

In addition, the extension now provides an explicit mapping for colorizing the grammar. The problem with LSP and VS2019 is that neither supports languages like Antlr. So, language features like nonterminals and terminals are mapped to client-specific data types that have display properties. This mapping can be modified in the options file for Antlrvsix. In addition, I now tag almost all of the grammar.

What’s next up for Antlrvsix? The main problem I wanted to solve is the performance of the Java9 parser, which performs terribly for certain input. The issue seems to involve left factoring, but it’s not clear. I’ll now be working on this for the next month. I was hoping to add Piggy into the extension, but that will have to wait.


Posted in Tip

Forgetting Flex for now

After a bit of fiddling around with what I thought would be a lexical analyzer for Flex, I now realize that the lexer isn’t defined by scan.l, but a combination of one generated from flex.skl, a file that is processed by m4, and the scanner generated by scan.l. Originally, flex.skl included some switches for C or C++, but it’s now m4 dependent. I’m not sure why #ifdefs wouldn’t have sufficed; it would have been far more portable. Further, after carefully inspecting the parser grammar, I now realize it is missing a ton of things from an input file because the lexer yanks so much information out of the input stream and never passes it onto the parser. For example, most of the options in a Flex file are simply swallowed by the scanner, only the ones with ‘=’ passed to the parser. (Bullshit.) Code blocks within %{ … %} or %top{ … } are swallowed up by the lexer, and never mentioned in the parser grammar. So, in order to really write an import of Flex, the lexer and scanner would have to be written from scratch. (Originally, I was toying with doing just that, after successfully converting Bison’s parse.y to Antlr, but chose to start with the Flex sources because I thought the grammars were complete. They just were complete garbage from the standpoint of an import.) Given the number of problems, I’m going to pass on Flex import for now. I will probably revisit Flex import but at a later date. For now, I’m ODed on a fruitless translation.

Fortunately, I don’t really need to implement any of this for Antlrvsix right now. I’m going to continue with the rest of the “to do’s” I have slated.

Posted in Tip

Flex vs Bison import

I’ve written a basic Bison import for Antlrvsix, and now I’m in the guts of writing a Flex import. While writing the Bison import, I soon realized a problem that I could not deal with immediately. Many people write Bison code that is very undisciplined. Often the code contains driver code (calc.y from the official Gnu repository), The static semantics is often wrapped up with the tree construction. While this is ok, this means the grammar contains target-language code, which is not a great way to write grammars. If you really want an abstract syntax tree instead of a parse tree for semantic checks, you should discipline the semantic checks outside of tree construction. For now, I do not copy code blocks to the generated Antlr grammar.

For Flex, the problem is that the embedded code exclusively performs semantic checks at the lexer phase. So this code must be included in a generated Antlr lexer grammar. The good news is that it’s not too difficult to convert the patterns in Flex into Antlr lexer rules and modes. You can then convert the code blocks in a Flex rule into your target language. Whether this works or not, I will find out–I’m translating line-by-line the Flex lexical grammar into an Antlr lexer grammar, and so far, it looks good.

Aside, I’ve been using Antlrvsix quite a bit myself to write the Flex grammars. I added a refactoring to sort lexer modes alphabetically since the original Flex “scan.l” code defined the start states in a haphazard order. And, since I’ve switch to the “Dark mode” theme in Visual Studio 2019, I added to the options color selections for grammar symbols so I can see the symbol text more clearly.


Posted in Tip

Status of transforms for Antlr

I’ve been adding transform/refactoring operations to Antlrvsix for the last two weeks. So far, I have the following: (1) replace parser string literals; (2) remove useless parser rules; (3) move parser start rules to top; (4) reorder parser rules alphabetically; (5) reorder parser rules vis BFS; (6) reorder parser rules DFS; (7) split combined grammars; (8) merge split grammars. I’ll soon start the really important transforms (rewrite rules to remove excessive factoring, left recursion, etc). I will also add in a Flex/Bison import tool to convert these grammars to Antlr. I’m making a release of Antlrvsix v5.5 with these changes, but much more is coming.


Posted in Tip

Drivers for any Antlr grammar

This is a placeholder for something I’ve been wanting to do for a while. I’m constantly seeing Antlr grammars posted in Stackoverflow that I want to test out. So, I’d like Antlr4BuildTasks.Templates to include something where you can just copy the grammar files into a directory, then type “dotnet new antlr”, and the damn thing will create a driver for the grammars. Another thing I’d like is to type “dotnet new antlr –grammar Java9” and it fish out the grammar files from the Antlr Grammars-v4 Github repository, then create a damn driver. This should be automated.

Posted in Tip

Git commits to Antlr vs Bison

While I hunker down here for SARS-CoV-19 to burn itself out (I am in a very high-risk category), I decided to check up on the activity of commits between Antlr and Bison. This is because I read on the Internet that Bison isn’t being actively developed. It turns out, not only is this not true, Bison is being modified more often than Antlr. I wonder what changes are being made to Bison. And, I wonder why activity for Antlr is so low.

Activity on Antlr

Activity on Bison

Posted in Tip

Transformations for Antlr grammars

At this point, I am working on the code for refactoring Antlr grammars in Antlrvsix. While I plan to use Piggy to implement these refactorings, I will first implement them in C# code so I can get a clear picture of what to do with Piggy, make any changes, then rewrite the refactorings using Piggy templates.

Here is a preliminary list of some transformations I plan to write:

  • Convert string literals in parser with lexer token types.
  • Remove useless productions.
  • Move start symbol to the top of parser grammar.
  • Order parser rules (a) alphabetically, (b) DFS traversal, or (c) BFS traversal.
  • Separate/combine lexer and parser grammars.
  • Convert parser rules to fragment lexer rules and vice versa.
  • Move lexer fragment rules to top/bottom.
  • Order modes in alphabetic order.
  • Remove left recursion.

The first three transformations have already been added to Antlrvsix. Here are some clips on how they work.

Additionally, there are a number of papers to check out.


Posted in Tip

Notes on cleaning up shitty source code

This is just a note to myself on what procedures I take to clean up source code. This occurs because of the basic development cycle in trying to get something to work with complicated APIs, which goes something like this:

  1. I search the web to find something that supposedly works, i.e., it was written by someone of authority, e.g. Microsoft, and was checked into Github.
  2. I copy and paste snippets into my code and try it out.
  3. It usually doesn’t work because the APIs were changed or gone obsolete, or I’m trying to add in something that isn’t compatible with other code. If it doesn’t work, I go back and try something else.
  4. After a bunch of iterations of adding in useless code, I find something that works and check it in.
  5. I now have to clean up the code, full of garbage from snippets.

To clean the code, I employ several tools: Resharper, CodeMaid, and Visual Studio IDE. I do not know what Visual Studio Code does, nor do I care because it is such a poor development environment. I mentioned some of the problems in a previous post.

  • The first step is to reorder all the garbage so that classes have elements of fields, properties, and methods ordered, and each element is ordered alphabetically. To reorganize, select the document, then right-click and select “CodeMaid->Reorganize Active Document”. Note, if you have preprocessor directives, CodeMaid will not observe them in the reorganization. You should push any pragmas into the project .csproj file.
  • Use VS 2019 “Analyze => Code Cleanup”. This feature was just added, and contains numerous fix-ups: “Apply object/collection initialization preferences”; “Apply expression/block body preferences”; “Apply ‘this.’ qualification preferences”; “Sort usings”; “Remove unnecessary usings”; “Apply implicit/explicit type preferences”; “Remove unused variables”; “Remove unnecessary casts”; “Apply inline ‘out’ variable preferences”; “Add accessibility modifiers”; “Sort accessibility modifiers”; “Make private fields read-only when possible”; “Apply language/framework type preferences”; “Add/remove braces for single-line control statements”. You might want to check-in your code prior to these changes.
  • Remove manually any .cs file that is unused.
  • Test using “Find all references” and remove manually any unused fields, properties, methods, and types. Note, take care to not remove any unused elements or classes that APIs are expecting.
  • Set tabs to 4 spaces, and replace them with spaces.
  • I personally do not care for most code document comments and usually strip all comments from code. They clutter the entire meaning of the code. The code should speak for itself. If you can’t understand your code, you should not program.
  • Strip all #regions. It is a stupid concept and a waste of a compiler directive. It is better to use a proper development environment that does this for you.

Posted in Tip

Parsing Bison grammars with Antlr

I’ve started to write an Antlr grammar for Bison, with the goal of automatically converting the grammars to Antlr, or another parser generator for that matter. As it turns out, the “central dogma” of parsing (i.e., “you cannot use an LR grammar in an LL parser, and vice versa”) is untrue with the unlimited symbol lookahead parsers that are available nowadays. The major issue will be handling static semantics. Many Bison grammars embed tree construction into the grammar, as well as performing static semantic checks. All this needs to be ripped out and done in a clean manner.

I have a grammar for Bison that works pretty well on eleven different grammars of varying complexity. I am now looking into a Github scraper that searches for and collects all public Bison/Yacc grammars so I can place them in a test suite.


Update Feb 24 2020: I have a Github crawler that is now downloading 32K Yacc grammars. I plan to test each of them with this Bison parser. Here is the script (more or less without chunking issues resolved).


# Libraries #

import sys
import os
import wget
import time
import simplejson
import csv
import pycurl
import certifi
import math
    from BytesIO import BytesIO
except ImportError:
    from io import BytesIO

# Constants #

username = sys.argv[1]
password = sys.argv[2]
extension = sys.argv[3]
print("username " + username + " password " + password + " extension " + extension)

URL = "" #The basic URL to use the GitHub API
QUERY = "parser+extension:" + extension
PARAMETERS = "&per_page=100" #Additional parameters for the query (by default 100 items per page)
DELAY_BETWEEN_QUERYS = 10 #The time to wait between different queries to GitHub (to avoid be banned)
OUTPUT_FOLDER = "./" #Folder where files will be stored
OUTPUT_FILE = "./data.csv" #Path to the CSV file generated as output

# Functions #

def getUrl(url):
    ''' Given a URL it returns its body '''
    buffer = BytesIO()
    c = pycurl.Curl()
    c.setopt(pycurl.CAINFO, certifi.where())
    c.setopt(c.URL, url)
    c.setopt(c.USERPWD, '%s:%s' %(username, password))
    c.setopt(c.WRITEDATA, buffer)
    body = buffer.getvalue()
    print("body " + str(body))
    return body

# MAIN #

count = 0
csvfile = open(OUTPUT_FILE, 'w')
writer = csv.writer(csvfile, delimiter=',')

def save(data):
    global count
    global writer
    items = data["items"]
    for entry in items:
        name = entry["name"]
        repo = entry["repository"]
        url = entry["url"]
        more = simplejson.loads(getUrl(url))
        download_url = more["download_url"]
        writer.writerow([str(count), str(name), str(download_url)]), out=OUTPUT_FOLDER + str(count) + "/" + name)
        count = count + 1

#Run queries to get information in json format and download ZIP file for each repository
print("Processing ...")

dataRead = simplejson.loads(getUrl(url))
numberOfPages = int(math.ceil(dataRead.get('total_count')/100.0))

#Results are in different pages
for currentPage in range(1, numberOfPages+1):
    print("Processing page " + str(currentPage) + " of " + str(numberOfPages) + " ...")
    url = URL + QUERY + PARAMETERS + "&page=" + str(currentPage)
    dataRead = simplejson.loads(getUrl(url))
print("DONE! " + str(countOfRepositories) + " repositories have been processed.")
Posted in Tip

Dynamic and Multi-server LSP applications?

Having now implemented an LSP server and two different clients for Antlr, I can say I am more familiar with the LSP protocol. From these implementations, what can I say? LSP works well for single client/single server instances, but I am wondering how I can use LSP in a situation that modifies a workspace, reads and alters non-Antlr source code from the Antlr LSP server.

What I want is to add a “go to visitor method” of a grammar symbol feature to my extension. What I mean by this is that I want to find a C#/C++ method that corresponds to the non-terminal symbol (see this documentation for an explanation of a visitor/listener in the Antlr runtime). To do this, the Antlr LSP server needs to parse the grammar to get the name of the grammar symbol the cursor is currently on. But, it needs to parse C# (or C++, etc) as well. Currently, what happens is it assumes the source has been written to disk and the Antlr server uses the Microsoft.CodeAnalysis APIs directly. The user opens a Solution or Project in VS2019, opens a window on an Antlr grammar source file, positions the cursor on a non-terminal, then asks for “go to visitor method”. The Antlr server is called with a custom message “go to visitor method” because the knowledge of Antlr grammars is in the server. The server parses every C# (but could be in C++ or another target language) to find the visitor method corresponding to the grammar non-terminal. If it exists, the server reports the location back to the client. If it doesn’t exist, the server now may create a source file, or modify a source file. It may need to add a class, or a method to correspond to the grammar non-terminal. Note that the Antlr server has no knowledge of whether the client currently has an LSP server opened on the C# source. Nor does it know if the copy is currently being modified but not saved out. It doesn’t modify the workspace and report it back to the client, but it should.

The problem with this scenario is that there is no shared LSP server for the C# code–it’s owned by the editor and/or the LSP server that parses the C# source code. However, I need to parse the source code in order to provide this navigational (and possible source code generation) feature. Is there any coordination of the services between servers?

Posted in Tip