Parsing Bison grammars with Antlr

I’ve started to write an Antlr grammar for Bison, with the goal of automatically converting the grammars to Antlr, or another parser generator for that matter. As it turns out, the “central dogma” of parsing (i.e., “you cannot use an LR grammar in an LL parser, and vice versa”) is untrue with the unlimited symbol lookahead parsers that are available nowadays. The major issue will be handling static semantics. Many Bison grammars embed tree construction into the grammar, as well as performing static semantic checks. All this needs to be ripped out and done in a clean manner.

I have a grammar for Bison that works pretty well on eleven different grammars of varying complexity. I am now looking into a Github scraper that searches for and collects all public Bison/Yacc grammars so I can place them in a test suite.

–Ken

Update Feb 24 2020: I have a Github crawler that is now downloading 32K Yacc grammars. I plan to test each of them with this Bison parser. Here is the script (more or less without chunking issues resolved).

#!/user/bin/python

#############
# Libraries #
#############

import sys
import os
import wget
import time
import simplejson
import csv
import pycurl
import certifi
import math
try:
    from BytesIO import BytesIO
except ImportError:
    from io import BytesIO

#############
# Constants #
#############

username = sys.argv[1]
password = sys.argv[2]
extension = sys.argv[3]
print("username " + username + " password " + password + " extension " + extension)

URL = "https://api.github.com/search/code?q=" #The basic URL to use the GitHub API
QUERY = "parser+extension:" + extension
PARAMETERS = "&per_page=100" #Additional parameters for the query (by default 100 items per page)
DELAY_BETWEEN_QUERYS = 10 #The time to wait between different queries to GitHub (to avoid be banned)
OUTPUT_FOLDER = "./" #Folder where files will be stored
OUTPUT_FILE = "./data.csv" #Path to the CSV file generated as output

#############
# Functions #
#############

def getUrl(url):
    ''' Given a URL it returns its body '''
    buffer = BytesIO()
    c = pycurl.Curl()
    c.setopt(pycurl.CAINFO, certifi.where())
    c.setopt(c.URL, url)
    c.setopt(c.USERPWD, '%s:%s' %(username, password))
    c.setopt(c.WRITEDATA, buffer)
    c.perform()
    c.close()
    body = buffer.getvalue()
    print("body " + str(body))
    return body


########
# MAIN #
########

count = 0
csvfile = open(OUTPUT_FILE, 'w')
writer = csv.writer(csvfile, delimiter=',')

def save(data):
    global count
    global writer
    items = data["items"]
    for entry in items:
        print(entry["name"])
        name = entry["name"]
        repo = entry["repository"]
        url = entry["url"]
        more = simplejson.loads(getUrl(url))
        download_url = more["download_url"]
        os.mkdir(str(count))
        writer.writerow([str(count), str(name), str(download_url)])
        wget.download(download_url, out=OUTPUT_FOLDER + str(count) + "/" + name)
        count = count + 1
        time.sleep(DELAY_BETWEEN_QUERYS)


#Run queries to get information in json format and download ZIP file for each repository
print("Processing ...")

url = URL + QUERY + PARAMETERS
dataRead = simplejson.loads(getUrl(url))
save(dataRead)
numberOfPages = int(math.ceil(dataRead.get('total_count')/100.0))

#Results are in different pages
for currentPage in range(1, numberOfPages+1):
    print("Processing page " + str(currentPage) + " of " + str(numberOfPages) + " ...")
    url = URL + QUERY + PARAMETERS + "&page=" + str(currentPage)
    dataRead = simplejson.loads(getUrl(url))
    save(dataRead)
    time.sleep(DELAY_BETWEEN_QUERYS)
    
print("DONE! " + str(countOfRepositories) + " repositories have been processed.")
csvfile.close()
Posted in Tip