I’ve started to write an Antlr grammar for Bison, with the goal of automatically converting the grammars to Antlr, or another parser generator for that matter. As it turns out, the “central dogma” of parsing (i.e., “you cannot use an LR grammar in an LL parser, and vice versa”) is untrue with the unlimited symbol lookahead parsers that are available nowadays. The major issue will be handling static semantics. Many Bison grammars embed tree construction into the grammar, as well as performing static semantic checks. All this needs to be ripped out and done in a clean manner.
I have a grammar for Bison that works pretty well on eleven different grammars of varying complexity. I am now looking into a Github scraper that searches for and collects all public Bison/Yacc grammars so I can place them in a test suite.
–Ken
Update Feb 24 2020: I have a Github crawler that is now downloading 32K Yacc grammars. I plan to test each of them with this Bison parser. Here is the script (more or less without chunking issues resolved).
#!/user/bin/python ############# # Libraries # ############# import sys import os import wget import time import simplejson import csv import pycurl import certifi import math try: from BytesIO import BytesIO except ImportError: from io import BytesIO ############# # Constants # ############# username = sys.argv[1] password = sys.argv[2] extension = sys.argv[3] print("username " + username + " password " + password + " extension " + extension) URL = "https://api.github.com/search/code?q=" #The basic URL to use the GitHub API QUERY = "parser+extension:" + extension PARAMETERS = "&per_page=100" #Additional parameters for the query (by default 100 items per page) DELAY_BETWEEN_QUERYS = 10 #The time to wait between different queries to GitHub (to avoid be banned) OUTPUT_FOLDER = "./" #Folder where files will be stored OUTPUT_FILE = "./data.csv" #Path to the CSV file generated as output ############# # Functions # ############# def getUrl(url): ''' Given a URL it returns its body ''' buffer = BytesIO() c = pycurl.Curl() c.setopt(pycurl.CAINFO, certifi.where()) c.setopt(c.URL, url) c.setopt(c.USERPWD, '%s:%s' %(username, password)) c.setopt(c.WRITEDATA, buffer) c.perform() c.close() body = buffer.getvalue() print("body " + str(body)) return body ######## # MAIN # ######## count = 0 csvfile = open(OUTPUT_FILE, 'w') writer = csv.writer(csvfile, delimiter=',') def save(data): global count global writer items = data["items"] for entry in items: print(entry["name"]) name = entry["name"] repo = entry["repository"] url = entry["url"] more = simplejson.loads(getUrl(url)) download_url = more["download_url"] os.mkdir(str(count)) writer.writerow([str(count), str(name), str(download_url)]) wget.download(download_url, out=OUTPUT_FOLDER + str(count) + "/" + name) count = count + 1 time.sleep(DELAY_BETWEEN_QUERYS) #Run queries to get information in json format and download ZIP file for each repository print("Processing ...") url = URL + QUERY + PARAMETERS dataRead = simplejson.loads(getUrl(url)) save(dataRead) numberOfPages = int(math.ceil(dataRead.get('total_count')/100.0)) #Results are in different pages for currentPage in range(1, numberOfPages+1): print("Processing page " + str(currentPage) + " of " + str(numberOfPages) + " ...") url = URL + QUERY + PARAMETERS + "&page=" + str(currentPage) dataRead = simplejson.loads(getUrl(url)) save(dataRead) time.sleep(DELAY_BETWEEN_QUERYS) print("DONE! " + str(countOfRepositories) + " repositories have been processed.") csvfile.close()