A comparison of Antlr grammars for parsing Java

This is an article about my ongoing research regarding the state of Antlr grammars for parsing Java. Some of the tests are taking weeks of computing time, so the results are preliminary.

Antlr is a popular LL(*) parser generator for recognizers of C#, Java, and many other programming languages. For Java, there are three grammars available on the Antlr grammar website: Java, Java8, and Java9. If you are a developer who hasn’t followed the maintenance history, it is unclear which grammar one should choose. Some of the changes that have been made to one grammar have not been applied to the other grammars. The basis for all grammars, however, is The Java Language Specification. Unfortunately, the latest available now is version 13 making all of them out of date.

The Java grammar was written by Parr originally with left recursion and common left-factors removed. (For additional information on left recursion removal, see this blog series.) The code has not been updated in the last year and apparently does accept anything more recent than Java 8. After the release of Antlr version 4, left recursion and common left-factors were allowed. The Java8 code was written by Harwell and Parr with this in mind. It has been last updated a few months ago. The Java9 code was forked by Chan from the Java8 grammar and has been last updated two years ago. It apparently accepts Java 9 source code. Both Java8 and Java9 are very slow.

Which brings us to the following questions: How well do these grammars perform? How well do they accept currently available open-source code? Is it possible to improve the performance of the grammars? Which one would be best to choose to update to the current specification?

In order to answer these questions, we studied the Java and Java9 grammars for performance and acceptance of several open-source projects.

Methods

1. Three grammars for Java that were tested:

2. These were tested against two runtime targets

  • Java
  • C#

3. The Java source code that was used for the tests:

  • http://hg.openjdk.java.net/jdk/jdk
    • 17234 files; 4816078 lines. (Using the Java grammar, 25349955 tokens; 38019443 tree nodes.
  • )https://github.com/AndroidSDKSources/android-sdk-sources-for-api-level-5.git
    • 11101 files; 2803192 lines. (Using the Java grammar, 16567260 tokens; 25457752 tree nodes in all parse trees generated.)
  • https://github.com/AndroidSDKSources/android-sdk-sources-for-api-level-29.git
    • 11473 files; 4175779 lines. (Using the Java grammar, 23624317 tokens; 35668427 tree nodes.)

4. Source and scripts for testing are here.

5. Machine 1: AMD Ryzen 7 2700 eight-core processor, ASRock B450 Gaming-ITX/ac, 16 GB DDR4-2666 (1333 MHz) memory; Antlr 4.7.2 tool and runtime; Visual Studio 2019; Java SE 11.0.4.

6. Machine 2: AMD Ryzen 3 2200G four-core processor, Gigabyte Technology Co. Ltd. B450M DS3H-CF, 16 GB DDR4 (1066 MHz) memory; Antlr 4.7.2 tool and runtime; Visual Studio 2019; Java SE 11.0.4. Used to count nodes in parse tree.

Preliminary results:

  • In the Java grammar, white space is shunted into the HIDDEN channel. In the Java9 grammar, white space is thrown away via the “skip” Antlr keyword. After adjusting the grammars and token types, Antlr generates lexers that produce the same token input stream from anecdotal evidence of a few large test cases.
  • A Java source that ran particularly slow for the Java9 grammar compared to the Java grammar was android-sdk-sources-for-api-level-5/com/android/phone/BluetoothHandsfree.java. This file produced a large number of parse tree nodes associated with expressions, e.g., primary, assignmentExpression, expression, conditionalExpression, conditionalOrExpression, conditionalAndExpression, exclusiveOrExpression, inclusiveOrExpression, andExpression, equalityExpression, relationalExpression, shiftExpression, additiveExpression, multiplicativeExpression, postfixExpression, unaryExpressionNotPlusMinus, unaryExpression, identifier, etc. The Java9 grammar disambiguates expression using the usual refactoring, but notably includes left-recursive productions for andExpression, equalityExpression, relationalExpression, shiftExpression, additiveExpression, etc. Although Antlr handles these productions, it it very expensive to use.
  • Running a Net Core program through dotnet.exe against a stub input program has a minimum runtime of 1.2s.
  • For Net Core, the Java9 grammar ran on average 40x’s slower compared to the Java grammar for the Android 5 source code.
  • There were not any differences in the parse trees constructed between C# and Java for the Java grammar.
  • Out of 39673 Java source files, 134 parsed with errors with the Java grammar.
  • The file android-sdk-sources-for-api-level-29/com/android/server/pm/ActivityManagerService.java parsed particularly slow with Java9: 18 m 50 s with C#!
  • C# tests of all three libraries of code:
    • Java/ grammar
      • android-sdk-sources-for-api-level-5 – 50m 10s
        • All 11101 files parsed:
          • No exceptions
          • 1 file parsed with errors.
      • android-sdk-sources-for-api-level-29 – 54m 7s
        • All 11473 files parsed:
          • No exceptions.
          • 47 files parsed with errors.
      • jdk – 1h 19m 1s
        • All 17234 files parsed:
          • One file parsed with exception.
          • 86 files parsed with errors.
    • Java9/ grammar
      • android-sdk-sources-for-api-level-5 – 20h 53m 59s
        • All 11101 files parsed:
          • No exceptions
          • 5 files parsed with errors.
      • android-sdk-sources-for-api-level-29 – 49h 20m 35s
        • All 11473 files parsed:
          • No exceptions.
          • 3 files parsed with errors.
      • jdk – 29h 55m 13s
        • All 17234 files parsed:
          • One file parsed with exception.
          • 3 files parsed with errors.
    • Java8/ grammar:
      • android-sdk-sources-for-api-level-5 – 14h 37m 06s
      • android-sdk-sources-for-api-level-29 – 20h 42m 28s
      • jdk – 31h 55m 37s
  • Java tests of all three libraries of code:
    • Java/ grammar
      • android-sdk-sources-for-api-level-5 – 1h 14m 27s
        • All 11101 files parsed:
          • No files with exceptions.
          • 1 file parsed with errors.
      • android-sdk-sources-for-api-level-29 – 1h 20m 13s
        • All 11473 files parsed:
          • No files with exceptions.
          • 47 files parsed with errors.
      • jdk – 1h 56m 39s
        • All 17234 files parsed:
          • One file parsed with exception.
          • 86 files parsed with errors.
    • Java9/ grammar
      • android-sdk-sources-for-api-level-5 – 9h 06m 22s
        • 11098 files parsed:
          • 3 files caused a complete crash uncaught.
          • No files parsed with exception.
          • 5 files parsed with errors.
      • android-sdk-sources-for-api-level-29 – 23h 51m 24s
        • 11448 files parsed:
          • 25 files caused a complete crash uncaught.
          • No files parsed with exception.
          • 3 files parsed with errors.
      • jdk – 14h 47m 9s
        • 17220 files parsed:
          • 14 files caused a complete crash uncaught.
          • 1 file parsed with exception.
          • 2 files parsed with errors.
  • More files parsed with the Java9 grammar than the Java grammar.
  • For ActivityManagerService.java, is the runtime proportional to the number of nodes in the parse tree? No!
    • Java grammar– 1.58 s, 243452 nodes
      • Top nodes:
        • 8497 blockStatement
        • 9331 statement
        • 21677 primary
        • 37843 expression
        • 50202 HIDDEN
        • 107849 TOKEN
    • Java9 grammar– 27 m 8 s, 559781 nodes.
      • Top nodes:
        • 8491 blockStatement
        • 9018 primaryNoNewArray_lfno_primary
        • 9074 statement
        • 9086 primary
        • 11487 expressionName
        • 14756 assignmentExpression
        • 14761 expression
        • 14915 conditionalExpression
        • 15133 conditionalOrExpression
        • 15485 conditionalAndExpression
        • 15535 exclusiveOrExpression
        • 15535 inclusiveOrExpression
        • 15614 andExpression
        • 16821 equalityExpression
        • 17332 shiftExpression
        • 17355 relationalExpression
        • 18758 additiveExpression
        • 18890 multiplicativeExpression
        • 19145 postfixExpression
        • 19289 unaryExpressionNotPlusMinus
        • 19311 unaryExpression
        • 34858 identifier
        • 50202 HIDDEN
        • 107849 TOKEN
    • 1.58 / 243452 = 6.5 x 10^-6 s / node.
      • 1.58 / 85401 = 1.9 x 10^-5 s / node.
    • 1628 / 559781 = 2.9 x 10^-3 s / node.
      • 1628 / 401730 = 4.1 x 10^-3 s / node.

Update November 11, 2019

Posted in Tip