March madness…

For several months, I had been editing a new edition of a textbook (Atlas of the Canine Brain, ISBN 978-0-916182-17-5). This book was first published in Russian in 1959, then translated and published in English in 1964. Although the English book was for sale, the publishing company (NPP Books) had only a limited number of copies left. So, a Print-On-Demand (POD) version of the book was needed. Of course, in 1964 there were no personal computers. (Even in the early '70's, I was still using punched cards.) The book was written by typewritten on 8.5" by 11" paper, but the original manuscript, which also included the figures, was lost. Fortunately, the text and figures were recovered from the Russian and English books using a scanner and optical character recognition (OCR). Call me old fashioned, but it still seems quite remarkable that the technology exists to recover text from old books.

However, despite the help of these technologies, the results were far from having the text in a document markup language (LATEX), which is used to produce a typeset quality PDF file of the book. Current OCR technologies do not produce 100% accurate results, do not generally work well when text is underlined, is super- and subscripted, has language-dependent character accents, or text that is formatted in tables.

In addition, the bibliography was not in an acceptable style.  Citations throughout the book could not be traced easily to a particular reference in the bibliography. Consequently, the bibliography needed to be converted into BIBTEX format. Although I could have edited the large bibliography using Word or OpenOffice, and use Writer2BibTeX to output BIBTEX,  I decided to use Antlr, a compiler front-end generator, to parse and translate the bibliography from unformatted text into BIBTEX.

The problem

A program is to be written that will input bibliography entries that are in a standard bibliography format for a book.  These entries were produced through the process of scanning the images of the bibliography, converted into text using OCR, then converted to a TEX format, using OpenOffice Writer and the Write2Latex plugin.  The output of the program is a bibliography in BIBTEX format.

INPUT:

Abuladze, K.S., Zelikin, I.U., and Rosenthal, I.S.: The reactions of the
dog after the removal of the entire cortex of the right hemisphere and
the motor region of the left hemisphere. Archive of Biological
Sciences, 61, 3, 94,
1941.}

{\selectlanguage{english}\sffamily
Ivanov-Smolenskiy, A.G.: Outlines of patho-physiology of higher nervous
activity. M., 1949.}

{\selectlanguage{english}\sffamily
R\"{u}dinger, N.: licher die Hirne verschiedener Hunderassen. Sitzungsber.
d. Wissensch. zu M\"{u}nchen, 1894.}

{\selectlanguage{english}\sffamily
Clark le Gros, W.E.: The homologies of the pulvinar in mammals. Monit.
Zool. ital., 41, 1931.}

{\selectlanguage{english}\sffamily
Adrianov, O.S. and Mering, T.A.: Materials on the morphology and
physiology of
the cortical ends of the analysors of the dog. VIII All-Union convention
of physiologists, biochemists and pharmacologits. Theses of lectures.
M., 9,
1955.}

 

OUTPUT:

@article {
	testx1,
	author = { {K. S.} {Abuladze} AND {I. U.} {Zelikin} AND {I. S.} {Rosenthal} },
	title  = { The reactions of the  dog after the removal of the entire cortex of the right hemisphere and  the motor region of the left hemisphere. },
	journal = {  Archive of Biological  Sciences, 61, 3, 94. },
	year   = { 1941 }
}

@article {
	testx2,
	author = { {A. G.} {Ivanov-Smolenskiy} },
	title  = { Outlines of patho-physiology of higher nervous  activity. },
	journal = {  M. },
	year   = { 1949 }
}

@article {
	testx3,
	author = { {N.} {R\"{u}dinger} },
	title  = { licher die Hirne verschiedener Hunderassen. },
	journal = {  Sitzungsber.  d. Wissensch. zu M\"{u}nchen. },
	year   = { 1894 }
}

@article {
	testx4,
	author = { {W. E.} {Clark le Gros} },
	title  = { The homologies of the pulvinar in mammals. },
	journal = {  Monit.  Zool. ital., 41. },
	year   = { 1931 }
}

@article {
	testx5,
	author = { {O. S.} {Adrianov} AND {T. A.} {Mering} },
	title  = { Materials on the morphology and  physiology of  the cortical ends of the analysors of the dog. },
	journal = {  VIII All-Union convention  of physiologists, biochemists and pharmacologits. Theses of lectures.  M., 9. },
	year   = { 1955 }
}

The input format of each reference is relatively simple to state: it is a list of author(s), followed by a colon, the title, the journal or book, and finally, the year that it was published. There were several nuances in this simple definition: the text could contain LaTex formatting directives for generating special character accents, e.g., '\"{u}'; the authors are optionally separated by commas; and, last name may be complex, like "Clark le Gros". As in so many parsing problems, writing the grammar for this "simple" problem can be quite difficult because of these little nuances, and because Antlr is difficult to use.  The experience brought back many unpleasant memories of Antlr (because it is not a well-documented tool), and why I have many issues with much of the free software that developers use.  But, that is another story…

The Solution

Doit.g:

grammar Doit;

options {
    backtrack=true;
    output=AST;
}

tokens {
    ALLREFS;
    REF;
    NAME;
    FIRSTNAME;
    LASTNAME;
    DATE;
    TITLE;
}

@members{
    boolean matchdigits(TokenStream input)
    {
        // return true if the lookahead isn't the end of the reference.  This occurs when the input is "date . }".
        int p = 1;
        int t = input.LA(p);
        if (t != DIGITS)
        {
            return false;
        }
        while (t == DIGITS)
        {
            t = input.LA(++p);
        }
        if (! (t == PERIOD || t == MINUS))
        {
            return true;
        }
        // two cases: t was a period or t was a minus.
        if (t == MINUS)
        {
            t = input.LA(++p);
            if (t != DIGITS)
            {
                return true;
            }
            while (t == DIGITS)
            {
                t = input.LA(++p);
            }
            if (! (t == PERIOD))
            {
                return true;
            }
        }
        // skip past period.
        t = input.LA(++p);
        if (t != CC)
        {
            return true;
        }
        return false;
    }

    boolean matchlastname(TokenStream input)
    {
        // match long name.
        int p = 1;
        int t = input.LA(p);
        if (t != WORD)
        {
            return false;
        }
        while (t == WORD || t == OC || t == CC || t == WS || t == MINUS || t == TEX)
        {
            t = input.LA(++p);
            if (input.LT(1).getText().equals("and"))
                return true;
            if (input.LT(1).getText().equals("und"))
                return true;
            if (input.LT(1).getText().equals("et"))
                return true;
        }
        if (t == COMMA)
        {
            return true;
        }
        if (t == COLON)
        {
            return true;
        }
        if (t == PERIOD)
        {
            return false;
        }
        return false;
    }

}

prog:   ref+ WS? EOF -> ^(ALLREFS ref+)
    ;

ref:   block
    ;

block:
    WS!
    OC!
    stuff
    CC!
    ;

stuff:
    header
    authors
    COLON
    rest
    date
        -> ^(REF authors ^(TITLE rest) date)
    ;

header:
        HEADER WS
    ;

authors :
    name (
        COMMA! WS! (andauthor | name)
        | WS! andauthor
        )*
    ;

andauthor:
    { input.LT(1).getText().equals("and")
      || input.LT(1).getText().equals("und")
      || input.LT(1).getText().equals("et")
        }? WORD! WS! name
    ;

name:
    (
        lastname
        ((COMMA WS firstname)=> COMMA WS firstname)?
    ) -> ^(NAME (^(FIRSTNAME firstname))? ^(LASTNAME lastname))
    ;

lastname:
    { matchlastname(input) }? ( WORD ((MINUS WORD) | (OC WORD CC WORD?) | TEX | (WS WORD))* )
    ;

firstname:
    WORD PERIOD (WORD PERIOD (WORD PERIOD (WORD PERIOD)? )? )?
    ( { !(
        input.LT(1).getText().equals("and")
        || input.LT(1).getText().equals("und")
        || input.LT(1).getText().equals("et")
        ) }? WORD )?
    ;

rest:
    (WORD
        | PERIOD
        | COLON
        | COMMA
        | MINUS
        | { matchdigits(input) }?=> DIGITS
        | esc
        | WS
        | TEX
        )+
    ;

esc:
    OC (WORD | MINUS)+ CC
    ;

date:
        DIGITS (MINUS DIGITS)?  PERIOD -> ^(DATE DIGITS (MINUS DIGITS)?)
    ;

PERIOD :
    '.'
    ;

COLON :
    ':'
    ;

MINUS:
    '-'
    ;

DIGITS:
    ('0' .. '9')+
    ;

WORD  :
    ('a'..'z'
        | 'A'..'Z'
        | '"'
        | '\''
        | '`'
        | '?'
        | '/'
        | ';'
        | '('
        | ')'
        | '=')+
    ;

TEX:
    '\\' .
    ;

COMMA:
    ','
    ;

HEADER:
    '\\selectlanguage{english}\\sffamily'
    ;

OC :
    '{'
    ;

CC :
    '}'
    ;

WS  :
    (
        ' '
        | '\t'
        | '\r'? '\n'
        | '%' .* '\r'? '\n'
        )+
    ;

Doit.java:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import java.io.*;

public class Doit {

    int my_i = 0;

    String my_chop(String inp)
    {
        // remove any trailing blanks and comma.
        int last = inp.length() - 1;
        while (last > 0)
        {
            if (inp.charAt(last) == ' '
                  || inp.charAt(last) == ','
                  || inp.charAt(last) == '\n'
                  || inp.charAt(last) == '\r')
            {
                last--;
            }
            else
                break;
        }
        String tmp = inp.substring(0, last+1);
        // make sure it ends with a '.'.
        if (tmp.charAt(tmp.length()-1) != '.')
            tmp += '.';
        return tmp;
    }

    String my_year(String inp)
    {
        // remove any trailing .
        int last = inp.length() - 1;
        while (last > 0)
        {
            if (inp.charAt(last) == ' '
                  || inp.charAt(last) == '.'
                  || inp.charAt(last) == '\n'
                  || inp.charAt(last) == '\r')
            {
                last--;
            }
            else
                break;
        }
        return inp.substring(0,last+1);
    }

    String my_escape(String inp)
    {
        // escape any backslashes that don't make sense.
        String s = new String();
        int last = 0;
        s = "";
        while (last < inp.length())
        {
            if (inp.charAt(last) == '\\' && inp.charAt(last+1) == 't')
                s += "\\\\";
            else if (inp.charAt(last) == '\\' && inp.charAt(last+1) == 'd')
                s += "\\\\";
            else
                s += inp.charAt(last);
            last++;
        }
        return s;
    }

    String my_rmnewline(String inp)
    {
        // remove new lines.
        String s = new String();
        int last = 0;
        s = "";
        while (last < inp.length())
        {
            if (!(inp.charAt(last) == '\n' || inp.charAt(last) == '\r'))
                s += inp.charAt(last);
            else
                s += ' ';
            last++;
        }
        return s;
    }

    String walk(String id, Tree ast) {
        int i;
        String result = "";
        int t = ast.getType();
        switch(t) {
            case DoitParser.ALLREFS:
                for (i = 0; i < ast.getChildCount(); ++i)
                    result += walk(id, (Tree)ast.getChild(i));
                break;

            case DoitParser.REF:
            {
                result += "\n@article {\n";
                result += "     " + id + (++my_i) + ",\n";
                boolean first = true;
                for (i = 0; i < ast.getChildCount(); ++i)
                {
                    Tree c = (Tree)ast.getChild(i);
                    int ct = c.getType();
                    if (ct == DoitParser.NAME)
                    {
                        if (first)
                            result += " author = { ";
                        else
                            result += " AND ";
                        first = false;
                        result += walk(id, (Tree)ast.getChild(i));
                        continue;
                    } else if (ct == DoitParser.TITLE) {
                        result += " },\n";
                    }
                    result += walk(id, ast.getChild(i));
                }
                result += "}\n\n";
                break;
            }

            case DoitParser.NAME:
                for (i = 0; i < ast.getChildCount(); ++i)
                    result += walk(id, (Tree)ast.getChild(i));
                break;

            case DoitParser.FIRSTNAME:
                result += "{";
                for (i = 0; i < ast.getChildCount(); ++i)
                {
                    result += ast.getChild(i).getText();
                    if (ast.getChild(i).getType() == DoitParser.PERIOD
                          && i != (ast.getChildCount()-1))
                        result += ' ';
                }
                result += "}";
                result += " ";
                break;

            case DoitParser.LASTNAME:
                result += "{";
                for (i = 0; i < ast.getChildCount(); ++i)
                    result += ast.getChild(i).getText();
                result += "}";
                break;

            case DoitParser.TITLE:
            {
                String temp = " title  = {";
                boolean first = true;
                String last = null;
                String lastm1 = null;
                for (i = 0; i < ast.getChildCount(); ++i) {
                    temp += my_rmnewline(ast.getChild(i).getText());
                    // On first period of long word, make this the end
                    // of the title and start
                    if (ast.getChild(i).getType() == DoitParser.PERIOD && first
                          && last != null)
                    {
                        if (last.length() > 1
                              || (lastm1 != null && lastm1.compareTo(".") == 0)
                              || (last.compareTo("}") == 0))
                        {
                            temp += " },";
                            temp += '\n';
                            temp += "   journal = { ";
                            first = false;
                        }
                    }
                    lastm1 = last;
                    last = ast.getChild(i).getText();
                }
                result += my_chop(my_escape(temp));
                result += " },";
                break;
            }

            case DoitParser.DATE:
                result += "\n   year   = { ";
                for (i = 0; i < ast.getChildCount(); ++i)
                    result += ast.getChild(i).getText();
                result += " }\n";
                break;

            default:
                result += ast.getText();
                break;
        }
        return result;
    }

    public static void main(String[] args) throws Exception {
        for (String s: args) {
            try {
                System.out.println("Input file is " + s);
                File inFile = new File(s);
                String inFileName = inFile.getName();
                int whereDot = inFileName.lastIndexOf('.');
                if (!(0 < whereDot && whereDot <= inFile.getName().length() - 2 ))
                {
                    System.out.println("Illegal file name.");
                    continue;
                }
                String prefix = inFileName.substring(0,whereDot) + "x";
                String outFileName = inFileName.substring(0,whereDot) + ".bib";
                System.out.println("Output file is " + outFileName);

                // Create output file.
                FileOutputStream out;
                PrintStream p;
                out = new FileOutputStream(outFileName);
                p = new PrintStream(out);

                System.setIn(new FileInputStream(s));
                ANTLRInputStream input = new ANTLRInputStream(System.in);
                DoitLexer lexer = new DoitLexer(input);
                CommonTokenStream tokens = new CommonTokenStream(lexer);
                DoitParser parser = new DoitParser(tokens);
                DoitParser.prog_return result = parser.prog();
                Tree t = (Tree)result.getTree();
                Doit doit = new Doit();
                String translation = doit.walk(prefix, t);
                p.println(translation);
                out.close();

            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

The grammar for the list of unformatted references starts on line 98 of Doit.g. This grammar produces an abstract syntax tree (AST), which is then walked by a simple tree walker (starting on line 87 of Doit.java) to print out the bibliography in BIBTEX format.

This grammar requires several syntactic and semantic predicates, which are hard to use and understand.  On line 141 of Doit.g, the syntactic predicate (COMMA WS firstname)=> is used in to recognize authors with a first name after looking ahead for the first name.  For example, when matching "Abuladze, K.S.", the comma is only recognized when the first name "K.S." is also recognized. Otherwise, the comma separates last names. On lines 132-135 of Doit.g, the semantic predicate

{ input.LT(1).getText().equals("and")
|| input.LT(1).getText().equals("und")
|| input.LT(1).getText().equals("et")
}?
gates the production (and alternatives that use the non-terminal andauthor), so that "and" (in English) starts another author.

Posted in Tip