Altgen — A lightweight tree generator

Overview

This is a very compact parser and abstract syntax tree generator for Java that

features a very terse, but readable, meta language for specifying the front-end of a compiler (up to the generation of abstract syntax trees), achieving conciseness by making heavy use of convention over markup;
has a fully declarative metalanguage;
can support different grammar notations; and
generates statically-type syntax trees.

The generator itself is currently under development, and will eventually be hosted at Google Code. The principles behind the framework are described in an upcoming paper in IASTED SEA 2009. The implementation is interesting because so much of the generated parser and tree builder is inferred from the input grammar, rather than configured.

Status

Grammmar annotation

Here is a look at the current state of the generator. One supplies an annotated grammar such as the following:

ID      => @[A-Za-z][A-Za-z0-9_]+
NUMLIT  => @\d+(\.\d+([Ee][+-]?\d+)?)?
STRLIT  => @"[^"\p{Cc}]*"
SKIP    => @\s+
Program => Block {Program block}
Block   => (decs*:Dec ";")* (stmts*:Stmt ";")+ {Block decs stmts}
Dec     => "var" ID ("=" Exp)? {Var:Dec id exp}
        |  "fun" ID "(" params:IdList? ")" "=" Exp
               {Fun:Dec id params exp}
Stmt    => ID "=" Exp {Assign:Stmt id exp}
        |  "read" IdList {Read:Stmt idList}
        |  "write" ExpList {Write:Stmt expList}
        |  "while" Exp "do" Block "end" {While:Stmt exp block}
IdList  => id*:ID ("," id*:ID)*
ExpList => exp*:Exp ("," exp*:Exp)*
Exp     => left:Term (op:("+" | "-") right:Term
               left:{Bin:Exp op left right}
           )*
Term    => left:Factor (op:("*" | "/") right:Factor
               left:{Bin:Exp op left right}
           )*
Factor  => val:NUMLIT {NumLit:Exp val}
        |  val:STRLIT {StrLit:Exp val}
        |  ID {Ref:Exp id}
        |  Call
        |  "(" e:Exp ")" ^e
Call    => ID "(" args:ExpList? ")" {Call:Exp id args}

The generator produces four artifacts:

A scanner (lexical analyzer) for the language described by this annotated grammar
A parser for the language
A collection of Java classes for the abstract syntax nodes of the specified language
A visitor framework, so compiler writers need only add per-node processing code to use the generated AST classes.

Scanner Generation

Altgen features a rather simple scanner generator; however, the scanner generation is of interest because it must infer the token set from the grammar. Particular attention has been given to making the generated scanner readable. For the grammar above, altgen produces:

package com.example;

import java.io.IOException;
import java.io.LineNumberReader;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Token {

    public static enum Type {
        TOKEN_4("\\Q;\\E"),
        TOKEN_5("var"),
        TOKEN_6("\\Q=\\E"),
        TOKEN_7("fun"),
        TOKEN_8("\\Q(\\E"),
        TOKEN_9("\\Q)\\E"),
        TOKEN_10("read"),
        TOKEN_11("write"),
        TOKEN_12("while"),
        TOKEN_13("do"),
        TOKEN_14("end"),
        TOKEN_15("\\Q,\\E"),
        TOKEN_16("\\Q+\\E"),
        TOKEN_17("\\Q-\\E"),
        TOKEN_18("\\Q*\\E"),
        TOKEN_19("\\Q/\\E"),
        ID("[A-Za-z][A-Za-z0-9_]+"),
        NUMLIT("\\d+(\\.\\d+([Ee][+-]?\\d+)?)?"),
        STRLIT("\"[^\"\\p{Cc}]*\""),
        SKIP("\\s+");

        Pattern pattern;

        private Type(String pattern) {
            this.pattern = Pattern.compile(pattern);
        }
    }

    final Type type;
    final int line;
    final int column;
    final String lexeme;

    public Token(Type type, int line, int column, String lexeme) {
        this.type = type;
        this.line = line;
        this.column = column;
        this.lexeme = lexeme;
    }

    public static List<Token> tokenize(LineNumberReader reader) throws IOException {
        List<Token> tokens = new ArrayList<Token>();
        while (true) {
            String source = reader.readLine();
            if (source == null) return tokens;
            int line = reader.getLineNumber();
            int column = 0;
            final int end = source.length();
            Matcher m = Pattern.compile("").matcher(source);
            m.useTransparentBounds(true).useAnchoringBounds(false);
            Tokenizing: while (column < end) {
                m.region(column, end);
                for (Type t : Type.values()) {
                    if (m.usePattern(t.pattern).lookingAt()) {
                        if (t != Type.SKIP) {
                            tokens.add(new Token(t, line, m.start() + 1,
                                    source.substring(m.start(), m.end())));
                        }
                        column = m.end();
                        continue Tokenizing;
                    }
                }

                // Text at current position doesn't match any token
                throw new RuntimeException("Lexical error at line " + line + ", column" + column);
            }
        }
    }
}

Parser Generation

Currently altgen generates only stubs for the parsing functions in a recursive descent parser. This means that it only accepts LL grammars. Note that helper methods allow for lookahead. For our running example, the following is produced:

package com.example;

import static com.example.Token.Type.ID
import static com.example.Token.Type.NUMLIT
import static com.example.Token.Type.STRLIT
import static com.example.Token.Type.SKIP
import static com.example.Token.Type.TOKEN_4
import static com.example.Token.Type.TOKEN_5
import static com.example.Token.Type.TOKEN_6
import static com.example.Token.Type.TOKEN_7
import static com.example.Token.Type.TOKEN_8
import static com.example.Token.Type.TOKEN_9
import static com.example.Token.Type.TOKEN_10
import static com.example.Token.Type.TOKEN_11
import static com.example.Token.Type.TOKEN_12
import static com.example.Token.Type.TOKEN_13
import static com.example.Token.Type.TOKEN_14
import static com.example.Token.Type.TOKEN_15
import static com.example.Token.Type.TOKEN_16
import static com.example.Token.Type.TOKEN_17
import static com.example.Token.Type.TOKEN_18
import static com.example.Token.Type.TOKEN_19

import java.util.ArrayList;
import java.util.List;


public class Parser {

    private List<Token> tokens;
    private int current = 0;

    public Parser(List<Token> tokens) {
        this.tokens = tokens;
    }

    // Returns whether the next tokens are the given sequence
    private boolean on(Token.Type... types) {
        for (int i = 0; i < types.length; i++) {
            if (current + i >= tokens.size()
                    || tokens.get(current + i).type != types[i]) {
                return false;
            }
        }
        return true;
    }

    // Returns whether the current token is one of the given types
    private boolean onOneOf(Token.Type... types) {
        for (Token.Type type: types) {
            if (current < tokens.size() && tokens.get(current).type == type) {
                return true;
            }
        }
        return false;
    }

    // Returns the current token and advances the current pointer
    private Token match() {
        return tokens.get(current++);
    }

    // Matches the current token only if it has the expected type
    private Token match(Token.Type type) {
        if (on(type)) {
            return match();
        } else {
            throw syntaxError("Expected " + type);
        }
    }

    // Compose an exception with the current line/col and custom message
    private RuntimeException syntaxError(String message) {
        Token t = tokens.get(current);
        return new RuntimeException("Syntax error near line "
                + t.line + ", column " + t.column + ": " + message);
    }

    public Program parseProgram() {
        throw new UnsupportedOperationException();
    }

    public Block parseBlock() {
        throw new UnsupportedOperationException();
    }

    public Dec parseDec() {
        throw new UnsupportedOperationException();
    }

    public Stmt parseStmt() {
        throw new UnsupportedOperationException();
    }

    public List parseIdList() {
        throw new UnsupportedOperationException();
    }

    public List parseExpList() {
        throw new UnsupportedOperationException();
    }

    public Exp parseExp() {
        throw new UnsupportedOperationException();
    }

    public Exp parseTerm() {
        throw new UnsupportedOperationException();
    }

    public Exp parseFactor() {
        throw new UnsupportedOperationException();
    }

    public Exp parseCall() {
        throw new UnsupportedOperationException();
    }

}

AST Class Generation

A node class is generated for each of the non-terminals inferred from the grammar. Here are a couple examples:

public class While extends Stmt {

    private Exp exp;
    private Block block;

    public While(Exp exp, Block block) {
        this.exp = exp;
        this.block = block;
    }

    public Exp getExp() {
        return exp;
    }

    public Block getBlock() {
        return block;
    }

    public void accept(Visitor v) {
        v.visitWhile(this);
    }
}

public class Block {

    private List<Dec> decs;
    private List<Stmt> stmts;

    public Block(List<Dec> decs, List<Stmt> stmts) {
        this.decs = decs;
        this.stmts = stmts;
    }

    public List<Dec> getDecs() {
        return decs;
    }

    public List<Stmt> getStmts() {
        return stmts;
    }

    public void accept(Visitor v) {
        v.visitBlock(this);
    }
}

Vistor Framework

Currently altgen produces only an interface for the classic Visitor pattern; in the above running example this would be:

package com.example;
public interface Visitor {
    void visitRef(Ref node);
    void visitAssign(Assign node);
    void visitCall(Call node);
    void visitFun(Fun node);
    void visitStrLit(StrLit node);
    void visitBin(Bin node);
    void visitExp(Exp node);
    void visitNumLit(NumLit node);
    void visitWhile(While node);
    void visitWrite(Write node);
    void visitStmt(Stmt node);
    void visitDec(Dec node);
    void visitRead(Read node);
    void visitProgram(Program node);
    void visitVar(Var node);
    void visitBlock(Block node);
}

Users can produce several implementations of this interface for each kind of tree operation (e.g., writing as XML, semantic analysis, pretty printing, code generation, etc.)

In the future we will autogenerate other, more sophisticated navigation mechanisms, such as extrinsic visitors and walkabouts.