This is a very compact parser and abstract syntax tree generator for Java that
The generator itself is currently under development, and will eventually be hosted at Google Code. The principles behind the framework are described in an upcoming paper in IASTED SEA 2009. The implementation is interesting because so much of the generated parser and tree builder is inferred from the input grammar, rather than configured.
Here is a look at the current state of the generator. One supplies an annotated grammar such as the following:
ID => @[A-Za-z][A-Za-z0-9_]+ NUMLIT => @\d+(\.\d+([Ee][+-]?\d+)?)? STRLIT => @"[^"\p{Cc}]*" SKIP => @\s+ Program => Block {Program block} Block => (decs*:Dec ";")* (stmts*:Stmt ";")+ {Block decs stmts} Dec => "var" ID ("=" Exp)? {Var:Dec id exp} | "fun" ID "(" params:IdList? ")" "=" Exp {Fun:Dec id params exp} Stmt => ID "=" Exp {Assign:Stmt id exp} | "read" IdList {Read:Stmt idList} | "write" ExpList {Write:Stmt expList} | "while" Exp "do" Block "end" {While:Stmt exp block} IdList => id*:ID ("," id*:ID)* ExpList => exp*:Exp ("," exp*:Exp)* Exp => left:Term (op:("+" | "-") right:Term left:{Bin:Exp op left right} )* Term => left:Factor (op:("*" | "/") right:Factor left:{Bin:Exp op left right} )* Factor => val:NUMLIT {NumLit:Exp val} | val:STRLIT {StrLit:Exp val} | ID {Ref:Exp id} | Call | "(" e:Exp ")" ^e Call => ID "(" args:ExpList? ")" {Call:Exp id args}
The generator produces four artifacts:
Altgen features a rather simple scanner generator; however, the scanner generation is of interest because it must infer the token set from the grammar. Particular attention has been given to making the generated scanner readable. For the grammar above, altgen produces:
package com.example; import java.io.IOException; import java.io.LineNumberReader; import java.util.ArrayList; import java.util.List; import java.util.regex.Matcher; import java.util.regex.Pattern; public class Token { public static enum Type { TOKEN_4("\\Q;\\E"), TOKEN_5("var"), TOKEN_6("\\Q=\\E"), TOKEN_7("fun"), TOKEN_8("\\Q(\\E"), TOKEN_9("\\Q)\\E"), TOKEN_10("read"), TOKEN_11("write"), TOKEN_12("while"), TOKEN_13("do"), TOKEN_14("end"), TOKEN_15("\\Q,\\E"), TOKEN_16("\\Q+\\E"), TOKEN_17("\\Q-\\E"), TOKEN_18("\\Q*\\E"), TOKEN_19("\\Q/\\E"), ID("[A-Za-z][A-Za-z0-9_]+"), NUMLIT("\\d+(\\.\\d+([Ee][+-]?\\d+)?)?"), STRLIT("\"[^\"\\p{Cc}]*\""), SKIP("\\s+"); Pattern pattern; private Type(String pattern) { this.pattern = Pattern.compile(pattern); } } final Type type; final int line; final int column; final String lexeme; public Token(Type type, int line, int column, String lexeme) { this.type = type; this.line = line; this.column = column; this.lexeme = lexeme; } public static List<Token> tokenize(LineNumberReader reader) throws IOException { List<Token> tokens = new ArrayList<Token>(); while (true) { String source = reader.readLine(); if (source == null) return tokens; int line = reader.getLineNumber(); int column = 0; final int end = source.length(); Matcher m = Pattern.compile("").matcher(source); m.useTransparentBounds(true).useAnchoringBounds(false); Tokenizing: while (column < end) { m.region(column, end); for (Type t : Type.values()) { if (m.usePattern(t.pattern).lookingAt()) { if (t != Type.SKIP) { tokens.add(new Token(t, line, m.start() + 1, source.substring(m.start(), m.end()))); } column = m.end(); continue Tokenizing; } } // Text at current position doesn't match any token throw new RuntimeException("Lexical error at line " + line + ", column" + column); } } } }
Currently altgen generates only stubs for the parsing functions in a recursive descent parser. This means that it only accepts LL grammars. Note that helper methods allow for lookahead. For our running example, the following is produced:
package com.example; import static com.example.Token.Type.ID import static com.example.Token.Type.NUMLIT import static com.example.Token.Type.STRLIT import static com.example.Token.Type.SKIP import static com.example.Token.Type.TOKEN_4 import static com.example.Token.Type.TOKEN_5 import static com.example.Token.Type.TOKEN_6 import static com.example.Token.Type.TOKEN_7 import static com.example.Token.Type.TOKEN_8 import static com.example.Token.Type.TOKEN_9 import static com.example.Token.Type.TOKEN_10 import static com.example.Token.Type.TOKEN_11 import static com.example.Token.Type.TOKEN_12 import static com.example.Token.Type.TOKEN_13 import static com.example.Token.Type.TOKEN_14 import static com.example.Token.Type.TOKEN_15 import static com.example.Token.Type.TOKEN_16 import static com.example.Token.Type.TOKEN_17 import static com.example.Token.Type.TOKEN_18 import static com.example.Token.Type.TOKEN_19 import java.util.ArrayList; import java.util.List; public class Parser { private List<Token> tokens; private int current = 0; public Parser(List<Token> tokens) { this.tokens = tokens; } // Returns whether the next tokens are the given sequence private boolean on(Token.Type... types) { for (int i = 0; i < types.length; i++) { if (current + i >= tokens.size() || tokens.get(current + i).type != types[i]) { return false; } } return true; } // Returns whether the current token is one of the given types private boolean onOneOf(Token.Type... types) { for (Token.Type type: types) { if (current < tokens.size() && tokens.get(current).type == type) { return true; } } return false; } // Returns the current token and advances the current pointer private Token match() { return tokens.get(current++); } // Matches the current token only if it has the expected type private Token match(Token.Type type) { if (on(type)) { return match(); } else { throw syntaxError("Expected " + type); } } // Compose an exception with the current line/col and custom message private RuntimeException syntaxError(String message) { Token t = tokens.get(current); return new RuntimeException("Syntax error near line " + t.line + ", column " + t.column + ": " + message); } public Program parseProgram() { throw new UnsupportedOperationException(); } public Block parseBlock() { throw new UnsupportedOperationException(); } public Dec parseDec() { throw new UnsupportedOperationException(); } public Stmt parseStmt() { throw new UnsupportedOperationException(); } public ListparseIdList() { throw new UnsupportedOperationException(); } public List parseExpList() { throw new UnsupportedOperationException(); } public Exp parseExp() { throw new UnsupportedOperationException(); } public Exp parseTerm() { throw new UnsupportedOperationException(); } public Exp parseFactor() { throw new UnsupportedOperationException(); } public Exp parseCall() { throw new UnsupportedOperationException(); } }
A node class is generated for each of the non-terminals inferred from the grammar. Here are a couple examples:
public class While extends Stmt { private Exp exp; private Block block; public While(Exp exp, Block block) { this.exp = exp; this.block = block; } public Exp getExp() { return exp; } public Block getBlock() { return block; } public void accept(Visitor v) { v.visitWhile(this); } }
public class Block { private List<Dec> decs; private List<Stmt> stmts; public Block(List<Dec> decs, List<Stmt> stmts) { this.decs = decs; this.stmts = stmts; } public List<Dec> getDecs() { return decs; } public List<Stmt> getStmts() { return stmts; } public void accept(Visitor v) { v.visitBlock(this); } }
Currently altgen produces only an interface for the classic Visitor pattern; in the above running example this would be:
package com.example; public interface Visitor { void visitRef(Ref node); void visitAssign(Assign node); void visitCall(Call node); void visitFun(Fun node); void visitStrLit(StrLit node); void visitBin(Bin node); void visitExp(Exp node); void visitNumLit(NumLit node); void visitWhile(While node); void visitWrite(Write node); void visitStmt(Stmt node); void visitDec(Dec node); void visitRead(Read node); void visitProgram(Program node); void visitVar(Var node); void visitBlock(Block node); }
Users can produce several implementations of this interface for each kind of tree operation (e.g., writing as XML, semantic analysis, pretty printing, code generation, etc.)
In the future we will autogenerate other, more sophisticated navigation mechanisms, such as extrinsic visitors and walkabouts.