Table of Content

  1. Cepter, a Java parser generator

  2. Reference Manual

    1. Non-terminal syntax

    2. Rule syntax

    3. Argument operators

      1. Optional element (?)

      2. Repeated elements

      3. One or more sequence (+)

      4. Zero or more sequence (*)

      5. One or more delimited sequence (/+)

      6. Zero or more delimited sequence (/*)

    4. Short action specifications

      1. return

      2. new

      3. abstract

Cepter, a Java parser generator

Cepter is a Java LALR(k) parser generator. Due to its bottom-up parsing approach, it does support constructs which cannot be handled e.g. by Antlr. Due to configurable look-ahead k, it can also handle constructs that cannot be resolved by LALR(1) parsers like Bison. It has some unique features as well, which are documented in detail below. To give you a short list:

Method-like syntax
where each non-terminal is represented by a brace-enclosed group of rules, and each rule looks a lot like a method, with arguments given as a comma-separated list enclosed in parentheses, and the reduction code given in the brace-enclosed body. The types of the arguments will be grammar symbols, though, not Java types.
Subclassable
because reduction actions are realized through invocations of methods. Therefore you can derive your own variant of a parser by subclassing a generated parser class and overriding some of its methods.
Argument operators
allow the simple declaration of optional elements or sequences of elements. They look a lot like Kleene operators used in regular expressions.

Currently, Cepter is in the early stages of development. There are no releases available yet. This document here is currently in large parts more like a design blueprint than a documentation for a working implementation. If you like my ideas and want to help making them come true, please contact me about how you might contribute.

Reference Manual

Non-terminal syntax

The description of a non-terminal in Cepter has the following form:

Type name { rule… }
name
is the name of the non-terminal, i.e. an arbitrary identifier.
Type
is the Java type associated with each reduction of this non-terminal.
rule…
is a sequence of rules. Each rule indicates a possible production of the non-terminal.

Rule syntax

A typical rule will look somewhat like this:

methodName (symbol1, symbol2 a, "," . , symbol2 b) {
    return new SyntaxTreeObject(symbol1, a, b);
}
method name
denotes the name of the method that will contain the code for the reduction action of this rule. If the name is omitted, the name of the current non-terminal will be used instead. For multiple rules, this will lead to overloaded functions, but it may cause problems if different rules agree in the number and types of their arguments.
symbol
is the name of any (terminal or non-terminal) symbol which is required at this position. Instead of giving a name of a symbol, a string enclosed in double quotes may be used. A terminal lexer token matching this string will be accepted. A suitable rule will be added to the lexer implicitely unless there already exists an explicit lexer rule for just that string.
argument
names following the production symbols (like a or b in the example above) specify the name that should be used for the argument in question inside the code of the rule. Omitting the name will result in an argument named just like the symbol, which will cause problems if the same symbol is used repeatedly. Specifying a dot in the location of the argument name will cause the symbol to be omitted from the list of method arguments altogether.

Argument operators

The symbols of a rule might optionally be followed by one of the argument operators described below, in order to represent optional or repeated elements. The effect of these operators will be explained by giving equivalent constructs using additional non-terminals. Those equivalent constructs are only used internally, though, as the names they use for their non-terminals are not vaild non-terminal names in the regular grammar specification, and as the actions they execute are not represented as methods in the parser class.

Optional element (?)

An optional element may be denoted by following its symbol with a quotation mark. If the symbol is not present, it will be represented as null upon invocation of the reduction method. If the argument name of an optional element is omitted, it defaults to the name of the element symbol without the question mark. The operator implicitely adds the following nonterminal and set of rules:

TypeOfName name? {
    ()         { return null; }
    (name ret) { return ret;  }
}

Repeated elements

There are four different operators for sequences of repeated elements, depending on whether there must be at least one element in the sequence, and whether the elements are separated by some delimiter symbol. In each case, the sequence is passed to the reduction method as a list of the appropriate type. Delimiters will simply be dropped in the process, as they are not expected to carry any data of their own. If the argument name of such a sequence argument is omitted, it defaults to the name of the sequence element symbol, with an added letter 's'.

One or more sequence (+)

A sequence of one or more productions of the same symbol may be denoted by following the symbol with a plus sign. The operator implicitely adds the following nonterminal and set of rules:

List<TypeOfName> name+ {
    (name elt) {
        List<TypeOfName> lst = new ArrayList<TypeOfName>();
        lst.add(elt);
        return lst;
    }
    (name+ lst, name elt) {
        lst.add(elt);
        return lst;
    }
}

Zero or more sequence (*)

A sequence of zero or more productions of the same symbol may be denoted by following the symbol with an asterisk. The operator implicitely adds the following nonterminal and set of rules:

List<TypeOfName> name* {
    () {
        return new ArrayList<TypeOfName>();
    }
    (name* lst, name elt) {
        lst.add(elt);
        return lst;
    }
}

One or more delimited sequence (/+)

A sequence of one or more productions of the same symbol, separated by productions of a different symbol, may be denoted by following the symbol of the sequence element with a slash, the symbol of the separator, and finally a plus sign. Think of the slash as divided by. The operator implicitely adds the following nonterminal and set of rules:

List<TypeOfItem> item/separator+ {
    (item elt) {
        List<TypeOfItem> lst = new ArrayList<TypeOfItem>();
        lst.add(elt);
        return lst;
    }
    (item/separator+ lst, separator ., item elt) {
        lst.add(elt);
        return lst;
    }
}

Zero or more delimited sequence (/*)

A sequence of zero or more productions of the same symbol, separated by productions of a different symbol, may be denoted by following the symbol of the sequence element with a slash, the symbol of the separator, and fionally an asterisk. The operator implicitely adds the following nonterminal and set of rules:

List<TypeOfItem> item/separator* {
    () {
        return new ArrayList<TypeOfItem>();
    }
    (item/separator+ lst) {
        return lst;
    }
}

Short action specifications

Instead of giving a full reduction method body enclosed in curly braces, common actions may be shortened to a single keyword following the rule, without curly braces. In these cases, the rule must be terminated by a semicolon.

return

The return shortcut allows passing a single rule argument as the result of the rule. Any other arguments must be ignored (i.e. followed by a dot). The following two rules are equivalent:

(pre ., item itm, post .) { return itm; }
(pre ., item, post .) return;

new

The new shortcut allows passing all rule arguments directly to the constructor of some class. It may be followed by a class name, but if that is omitted, it defaults to the method name of the rule (which in turn defaults to the name of the non-terminal), but with the first character converted to upper case. If a class name is given, it may include generic type arguments. The following three rules are equivalent:

ruleName (foo a, sep ., bar b) { return new RuleName(a, b); }
ruleName (foo a, sep ., bar b) new RuleName;
ruleName (foo a, sep ., bar b) new;

abstract

Specifying the keyword abstract after a rule will suppress code generation altogether. Instead the corresponding method of the parser class as well (as the class as a whole) will be declared abstract. You'll have to implement it in a suitable subclass. A rule of the form

ruleName (foo, bar) abstract;

will result in the following method declaration in the parser class:

protected abstract TypeOfNonTerminal ruleName (TypeOfFoo foo, TypeOfBar bar);