Aurum is a Ruby-based LALR(n) parser generator that you can use to develop your own domain specified languages, scripting languages and programming languages.Although it's just yet another parser generator, Aurum is slightly different from other widely used parser generators:
- One of major targets of Aurum is to simplify external DSL development, espectually Ruby external DSL.
- Aurum uses incremental LALR(n) algorithm instead of the common used LALR(1)/Full LALR(n) algorithm. That means:
- Allowing the user to express grammars in a more intuitive mannar.
- Making it easier to handle complicated grammars. For exmaple,
COBOL(LALR(2 or 3)), simplified nature language(LALR(3+)) and etc.
- Closer to Generalized LR in language recognizing but much more faster.
- Smaller parsing table comparing to Full LALR/LR(n) algorithm.
- Aurum supports grammar reuse, and itslef'll be shipped with some pre-defined common structures. One of the pain points of external DSL is that you have to re-define lots of common structures, such as if statements, block structure and etc. With Aurum, you could simply reuse them.
- Aurum uses a Ruby interal DSL as meta-language, and provides a generic lexer/parser as well. You could test your grammar by the comprehensive testing libraries Ruby has(you could even develop your lexer/parser in the TDD fashion).
- As the name suggested, Aurum, the Latin word for Gold, is partially inspired by the GOLD Parsing System. The grammar you created with Aurum could be completely independent of any implementation language,even Ruby.(not implemented yet :) )
Ok, let's start from the 'Hello World in Compiler Construction' —— Expression Evaluation
1 require 'aurum'
2
3 class ExpressionGrammar < Aurum::Grammar
4 tokens do
5 ignore string(' ').one_or_more # <= a
6 _number range(?0, ?9).one_or_more # <= b
7 end
8
9 precedences do # <= c
10 left '*', '/'
11 left '+', '-'
12 end
13
14 productions do # <= d
15 expression expression, '+', expression {expression.value = expression1.value + expression2.value} # <= e
16 expression expression, '-', expression {expression.value = expression1.value - expression2.value}
17 expression expression, '*', expression {expression.value = expression1.value * expression2.value}
18 expression expression, '/', expression {expression.value = expression1.value / expression2.value}
19 expression '(', expression, ')' do expression.value = expression1.value end # <= f
20 expression _number {expression.value = _number.value.to_i}
21 expression '+', _number {expression.value = _number.value.to_i}
22 expression '-', _number {expression.value = -_number.value.to_i}
23 end
24 end
If you has any experience with other compiler compiler/parser generator, you probably could understand what happens above quite easily. Instead of explaining things like token, character class, and production, I'd like to emphasise some Aurum conventions:
- At point a, we use 'ignore' directive to declare the ignored pattern, such as whitespaces etc.'string' is one of the helper methods(others are enum, range and concat), which is used to define lexical patterns. It will create a pattern matching the given string exactly.
- At point b, we declare a lexical token named '_number'. In Aurum, lexical tokens, or terminals from syntax perspective, always start with '_'. The expression '_token_name pattern' is equivalent to 'match pattern, :recognized => :_toke_name'. The 'match' directive is a common way to associate lexical action with leixcal pattern.
- At point c, we declare operator precedences of the Expression grammar.The eariler the operators definied, the higher precedence they will have.
- At point d, we declare syntax rules of Expression grammar. According to Aurum naming convention, all terminals should start with '_' while all nontermainls start with lower case alphabet character. String literals will be interpreted as reserve words, and added to lexer automatically.
- At point e, we define a semantic action to the Addition rule. In semantic action, you could access to the objects in value stack via the name of corresponding symbols.If there are more than one symbol with the same name, you could differentiate them by the order they appered in the production.
- At point f, we use do..end instead of {..}. Using Ruby internal DSL as meta-langauge is a double-side sword, you have to bear its flaws while enjoying the remaining parts. There is no perfect world, isn't it?
Now, let's find out how we could use this expression grammar. You could use the helper method as below(it will recalcuate lexical table and parsing table for every call, could be quite slow):
1 puts ExpressionGrammar.parse_expression('1+1').value
or use the lexical table and parsing table to create your own lexer & parser:
1 lexer = Aurum::Engine::Lexer.new(ExpressionGrammar.lexical_table, '1+1')
2 parser = Aurum::Engine::Parser.new(ExpressionGrammar.parsing_table(:expression))
3 puts parser.parse(lexer).value
At the end of this post, I'd like to give another grammar example coming from
Martin Fowler's
HelloParserGenerator series:
1 require 'aurum'
2
3 Item = Struct.new(:name)
4
5 class Catalog < Aurum::Grammar
6 tokens do
7 ignore enum(" \r\n").one_or_more
8 _item range(?a,?z).one_or_more
9 end
10
11 productions do
12 configuration configuration, item {configuration.value = configuration1.value.merge({item.value.name => item.value})}
13 configuration _ {configuration.value = {}}
14 item 'item', _item {item.value = Item.new(_item.value)}
15 end
16 end
17
18 config = Catalog.parse_configuration(<<EndOfDSL).value
19 item camera
20 item laser
21 EndOfDSL
22
23 puts config['camera'].name
P.S.:The post is based on the developing version of Aurum(0.2.0). You could get it from the svn repository.
P.S.P.S.: There is a more complicated example in the examples directory, a simple Smalltalk interpreter. Have fun:)