Thursday, March 06, 2008

Parser generators for PHP

A couple of months ago, I wrote a CSS parser package, and accompanying data structures class for blog.com. At the time, I used php_parsergenerator. While I did manage to get a successful result, I also hit just about every limit in php_parsergenerator: It generates LR(0) parsers, and provides no infrastructure for a proper (fast) tokenizer. Transforming grammars into LR(0) is gruesome work, and the end result is usually a very extense grammar, work made only worse by the very terse error reporting by php_parsergenerator.

So, a couple of weeks ago, I dove in, and re-learned compiler theory. The result is this set of packages, which I plan on proposing into PEAR:
Generated parsers are LALR(n). The generator is smart enough to generate the LR(0) state set and only add lookaheads when there are parsing conflicts. The result is a smaller parsing table than would be created if the generator used the classical approach of generating LR(n) tables and then merging or the other common alternative of lookahead propagation when creating the LR(0) state set.

There are three other important differences to php_parsergenerator:
  • Grammar definition is not part of the parser generator. This means that any grammar definition grammar can be used by implementing a parser that translates the definition into a Structures_Grammar instance. I've implemented BNF and will use it to write EBNF, but grammar parsers are easy to write.
  • The parser driver code is not part of the generated code. A php_parsergenerator generated parser looks like this. A parser generated by Text_Parser_Generator looks like this, because it delegates most of the work to the Text_Parser_LALR parent class.
  • Error reporting by the parser generator is much improved. Text_Parser_Generator outputs, in the parser comments, the parser states. When reporting a conflict, it explains which rules in which states caused the conflict, and includes expected input information. It's usually trivial to spot the error in the grammar definition.
All of this code is now alpha quality. It is lightly unit-tested, and is lacking some API documentation and all usage documentation. The current plan now is to write Text_Parser_EBNF and then directly use the W3C official grammar to rewrite CSS_Parser. By then, I'll forward all of this for PEAR approval.

Anyhow, the objective of this post is to draw more eyes to the code (release early, release often). If you plan on using text parsers on php, take a look at the unit tests on the code here, and test-drive the code. I thank all feedback, namely bug reports.

Now, go and rip my code apart :-)
Posted by K at 20:19:53 | Permanent Link | Comments (0) |
Comments
Write a comment