Parser generators for PHP
A couple of months ago, I wrote a CSS parser package, and accompanying data structures class for blog.com. At the time, I used php_parsergenerator. While I did manage to get a successful result, I also hit just about every limit in php_parsergenerator: It generates LR(0) parsers, and provides no infrastructure for a proper (fast) tokenizer. Transforming grammars into LR(0) is gruesome work, and the end result is usually a very extense grammar, work made only worse by the very terse error reporting by php_parsergenerator.
So, a couple of weeks ago, I dove in, and re-learned compiler theory. The result is this set of packages, which I plan on proposing into PEAR:
There are three other important differences to php_parsergenerator:
Anyhow, the objective of this post is to draw more eyes to the code (release early, release often). If you plan on using text parsers on php, take a look at the unit tests on the code here, and test-drive the code. I thank all feedback, namely bug reports.
Now, go and rip my code apart :-)
So, a couple of weeks ago, I dove in, and re-learned compiler theory. The result is this set of packages, which I plan on proposing into PEAR:
- Tokenizer infrastructures packages: Text_Tokenizer, Text_Tokenizer_Regex and Text_Tokenizer_Regex_Matcher_ext
- Parser generation: Text_Parser_Generator
- Parser proper: Text_Parser
- Grammar description: Structures_Grammar
- Grammar parsing: Text_Parser_BNF
There are three other important differences to php_parsergenerator:
- Grammar definition is not part of the parser generator. This means that any grammar definition grammar can be used by implementing a parser that translates the definition into a Structures_Grammar instance. I've implemented BNF and will use it to write EBNF, but grammar parsers are easy to write.
- The parser driver code is not part of the generated code. A php_parsergenerator generated parser looks like this. A parser generated by Text_Parser_Generator looks like this, because it delegates most of the work to the Text_Parser_LALR parent class.
- Error reporting by the parser generator is much improved. Text_Parser_Generator outputs, in the parser comments, the parser states. When reporting a conflict, it explains which rules in which states caused the conflict, and includes expected input information. It's usually trivial to spot the error in the grammar definition.
Anyhow, the objective of this post is to draw more eyes to the code (release early, release often). If you plan on using text parsers on php, take a look at the unit tests on the code here, and test-drive the code. I thank all feedback, namely bug reports.
Now, go and rip my code apart :-)
