My Experience with Some Parser Generators
TOML is a new data format (and young) standard that can hold information similar to JSON and YAML. TOML wants to be simple to write, use and implement. Thanks to those goals and a need for a data format for my games, I thought I’d take a whack at implementing it myself in C.
My formal computer science education and my ever continuing interest in programming languages has taught me about language grammars and parsing. Though I’ve never used the ideas and theories actively in something complete and useful. I didn’t go as far as writing the parser from scratch. Not this time. Instead I ended up using tools called Lemon and re2c to make tomlc.
I didn’t intend to try multiple parser generators. At first look I thought ANTLR would do what I wanted and needed but due to issues I couldn’t make it work. Bison and Flex was my next shot but I found certain limitations to make it undesirable for this implementation. I finally landed with Lemon that while the grammar file, I think, is more complicated it works brilliantly. I found each of the generators to be great and given certain project details they could each be workable.
ANTLR is pretty fantastic. My suspicion is that it’ll generally work better with vm languages than it did for me with C. An external runtime library is needed for C for ANTLR3. I spent some hours tracking down documentation for ANTLR and the C library, some more trying to debug the seg faults that appeared in the antlr runtime library and some final time looking through the library’s source. In the end I figure I was mismatching the runtime library with the version of antlr to generate the C code. Just not making enough progress with the time I spent I moved on to the next parser generator.
While ANTLR didn’t work for me, it definitely has some cool benefits. Compared to the later generators the grammar was simpler to write. Other parser generators rely on a separate method to lexically scan tokens from the input, ANTLR has that built into the grammar files. If you are using the older version 3 than the more recent version 4, as I did since it does not yet support a C target, you can target a number of languages instead of version 4’s two current targets.
Here is what my resulting ANTLR grammar was before I moved to the next generator:
I can’t guarentee that this is an accurate grammar for TOML and it’s definitely incomplete, just that ANTLR3 builds target parsers with it.
ATF / Kyua
Since this was a project that was partly about trying things I hadn’t before experimented with unit test libraries. Seeing a lot of mentions for ATF, Automated Testing Framework, now known as Kyua, pronouced QA, I gave that a try. It has a lot of great features and an approach that could be useful for debugging users that made it very enticing. But I ended up running into a really big problem building Kyua’s developer tool. The library built easily but oddly the tool to run the built test suites created enough 450 meg core dumps as it built to fill all the free space on my computer’s drive. I succeeded at building it eventually but being frustrated at building Kyua and not wanting to explain how to work around the core dump issue in docs for my library, I gave up on it. The start of this project was rather frustrating.
Next on the list. Bison. An implementation of Yacc. I didn’t try it for as long as ANTLR. I made a simple TOML grammar and lexer with Flex that would compile. Poking through the generated source and figuring out how to build around it with the created APIs alerted me to the global variables Bison uses. Bison’s C++ version supports being executed in multiple threads but not the C output. Though it could be used fine, starting fresh with this parser, maybe I step further. So I started peaking around other generators that could support threaded parsing.
Oh yeah, the simple Bison grammar and Flex lexer.
Lemon, maintained by sqlite, builds on the ideas of Bison and Yacc. It definitely has some positives going for it. On the downside it uses a grammar style that can be harder to work with. While its documentation is shorter than the above mentioned generators, I found the content to bring me up to speed quicker than the others.
Lemon’s constraints and modifications grammatically really do make it simpler to use. Specifying names for a grammar rule’s components is really nice to have instead of counting which argument has the token you want to work on.
As designed, Lemon also doesn’t allow terminal tokens to be written in its grammar. Instead Lemon strictly has you provide a lexer to break string content into tokens for the generated parser to consume. Part of this clean separation of functionality also allows me to have lots of control in what to use to lexically analyze the input.
While ANTLR does everything itself, Yacc, Bison, and Lemon require a second tool to tokenize the input. Lemon’s provided freedom in this really leaves you open. Fortunately some other online posts mentioned re2c. As well both Lemon and re2c leave you to define how to communicate between the parts. At minimum you need to pass an integer to indicate the token type. But to be useful you’ll probably pass the string content of the token.
One thing odd about Lemon, it doesn’t have a way to write output files to a target directory from the tool.
To finish up about parser generators, my final Lemon grammar.
Unless I am forgetting something, this is a complete grammar for TOML. Paired with my re2c powered tokenizer, the above grammar can make sense of a TOML document.
Oh Yeah, Other Testing Stuff
I tried fctx and ended using libtap for testing. Compared to Kyua that are greatly lacking in features but I also found those attractive. However fctx and libtap are so simple it’s hard to screw up using or compiling them. libtap is a step simpler and writes test results in TAP. In the end, that last bit and being simple to use were my deciding factors for the test suite’s library.