MediaWiki is the dominant wiki syntax in the wild, largely due to the influence of Wikipedia. It was chosen as the language for SUMO's knowledge base during our migration from TikiWiki, largely because of its mindshare. However, there is no good implementation of MediaWiki parsing in Python. The best available, py-wikimarkup, is an almost line-for-line port of the original PHP, to the extent that it still has dollar signs in the comments. As such, it is a multi-stage regex-find-and-replace formatter that never creates a parse tree and is thus both (1) practically impossible to extend and (2) hard-coupled to outputting HTML. Both of these hurt us:
- SUMO needed to extend the parser to support such constructs as TikiWiki's "showfor", inclusions, and templates, and all of these had to be implemented with some degree of hackery, degrading comprehensibility of the code and leaving lots of broken corner cases that regularly plague contributors.
- Right now, it's impossible to translate wiki snippets (forum posts, for example) to plain text for inclusion in emails.
We propose to implement a new MediaWiki parser using proper parsing techniques: generating a parse tree, manipulating it, and then outputting (at least initially) HTML. Erik Rose has already done some research toward this: see the continually developing design document and some initial code.
Here's a mail I sent to a few clueful souls who mailed me. If you are clueful but didn't mail me, act as if I sent this to you, too!
Hi, all! You're receiving this because you have inquired about a Google Summer of Code project I'm mentoring: to create a more maintainable, extensible MediaWiki-syntax parser in Python. This is a bit of a "research" project in that I'm not entirely sure it's possible to create a comprehensible parser that still handles all the subtleties of the ill-designed MediaWiki language, but the plan is to at least have fun and learn something trying!
As you can see from https://wiki.mozilla.org/Community:SummerOfCode11:MediaWikiParser (which you all evidently already found), I've already made some progress—mostly in the form of educational false starts—in this project myself. We will probably throw away almost all of my existing code—but learn from its mistakes—in the course of the GSoC.
An aside about the specific difficulties with the current implementation:
It turns out that, among other productions, MediaWiki's external link syntax seems to require a lookahead of 3 or 4 tokens in the parser in order to distinguish it from simple text runs that happen to begin with "[[". LALR(1) parsers like yacc (a port of which is used in the current implementation) give you only 1 token of lookahead. There are certain tricks you can play to get around that, but they tend to hurt comprehensibility and introduce otherwise-unnecessary coupling between grammar productions, making future extension difficult. I have learned a lot of fascinating things about parsers while figuring this out. :-) A promising starting point for the GSoC work might be the Earley parser, which can parse any context-free grammar regardless of lookahead length, tolerates ambiguity, and has not-too-terrible run time.
This is my first GSoC, so I'm figuring some of this out as I go along. Due to parts of Google Melange system being offline for the past 5 days, I have been unable to register as an official mentor. I'm sure we'll get that ironed out. In the meantime, I don't think that should keep you from filing your student applications: http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/faqs#student_app. Let me know if you have trouble.
The rules seem to stipulate that I accept only one applicant, so here's what I'll be looking for:
- Enthusiasm and a playful attitude
- Ability to dig through CS literature and learn things about parsing algorithms
- Past experience or strong interest in parsing things (yielding real ASTs; regex soups don't count)
- A history of getting things done
Finally, even if I'm limited to accepting one applicant "officially", I will certainly not turn down unofficial collaboration once the project gets rolling. You might also choose to file multiple applications just to be safe; propose something cool, and it's likely we can find someone equally cool at Mozilla to mentor you.
Any questions—please ask. I look forward to working with you!
To get a sense of the project or start immersing yourself in its problem domain, you can start by reading this stuff:
- The grammar horrors and half-baked implementation attempts at http://www.mediawiki.org/wiki/Markup_spec. http://www.mediawiki.org/wiki/Markup_spec/BNF seems to be most complete/helpful.
- A nice short review of various parsing techniques: http://tratt.net/laurie/tech_articles/articles/parsing_the_solved_problem_that_isnt
- Good material on Earley parsers. Wikipedia's could be better. If you find something, please note it here. There are at least 2 implementations of Earley parsers in Python. One is SPARK (http://pages.cpsc.ucalgary.ca/~aycock/spark/doc.html and http://www.python.org/workshops/1998-11/proceedings/papers/aycock-little/aycock-little.html). The other is from NLTK. Chapter 8 of the NLTK book makes good recreational reading.