From MozillaWiki
Jump to: navigation, search

Before May 25th

After establishing the contact with my mentor, Erik Rose, we decided to complete a comparison, that he had already started, of the different existing parsers. I created an account on GitHub to host a fork of Erik's mediawiki-parser project.

We needed between 3 and 4 weeks to put the final list in place, by testing and comparing the parsers, the simplicity of their usage and the understandability of their source code.

We finally decided that PEG parsers were probably the best choice and, among them, I felt that Pijnu was the most pleasant to use. I began to implement some parts of MediaWiki grammar in it as a proof of concept.

However, Pijnu had some limitations that we wanted to correct (Unicode support, comparison in tests, tests failing unexpectedly). So, I contacted Pijnu's developer, Denis "Spir" Derman, and we discussed those questions.

After Spir solved the issue of failing tests (a memoization reset was needed), I added a lot of test cases for my parser and continued to develop it.

May 25th to Mat 31st

I spent a long time working on wikitable parsing. I added simple and semi-complex test cases and had them parsed correctly.

I also added a TODO file which gives the current status of the different parts of MediaWiki syntax for my parser.

Erik had a look at my commits and corrected a space/tabs indentation problem. We have planned to communicate with each other by audio.

June 1st to June 7th

I managed to understand how to use comparisons in tests in Pijnu and adapted my tests to do so.

I corrected the parsing of '=' in text lines which made me rewrite the parsing of titles, which now respect the behavior of MediaWiki. I added the support for nested tables and nested templates.

I tested pyParsing a little more. This confirmed my opinion that Pijnu was the best PEG parsing library we had tested so far and was much more practical and easy to use than pyParsing.

Spir informed me that he couldn't assume the development of Pijnu "in a foreseeable future"; after discussing this with him and with Erik, I decided to maintain it, keep the GPLv3 license and make the developments necessary to the continuation of our mediawiki parser (such as Unicode support).

I had a long and helpful discussion with Erik about what was urgent, less urgent and not needed. It was said that the most urgent was to have a clear view of the pre/main/post-processing steps. I will write something about that as soon as possible, getting inspiration from how MediaWiki parser works.

June 8th to June 14th

I improved Pijnu so that it can handle multiline tests. Then, I adapted the test file of our parser to use this functionality.

After a few email exchanges with Spir, he sent me the latest version of Pijnu and I (manually) merged it with our GitHub repository, which is now the official one for Pijnu. At the same time, I updated the license information of Pijnu (I added myself as the current developer) and improved the code style to match (partially but better than before) the PEP8 style guide.

June 15th to June 21st

I added a basic Unicode support in Pijnu, so that our parser could parse Unicode strings.

Erik defined tickets from my TODO list and I added the PEG rules and Python tests for some of them (paragraphs, lists). I also rewrote a part of the current syntax definition in order to be able to parse escaping characters (eg. [ { | } ]) when not used in the definition of a link or a template.

June 22nd to June 28th

June 22th+: Erik validated my progress so far and we decided to mark a first step by reorganizing the directory structure, putting Pijnu's generated parser as the center of our parser and work in branches from there. I totally rewrote the tests architecture in order to be able to use nose for testing.

I added the support for HTML tags and entities and tests for this.

June 29th to July 5th

I subscribed to the wikitext-l mailing-list and presented myself. I had a little feedback on my project, but not much.

I added the support for the HTML comments, added more tests for tags parsing and improved the AST structure for the wikiTables.

I allowed parsing template parameters and added tests for them as for nowiki sections and horizontal rules (HR tags).

July 6th to July 12th

I began the postprocessing part, which is rendering the AST into HTML or text.

Erik made some modifications to Pijnu to allow more than one instance of it to be created (having one instance only would generate memoization problems in our case), so, I adapted our parser to take those changes into account.

July 13th to July 19th

I began with a big merge of the different branches created so far. Erik helped me a lot with this because it was kind of a mess...

After that, I began working on the preprocessor part. The preprocessor is responsible for substituting the templates and their parameters. I made a proof of concept for it and improved it for several corner cases I found.

I finished the week by applying the Python convention on our mediawiki.pijnu file (under_score instead of CamelCase), which needed to update all the tests created so far.

July 20th to July 26th

I implemented and tested the support for HTML entities and discussed a lot with Erik about this.

I then added the rendering (postprocessor) part for allowed and disallowed HTML tags and HTML attributes and the tests for it.

I finally did the same for HTML tables, templates, HR, titles and adding the corresponding tests.

July 27th to August 2nd

I corrected the implementation for bold and italic and implemented the rendering for lists (hard!).

I implemented a function to check tags balancing, in order to obtain a clean output.

I tried our parser on a real randomly chosen Wikipedia article [1], and corrected some bugs this test revealed. Parsing this article requires between 2 and 3 seconds on my machine, and we plan to improve that later.

I finished the rendering part by adding the support for inline URLs, external links and internal links (including the categories and files).

I finally solved a few bugs discovered by Erik and me.