67
edits
(Skeleton for testing linkage) |
(Add some general notes on expected content and the parsing grammar) |
||
Line 4: | Line 4: | ||
It is not enough to create a | It is not enough to create a | ||
[https://bugzilla.mozilla.org/showdependencytree.cgi?id=996570&hide_resolved=1 data store] for compatibility data. It also needs to be populated with structured data. We've decided to start with MDN data, rather than start from scratch or [https://github.com/Fyrd/caniuse an existing data source]. An MDN data importer is now part of the [https://github.com/jwhitlock/web-platform-compat web-platform-compat] project, and is live at https://browsercompat.herokuapp.com/importer/. | [https://bugzilla.mozilla.org/showdependencytree.cgi?id=996570&hide_resolved=1 data store] for compatibility data. It also needs to be populated with structured data. We've decided to start with MDN data, rather than start from scratch or [https://github.com/Fyrd/caniuse an existing data source]. An MDN data importer is now part of the [https://github.com/jwhitlock/web-platform-compat web-platform-compat] project, and is live at https://browsercompat.herokuapp.com/importer/. | ||
=== Expected MDN content === | |||
The importer works with the raw versions of pages, which contains HTML with KumaScript tags. For example, the MDN page about the HTML <p> element is: | |||
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/p | |||
and the raw version of the page is: | |||
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/p?raw | |||
You can also see the raw version by editing a page and selection the "Source" button in the upper left corner. | |||
The importer is expecting a page that matches this pattern (most pages are more complex): | |||
<pre> | |||
<h2 id="Summary">Summary</h2> | |||
<!-- ... Other content .... --> | |||
<h2 id="Specifications" name="Specifications">Specifications</h2> | |||
<table class="standard-table"> | |||
<thead> | |||
<tr> | |||
<th scope="col">Specification</th> | |||
<th scope="col">Status</th> | |||
<th scope="col">Comment</th> | |||
</tr> | |||
</thead> | |||
<tbody> | |||
<tr> | |||
<td>{{SpecName('HTML WHATWG', 'grouping-content.html#the-p-element', '<p>')}}</td> | |||
<td>{{Spec2('HTML WHATWG')}}</td> | |||
<td> </td> | |||
</tr> | |||
</tbody> | |||
</table> | |||
<h2 id="Browser_compatibility" name="Browser_compatibility">Browser compatibility</h2> | |||
<div> | |||
{{CompatibilityTable}}</div> | |||
<div id="compat-desktop"> | |||
<table class="compat-table"> | |||
<tbody> | |||
<tr> | |||
<th>Feature</th> | |||
<th>Chrome</th> | |||
<th>Firefox (Gecko)</th> | |||
</tr> | |||
<tr> | |||
<td>Basic support</td> | |||
<td>1.0</td> | |||
<td>{{CompatGeckoDesktop("1.0")}} [1]</td> | |||
</tr> | |||
</tbody> | |||
</table> | |||
</div> | |||
<div id="compat-mobile"> | |||
<table class="compat-table"> | |||
<tbody> | |||
<tr> | |||
<th>Feature</th> | |||
<th>Android</th> | |||
<th>Firefox Mobile (Gecko)</th> | |||
</tr> | |||
<tr> | |||
<td>Basic support</td> | |||
<td>{{CompatVersionUnknown}}</td> | |||
<td>{{CompatGeckoMobile("1.0")}}</td> | |||
</tr> | |||
</tbody> | |||
</table> | |||
</div> | |||
<p>[1] This is a footnote</p> | |||
<h2 id="See_also">See also</h2> | |||
<!-- ... Rest of content ... --> | |||
</pre> | |||
The importer is flexible about whitespace and some common MDN alternate patterns, but this flexibility has to be built in. If the page uses valid but unexpected HTML, the importer will fail, usually with a "section_skipped" critical error. | |||
=== The Parser === | |||
The importer uses a [https://github.com/erikrose/parsimonious Parsing Expression Grammar] (PEG) library to parse the raw MDN page. This can extract the useful data, as well as report the position and range of some unexpected content. The page grammar is in [https://github.com/mozilla/web-platform-compat/blob/master/mdn/scrape.py the source code], and, while precise, can be difficult to understand. | |||
For example, to parse a row in the specifications table: | |||
<pre> | |||
<tr> | |||
<td>{{SpecName('HTML WHATWG', 'grouping-content.html#the-p-element', '<p>')}}</td> | |||
<td>{{Spec2('HTML WHATWG')}}</td> | |||
<td> </td> | |||
</tr> | |||
</pre> | |||
the grammar specifies: | |||
<pre> | |||
spec_row = tr_open _ specname_td _ spec2_td _ specdesc_td _ "</tr>" _ | |||
specname_td = td_open _ kumascript "</td>" | |||
spec2_td = td_open _ kumascript "</td>" | |||
specdesc_td = td_open _ inner_td _ "</td>" | |||
inner_td = ~r"(?P<content>.*?(?=</td>))"s | |||
td_open = "<td" _ opt_attrs ">" | |||
... other rules ... | |||
_ = ~r"[ \t\r\n]*"s | |||
</pre> | |||
The first line (known as a "rule") says "Expect a <code>tr_open</code>, followed by optional whitespace, followed by a <code>spec_name</code>, ...", where the rules <code>tr_open</code> and <code>spec_name</code> elements are defined further down in the grammar. The PEG engine tries to match the MDN page against the grammar. If successful, the content defined by the elements can be extracted for further processing. If the grammar doesn't match, then the rule where matching stopped can be reported for a human to think about. | |||
== The Issues == | == The Issues == |
edits