Gaia/System/Keyboard/IME/Latin/Prediction & Auto Correction

From MozillaWiki
< Gaia‎ | System‎ | Keyboard
Revision as of 06:59, 27 January 2015 by Mnjul (talk | contribs) (→‎Notes)
Jump to navigation Jump to search

(This page is still being populated by John Lu -- jlu@mozilla.com)

Data Structure

Algorithms

Notes

Same XML, Different Dict Blob

  • This is tested on Ubuntu 14.04's Python 3 (required by Makefile), version 3.4.0.
  • It is possible that two runs of XML-to-dict conversion on the same wordlist XML produce different dictionary binary blobs, thus triggering git to think there is a new revision of the *.dict.

For example, the following code will result in diff resulting in "different":

# rm en.dict if make doesn't want to make a new en.dict
$ make en.dict
$ mv en.dict en.dict.old
$ make en.dict
$ diff en.dict en.dict.old
Binary files en.dict and en.dict.old differ
  • The difference does not affect any functionality at any degree.
  • We look further by tracking the difference:
$ xxd en.dict > en.dict.xxd
$ xxd en.dict.old en.dict.old.xxd
$ diff en.dict.xxd en.dict.old.xxd
27,29c27,29
< 00001a0: 0300 e000 0000 0300 2400 0000 0200 3300  ........$.....3.
< 00001b0: 0000 0200 e400 0000 0200 3200 0000 0100  ..........2.....
< 00001c0: e500 0000 0100 eb00 0000 0100 ee00 0000  ................
---
> 00001a0: 0300 e000 0000 0300 e400 0000 0200 3300  ..............3.
> 00001b0: 0000 0200 2400 0000 0200 3200 0000 0100  ....$.....2.....
> 00001c0: eb00 0000 0100 e500 0000 0100 ee00 0000  ................
  • We first determine the location of the difference is at the character frequency table:
    • Insert print(output.tell()) in emit() function, before and after the first for-loop.
  • Now notice the actual difference: bytes from {0x1a8 to 0x1ad} and bytes from {0x1b4 to 0x1b9} are interchangeable: [2400 0000 0020] and [e400 0000 0200]. This is the same for bytes from {0x1c0 to 0x1c5} and from {0x1c6 to 0x1cb}.
  • This is because characterFrequency.items() at emit() does not necessarily produce the same ordering on each run. It doesn't matter sorted() does stable sorting.
  • Let's insert from pprint import pprint and pprint(list(characterFrequency.items())) at emit() function. Two runs and the results are:
First run Second run
[('é', 220),
('L', 2372),
('-', 277),
('j', 2151),
('M', 4099),
('å', 1),
('H', 2115),
('W', 1544),
('è', 29),
('G', 2251),
('f', 15984),
('a', 110508),
('z', 5884),
('â', 8),
('v', 12864),
('V', 924),
('Y', 325),
('à', 3),
('s', 131933),
('ü', 17),
('m', 35668),
('n', 94375),
('r', 98945),
('w', 11092),
('i', 113155),
('o', 85820),
('h', 30601),
('R', 2268),
('û', 1),
('S', 4682),
('t', 85186),
('ï', 4),
('e', 147827),
('q', 2139),
('I', 1213),
('ë', 1),
('X', 105),
('Q', 205),
('l', 71483),
('F', 1783),
('A', 3463),
('ö', 11),
('K', 1575),
('k', 13233),
('g', 34650),
('D', 2396),
('N', 1525),
('O', 1209),
('P', 2837),
('ñ', 21),
('d', 45495),
('ê', 18),
('B', 3676),
('2', 1),
('3', 2),
('Z', 314),
('u', 43362),
('É', 3),
('ç', 12),
('ó', 7),
('J', 1081),
("'", 30168),
('C', 4323),
('c', 50164),
('b', 24532),
('E', 1684),
('$', 2),
('x', 3605),
('î', 1),
('á', 10),
('p', 33944),
('ô', 6),
('U', 602),
('T', 2515),
('y', 22361),
('ä', 2)]
[('î', 1),
('ó', 7),
('Y', 325),
('å', 1),
('c', 50164),
('ä', 2),
('J', 1081),
('F', 1783),
('l', 71483),
('é', 220),
('U', 602),
('s', 131933),
('x', 3605),
('ô', 6),
('-', 277),
('ë', 1),
('j', 2151),
('ê', 18),
('y', 22361),
('i', 113155),
('K', 1575),
('r', 98945),
('D', 2396),
('ö', 11),
('O', 1209),
('k', 13233),
('t', 85186),
('B', 3676),
('p', 33944),
('h', 30601),
('$', 2),
('É', 3),
('u', 43362),
('3', 2),
('A', 3463),
('Z', 314),
('H', 2115),
('f', 15984),
('T', 2515),
('I', 1213),
('n', 94375),
('e', 147827),
('â', 8),
('b', 24532),
('Q', 205),
('ç', 12),
('G', 2251),
('W', 1544),
('g', 34650),
('è', 29),
("'", 30168),
('R', 2268),
('S', 4682),
('P', 2837),
('2', 1),
('o', 85820),
('ñ', 21),
('z', 5884),
('w', 11092),
('X', 105),
('ï', 4),
('v', 12864),
('V', 924),
('N', 1525),
('á', 10),
('û', 1),
('m', 35668),
('M', 4099),
('C', 4323),
('à', 3),
('L', 2372),
('d', 45495),
('ü', 17),
('q', 2139),
('E', 1684),
('a', 110508)]
  • Let's look at the three tuples with frequency = 2, namely ('3', 2), ('$', 2), and ('ä', 2) (at first run). However, at second run, the ordering is ('ä', 2), ('$', 2), and ('3', 2). Thus, the written files are different.
  • Let's go back to the xxd diff result above and recall that the difference is the interchanging of [2400 0000 0020] and [e400 0000 0200]. These bytes encode ('$', 2) and ('ä', 2). Also, between these addresses is [3300 0000 0020], encoding ('3', 2). So, we have a change of order of characters of the same frequency -- which does not affect any functionality, since no particular order of characters of the same frequency is expected, at the character frequency table.