Gaia/System/Keyboard/IME/Latin/Prediction & Auto Correction
Jump to navigation
Jump to search
(This page is still being populated by John Lu -- jlu@mozilla.com)
Data Structure
Algorithms
Notes
Same XML, Different Dict Blob
- This is tested on Ubuntu 14.04's Python 3 (required by Makefile), version 3.4.0.
- It is possible that two runs of XML-to-dict conversion on the same wordlist XML produce different dictionary binary blobs, thus triggering git to think there is a new revision of the *.dict.
For example, the following code will result in diff resulting in "different":
# rm en.dict if make doesn't want to make a new en.dict $ make en.dict $ mv en.dict en.dict.old $ make en.dict $ diff en.dict en.dict.old Binary files en.dict and en.dict.old differ
- The difference does not affect any functionality at any degree.
- We look further by tracking the difference:
$ xxd en.dict > en.dict.xxd $ xxd en.dict.old en.dict.old.xxd $ diff en.dict.xxd en.dict.old.xxd 27,29c27,29 < 00001a0: 0300 e000 0000 0300 2400 0000 0200 3300 ........$.....3. < 00001b0: 0000 0200 e400 0000 0200 3200 0000 0100 ..........2..... < 00001c0: e500 0000 0100 eb00 0000 0100 ee00 0000 ................ --- > 00001a0: 0300 e000 0000 0300 e400 0000 0200 3300 ..............3. > 00001b0: 0000 0200 2400 0000 0200 3200 0000 0100 ....$.....2..... > 00001c0: eb00 0000 0100 e500 0000 0100 ee00 0000 ................
- We first determine the location of the difference is at the character frequency table:
- Insert
print(output.tell())inemit()function, before and after the first for-loop.
- Insert
- Now notice the actual difference: bytes from {0x1a8 to 0x1ad} and bytes from {0x1b4 to 0x1b9} are interchangeable: [2400 0000 0020] and [e400 0000 0200]. This is the same for bytes from {0x1c0 to 0x1c5} and from {0x1c6 to 0x1cb}.
- This is because
characterFrequency.items()atemit()does not necessarily produce the same ordering on each run. It doesn't mattersorted()does stable sorting. - Let's insert
from pprint import pprintandpprint(list(characterFrequency.items()))atemit()function. Two runs and the results are:
| First run | Second run |
|---|---|
[('é', 220),
('L', 2372),
('-', 277),
('j', 2151),
('M', 4099),
('å', 1),
('H', 2115),
('W', 1544),
('è', 29),
('G', 2251),
('f', 15984),
('a', 110508),
('z', 5884),
('â', 8),
('v', 12864),
('V', 924),
('Y', 325),
('à', 3),
('s', 131933),
('ü', 17),
('m', 35668),
('n', 94375),
('r', 98945),
('w', 11092),
('i', 113155),
('o', 85820),
('h', 30601),
('R', 2268),
('û', 1),
('S', 4682),
('t', 85186),
('ï', 4),
('e', 147827),
('q', 2139),
('I', 1213),
('ë', 1),
('X', 105),
('Q', 205),
('l', 71483),
('F', 1783),
('A', 3463),
('ö', 11),
('K', 1575),
('k', 13233),
('g', 34650),
('D', 2396),
('N', 1525),
('O', 1209),
('P', 2837),
('ñ', 21),
('d', 45495),
('ê', 18),
('B', 3676),
('2', 1),
('3', 2),
('Z', 314),
('u', 43362),
('É', 3),
('ç', 12),
('ó', 7),
('J', 1081),
("'", 30168),
('C', 4323),
('c', 50164),
('b', 24532),
('E', 1684),
('$', 2),
('x', 3605),
('î', 1),
('á', 10),
('p', 33944),
('ô', 6),
('U', 602),
('T', 2515),
('y', 22361),
('ä', 2)]
|
[('î', 1),
('ó', 7),
('Y', 325),
('å', 1),
('c', 50164),
('ä', 2),
('J', 1081),
('F', 1783),
('l', 71483),
('é', 220),
('U', 602),
('s', 131933),
('x', 3605),
('ô', 6),
('-', 277),
('ë', 1),
('j', 2151),
('ê', 18),
('y', 22361),
('i', 113155),
('K', 1575),
('r', 98945),
('D', 2396),
('ö', 11),
('O', 1209),
('k', 13233),
('t', 85186),
('B', 3676),
('p', 33944),
('h', 30601),
('$', 2),
('É', 3),
('u', 43362),
('3', 2),
('A', 3463),
('Z', 314),
('H', 2115),
('f', 15984),
('T', 2515),
('I', 1213),
('n', 94375),
('e', 147827),
('â', 8),
('b', 24532),
('Q', 205),
('ç', 12),
('G', 2251),
('W', 1544),
('g', 34650),
('è', 29),
("'", 30168),
('R', 2268),
('S', 4682),
('P', 2837),
('2', 1),
('o', 85820),
('ñ', 21),
('z', 5884),
('w', 11092),
('X', 105),
('ï', 4),
('v', 12864),
('V', 924),
('N', 1525),
('á', 10),
('û', 1),
('m', 35668),
('M', 4099),
('C', 4323),
('à', 3),
('L', 2372),
('d', 45495),
('ü', 17),
('q', 2139),
('E', 1684),
('a', 110508)]
|
- Let's look at the three tuples with frequency = 2, namely ('3', 2), ('$', 2), and ('ä', 2) (at first run). However, at second run, the ordering is ('ä', 2), ('$', 2), and ('3', 2). Thus, the written files are different.
- Let's go back to the xxd diff result above and recall that the difference is the interchanging of [2400 0000 0020] and [e400 0000 0200]. These bytes encode ('$', 2) and ('ä', 2). Also, between these addresses is [3300 0000 0020], encoding ('3', 2). So, we have a change of order of characters of the same frequency -- which does not affect any functionality, since no particular order of characters of the same frequency is expected, at the character frequency table.