Gaia/System/Keyboard/IME/Latin/Prediction & Auto Correction: Difference between revisions

From MozillaWiki
< Gaia‎ | System‎ | Keyboard
Jump to navigation Jump to search
No edit summary
No edit summary
Line 4: Line 4:
The dictionary structure is stored as a [http://en.wikipedia.org/wiki/Ternary_search_tree Ternary Search Tree].
The dictionary structure is stored as a [http://en.wikipedia.org/wiki/Ternary_search_tree Ternary Search Tree].


The xml2dict.py script processes the word list and generate the TST along with supporting meta data and auxiliary data structures.
The [https://github.com/mozilla-b2g/gaia/blob/master/apps/keyboard/js/imes/latin/dictionaries/xml2dict.py xml2dict.py] script
processes the word list and generate the TST along with supporting meta data and auxiliary data structures.
The script balances the TST such that, at any node, the highest-frequency word is found by following the center pointer.
The script balances the TST such that, at any node, the highest-frequency word is found by following the center pointer.
As explained more detailedly in below, each center child node has a linked list, and the next most frequent word
As explained more detailedly in below, each center child node has a linked list, and the next most frequent word
Line 287: Line 288:
== Algorithms ==
== Algorithms ==


== Questions for Clarification ==
== Known Issues ==
* Is character frequency table's frequency information not being used by predictions.js?
 
== Notes ==
=== Same XML, Different Dict Blob ===
=== Same XML, Different Dict Blob ===
* This is tested on Ubuntu 14.04's Python 3 (required by Makefile), version 3.4.0.
* This is tested on Ubuntu 14.04's Python 3 (required by Makefile), version 3.4.0.
Line 481: Line 479:
* Let's look at the three tuples with frequency = 2, namely ('3', 2), ('$', 2), and ('ä', 2) (at first run). However, at second run, the ordering is ('ä', 2), ('$', 2), and ('3', 2). Thus, the written files are different.
* Let's look at the three tuples with frequency = 2, namely ('3', 2), ('$', 2), and ('ä', 2) (at first run). However, at second run, the ordering is ('ä', 2), ('$', 2), and ('3', 2). Thus, the written files are different.
* Let's go back to the xxd diff result above and recall that the difference is the interchanging of [2400 0000 0020] and [e400 0000 0200]. These bytes encode ('$', 2) and ('ä', 2). Also, between these addresses is [3300 0000 0020], encoding ('3', 2). So, we have a change of order of characters ''of the same frequency'' -- which does not affect any functionality, since no particular order of characters of the same frequency is expected, at the character frequency table.
* Let's go back to the xxd diff result above and recall that the difference is the interchanging of [2400 0000 0020] and [e400 0000 0200]. These bytes encode ('$', 2) and ('ä', 2). Also, between these addresses is [3300 0000 0020], encoding ('3', 2). So, we have a change of order of characters ''of the same frequency'' -- which does not affect any functionality, since no particular order of characters of the same frequency is expected, at the character frequency table.
== Unclarified Questions ==
* Is character frequency table's frequency information not being used by predictions.js?

Revision as of 07:09, 28 January 2015

(This page is still being populated by John Lu -- jlu@mozilla.com)

Dictionary Blob Data Structure

The dictionary structure is stored as a Ternary Search Tree.

The xml2dict.py script processes the word list and generate the TST along with supporting meta data and auxiliary data structures. The script balances the TST such that, at any node, the highest-frequency word is found by following the center pointer. As explained more detailedly in below, each center child node has a linked list, and the next most frequent word with the same parent node is found by following the "next" pointer.

Header

After building the TST data structure, the script serializes it into a compact binary file with variable length nodes. The file begins with the 8 ASCII characters "FxOSDICT" and four more bytes. Bytes 8, 9 and 10 are currently unused and byte 11 is a dictionary format version number, currently 1.

Byte 12 file specifies the length of the longest word in the dictionary.

Character Table

After these first 13 bytes, the file contains a character table that lists the characters used in the dictionary. This table begins with a two-byte big-endian integer that specifies the number of entries in the table. Table entries follow. Each table entry is a big-endian two-byte character code, followed by a big-endian 4-byte number that specifies the number of times the character appears in the dictionary.

Serialized Nodes

After the character table, starting at byte [15 + num_char_table_entries*6], serialized nodes are stored. Each node is between 1 byte and 6 bytes long, encoded as follows.

The first byte of each node is always an 8-bit bitfield: csnfffff

The c bit specifies whether this node represents a character. If c is 1, then a character code follows this first byte. If c is 0, then this is a terminal node that marks the end of the word. If the n bit is 0, then this node is only 1-byte long (= this very first byte). If the n bit is 1, then the "next" pointer follows, as described below, and this node is 4-byte long.

The s bit specifies the size of the character associated with this node. If s is 0, the character is one byte long. If s is 1, the character is a big-endian two byte value.

The n bit specifies whether this node includes a next pointer. If n is 1 then the character code is followed by a big-endian 3 byte number.

The fffff bits represent a number between 0 and 31 and provide a weight for the node. This is usually based on the word frequency data from the dictionary, though this may be tuned by adjusting frequency depending on word length, for example. At any node, these frequency bits represent the weight of the highest-frequency word that we can meet when we traverse along this node's center-pointer descendant.

If the c bit is set, the following one or two bytes (depending on the s bit) of the node are the Unicode character code that is stored in the node. The two-byte codes are big-endian.

If the n bit is set, the following 3 bytes are a big-endian 24-bit unsigned integer, representing the pointer to the "next" character node. The pointer contains the byte offset from the beginning of the serialized nodes, i.e. after the character table.

If the c bit was set, the node has a character, and this means that it also has a center child, as defined by TST. We serialize the tree so that in the binary blob, the center child node always follows the current node in the blob, so we don't need to store the pointer to the center child.

As it turns out, we never need left and right nodes during prediction, so they are not serialized into the blob.

Serialization Example

Let's serialize a TST of the words:

'zila', 'zilian', 'ziliac', 'zart', 'zalarm', 'zset'

And the frequency is:

Word Frequency
zila 255
zilian 175
ziliac 55
zart 215
zalarm 135
zset 95

The visualized TST looks like this:

[z]
 |
[i     ->     a     ->     s]
 |            |            |
[l]          [r  ->  l]   [e]
 |            |      |     |
[a  ->  i]   [t]    [a]   [t]
        |            |
       [a]          [r]
        |            |
       [n  ->  c]   [m]

The vertical up-to-down edge is a center pointer, while a horizontal left-to-right edge denotes the "next" pointer. (the spacing between the horizontal edge doesn't matter, of course). Note that the ordering of [i -> a -> s] is important, and cannot be something else like [a -> s -> i] for our case, as the words traversable (through center pointers) from 'i' have the highest frequency (zila = 255) than those from 'a' (zart = 215), and in turn, than those from 's' (zset = 95). Similarly, [a -> i] in z-i-l ancestry cannot be [i -> a] as 'zila' is more frequent than 'zilian' and 'ziliac'.

Let's talk about the actual serialization now. Suppose the normalized frequency (from 0 to 31, as described above) is:

Word Normalized Frequency
zila 31
zilian 22
ziliac 7
zart 27
zalarm 17
zset 12

The TST will be serialized as followed (after the character table):

(Again, note that offset 0 denotes the beginning of the tree, and is byte [15 + 11*6] from the beginning of the whole blob. 11 because we have 11 alphabets)

  • The position of the 'i' node, next-pointed from 'a'
Node Offset Byte(s) Description
'z' 0 0x9f
= 0b10011111
  • This node has character 'z', a ASCII-7 one-byte
  • No "next" pointer
  • The most-frequent word that can be traversed from this node has freq = 31, of 'zila'
1 0x7a
= chr('z')
'i' 2 0xbf
= 0b10011111
  • This node has character 'i' of the second level, as the center node from 'z' above
  • There is a "next" pointer, to 'a'
  • The most-frequent word that can be traversed from this node has freq = 31, of 'zila'
3 0x69
= chr('i')
4 5 6 0x00 0x00 0x1c
= hex(28)
  • The position of the 'a' node, next-pointed from 'i'
'l' 7 0x9f
= 0b10011111
  • This node has character 'l' of the third level, as the center node from 'i' above
  • No "next" pointer
  • The most-frequent word that can be traversed from this node has freq = 31, of 'zila'
8 0x6c
= chr('l')
'a' 9 0xbf
= 0b10011111
  • This node has character 'a' of the fourth level, as the center node from 'l' above
  • There is a "next" pointer, to 'i'
  • The most-frequent word that can be traversed from this node has freq = 31, of 'zila'
10 0x61
= chr('a')
11 12 13 0x00 0x00 0x0f
= hex(15)
terminal 14 0x1f
= 0b00011111
  • This is the terminal node for the 'zila' word. Frequency of 'zila' is 31
  • No "next" pointer
'i' 15 0x96
= 0b10010110
  • This is the 'i' node next-pointed from 'a', in the fourth level
  • No 'next' pointer
  • The most-frequent word that can be traversed from this node has freq = 22, of 'zilian'
16 0x69
= chr('i')
'a' 17 0x96
= 0b10010110
  • This node has character 'a' of the fifth level, as the center node from 'i' above
  • No 'next' pointer
  • The most-frequent word that can be traversed from this node has freq = 22, of 'zilian'
18 0x61
= chr('a')
'n' 19 0xb6
= 0b10110110
  • This node has character 'n' of the sixth level, as the center node from 'a' above
  • There is a "next" pointer, to 'c'
  • The most-frequent word that can be traversed from this node has freq = 22, of 'zilian'
20 0x6e
= chr('n')
21 22 23 0x00 0x00 0x19
= hex(25)
  • The position of the 'c' node, next-pointed from 'n'
terminal 24 0x16
= 0b00010110
  • This is the terminal node for the 'zilian' word. Frequency of 'zilian' is 22
  • No "next" pointer
'c' 25 0x87
= 0b1000111
  • This is the 'c' node next-pointed from 'n', in the sixth level
  • No 'next' pointer
  • The most-frequent word that can be traversed from this node has freq = 7, of 'ziliac'
26 0x63
= chr('c')
terminal 27 0x07
= 0b00000111
  • This is the terminal node for the 'ziliac' word. Frequency of 'zilian' is 7
  • No "next" pointer
'a' 28 0xbb
= 0b10111011
  • This is the 'a' node next-pointed from 'i', in the second level
  • There is a "next" pointer, to 's'
  • The most-frequent word that can be traversed from this node has freq = 27, of 'zart'
(and we encode character 'a')
(and the position of that "next" 's' node)
(...and the rest of the TST...)

Notes

Note that a terminal node (with c bit = 0) can still have a next pointer, as we always sort words by frequency, in the list linked by next pointers. For example, consider the word list:

Word Frequency
zilia 127
zilian 255
ziliac 31

For which the TST tree will be:

[z]
 |
[i]
 |
[l]
 |
[i]
 |
[a]
 |
[n -> EOW -> c]

where the EOW denotes the terminal node. We also need the terminal node to have frequency information, such that prediction engine knows the frequency of 'zilia', in the example.

It's worthwhile noting that in real-world scenarios, the concept of "root node" is also a list linked by next pointers -- indeed, not all English words begin with 'z'.

Algorithms

Known Issues

Same XML, Different Dict Blob

  • This is tested on Ubuntu 14.04's Python 3 (required by Makefile), version 3.4.0.
  • It is possible that two runs of XML-to-dict conversion on the same wordlist XML produce different dictionary binary blobs, thus triggering git to think there is a new revision of the *.dict.

For example, the following code will result in diff resulting in "different":

# rm en.dict if make doesn't want to make a new en.dict
$ make en.dict
$ mv en.dict en.dict.old
$ make en.dict
$ diff en.dict en.dict.old
Binary files en.dict and en.dict.old differ
  • The difference does not affect any functionality at any degree.
  • We look further by tracking the difference:
$ xxd en.dict > en.dict.xxd
$ xxd en.dict.old en.dict.old.xxd
$ diff en.dict.xxd en.dict.old.xxd
27,29c27,29
< 00001a0: 0300 e000 0000 0300 2400 0000 0200 3300  ........$.....3.
< 00001b0: 0000 0200 e400 0000 0200 3200 0000 0100  ..........2.....
< 00001c0: e500 0000 0100 eb00 0000 0100 ee00 0000  ................
---
> 00001a0: 0300 e000 0000 0300 e400 0000 0200 3300  ..............3.
> 00001b0: 0000 0200 2400 0000 0200 3200 0000 0100  ....$.....2.....
> 00001c0: eb00 0000 0100 e500 0000 0100 ee00 0000  ................
  • We first determine the location of the difference is at the character frequency table:
    • Insert print(output.tell()) in emit() function, before and after the first for-loop.
  • Now notice the actual difference: bytes from {0x1a8 to 0x1ad} and bytes from {0x1b4 to 0x1b9} are interchangeable: [2400 0000 0020] and [e400 0000 0200]. This is the same for bytes from {0x1c0 to 0x1c5} and from {0x1c6 to 0x1cb}.
  • This is because characterFrequency.items() at emit() does not necessarily produce the same ordering on each run. It doesn't matter sorted() does stable sorting.
  • Let's insert from pprint import pprint and pprint(list(characterFrequency.items())) at emit() function. Two runs and the results are:
First run Second run
[('é', 220),
('L', 2372),
('-', 277),
('j', 2151),
('M', 4099),
('å', 1),
('H', 2115),
('W', 1544),
('è', 29),
('G', 2251),
('f', 15984),
('a', 110508),
('z', 5884),
('â', 8),
('v', 12864),
('V', 924),
('Y', 325),
('à', 3),
('s', 131933),
('ü', 17),
('m', 35668),
('n', 94375),
('r', 98945),
('w', 11092),
('i', 113155),
('o', 85820),
('h', 30601),
('R', 2268),
('û', 1),
('S', 4682),
('t', 85186),
('ï', 4),
('e', 147827),
('q', 2139),
('I', 1213),
('ë', 1),
('X', 105),
('Q', 205),
('l', 71483),
('F', 1783),
('A', 3463),
('ö', 11),
('K', 1575),
('k', 13233),
('g', 34650),
('D', 2396),
('N', 1525),
('O', 1209),
('P', 2837),
('ñ', 21),
('d', 45495),
('ê', 18),
('B', 3676),
('2', 1),
('3', 2),
('Z', 314),
('u', 43362),
('É', 3),
('ç', 12),
('ó', 7),
('J', 1081),
("'", 30168),
('C', 4323),
('c', 50164),
('b', 24532),
('E', 1684),
('$', 2),
('x', 3605),
('î', 1),
('á', 10),
('p', 33944),
('ô', 6),
('U', 602),
('T', 2515),
('y', 22361),
('ä', 2)]
[('î', 1),
('ó', 7),
('Y', 325),
('å', 1),
('c', 50164),
('ä', 2),
('J', 1081),
('F', 1783),
('l', 71483),
('é', 220),
('U', 602),
('s', 131933),
('x', 3605),
('ô', 6),
('-', 277),
('ë', 1),
('j', 2151),
('ê', 18),
('y', 22361),
('i', 113155),
('K', 1575),
('r', 98945),
('D', 2396),
('ö', 11),
('O', 1209),
('k', 13233),
('t', 85186),
('B', 3676),
('p', 33944),
('h', 30601),
('$', 2),
('É', 3),
('u', 43362),
('3', 2),
('A', 3463),
('Z', 314),
('H', 2115),
('f', 15984),
('T', 2515),
('I', 1213),
('n', 94375),
('e', 147827),
('â', 8),
('b', 24532),
('Q', 205),
('ç', 12),
('G', 2251),
('W', 1544),
('g', 34650),
('è', 29),
("'", 30168),
('R', 2268),
('S', 4682),
('P', 2837),
('2', 1),
('o', 85820),
('ñ', 21),
('z', 5884),
('w', 11092),
('X', 105),
('ï', 4),
('v', 12864),
('V', 924),
('N', 1525),
('á', 10),
('û', 1),
('m', 35668),
('M', 4099),
('C', 4323),
('à', 3),
('L', 2372),
('d', 45495),
('ü', 17),
('q', 2139),
('E', 1684),
('a', 110508)]
  • Let's look at the three tuples with frequency = 2, namely ('3', 2), ('$', 2), and ('ä', 2) (at first run). However, at second run, the ordering is ('ä', 2), ('$', 2), and ('3', 2). Thus, the written files are different.
  • Let's go back to the xxd diff result above and recall that the difference is the interchanging of [2400 0000 0020] and [e400 0000 0200]. These bytes encode ('$', 2) and ('ä', 2). Also, between these addresses is [3300 0000 0020], encoding ('3', 2). So, we have a change of order of characters of the same frequency -- which does not affect any functionality, since no particular order of characters of the same frequency is expected, at the character frequency table.

Unclarified Questions

  • Is character frequency table's frequency information not being used by predictions.js?