Gaia/System/Keyboard/IME/Latin/Prediction & Auto Correction
(This page is still being populated by John Lu -- jlu@mozilla.com)
Data Structures and Algorithms
Data Structures
To facilitate auto-correction and prediction, our dictionary is represented with a data structure based on Ternary Search Tree. When we generate a TST from a list of words and frequencies of the words, we balance the TST such that at any node, the highest-frequency word is found by following the center pointer. In order to be able to find the next highest-frequency word with the same prefix (i.e. of the same subtree parent), each center child node is a linked list, linked by the "next" pointer, which we may follow.
Source Word List
Currently, the lists of words come from the Android source and is stored in XML format. Each word entry includes a frequency, between 0 and 255, denoting how frequent the word appears in the language. As a special case, a frequency of 0 means profanity -- words that should not be considered a typo, but that should never be suggested explicitly.
The word list is both case- and diacritic-sensitive, and so is the resulting TST blob.
Example TST
Let's take a list of words:
'zila', 'zilian', 'ziliac', 'zart', 'zalarm', 'zset'
And their frequency is:
| Word | Frequency |
|---|---|
| zila | 255 |
| zilian | 175 |
| ziliac | 55 |
| zart | 215 |
| zalarm | 135 |
| zset | 95 |
The TST looks like this:
[z]
|
[i -> a -> s]
| | |
[l] [r -> l] [e]
| | | |
[a -> i] [t] [a] [t]
| |
[a] [r]
| |
[n -> c] [m]
The vertical up-to-down edge is a center pointer, while a horizontal left-to-right edge denotes the "next" pointer. (The spacing within a horizontal edge doesn't matter, of course). Note that the ordering of [i -> a -> s] in the second level is important, and cannot be something else like [a -> s -> i] in our case, as the words traversable (through center pointers) from 'i' have the highest frequency (zila = 255) than those from 'a' (zart = 215), and in turn, than those from 's' (zset = 95). Similarly, [a -> i] in z-i-l ancestry cannot be [i -> a] as 'zila' is more frequent than 'zilian' and 'ziliac'.
It's worth noting that in real-world scenarios, the concept of "root node" is also a list linked by next pointers -- indeed, not all English words begin with 'z'.
The "Level" Concept
The idea of "center" and "next" pointers can sound very confusing at first glance. The concept of levels can help clarify the idea of "center" and "next" pointers:
- When we follow the "center" pointer, we advance one level down.
- When we follow the "next" pointer, we stay at the same level.
- Level in TST is analogous to character position in a string.
This concept is particularly important because during prediction/auto-correction, we match the input string against the TST, from the beginning of the string and the root of the tree, one character by one character. As level is analogous to character position in the string, when we follow the center pointer one level down, we mean "We believe matching this letter at this character position is pretty good. Let's proceed with the next character.", while when we follow the next pointer, we mean "We believe matching this letter at this character position is not that good. Let's see if matching another letter, at this character position, would be good." -- Then, since the list of "next" nodes are sorted from highest frequency to lowest frequency, we're more likely to hit the "pretty good" candidates before we try the "not that good" candidates.
Prediction Algorithms
The prediction algorithm is mostly implemented in predictions.js.
However, latin.js
does some post-processing on the suggestions returned from the prediction engine,
in handleSuggestions(), and it's important to understand the underlying dependency
the post-processing procedures may have with the prediction algorithms.
Big Picture & Terminology
Note: This sub-section omits many details in order to keep the introduction streamlined.
The prediction mechanism works by finding loose matches for the user's input. The hard part of this process is getting that "loose" part right --- we employ heuristics where we see if slight changes to the user's input can transform the input into a known word, and if so, we say the known word is a loose match for the user input.
The goal for the prediction/matching process is to return a number of suggestions which are possible correct spellings of the user input. We generate such suggestions by growing candidates step-by-step. A candidate is a prefix of a dictionary word, and may also be regarded as what you get by traversing from TST root node down a specific levels of center nodes, and optionally moving horizontally with next pointers within a level. In that TST tree above, "zili", "zala", and "zse" are candidates.
At each step, we generate new candidates:
- The current candidate is represented by a TST node (again, by traversing from TST root node down)
- There are a few dictionary word prefixes, represented by TST nodes that may be reachable from the current candidate TST node.
- Let's colloquially call them "new prefixes".
- Keep in mind that "reachable" includes both the center and the next pointers (details in later sections).
- Comparing the "new prefix" with the user input up to a specific position N, we may find a way to mutate the user input, to match the "new prefix".
- The resulting candidate is usually one char longer than the previous candidate (thus in the next step, our prefix strings to compare are N+1 chars long).
In the TST example above, if the current candidate is "zar", then the "new prefixes" may be "zart", and "zal".
When we finally come close to the user input, the final candidates are sent back to Latin IME as suggestions.
During this matching/mutation process, we associate a weight with each candidate we're considering. A weight of a candidate is a collective product of that candidate's frequency value (as stored in its TST node), and how many and what kinds of mutations we have to do to make the (prefix of the) input string equal the "new prefix". We'll introduce the weighting detail later in this section. To avoid exponentially growing our search space, candidates and suggestions are stored in a bounded priority queue, their weight being the priority. When we go from one step to another step, we always retrieve the highest-priority candidates from the queue and grow candidates from the queue.
Generally speaking, we have a few kinds of mutations:
- Insertion: where we insert a character that's present in "new prefix" at position N but not in the user input at position N.
- Deletion: where we delete a character that's not present in "new prefix" at position N but present in the user input at position N.
- Transposition: where the character in "new prefix" at position N is actually the character in the user input at position N+1.
- Note: we are not checking if the character in "new prefix" at position N+1 is the character in the user input at position N. We merely swap the user input chars at positions N and N+1. In the next step, if the user input char at position N+1 (which was the char at position N in this step) can't be matched without expensive mutations, the grown candidate will be likely discarded according to weight/correction limit.
- Substitution: where we just brute-force change the character in the user input at position N to be the character of the "new prefix" at position N.
Apart from the weighting mechanism, the prediction engine also imposes a maximum number of corrections that can be made to a user input for matching a specific suggestion. Most of the mutations count as corrections, but some specific mutation operations (such as substituting a character with its diacritic variant or with its capitalized variant) don't. Note we're talking about suggestions here -- a candidate growing into a new candidate retains the information of the number of corrections we have had to make to the user input. During the process, when we need to make more corrections to the user input than allowed in order to match a specific candidate, we drop that candidate from the queue right away.
Now back to the weight of the candidates. As said above, this weight is based on the word frequency of the candidate's TST node, but it is reduced when we make mutations to the user's input, to reflect our decreased confidence in making such suggestion. In the codes, we define a few weight multipliers when we make various kinds of mutations. Such multipliers changes a candidate's weight as it grows and as we go through the matching process. Generally speaking, substituting variant forms and inserting apostrophes does not reduce the weight by much at all. Substituting a nearby key reduces the weight moderately (depending on how near the keys are on the screen). Insertions, deletions and transpositions reduce the weight significantly.
The prediction engine holds nearbyKeys structure which contains
spatial information of keys -- which keys are near each other, and how near they are.
Such information is provided by the Latin IME, subsequently provided by Keyboard App's
layout rendering management logics. As said above, the spatial information is used to weight
substitutions.
Finally, the weight of a fully-grown candidate is the weight of the suggestion that the candidate has grown into. Or rather, we can say it's the confidence on the suggestion that the prediction engine has, toward a specific user input.
Function Call Breakdown
(Not yet written)
Step-by-Step Example
(Not yet written)
Dictionary Blob Structure
The xml2dict.py script processes the word list, generates the TST along with supporting metadata and auxiliary data structures, and serializes it into a compact binary blob, which is written as the *.dict file, to be unserialized and used by the predictions engine.
Header
The file begins with the 8 ASCII characters "FxOSDICT" and four more bytes. Bytes 8, 9 and 10 are currently unused and byte 11 is a dictionary format version number, currently 1.
Byte 12 file specifies the length of the longest word in the dictionary.
Character Table
After these first 13 bytes, the file contains a character table that lists the characters used in the dictionary. This table begins with a two-byte big-endian integer that specifies the number of entries in the table. Table entries follow. Each table entry is a big-endian two-byte character code, followed by a big-endian 4-byte number that specifies the number of times the character appears in the dictionary.
Serialized Nodes
After the character table, serialized nodes are stored, starting at byte [15 + num_char_table_entries*6], depth-first. Each node has variable length, between 1 byte and 6 bytes long, encoded as follows.
The first byte of each node is always an 8-bit bitfield: csnfffff
The c bit specifies whether this node represents a character.
- If c is 1, then a character code follows this first byte.
- If c is 0, then this is a terminal node that marks the end of the word.
- If the n bit is 0, then this node is only 1-byte long (= this very first byte).
- If the n bit is 1, then the "next" pointer follows, as described below, and this node is 4-byte long.
The s bit specifies the size of the character associated with this node.
- If s is 0, the character is one byte long.
- If s is 1, the character is a big-endian two byte value.
The n bit specifies whether this node includes a next pointer.
- If n is 1, the character code is followed by a big-endian 3-byte number.
The fffff bits represent a number between 0 and 31 and provide a weight for the node. This is usually based on the word frequency data from the dictionary, though this may be tuned by adjusting frequency depending on word length, for example. At any node, these frequency bits represent the weight of the highest-frequency word that we can meet when we traverse along this node's center-pointer descendant.
If the c bit is set, the following one or two bytes (depending on the s bit) of the node are the Unicode character code that is stored in the node.
If the n bit is set, the following 3 bytes are a big-endian 24-bit unsigned integer, representing the pointer to the "next" character node. The pointer contains the byte offset from the beginning of the serialized nodes, i.e. after the character table.
If the c bit was set, the node has a character, and this means that it also has a center child, as defined by TST. We serialize the tree so that in the binary blob, the center child node always follows the current node in the blob, so we don't need to store the pointer to the center child.
As it turns out, we never need left and right nodes during prediction, so they are not serialized into the blob.
TST Serialization Example
Let's serialize the TST in the "Data Structures and Algorithms" section. Suppose the normalized frequency (from 0 to 31, as described above) is:
| Word | Normalized Frequency |
|---|---|
| zila | 31 |
| zilian | 22 |
| ziliac | 7 |
| zart | 27 |
| zalarm | 17 |
| zset | 12 |
The TST will be serialized as followed (after the character table):
(Again, note that offset 0 denotes the beginning of the tree, and is byte [15 + 11*6] from the beginning of the whole blob. 11 because we have 11 alphabets)
| Node | Offset | Byte(s) | Description |
|---|---|---|---|
| Level 1 'z' |
0 | 0x9f = 0b10011111 |
|
| 1 | 0x7a = chr('z') |
||
| Level 2 'i' |
2 | 0xbf = 0b10011111 |
|
| 3 | 0x69 = chr('i') |
||
| 4 5 6 | 0x00 0x00 0x1c = hex(28) |
| |
| Level 3 'l' |
7 | 0x9f = 0b10011111 |
|
| 8 | 0x6c = chr('l') |
||
| Level 4 'a' |
9 | 0xbf = 0b10011111 |
|
| 10 | 0x61 = chr('a') |
||
| 11 12 13 | 0x00 0x00 0x0f = hex(15) |
| |
| terminal | 14 | 0x1f = 0b00011111 |
|
| Level 4 'i' |
15 | 0x96 = 0b10010110 |
|
| 16 | 0x69 = chr('i') |
||
| Level 5 'a' |
17 | 0x96 = 0b10010110 |
|
| 18 | 0x61 = chr('a') |
||
| Level 6 'n' |
19 | 0xb6 = 0b10110110 |
|
| 20 | 0x6e = chr('n') |
||
| 21 22 23 | 0x00 0x00 0x19 = hex(25) |
| |
| terminal | 24 | 0x16 = 0b00010110 |
|
| Level 6 'c' |
25 | 0x87 = 0b1000111 |
|
| 26 | 0x63 = chr('c') |
||
| terminal | 27 | 0x07 = 0b00000111 |
|
| Level 2 'a' |
28 | 0xbb = 0b10111011 |
|
| (and we encode character 'a') | |||
| (and the position of that "next" 's' node) | |||
| (...and the rest of the TST. As we serialize the nodes depth-first, the following nodes would be 'r', 't', 'l', 'a', 'r', 'm', 's', 'e', 't') | |||
Notes on Terminal Node
A a terminal node (with c bit = 0) can still have a next pointer.
For example, consider the word list:
| Word | Frequency |
|---|---|
| zilia | 127 |
| zilian | 255 |
| ziliac | 31 |
, for which the TST tree will be:
[z] | [i] | [l] | [i] | [a] | [n -> EOW -> c]
where the EOW denotes the terminal node.
We also need the terminal node to have frequency information, such that prediction engine knows the frequency of 'zilia', in the example.
Known Issues
Same XML, Different Dict Blob
- This is tested on Ubuntu 14.04's Python 3 (required by Makefile), version 3.4.0.
- It is possible that two runs of XML-to-dict conversion on the same wordlist XML produce different dictionary binary blobs, thus triggering git to think there is a new revision of the *.dict.
For example, the following code will result in diff resulting in "different":
# rm en.dict if make doesn't want to make a new en.dict $ make en.dict $ mv en.dict en.dict.old $ make en.dict $ diff en.dict en.dict.old Binary files en.dict and en.dict.old differ
- The difference does not affect any functionality at any degree.
- We look further by tracking the difference:
$ xxd en.dict > en.dict.xxd $ xxd en.dict.old en.dict.old.xxd $ diff en.dict.xxd en.dict.old.xxd 27,29c27,29 < 00001a0: 0300 e000 0000 0300 2400 0000 0200 3300 ........$.....3. < 00001b0: 0000 0200 e400 0000 0200 3200 0000 0100 ..........2..... < 00001c0: e500 0000 0100 eb00 0000 0100 ee00 0000 ................ --- > 00001a0: 0300 e000 0000 0300 e400 0000 0200 3300 ..............3. > 00001b0: 0000 0200 2400 0000 0200 3200 0000 0100 ....$.....2..... > 00001c0: eb00 0000 0100 e500 0000 0100 ee00 0000 ................
- We first determine the location of the difference is at the character table:
- Insert
print(output.tell())inemit()function, before and after the first for-loop.
- Insert
- Now notice the actual difference: bytes from {0x1a8 to 0x1ad} and bytes from {0x1b4 to 0x1b9} are interchangeable: [2400 0000 0020] and [e400 0000 0200]. This is the same for bytes from {0x1c0 to 0x1c5} and from {0x1c6 to 0x1cb}.
- This is because
characterFrequency.items()atemit()does not necessarily produce the same ordering on each run. It doesn't mattersorted()does stable sorting. - Let's insert
from pprint import pprintandpprint(list(characterFrequency.items()))atemit()function. Two runs and the results are:
| First run | Second run |
|---|---|
[('é', 220),
('L', 2372),
('-', 277),
('j', 2151),
('M', 4099),
('å', 1),
('H', 2115),
('W', 1544),
('è', 29),
('G', 2251),
('f', 15984),
('a', 110508),
('z', 5884),
('â', 8),
('v', 12864),
('V', 924),
('Y', 325),
('à', 3),
('s', 131933),
('ü', 17),
('m', 35668),
('n', 94375),
('r', 98945),
('w', 11092),
('i', 113155),
('o', 85820),
('h', 30601),
('R', 2268),
('û', 1),
('S', 4682),
('t', 85186),
('ï', 4),
('e', 147827),
('q', 2139),
('I', 1213),
('ë', 1),
('X', 105),
('Q', 205),
('l', 71483),
('F', 1783),
('A', 3463),
('ö', 11),
('K', 1575),
('k', 13233),
('g', 34650),
('D', 2396),
('N', 1525),
('O', 1209),
('P', 2837),
('ñ', 21),
('d', 45495),
('ê', 18),
('B', 3676),
('2', 1),
('3', 2),
('Z', 314),
('u', 43362),
('É', 3),
('ç', 12),
('ó', 7),
('J', 1081),
("'", 30168),
('C', 4323),
('c', 50164),
('b', 24532),
('E', 1684),
('$', 2),
('x', 3605),
('î', 1),
('á', 10),
('p', 33944),
('ô', 6),
('U', 602),
('T', 2515),
('y', 22361),
('ä', 2)]
|
[('î', 1),
('ó', 7),
('Y', 325),
('å', 1),
('c', 50164),
('ä', 2),
('J', 1081),
('F', 1783),
('l', 71483),
('é', 220),
('U', 602),
('s', 131933),
('x', 3605),
('ô', 6),
('-', 277),
('ë', 1),
('j', 2151),
('ê', 18),
('y', 22361),
('i', 113155),
('K', 1575),
('r', 98945),
('D', 2396),
('ö', 11),
('O', 1209),
('k', 13233),
('t', 85186),
('B', 3676),
('p', 33944),
('h', 30601),
('$', 2),
('É', 3),
('u', 43362),
('3', 2),
('A', 3463),
('Z', 314),
('H', 2115),
('f', 15984),
('T', 2515),
('I', 1213),
('n', 94375),
('e', 147827),
('â', 8),
('b', 24532),
('Q', 205),
('ç', 12),
('G', 2251),
('W', 1544),
('g', 34650),
('è', 29),
("'", 30168),
('R', 2268),
('S', 4682),
('P', 2837),
('2', 1),
('o', 85820),
('ñ', 21),
('z', 5884),
('w', 11092),
('X', 105),
('ï', 4),
('v', 12864),
('V', 924),
('N', 1525),
('á', 10),
('û', 1),
('m', 35668),
('M', 4099),
('C', 4323),
('à', 3),
('L', 2372),
('d', 45495),
('ü', 17),
('q', 2139),
('E', 1684),
('a', 110508)]
|
- Let's look at the three tuples with frequency = 2, namely ('3', 2), ('$', 2), and ('ä', 2) (at first run). However, at second run, the ordering is ('ä', 2), ('$', 2), and ('3', 2). Thus, the written files are different.
- Let's go back to the xxd diff result above and recall that the difference is the interchanging of [2400 0000 0020] and [e400 0000 0200]. These bytes encode ('$', 2) and ('ä', 2). Also, between these addresses is [3300 0000 0020], encoding ('3', 2). So, we have a change of order of characters of the same frequency -- which does not affect any functionality, since no particular order of characters of the same frequency is expected, at the character table.
Ideas for Future Improvements
Agglutinative Languages
- How do we predict for agglutinative languages? See bug 1119426 and bug 1113395.
- According to the bug comments, naïve appending a word onto another word in the dictionaries won't work ideally.
Combining Characters
- Combining characters aren't handled very gracefully right now. These characters take separate code points, occupy separate char positions in a string, but appear as one "letter" and may be lexically one "letter" in some specific language.
- Example: 'री', in Devanagari (alphabet used for Hindi) is composed of 'र' ('\u0930') and '"ी' ('\u0940').
- (Alright, I know Hindi is not covered in Latin IME. But still...)
- Currently our engine would regard two combining characters as... two independent characters (no surprise), and they take the space of two nodes.
- However, ideally, if the two combining characters represent one single letter in a language, our engine should regard them as one character as though they're one char node.
- How?
Emoji and Non-BMP Characters
Unfortunately, not all Unicode characters can fit in the 2-byte code space. The emoji blocks at the Unicode SMP plane is a very good example: The emoji characters occupy code points U+1F300 to U+1F6FF, and an emoji character cannot fit within two bytes.
- If you run xml2dict.py through a word list containing an emoji, then at the line where a character code point value is serialized into a 16-bit integer:
output.write(struct.pack(">H", ord(item[0]))), you get a warning.- And the resulting dictionary blob is, of course, not realistic.
- If you run word_list_converter.js through a word list containing an emoji, then -- alright, JavaScript uses UCS-2 to encode unicode code points (tongue twisting sentence, I know). So those emoji characters get break into surrogate pairs.
- For example, this earth emoji '🌏', with code point value U+1F30F, is broken into the surrogate pair '\uD83C\uDF0F' in UCS-2.
- This will (conveniently) allow us to squeeze an emoji character into our current dictionary blob structure without modification. However, we'll still run into the issue as in "Combining Characters": the engine needs to know that a surrogate pair, albeit taking some space of two nodes, is one single letter.
Misc
- bug 908487 may use nodes linked by "next" pointers to know which keys to enlarge.
Unclarified Questions
- Is character table's frequency information not being used by predictions.js?