Personal tools

Labs/Ubiquity/Parser Documentation

From MozillaWiki

Jump to: navigation, search

Back to Labs/Ubiquity.

The parser architecture, as depicted by Jono DiCarlo.

Contents

The Epic Journey

We will now follow the journey of a string through Ubiquity's parser.

Suppose that the user enters the following into the parser:

 trans hello world to ja

Early Parsing: Finding the Verb Matches

Language-specific parser plugins

To make internationalization easier, all operations which are (human)-language-dependent are handled by language-specific plugins. For instance, in some languages the verb comes before the object, in other languages after the object, etc., so figuring out which part of the sentence is the verb is language-dependent.

There is a wiki page on writing a language-specific plugin which describes how these plugins work from the perspective of implementing one; information that's on that page won't be repeated here.

See NLParser.makeParserForLanguage() in parser.js for how the parser chooses a particular language plugin to use; this is called on initialization. See NLParser.Parser.updateSuggestionList() in parser.js for how the parser dispatches the user input to the parser plugin.

So if Ubiquity is in English mode, then the sentence is dispatched to the English-specific parser plugin for the first stage of parsing. The entry point here is the function EnParser.parseSentence() in file locale_en.js.

For an example of how parsing works differently in other languages, see file locale_jp.js, the Japanese parser plugin, which would be called instead of the English one if the language had been set to "jp".

The job of the language-specific plugin is twofold:

  1. To find the verb in the input, so that it can be matched against command names
  2. To generate every valid assignment of input substrings to arguments of the command.

Where's the Verb?

Once inside the language-specific plugin, things might happen in a different order depending on the language, but we'll keep focusing on the English case for now.

The English parser starts by finding the verb. Currently it does this by splitting the input string on spaces, and taking the first word (i.e. everything up to the first space) as the verb. In our example, "trans" is our verb, and ["hello", "world", "to", "ja"] is the rest of our sentence.

(Note: this simplistic way of getting the verb currently limits us to having only single-word command names. It needs to be changed to support multi-word commands. We also need to consider whether we want to support the case of the user entering the object first and the verb afterward.)

Looking for a Verb Match

EnParser.parseSentence() in local_en.js

The parser maintains a list of every command that's installed in Ubiquity. When parsing input, the main parser passes this list to the language-specific plugin along with the input and the selection (if any).

The next step in the parsing process is to take the input verb (which we found in the last step) and compare it to the name of each command to figure out which ones are potential matches. (Note: this part is actually not language-specific, but it's done in the language-specific plugin for reasons of keeping the code relatively simple. In the future we may factor it out to the main parser.)

Each verb is scored against the input string, using the metric described in Scoring the Quality of the Verb Match. These scores will be used later to sort the suggestions in order from best to worst, but for now we only care whether a verb gets a score greater than zero. Verbs that get a match quality score of 0 are discarded at this point, while anything with a positive number -- meaning it matches at all, no matter how tenuously -- continues to the next step.

Mid-Stage Parsing: Assigning the Arguments

the PartiallyParsedSentence object

The NLParser.PartiallyParsedSentence object, defined in parser.js, represents a particular way of assigning input substrings to the arguments of a particular verb.

For example, the translate verb has these arguments:

  • Direct object (the string to translate) which can be any string.
  • "to", the language to translate to, which is noun_type_language so it can only be a language name.
  • "from", the language to translate from, also noun_type_language.

So when given the following input:

trans hello world to ja

there are two PartiallyParsedSentences that can be created:

  1. Direct object is "hello world to ja". "to" and "from" arguments undefined.
  2. Direct object is "hello world", "to" argument is "ja", "from" argument undefined.

The PartiallyParsedSentence class encapsulates the verb, a mapping of strings to arguments, and various logic for dealing with these things. It also handles selection interpolation, about which more later.

Assigning input substrings to command arguments

_recursiveParse() in locale_en.js

This is language specific, because it depends on the grammar of the language. In English, we expect prepositions to come before nouns -- that is, we expect the word "to" to come before the language that the user wants to translate to. So any substring that follows the word "to" could potentially be assigned to the "to" argument.

In the code, arguments that are identified by prepositions like "to" or "from" are often called modifiers. An argument that just comes after the verb, with no preposition, is called a direct object. Many verbs, like translate, have both modifiers and direct objects.

Since the direct object isn't marked by a preposition, that means anything that comes after the verb could be part of the direct object.

To further complicate things, arguments (both direct object and modifiers) can occur in any order.

(TODO: More about how the argument assignment works in en)

Mid-stage Parsing: Using the Selection

There are two ways that the user's selection can be incorporated into the input: explicitly, by the use of a "magic pronoun" like "this"; or implicitly/automatically, because the user has left an argument blank.

Interpolating the selection in place of magic pronouns

NLParser.PartiallyParsedSentence._suggestWithPronounSub() in parser.js

The algorithm used here is generic, so it's in the core parser; but the list of "magic pronouns" is language-specific, and therefore imported from the language plugin. For instance, here's the list of English magic pronouns imported from ubiquity/modules/parser/locale_en.js:

EnParser.PRONOUNS = ["this", "that", "it", "selection", "him", "her", "them"];

If any of those words appear in the input, the PartiallyParsedSentence will attempt to substitute the selection for the word. So if the user selects:

your mom

and enters:

trans this to spanish

the suggestions will be:

  • translate your mom to spanish
  • translate this to spanish

For each argument that has a substring assigned, the PartiallyParsedSentence constructor calls _suggestWithPronounSub() once to generate suggestions for that argument based on substituting the selection (i.e. the first suggestion above), and it also calls _argSuggest() once to generate suggestions for that argument based on the literal input substring without substitution (i.e. the second suggestion above.)

_suggestWithPronounSub() goes through every pronoun in the language-specific PRONOUNS list, and produces one argument suggestion for each match that it finds.

There may be multiple magic pronouns; for example, if the user has selected something and entered the following:

twitter look at this iterator it is cool

Then we generate one suggestion based on substituting the selection for "this", another suggestion based on substituting the selection for "it", and a suggestion with no interpolation at all.

But we don't substitute the selection for the "it" that is part of the word "iterator", because we're using a regexp that only matches at word boundaries.

Because the substitution happens before the substring is passed to a noun_type for argument suggestion, it's possible to -- for instance -- select "ja", say "translate hello world to this", and get a suggestion of "translate hello world to japanese".

Filling missing arguments with the selection

NLParser.PartiallyParsedSentence.getAlternateSelectionInterpolations() in parser.js

Even if the user doesn't use any magic pronouns, the parser will get the selection and try it out for each un-filled argument, to see if it fits anywhere.

This is not language-specific, and is handled in the core parser.

For example, if the user has selected the text:

your mom

and typed this into ubiquity:

trans

There's only one possible PartiallyParsedSentence, and it leaves all the arguments blank. The "to" and "from" arguments are noun_type_language, so we pass "your mom" to noun_type_language.suggest(); we get back no suggestions, because that doesn't match any language name. However, the direct object is noun_arb_text, and when we pass "your mom" to noun_arb_text.suggest(), we get back a single suggestion, which is {text: "your mom"}.

So we produce one FullyParsedSentence based on this argument suggestion:

verb: Translate; direct object: {text: "your mom"}

The user sees this as the one suggestion in the suggestion list:

translate your mom (to language) (from language)


If a noun type produces no suggestions based on the selection, this doesn't invalidate the PartiallyParsedSetence; we just don't put the selection there.

If more than one empty argument can take the selection, we produce one FullyParsedSentence for each possible placement of the selection. So if the user selects the word:

spanish

and then types into ubiquity:

trans

the selection is valid for all three arguments, so we will get these suggestions:

  • translate (text to translate) to spanish (from language)
  • translate (text to translate) (to language) from spanish
  • translate spanish (to language) (from language)

We assume that the selection goes into one argument at most -- we don't fill multiple arguments with the same selection.

Text selection vs. HTML selection

(TODO)

Noun-First Parsing

If there is no verb match: Noun-First Suggestions

NLParser.Parser.updateSuggestionList() in parser.js

NLParser.Parser.nounFirstSuggestions() in parser.js

If the input verb doesn't match *any* command names at all, the parser will try to look for a noun match. That is, the parser will assume that the user has skipped typing the verb and gone straight to typing an argument. It will treat the entire input string as a noun, and produce a list of verbs that could possibly take that noun as an argument.

In practice, many commands take arguments of type noun_arb_text, meaning that they can accept *any input whatsoever* as an argument. That means that these commands will always appear as part of valid verb list when doing noun-first suggestion. This isn't the most useful thing, so the parser draws a somewhat arbitrary distinction at this point between *generic* verbs (verbs that take any string whatsoever) and *specific* verbs (verbs that accept only some arguments and reject others) -- for instance the "tab" command, which accepts only names of open tabs as arguments. When doing noun-first suggestions, specific verbs that can accept the input are always ranked over generic verbs.

Generic verbs are ranked by the frequency with which they have been chosen as noun-first completions.

Verb quality matches play no part in ranking noun-first suggestions, because there is no verb match quality to compare. Only the frequency of past usage and the quality of the noun matches make any difference.

As an optimization, the parser doesn't actually try out every generic verb. Instead, it asks the Suggestion Memory module (see also) for the top N most often used generic verbs, and compares only to those. This doesn't make a difference in the final suggestions because less-frequently-used generic verbs wouldn't make it into the top of the list anyway; skipping them speeds things up.

If the input string is empty

NLParser.Parser.updateSuggestionList() in parser.js

Then the parser looks at the user selection. If it's non-empty, then we assume that the user is looking for suggestions of verbs that could operate on this selection. This case is then handled exactly like the case above, with the sole difference that the "argument" is coming from the selection and not from the typed input.


In the unlikely event that the input and the selection are both empty, then the parser returns an empty suggestion list.

Late Parsing: Noun Suggestions, FullyParsedSentences

Getting suggestions from NounTypes

Find out more about the built-in noun-types.

See also Writing your own Noun-Types on the command author tutorial page.

Noun-types are also language-specific. The built-in ones are defined in ubiquity/feed-parts/header/<language code>/nountypes.js -- e.g. ubiquity/feed-parts/header/en/nountypes.js

The parser uses the nountypes in NLParser.PartiallyParsedSentence._argSuggest().

(TODO)

Asynchronous NounType Suggestions

(TODO!)

Rejecting parsings with invalid argument assignments

If a NounType is given input, and it returns an empty suggestion list, then that noun assignment is considered invalid. This invalidates the PartiallyParsedSentence, and no fully ParsedSentences are created from it.

For example, if the input was

trans hello world to blarg

Then the PartiallyParsedSentences are:

  1. verb = TRANSLATE, directObject = "hello world to blarg", to = "", from = ""
  2. verb = TRANSLATE, directObject = "hello world", to = "blarg", from = ""

The "to" argument is noun_type_language. When it's given input "blarg", it produces no suggestions. Therefore the PartiallyParsedSentence that assigns the input string "blarg" to the "to" argument is considered invalid and dropped.

Note the difference between an invalid argument assignment, and leaving an argument unfilled. A PartiallyParsedSentence that leaves an argument unfilled is still valid. The above input would still produce a suggestion that treats "hello world to blarg" as the string to be translated, with the languages yet to be specified.

Turning PartiallyParsedSentences into FullyParsedSentences

The difference is that a PartiallyParsedSentence assigns a substring to each argument of the verb; a FullyParsedSentence assigns a noun suggestion to each argument of the verb. To turn one into the other involves giving the substring to a NounType and getting back a list of noun suggestions; we can then instantiate one FullyParsedSentence for each suggestion.

In practical terms, if our PartiallyParsedSentence is this:

Verb: Translate; Direct object: "hello, world"; To: "ja"

Then noun_type_language.suggest() might produce two suggestions based on "ja":

  • {text:"japanese", data: "jp"}
  • {text:"javanese", data: "jv"}

(This is a bad example because we don't actually have Javanese implemented, but bear with me.) Meanwhile, the type of the direct object, noun_arb_text, produces just one suggestion:

{text:"hello, world"}

So we now produce two FullyParsedSentences which are:

  • Verb: Translate; Direct object: {text:"hello, world"}; To: {text:"japanese"}
  • Verb: Translate; Direct object: {text:"hello, world"}; To: {text:"javanese"}

If you've written commands, you probably recognize these noun suggestions -- they are the objects that end up getting passed in to your .execute() and .preview() methods:

 CreateCommand({
   name: "translate",
   execute: function( directObj, languages ) {
     var toLangCode = languages.to.data  || this._getDefaultLang();
     // assuming user picked first suggestion, languages.to.data has value "jp"
     var fromLang = languages.from.data || "";

     translateTo( directObj.text, {to:toLangCode} );
     // assuming user picked first suggestion, directObj.text has value "hello, world"
 // etc.


There's no limit to how many arguments a verb can have, nor is there a limit on how many suggestions a noun_type can produce. If a PartiallyParsedSentence has three arguments, and each argument's noun_type returns three suggestions, then the algorithm must create a FullyParsedSentence for each possible combination, so there are 3 x 3 x 3 = 27 FullyParsedSentences to instantiate. (Which will later be ranked, sorted, and trimmed down to just the top few.) This is a pathological edge case; in the vast majority of situations there is at most one noun_type returning multiple suggestions. Nevertheless, we have to handle this stuff, which is why the code for turning PartiallyParsed into FullyParsed is so complex.

Filling missing arguments with defaults

NLParser.ParsedSentence.fillMissingArgsWithDefaults() in parser.js

At this point, we've filled in all the arguments we can fill based on stuff the user has typed and/or selected. There may still be arguments unfilled; if so, we'll attempt to fill them with default suggestions.

There are two sources that default suggestions can come from. The first place we check is the verb's argument definitions. There's a (not yet widely used) feature of the command creation API that lets verbs define defaults for its arguments, like so: (TODO)

If there's no verb-defined default, we look for a default defined by the noun type. For example, noun_type_date ( in ubiquity/feed-parts/header/en/nountypes.js ) defines a default() function that returns a suggestion for the current date.

A feature that is not yet implemented, but should be, is the ability for a nountype to return multiple defaults instead of just a single default. It might be useful, for instance, to have noun_type_language return as default suggestions the two or three languages that you use most often.

If no default is available from either source, the argument remains empty.

Scoring and Sorting Suggestions

Scoring the Quality of the Verb Match

(NLParser.Verb.match() in parser.js)

The current heuristic is extremely ad-hoc, but produces the ordering we want... so far.

When attempting to match an input verb to a command name, the parser tries the following types of matches:

  1. A perfect match between the input word and the verb name earns the maximum score, 1.0.
  2. If the input word matches the beginning of the verb name, e.g. input is "trans", command is "translate": it gets a good score, between 0.75 and 1.0. The more percentage of the verb name that is matched, the higher the score is within this range. (All else being equal, this will rank short verb names before longer ones; I'm not sure this is really needed or correct.)
  3. If the input word matches the verb name, but not at the beginning, (e.g.: input is "cal", verb is "add-to-calendar") then it gets a score between 0.5 and 0.75, again scaled by the percentage of the verb name that is matched. Match in the middle is not as good as match at the beginning, because it's assumed that if the user types "ans" they are more likely to want "answers" than "translate".
  4. If the input word matches the beginning of a synonym of the verb name, it gets a score between 0.25 and 0.5. For example, "tweet" is a synonym for "twitter", so if the input is "twee" then "twitter" will go in this category. (Note this means that any match to a synonym is ranked below any match to a primary name. This is to prevent synonyms from colonizing the namespace.)
  5. If the input word matches a synonym of the verb name, but not at the beginning, it gets a score between 0 and 0.5. For example, matching "eet" to "twitter" because it matches the end of "tweet". This type of match is considered a real long-shot, but better than nothing.

No match at all gets a 0.

The algorithm doesn't currently recognize disjoint or out-of-order matches. E.g. if the user typed "add cal", they might mean "add-to-calendar", and we might detect that fact if we did disjoint matches, but we don't so this will get a score of 0. It might be worth experimenting with matches like this, but how to rank them?

Scoring the Frequency of the Verb Choice

( NLParser.Parser._sortSuggestionList() in parser.js )


(TODO)

Scoring the Quality of the Noun Matches

NLParser.ParsedSentence._init() in parser.js

Each ParsedSentence gets assigned a third score: noun match quality. This is distinct from its verb match quality and its frequency score.

Once again, we draw a distinction between noun_arb_text (the type of argument that can accept any input whatsoever) and more discriminating noun types. Currently a ParsedSentence gets +1 to its noun match quality for every specific noun type argument that is filled by the input.

So for example, in parsing "trans hello world to ja", we have these three fully parsed sentences:

  1. verb = TRANSLATE, direct object = "hello world to ja", to = unfilled, from = unfilled
  2. verb = TRANSLATE, direct object = "hello world", to = "japanese", from = unfilled
  3. verb = TRANSLATE, direct object = "hello world", to = "javanese", from = unfilled

The direct object is worth no points because it's arbitrary text. The "to" and "from" arguments are a specific noun type (language) and therefore worth one point if filled. So ParsedSentence number 1 gets 0 points for noun match quality, while ParsedSentences number 2 and number 3 get 1 point each.

This is an extremely simplistic noun match quality ranking. Replacing it with a real measurement, one that takes into account string distance or past choice frequency of noun suggestions, is a major opportunity for future evolution of the parser.

Sorting the suggestions and returning them

NLParser.FullyParsedSentence.getMatchScores() in parser.js

NLParser.Parser._sortSuggestionList() in parser.js


The relative importance of the various scores depends on whether the suggestions were generated by the verb-first or by the noun-first parsing strategy. (It's guaranteed that all the suggestions in the suggestion list at any given time were either all generated verb-first or all generated noun-first.)

Verb-first suggestions are sorted thus:

  1. first by the frequency of the verb choice
  2. Ties on frequency are broken by verb match quality score.
  3. Ties on verb match quality score are broken by noun match quality score.

Multiple suggestions based on the same verb always tie on frequency and verb match quality for obvious reasons, so within the same verb suggestions end up being sorted only by noun match quality.

Noun-first suggestions have no verb match quality score. They have a verb choice frequency score, but this has to be calculated differently because there is no verb substring in the input to compare with the suggestion memory. Noun-first suggestions are sorted thus:

  1. first by noun match quality score
  2. Ties on noun match quality are broken by frequency of the verb choice

Because suggestions only get noun match quality from specific noun types (i.e. not noun_arb_text), this sorting order ensures that matches to specific data types in the input or selection get bumped to the top. For instance, try entering this:

today

You'll see that verbs that take dates as arguments (e.g. "check-calendar") appear at the top of the list, with your most commonly-used generic verbs (google, wikipedia, etc) under that.

Binding arguments to the FullyParsedSentences

Remember that the underlying Translate command object (defined in ubiquity/standard-feeds/general.js) has methods like:

 execute: function( directObj, languages )
 preview: function( pBlock, directObj, languages )

(where "languages" is a dictionary of modifier arguments, i.e. the "to" and "from" arguments)

A FullyParsedSentence object wraps a command object along with specific values for the directObj and modifier arguments. So client code can call .execute() or .preview() methods on a FullyParsedSentence object without having to supply values for the directObject or modifiers -- you can think of those values as already having been "bound". Instead, the function signatures look like this:

 execute: function( context )
 preview: function( context, previewBlock )


See ubiquity/modules/cmdmanager.js to see an example of client code using the external API of the FullyParsedSentence objects returned as suggestions from NLParser.Parser.getSuggestionList() and NLParser.Parser.getSentence().