Labs/Ubiquity/Parser 2: Difference between revisions

From MozillaWiki
< Labs‎ | Ubiquity
Jump to navigation Jump to search
No edit summary
No edit summary
 
(27 intermediate revisions by 2 users not shown)
Line 3: Line 3:
[[File:ParserTNG.jpg|640px|thumb|left|Photo of the whiteboard after Jono and mitcho's meeting on the parser design]]
[[File:ParserTNG.jpg|640px|thumb|left|Photo of the whiteboard after Jono and mitcho's meeting on the parser design]]


=Demo video=
=News=


Watch the latest demo video: http://vimeo.com/4307110
Parser 2 has shipped as part of [https://labs.mozilla.com/2009/07/ubiquity-0-5/ Ubiquity 0.5]. You can learn how to localize the parser (teach the parser the grammar of your language) in [[Labs/Ubiquity/Parser_2/Localization_Tutorial|this tutorial]].
 
Watch an older demo video: http://vimeo.com/4307110


=Intro=
=Intro=
Line 12: Line 14:
# split words/arguments + case markers
# split words/arguments + case markers
# pick possible verbs
# pick possible verbs
# (pick possible clitics - for the (near) future)
# pick possible clitics
# group into arguments (argument structure parsing)
# group into arguments (argument structure parsing)
# anaphora (magic word) substitution  
# anaphora (magic word) substitution  
# verb suggestion
# suggest normalized arguments
# suggest verbs for parses without one
# noun type detection
# noun type detection
# argument noun suggestion
# replace arguments with their nountype suggestions
# score + rank
# rank


==parser files==
==parser files==
Line 33: Line 36:


=step 1: split words/arguments + case markers=
=step 1: split words/arguments + case markers=
{{Ubiquity Parser infobox
|input = input argument (<code>Query.input</code>)
|output = updated input string <code>Query._input</code>
|configurable = <code>Parser.wordBreaker()</code>
|uses = }}
Step 1 doesn't actually split the words up into an array, but it does insert whitespace and no-width whitespace between characters in the input to facilitate future parsing.
Japanese: split on common particles... in the future get feedback from user for this
Japanese: split on common particles... in the future get feedback from user for this


Chinese: split on common functional verbs and prepositions
Chinese: split on common functional verbs and prepositions


strongly case marking languages: split off case affixes
=step 2: pick possible Verbs =
 
{{Ubiquity Parser infobox
|input = updated input (<code>Query._input</code>)
|output = verb-argument pairs <code>Query._verbArgPairs</code>
|uses = <code>Parser.verbFinder(), Parser._patternCache</code>
|configurable = }}


=step 2: pick possible Verbs =
Ubiq will cache a regexp for detection of substrings of verb names. For example: <code>(a|ad|add|add-|...|add-to-calendar|g|go|...google...)</code>
Ubiq will cache a regexp for detection of substrings of verb names. For example: <code>(a|ad|add|add-|...|add-to-calendar|g|go|...google...)</code>


Line 52: Line 69:
=step 3: pick possible clitics=
=step 3: pick possible clitics=


TODO ([http://en.wikipedia.org/wiki/Clitic#Clitics_in_Romance_languages wikipedia entry for Clitic])
[http://mitcho.com/blog/projects/solving-another-romantic-problem/ > see articles on clitic (weak pronoun) handling]
 
TODO


=step 4: group into arguments=
=step 4: group into arguments=
{{Ubiquity Parser infobox
|input = verb-argument pairs (<code>Query._verbArgPairs</code>)
|output = possible parses with incomplete arguments (<code>Query._possibleParses</code>)
|uses = <code>Parser.argFinder(), Parser.hasDelimiter(), Parser.getRoleByDelimiter, Parser._patternCache</code>
|configurable = <code>Parser.roles</code>}}
Find delimiters (see above).
Find delimiters (see above).


Line 86: Line 112:


=step 5: anaphora substitution=
=step 5: anaphora substitution=
{{Ubiquity Parser infobox
|input = <code>Query._possibleParses</code>
|output = <code>Query._possibleParses</code> (updated with additional parses)
|uses = <code>Query.context, Parser.substitute(), Parser._patternCache.anaphora</code>
|configurable = <code>Parser.anaphora</code>}}
Each language has a set of "anaphora" or "magic words", like the English <code>["this", "that", "it", "selection", "him", "her", "them"]</code>. This step will search for any occurrences of these in the parses' arguments and make substituted alternatives, if there is a selection text.
Each language has a set of "anaphora" or "magic words", like the English <code>["this", "that", "it", "selection", "him", "her", "them"]</code>. This step will search for any occurrences of these in the parses' arguments and make substituted alternatives, if there is a selection text.


=step 6: noun type detection=
=step 6: suggest normalized arguments=
 
{{Ubiquity Parser infobox
|input = <code>Query._possibleParses</code>
|output = <code>Query._possibleParses</code> (updated with additional parses)
|uses =
|configurable = <code>Parser.normalizeArgument()</code>}}
 
[http://mitcho.com/blog/projects/solving-another-romantic-problem/ > see blog post on argument normalization] and its use cases
 
For languages with a <code>normalizeArgument()</code> method, this method is applied to each argument. If any normalized alternatives are returned, a copy of the parse is made with that suggestion. Prefixes and suffixes stripped off through argument normalization is put in the <code>inactivePrefix</code> and <code>inactiveSuffix</code> properties of the argument.
 
=step 7: suggest verbs for parses without one=
 
{{Ubiquity Parser infobox
|input = parses with no verb from <code>Query._possibleParses</code>
|output = <code>Query._verbedParses</code>
|uses = <code>Parser.suggestVerb(), Query.addIfGoodEnough()</code>
|configurable = <code>Parser.normalizeArgument()</code>}}
 
For parses which don't yet have verbs, <code>Parser.suggestVerb()</code> is used to suggest a new verb whose arguments match those which have been parsed.
 
=A note on asynchronicity=
 
In order to support asynchronous nountype detection, steps 8-10 of the parser are done asynchronously, taken out of the general <code>Query._yieldingParse()</code> flow and instead put into a series of callback functions, <code>completeParse()</code> and <code>tryToCompleteParses()</code>.
 
First a list of dependencies between parses in <code>Query._verbedParses</code> and different argument strings is constructed. After each nountype detection is completed, <code>tryToCompleteParses</code> is called. If any parse has no more arguments to nountype-detect, <code>completeParse()</code> is called on those parses.
 
At the end of <code>completeParse()</code>, it checks to see if all parses have been completed and calls <code>Parser.onResults()</code> when done.
 
=step 8: noun type detection=
 
{{Ubiquity Parser infobox
|input = <code>Query._verbedParses</code>
|output = noun type suggestions in <code>nounCache</code>
|uses = <code>Parser.detectNounType()</code> and the nountypes themselves
|configurable = }}
 
For each parse, send each argument string to the noun type detector. The noun type detector will cache detection results, so it only checks each string once. This returns a list of possible noun types with their "scores".
For each parse, send each argument string to the noun type detector. The noun type detector will cache detection results, so it only checks each string once. This returns a list of possible noun types with their "scores".


<b>EX:</b>
<b>EX:</b>
  'Dan' -> [{type: contact, score: 1},{type: arb, score: .7}]
  'Dan' -> [{type: noun_type_contact, text: 'Dan Benjamin', data: 'dan.benjamin@test.com', score: 1},
  'my calendar' -> [{type: service, score: 1},{type: arb, score: .7}]
{type: noun_arb_text, text: 'Dan', data: 'Dan', score: .7}]
  'my calendar' -> [{type: noun_type_service, score: 1},{type: noun_arb_text, score: .7}]
 
Nountypes that match everything (for example, <code>noun_arb_text</code>) should return a score < 1 so that parses with more specific nountype matches are suggested (see scoring below).
 
=step 9: replace arguments with nountype suggestions=
 
{{Ubiquity Parser infobox
|input = each <code>parse</code> from <code>Query._verbedParses</code> passed to it
|output = <code>Query._scoredParses</code>
|uses = <code>Parser.suggestArgs()</code>
|configurable = }}
 
Each nountype suggestion in nounCache includes a suggestion for what the argument input may be. For example, "dan" could refer to the contact "Dan Benjamin" or the text "dan". Each combination of different suggestions for different arguments is constructed for each parse. Only nountypes in the nounCache which match the target nountype for that argument (based on that verb) are used.
 
For example, a certain parse has an <code>object</code> argument ("truck") and a <code>goal</code> argument ("tok"). "truck" has just one suggestion in the nounCache ("truck" as arbitrary text) but "tok" has two: "tok" as arbitrary text and "Tokyo" as a physical location. We then yield 2 ( = 1 * 2) parses with complete arguments for that base parse:
 
object: truck, goal: tok
object: truck, goal: Tokyo
 
If the verb in this parse, however, only accepts one nountype in the <code>goal</code>—arbitrary text or location—only one of those parses will be allowed.
 
If a certain argument has no nountype suggestions at all, or none of the nountype suggestions match the nountype required by the verb, that base parse will not yield any completed parses.
 
=step 10: ranking=
 
{{Ubiquity Parser infobox
|input = <code>Query._scoredParses</code>
|output = <code>Query.hasResults, Query.suggestionList</code> (read-only)
|uses =
|configurable = }}
 
[http://mitcho.com/blog/observation/scoring-for-optimization/ > see Scoring for Optimization] which explains the overall scoring philosophy used in ranking Ubiquity suggestions.


=step 7: ranking=
The scoring of individual parses includes factors which affect the score negatively and positively. For [http://mitcho.com/blog/observation/scoring-for-optimization/ optimization reasons], all negative factors (having too many arguments per role, having to suggest the verb, missing arguments, etc.) are calculated in the <code>scoreMultiplier</code> early on in the derivation, and this <code>scoreMultiplier</code> is used in the additive calculation of positive factors.


  foreach parse (w/o V)
  score = scoreMultiplier * (\sum_{each semantic role in the verb} score(the content of that argument being the appropriate nountype) + base score (1) )
  by semantic roles in the parse, find appropriate verbs
  foreach possible verb
    score = \prod_{each semantic role in the verb} score(the content of that argument being the appropriate nountype)
    
    
<b>EX:</b>
<b>EX:</b>
Line 117: Line 216:
so...
so...


  score = P(DO is a bunch of arb) * P(with is a contact) * P(goal is a service)
scoreMultiplier = 0.3 (as we had to suggest a verb)
  = 1 * 1 * 1
  score = scoreMultiplier(base score (1) + P(DO is a bunch of arb) + P(with is a contact) * P(goal is a service)
  = 0.3 ( 1 + 1 + 1 + 1 )


so <code>score = 1</code>
so <code>score = 1.2</code>


<b>/EX</b>
<b>/EX</b>
Now lower the score for >1 direct objects:
score = score * (0.5**(#DO-1)) (example algorithm)
<b>EX:</b> <code>score = 1</code>, with 2 direct objects, so
score = 1 * (0.5**1) = 1 * 0.5 = 0.5

Latest revision as of 17:35, 13 July 2009

(formerly Parser: The Next Generation)

Photo of the whiteboard after Jono and mitcho's meeting on the parser design

News

Parser 2 has shipped as part of Ubiquity 0.5. You can learn how to localize the parser (teach the parser the grammar of your language) in this tutorial.

Watch an older demo video: http://vimeo.com/4307110

Intro

High level overview:

  1. split words/arguments + case markers
  2. pick possible verbs
  3. pick possible clitics
  4. group into arguments (argument structure parsing)
  5. anaphora (magic word) substitution
  6. suggest normalized arguments
  7. suggest verbs for parses without one
  8. noun type detection
  9. replace arguments with their nountype suggestions
  10. rank

parser files

Parser 2 files and language files are in the ubiquity/modules/parser/new directory.

The main parser file, parser.js, also has a lot of inline documentation: inline documentation in parser.js

each language will have:

  • a head-initial or head-final parameter (prepositions or postpositions, basically... this changes the way we find possible argument substrings)
  • "semantic role identifiers"/"delimiters" (currently pre/postpositions... in the future case marking prefixes/suffixes, etc.) for different semantic roles

EX: add lunch with Dan tomorrow to my calendar

step 1: split words/arguments + case markers

input: input argument (Query.input)
output: updated input string Query._input
configurable: Parser.wordBreaker()
uses:

Step 1 doesn't actually split the words up into an array, but it does insert whitespace and no-width whitespace between characters in the input to facilitate future parsing.

Japanese: split on common particles... in the future get feedback from user for this

Chinese: split on common functional verbs and prepositions

step 2: pick possible Verbs

input: updated input (Query._input)
output: verb-argument pairs Query._verbArgPairs
configurable:
uses: Parser.verbFinder(), Parser._patternCache

Ubiq will cache a regexp for detection of substrings of verb names. For example: (a|ad|add|add-|...|add-to-calendar|g|go|...google...)

Search the beginning and end of the string for a verb: ^(MAGIC) (if you have a space-lang) and (MAGIC)$. This becomes the verb and the rest of the string becomes the "argString".

This step will return a set of (V,argString) pairs. (Note, this includes one pair where V=null and argString is the whole input.)

EX:

('add','lunch with Dan tomorrow to my calendar'),
(null,'add lunch with Dan tomorrow to my calendar')

step 3: pick possible clitics

> see articles on clitic (weak pronoun) handling

TODO

step 4: group into arguments

input: verb-argument pairs (Query._verbArgPairs)
output: possible parses with incomplete arguments (Query._possibleParses)
configurable: Parser.roles
uses: Parser.argFinder(), Parser.hasDelimiter(), Parser.getRoleByDelimiter, Parser._patternCache

Find delimiters (see above).

EX: for (null,'add lunch with Dan tomorrow to my calendar'), we get:

add lunch *with* Dan tomorrow *to* my calendar
add lunch with Dan tomorrow *to* my calendar
add lunch *with* Dan tomorrow to my calendar

then move to the right of each argument (because English is head-initial... see parameter above) to get argument substrings:

for add lunch *with* Dan tomorrow *to* my calendar:

{V:    null,
 DO:   ['add lunch','tomorrow','calendar'],
 with: 'Dan'
 goal: 'my'},
{V:    null,
 DO:   ['add lunch','calendar'],
 with: 'Dan tomorrow'
 goal: 'my'},
{V:    null,
 DO:   ['add lunch','tomorrow'],
 with: 'Dan'
 goal: 'my calendar'},
{V:    null,
 DO:   ['add lunch'],
 with: 'Dan tomorrow'
 goal: 'my calendar'}

(Note: for words which are not incorporated into an oblique argument (aka "modifier argument"), they are pushed onto the DO list.)

step 5: anaphora substitution

input: Query._possibleParses
output: Query._possibleParses (updated with additional parses)
configurable: Parser.anaphora
uses: Query.context, Parser.substitute(), Parser._patternCache.anaphora

Each language has a set of "anaphora" or "magic words", like the English ["this", "that", "it", "selection", "him", "her", "them"]. This step will search for any occurrences of these in the parses' arguments and make substituted alternatives, if there is a selection text.

step 6: suggest normalized arguments

input: Query._possibleParses
output: Query._possibleParses (updated with additional parses)
configurable: Parser.normalizeArgument()
uses:

> see blog post on argument normalization and its use cases

For languages with a normalizeArgument() method, this method is applied to each argument. If any normalized alternatives are returned, a copy of the parse is made with that suggestion. Prefixes and suffixes stripped off through argument normalization is put in the inactivePrefix and inactiveSuffix properties of the argument.

step 7: suggest verbs for parses without one

input: parses with no verb from Query._possibleParses
output: Query._verbedParses
configurable: Parser.normalizeArgument()
uses: Parser.suggestVerb(), Query.addIfGoodEnough()

For parses which don't yet have verbs, Parser.suggestVerb() is used to suggest a new verb whose arguments match those which have been parsed.

A note on asynchronicity

In order to support asynchronous nountype detection, steps 8-10 of the parser are done asynchronously, taken out of the general Query._yieldingParse() flow and instead put into a series of callback functions, completeParse() and tryToCompleteParses().

First a list of dependencies between parses in Query._verbedParses and different argument strings is constructed. After each nountype detection is completed, tryToCompleteParses is called. If any parse has no more arguments to nountype-detect, completeParse() is called on those parses.

At the end of completeParse(), it checks to see if all parses have been completed and calls Parser.onResults() when done.

step 8: noun type detection

input: Query._verbedParses
output: noun type suggestions in nounCache
configurable:
uses: Parser.detectNounType() and the nountypes themselves

For each parse, send each argument string to the noun type detector. The noun type detector will cache detection results, so it only checks each string once. This returns a list of possible noun types with their "scores".

EX:

'Dan' -> [{type: noun_type_contact, text: 'Dan Benjamin', data: 'dan.benjamin@test.com', score: 1},
{type: noun_arb_text, text: 'Dan', data: 'Dan', score: .7}]
'my calendar' -> [{type: noun_type_service, score: 1},{type: noun_arb_text, score: .7}]

Nountypes that match everything (for example, noun_arb_text) should return a score < 1 so that parses with more specific nountype matches are suggested (see scoring below).

step 9: replace arguments with nountype suggestions

input: each parse from Query._verbedParses passed to it
output: Query._scoredParses
configurable:
uses: Parser.suggestArgs()

Each nountype suggestion in nounCache includes a suggestion for what the argument input may be. For example, "dan" could refer to the contact "Dan Benjamin" or the text "dan". Each combination of different suggestions for different arguments is constructed for each parse. Only nountypes in the nounCache which match the target nountype for that argument (based on that verb) are used.

For example, a certain parse has an object argument ("truck") and a goal argument ("tok"). "truck" has just one suggestion in the nounCache ("truck" as arbitrary text) but "tok" has two: "tok" as arbitrary text and "Tokyo" as a physical location. We then yield 2 ( = 1 * 2) parses with complete arguments for that base parse:

object: truck, goal: tok
object: truck, goal: Tokyo

If the verb in this parse, however, only accepts one nountype in the goal—arbitrary text or location—only one of those parses will be allowed.

If a certain argument has no nountype suggestions at all, or none of the nountype suggestions match the nountype required by the verb, that base parse will not yield any completed parses.

step 10: ranking

input: Query._scoredParses
output: Query.hasResults, Query.suggestionList (read-only)
configurable:
uses:

> see Scoring for Optimization which explains the overall scoring philosophy used in ranking Ubiquity suggestions.

The scoring of individual parses includes factors which affect the score negatively and positively. For optimization reasons, all negative factors (having too many arguments per role, having to suggest the verb, missing arguments, etc.) are calculated in the scoreMultiplier early on in the derivation, and this scoreMultiplier is used in the additive calculation of positive factors.

score = scoreMultiplier * (\sum_{each semantic role in the verb} score(the content of that argument being the appropriate nountype) + base score (1) )
  

EX:

{V:    null,
 DO:   ['add lunch','tomorrow'],
 with: 'Dan'
 goal: 'my calendar'}
'Dan' -> [{type: contact, score: 1},{type: arb, score: .7}]
'my calendar' -> [{type: service, score: 1},{type: arb, score: .7}]

"add" lexical item:

...args:{DO: arb, with: contact, goal: service}...

so...

scoreMultiplier = 0.3 (as we had to suggest a verb)
score = scoreMultiplier(base score (1) + P(DO is a bunch of arb) + P(with is a contact) * P(goal is a service)
= 0.3 ( 1 + 1 + 1 + 1 )

so score = 1.2

/EX