Labs/Ubiquity/Parser 2: Difference between revisions

From MozillaWiki
< Labs‎ | Ubiquity
Jump to navigation Jump to search
No edit summary
(add normalizeArgument)
Line 12: Line 12:
# split words/arguments + case markers
# split words/arguments + case markers
# pick possible verbs
# pick possible verbs
# (pick possible clitics - for the (near) future)
# pick possible clitics
# group into arguments (argument structure parsing)
# group into arguments (argument structure parsing)
# anaphora (magic word) substitution  
# anaphora (magic word) substitution  
# verb suggestion
# suggest normalized arguments
# suggest verbs for parses without one
# noun type detection
# noun type detection
# argument noun suggestion
# replace arguments with their nountype suggestions
# score + rank
# rank


==parser files==
==parser files==
Line 88: Line 89:
Each language has a set of "anaphora" or "magic words", like the English <code>["this", "that", "it", "selection", "him", "her", "them"]</code>. This step will search for any occurrences of these in the parses' arguments and make substituted alternatives, if there is a selection text.
Each language has a set of "anaphora" or "magic words", like the English <code>["this", "that", "it", "selection", "him", "her", "them"]</code>. This step will search for any occurrences of these in the parses' arguments and make substituted alternatives, if there is a selection text.


=step 6: noun type detection=
=step 6: suggest normalized arguments=
 
[http://mitcho.com/blog/projects/solving-another-romantic-problem/ > see blog post on argument normalization] and its use cases
 
For languages with a <code>normalizeArgument()</code> method, this method is applied to each argument. If any normalized alternatives are returned, a copy of the parse is made with that suggestion. Prefixes and suffixes stripped off through argument normalization is put in the <code>inactivePrefix</code> and <code>inactiveSuffix</code> properties of the argument.
 
=step 7: noun type detection=
For each parse, send each argument string to the noun type detector. The noun type detector will cache detection results, so it only checks each string once. This returns a list of possible noun types with their "scores".
For each parse, send each argument string to the noun type detector. The noun type detector will cache detection results, so it only checks each string once. This returns a list of possible noun types with their "scores".


Line 95: Line 102:
  'my calendar' -> [{type: service, score: 1},{type: arb, score: .7}]
  'my calendar' -> [{type: service, score: 1},{type: arb, score: .7}]


=step 7: ranking=
=step 9: replace arguments with nountype suggestions=
 
 
 
=step 10: ranking=


  foreach parse (w/o V)
  foreach parse (w/o V)

Revision as of 04:33, 13 May 2009

(formerly Parser: The Next Generation)

Photo of the whiteboard after Jono and mitcho's meeting on the parser design

Demo video

Watch the latest demo video: http://vimeo.com/4307110

Intro

High level overview:

  1. split words/arguments + case markers
  2. pick possible verbs
  3. pick possible clitics
  4. group into arguments (argument structure parsing)
  5. anaphora (magic word) substitution
  6. suggest normalized arguments
  7. suggest verbs for parses without one
  8. noun type detection
  9. replace arguments with their nountype suggestions
  10. rank

parser files

Parser 2 files and language files are in the ubiquity/modules/parser/new directory.

The main parser file, parser.js, also has a lot of inline documentation: inline documentation in parser.js

each language will have:

  • a head-initial or head-final parameter (prepositions or postpositions, basically... this changes the way we find possible argument substrings)
  • "semantic role identifiers"/"delimiters" (currently pre/postpositions... in the future case marking prefixes/suffixes, etc.) for different semantic roles

EX: add lunch with Dan tomorrow to my calendar

step 1: split words/arguments + case markers

Japanese: split on common particles... in the future get feedback from user for this

Chinese: split on common functional verbs and prepositions

strongly case marking languages: split off case affixes

step 2: pick possible Verbs

Ubiq will cache a regexp for detection of substrings of verb names. For example: (a|ad|add|add-|...|add-to-calendar|g|go|...google...)

Search the beginning and end of the string for a verb: ^(MAGIC) (if you have a space-lang) and (MAGIC)$. This becomes the verb and the rest of the string becomes the "argString".

This step will return a set of (V,argString) pairs. (Note, this includes one pair where V=null and argString is the whole input.)

EX:

('add','lunch with Dan tomorrow to my calendar'),
(null,'add lunch with Dan tomorrow to my calendar')

step 3: pick possible clitics

TODO (wikipedia entry for Clitic)

step 4: group into arguments

Find delimiters (see above).

EX: for (null,'add lunch with Dan tomorrow to my calendar'), we get:

add lunch *with* Dan tomorrow *to* my calendar
add lunch with Dan tomorrow *to* my calendar
add lunch *with* Dan tomorrow to my calendar

then move to the right of each argument (because English is head-initial... see parameter above) to get argument substrings:

for add lunch *with* Dan tomorrow *to* my calendar:

{V:    null,
 DO:   ['add lunch','tomorrow','calendar'],
 with: 'Dan'
 goal: 'my'},
{V:    null,
 DO:   ['add lunch','calendar'],
 with: 'Dan tomorrow'
 goal: 'my'},
{V:    null,
 DO:   ['add lunch','tomorrow'],
 with: 'Dan'
 goal: 'my calendar'},
{V:    null,
 DO:   ['add lunch'],
 with: 'Dan tomorrow'
 goal: 'my calendar'}

(Note: for words which are not incorporated into an oblique argument (aka "modifier argument"), they are pushed onto the DO list.)

step 5: anaphora substitution

Each language has a set of "anaphora" or "magic words", like the English ["this", "that", "it", "selection", "him", "her", "them"]. This step will search for any occurrences of these in the parses' arguments and make substituted alternatives, if there is a selection text.

step 6: suggest normalized arguments

> see blog post on argument normalization and its use cases

For languages with a normalizeArgument() method, this method is applied to each argument. If any normalized alternatives are returned, a copy of the parse is made with that suggestion. Prefixes and suffixes stripped off through argument normalization is put in the inactivePrefix and inactiveSuffix properties of the argument.

step 7: noun type detection

For each parse, send each argument string to the noun type detector. The noun type detector will cache detection results, so it only checks each string once. This returns a list of possible noun types with their "scores".

EX:

'Dan' -> [{type: contact, score: 1},{type: arb, score: .7}]
'my calendar' -> [{type: service, score: 1},{type: arb, score: .7}]

step 9: replace arguments with nountype suggestions

step 10: ranking

foreach parse (w/o V)
  by semantic roles in the parse, find appropriate verbs
  foreach possible verb
    score = \prod_{each semantic role in the verb} score(the content of that argument being the appropriate nountype)
  

EX:

{V:    null,
 DO:   ['add lunch','tomorrow'],
 with: 'Dan'
 goal: 'my calendar'}
'Dan' -> [{type: contact, score: 1},{type: arb, score: .7}]
'my calendar' -> [{type: service, score: 1},{type: arb, score: .7}]

"add" lexical item:

...args:{DO: arb, with: contact, goal: service}...

so...

score = P(DO is a bunch of arb) * P(with is a contact) * P(goal is a service)
= 1 * 1 * 1

so score = 1

/EX

Now lower the score for >1 direct objects:

score = score * (0.5**(#DO-1)) (example algorithm)

EX: score = 1, with 2 direct objects, so

score = 1 * (0.5**1) = 1 * 0.5 = 0.5