Ubiquity Parser 2
(formerly Parser: The Next Generation)
Demo video
Watch the latest demo video: http://vimeo.com/4307110
Intro
High level overview:
- (split words/arguments)
- pick possible V's
- (pick possible clitics - for the (near) future)
- group into arguments
- anaphora (magic word) substitution
- noun type detection
- rank
each language will have:
- a head-initial or head-final parameter (prepositions or postpositions, basically... this changes the way we find possible argument substrings)
- "semantic role identifiers"/"delimiters" (currently pre/postpositions... in the future case marking prefixes/suffixes, etc.) for different semantic roles
EX: add lunch with Dan tomorrow to my calendar
step 1: split words/arguments + case markers
Japanese: split on common particles... in the future get feedback from user for this
Chinese: split on common functional verbs and prepositions
strongly case marking languages: split off case affixes
step 2: pick possible Verbs
Ubiq will cache a regexp for detection of substrings of verb names. For example: (a|ad|add|add-|...|add-to-calendar|g|go|...google...)
Search the beginning and end of the string for a verb: ^(MAGIC)
(if you have a space-lang) and (MAGIC)$
. This becomes the verb and the rest of the string becomes the "argString".
This step will return a set of (V,argString) pairs. (Note, this includes one pair where V=null
and argString
is the whole input.)
EX:
('add','lunch with Dan tomorrow to my calendar'), (null,'add lunch with Dan tomorrow to my calendar')
step 3: pick possible clitics
TODO (wikipedia entry for Clitic)
step 4: group into arguments
Find delimiters (see above).
EX: for (null,'add lunch with Dan tomorrow to my calendar')
,
we get:
add lunch *with* Dan tomorrow *to* my calendar add lunch with Dan tomorrow *to* my calendar add lunch *with* Dan tomorrow to my calendar
then move to the right of each argument (because English is head-initial... see parameter above) to get argument substrings:
for add lunch *with* Dan tomorrow *to* my calendar
:
{V: null, DO: ['add lunch','tomorrow','calendar'], with: 'Dan' goal: 'my'}, {V: null, DO: ['add lunch','calendar'], with: 'Dan tomorrow' goal: 'my'}, {V: null, DO: ['add lunch','tomorrow'], with: 'Dan' goal: 'my calendar'}, {V: null, DO: ['add lunch'], with: 'Dan tomorrow' goal: 'my calendar'}
(Note: for words which are not incorporated into an oblique argument (aka "modifier argument"), they are pushed onto the DO list.)
step 5: anaphora substitution
Each language has a set of "anaphora" or "magic words", like the English ["this", "that", "it", "selection", "him", "her", "them"]
. This step will search for any occurrences of these in the parses' arguments and make substituted alternatives, if there is a selection text.
step 6: noun type detection
For each parse, send each argument string to the noun type detector. The noun type detector will cache detection results, so it only checks each string once. This returns a list of possible noun types with their "scores".
EX:
'Dan' -> [{type: contact, score: 1},{type: arb, score: .7}] 'my calendar' -> [{type: service, score: 1},{type: arb, score: .7}]
step 7: ranking
foreach parse (w/o V) by semantic roles in the parse, find appropriate verbs foreach possible verb score = \prod_{each semantic role in the verb} score(the content of that argument being the appropriate nountype)
EX:
{V: null, DO: ['add lunch','tomorrow'], with: 'Dan' goal: 'my calendar'}
'Dan' -> [{type: contact, score: 1},{type: arb, score: .7}] 'my calendar' -> [{type: service, score: 1},{type: arb, score: .7}]
"add" lexical item:
...args:{DO: arb, with: contact, goal: service}...
so...
score = P(DO is a bunch of arb) * P(with is a contact) * P(goal is a service) = 1 * 1 * 1
so score = 1
/EX
Now lower the score for >1 direct objects:
score = score * (0.5**(#DO-1)) (example algorithm)
EX: score = 1
, with 2 direct objects, so
score = 1 * (0.5**1) = 1 * 0.5 = 0.5