Labs/Ubiquity/Parser 2/Localization Tutorial

From MozillaWiki
< Labs‎ | Ubiquity‎ | Parser 2
Jump to: navigation, search

If you are interested in command localization, please read the wiki pages on localizing commands and making commands localizable. This entry is for adding your language to Parser 2, so you can use Ubiquity with the grammar of your language.

Introduction

Ubiquity's Parser 2 was written from the ground up with one of its greatest priorities being internationalization... not just making commands localizable, but actually making it so Parser 2 can be easily taught the grammars of other languages. Key to this undertaking is an idea from the Principles and Parameters school of linguistics, that all languages' grammars are made up of the following: (from wikipedia)

  • A finite set of fundamental principles that are common to all languages; e.g., that a sentence must always have a subject, even if it is not overtly pronounced.
  • A finite set of parameters that determine syntactic variability amongst languages; e.g., a binary parameter that determines whether or not the subject of a sentence must be overtly pronounced.

Following this idea, we built a flexible universal parser, Parser 2, and pair it with a (often very small) set of individual language settings.

The result of this architecture is that it takes very little code to teach Parser 2 a new language. With a little bit of JavaScript and knowledge of and interest in your own language, you’ll be able to get at least rudimentary Ubiquity functionality in your language. Follow along in this step by step guide and please submit your (even incomplete) language files.

Set up your environment

If you’re new to Ubiquity core development, you’ll want to first read the Ubiquity Development Tutorial to learn how to get a live copy of the Ubiquity repository using Mercurial.

As you read along, you may find it beneficial to follow along in some of the more complete language settings files included in Parser 2: English, Japanese, Danish.

Writing your language settings

The structure of the language file

Each language in Parser 2 gets its own settings file. You'll need to look up the ISO 639-1 code for your language... Here we'll use English (code en) as an example here and the language settings file would then be called en.js and go in the /ubiquity/modules/parser/new/ directory of the repository.

Here is the basic template for a Ubiquity Parser 2 language file:

 function makeParser() {
   var en = new Parser('en');
 ...
   return en;
 };

Everything here is wrapped in a factory function called makeParser. This function initializes the new Parser object with the appropriate language code, sets a bunch of parameters (elided above) and returns it. That's it!

Now let's walk through some of the parameters you must set to get your language working. For reference, the properties the language parser object is required to have are: branching, anaphora, and roles.

Identifying your branching parameter

 en.branching = 'right'; // or 'left'

One of the first things you'll have to set for your parser is the branching parameter. Ubiquity Parser 2 uses the branching parameter to decide which direction to look for an argument after finding a delimiter or "role marker" (most often, these are prepositions or postpositions). For example, in English "from" is a delimiter for the goal role and its argument is on its right.

  Arrow-right.png
toMaryfromJohn

So "John" is a possible argument for the source role, but "Mary" should not be. Ubiquity can figure this out because English has the property en.branching = 'right'.

In Japanese, on the other hand, the argument of a delimiter like から ("from") is found on the left of that delimiter, so en.branching = 'left'.

Arrow-left.png   
メアリー-からジョン-に
MaryfromJohnto

In general, if your language has prepositions, you should use .branching = 'right' and if your language has postpositions, you can use .branching = 'left'.

For more info

Defining your roles

 en.roles = [
   {role: 'goal', delimiter: 'to'},
   {role: 'source', delimiter: 'from'},
   {role: 'position', delimiter: 'at'},
   {role: 'position', delimiter: 'on'},
   {role: 'alias', delimiter: 'as'},
   {role: 'instrument', delimiter: 'using'},
   {role: 'instrument', delimiter: 'with'}
 ];

The second required property is the inventory of semantic roles and their corresponding delimiters. Each entry has a role from the [of semantic roles] and a corresponding delimiter. Note that this mapping can be many-to-many, i.e., each role can have multiple possible delimiters and different roles can have shared delimiters. Try to make sure to cover all of the roles in the inventory of semantic roles.

For more info

Entering your anaphora ("magic words")

 en.anaphora = ["this", "that", "it", "selection", "him", "her", "them"];

The final required property is the anaphora property which takes a list of "magic words". Currently there is no distinction between all the different deictic anaphora which might refer to different things.

Register your language

Before testing out your new language settings file, you must register that language with the parser. There is a parser resgistry file at ubiquity/modules/parser/new/parser_registry.json. Open it up and add a new line to the JSON object mapping your language code to the native name of your language or locale. For example, if we wanted to add Danish (language code da), we could add the following line:

 da: "Dansk",

Special cases

Some special language features can be handled by overriding the default behavior from Parser. Please note that the exact implementation of a number of these features are still in flux.

Languages with no spaces

If your language does not delimit arguments (or words, more generally) with spaces, there will be a need to write a custom wordBreaker() method and set usespaces = false and joindelimiter = ''. For an example, please take a look at the Japanese or Chinese.

Case marking languages

In general, the plan for Parser 2 is to not try to attempt to handle strongly case marked languages, and to instead encourage the use of adpositions (prepositions or postpositions) as role markers. For more information, please read In Case of Case....

Stripping articles

Some languages have some delimiters which combine with articles. For example, in French, the preposition "à" combines with the masculine definite article "le" but not "la":

  1. à + la = à la
  2. à + le = au

You can add both "à" and "au" as delimiters of the `goal` role, but then you will get feminine arguments back with the determiner (e.g. "la table") while masculine arguments would be parsed without a determiner (e.g. "chat").

  1. "à la table" = "to the table"
  2. "au chat" = "to the cat"

These types of portmanteau'ed prepositions can be handled through a process of argument normalization. Each language's parser can optionally define a normalizeArgument() method which takes an argument and returns a list of normalized alternates. Normalized arguments are returned in the form of {prefix: ' ', newInput: ' ', suffix: ' '}. For example, if you feed "la table" to the French normalizeArgument(), it ought to return

 [{prefix: 'la ', newInput: 'table', suffix: }]

If there are no possible normalizations, normalizeArgument() should simply return []. Each alternative returned by normalizeArgument() is substituted into a copy of the possible parses just before nountype detection. The prefixes and suffixes are stored in the argument (as inactivePrefix and inactiveSuffix) so they can be incorporated into the suggestion display.

Here, for example, is how the inactive prefix "l'" is displayed in the parser playpen (described below). This way the user is told that the "l'" prefix is being ignored, and the nountype detection and verb action can act on the argument "English". (In the future, of course, we could teach this nountype to accept the Catalan "anglès".)

Catalan portmanteau.png

The easiest way to produce this output is to use the String.match() method. For example normalizeArgument() code, take a look at the Catalan and French parser files.

Test your parser

Now you can go into your about:config page and change the value of "extensions.ubiquity.language" to your language code and restart. All the verbs and nountypes at this point will remain the same as in the English version, but it should obey the argument structure (the word order and delimiters) of your language.

Ubiquity Parser 2 Playpen.png

You can also test your parser in the Parser 2 Playpen at chrome://ubiquity/content/playpen.html. There's a video explaining how you can use the Parser Playpen.

Conclusion

If you run into any trouble, feel free to ask for help on the Ubiquity i18n listhost or find mitcho on the Ubiquity IRC channel (mitcho @ irc.mozilla.org#ubiquity). Of course, once you're at a good stopping point, please contribute your language file to Ubiquity.

The next logical step to getting a better Ubiquity experience in your language is to localize commands.