Account confirmers, Anti-spam team, Bureaucrats, canmove, Confirmed users, Module owners and peers, smwadministrator, smwcurator, Administrators, MozillaWiki team, Widget editors
732
edits
(→State of the Field: Andrew Ng + Wikipedia article) |
(+Obsolete flag) |
||
| (38 intermediate revisions by 6 users not shown) | |||
| Line 1: | Line 1: | ||
{{RELEASE_MANAGEMENT_OBSOLETE}} | |||
=The problems= | =The problems= | ||
*One of the features where Chrome has beaten Firefox is providing users with automatic translation of web content using Google Translate. Google has spent a lot of time and incorporated some interesting strategies into building a complex, proprietary machine translation engine to handle this. The feature within Chrome not only allows users to call and retrieve machine translation output through the Google Translate engine, but Google Translate has an interface to allow users to make recommendations for improving the translation, thus allowing the engine to become more sophisticated and accurate. | *One of the features where Chrome has beaten Firefox is providing users with automatic translation of web content using Google Translate. Google has spent a lot of time and incorporated some interesting strategies into building a complex, proprietary machine translation engine to handle this. The feature within Chrome not only allows users to call and retrieve machine translation output through the Google Translate engine, but Google Translate has an interface to allow users to make recommendations for improving the translation, thus allowing the engine to become more sophisticated and accurate. | ||
*Before Chrome, Google Translate had an open API, which allowed them to collect content for use in their engine, but also made the web a generally more multilingual place. Using this open API, any website could add a snippet of code and see their site translated on the fly. Over three years ago, Google closed this API and began charging for the service, resulting in many websites becoming monolingual once again. Closing Google Translate has left a massive gap in the web and nothing yet has been able to fill the need. | *Before Chrome, Google Translate had an open API, which allowed them to collect content for use in their engine, but also made the web a generally more multilingual place. Using this open API, any website could add a snippet of code and see their site translated on the fly. Over three years ago, Google closed this API and began charging for the service, resulting in many websites becoming monolingual once again. Closing Google Translate has left a massive gap in the web and nothing yet has been able to fill the need. | ||
*The open MT ecosystem currently suffers from not being able to provide potential users with a quality web service or API which both MT end users and web admins could use for their projects. | |||
*Many Mozilla l10n teams consist of only 1-2 people. While they would love to be able to provide coverage in their language for all of the support and websites used to market to and assist users with issues, they do not have the time to commit. User, thus, have a localized Firefox, but lack the troubleshooting support in their language. | *Many Mozilla l10n teams consist of only 1-2 people. While they would love to be able to provide coverage in their language for all of the support and websites used to market to and assist users with issues, they do not have the time to commit. User, thus, have a localized Firefox, but lack the troubleshooting support in their language. | ||
*More and more Mozillians are non-English speakers or do not have English writing skills. There have been efforts to provide language education for Mozillians, however, the opportunities are limited to a small percentage of Mozillians. These Mozillians are thus limited in their participation due to the significan language barrier. | *More and more Mozillians are non-English speakers or do not have English writing skills. There have been efforts to provide language education for Mozillians, however, the opportunities are limited to a small percentage of Mozillians. These Mozillians are thus limited in their participation due to the significan language barrier. | ||
| Line 7: | Line 9: | ||
*Data collected for machine translation corpuses is often done via web crawling and consuming data that users unknowingly offer to these engines either due to web crawling or due to agreeing to obscure terms and conditions of using that MT service. Open data collection for MT corpuses is either non-existent or an obscure practice. | *Data collected for machine translation corpuses is often done via web crawling and consuming data that users unknowingly offer to these engines either due to web crawling or due to agreeing to obscure terms and conditions of using that MT service. Open data collection for MT corpuses is either non-existent or an obscure practice. | ||
=Research questions= | =Research questions= | ||
===How does machine translation work?=== | ===How does machine translation work?=== | ||
There are four general approaches to Machine Translation. Most of the early work, before massive corpora, was done with Rule-based machine translation ( [http://en.wikipedia.org/wiki/Rule-based_machine_translation http://en.wikipedia.org/wiki/Rule-based_machine_translation] ). However, most of the current work being done is with Statistical Machine Translation ( [http://en.wikipedia.org/wiki/Statistical_machine_translation http://en.wikipedia.org/wiki/Statistical_machine_translation] ). A brief description of each is available below. | |||
====Rule-Based Machine Translation==== | |||
Uses pre-defined grammatical and syntactic rules and large bilingual dictionaries to translate text. It can be very costly to produce the necessary resources for this type of translation but according to [http://blog.globalizationpartners.com/machine-translation.aspx http://blog.globalizationpartners.com/machine-translation.aspx] it can actually "produce better quality for language pairs with very different word orders (for, example English to Japanese)" | |||
====Statistical Machine Translation==== | |||
Uses statistical information to choose the "best" translation from the possible translations of a text. As far as I know, all work with statistical machine translation requires a bilingual corpus for calculating the necessary probabilities. | |||
====Example-based Machine Translation==== | |||
Uses cases and analogies, along with a parallel corpus, to determine the best translation. Somewhat similar to Rule-Based ([http://en.wikipedia.org/wiki/Example-based_machine_translation http://en.wikipedia.org/wiki/Example-based_machine_translation]). | |||
====Hybrid Machine Translation==== | |||
A combination of the previously mentioned approaches. | |||
===What are the benefits and drawbacks to each methodology?=== | ===What are the benefits and drawbacks to each methodology?=== | ||
=== | ===How do you measure the output quality of a machine translation engine?=== | ||
;Automated evaluation | |||
* BLEU Score - http://en.wikipedia.org/wiki/BLEU | |||
** Compares MT output against reference translations consisting of professional human translation, assigning a score (based on n-gram precision) to determine how close to the human translation the MT output arrives. | |||
* NIST - http://en.wikipedia.org/wiki/NIST_(metric) | |||
** Similar to BLEU, however, not all correct n-grams are created equal. Correct n-grams are weighted according to rarity of occurrence. | |||
* METEOR - http://en.wikipedia.org/wiki/METEOR | |||
** Evaluation based on unigram recall consistency, rather than precision (as BLEU and NIST do). | |||
* LEPOR - http://en.wikipedia.org/wiki/LEPOR | |||
** New MT evaluation model that is based on evaluating precision, recall, sentence-length and n-gram based word order. | |||
* WER score - https://en.wikipedia.org/wiki/Word_error_rate | |||
** The Word Error Rate calculates the word-level Levenshtein distance between MT output and a reference translation. Should correlate with the difficulty of post-editing machine translation output for publication. | |||
** PWER (Position-independent WER) is a variant where reorderings are disregarded. | |||
===What prominent machine translation engines are out there and what are they known for?=== | ===What prominent machine translation engines are out there and what are they known for?=== | ||
;[https://en.wikipedia.org/wiki/Comparison_of_machine_translation_applications This is a much more concise table] of the current offerings. Includes both open and closed source engines that have front-end applications. | |||
;[http://www.computing.dcu.ie/~mforcada/fosmt.html This is a list of all open source MT engines.] Some have web services, many do not. | |||
{| class="wikitable sortable" border="1" | {| class="wikitable sortable" border="1" | ||
|- | |- | ||
| Line 23: | Line 52: | ||
! Open/Closed | ! Open/Closed | ||
! # of supported languages | ! # of supported languages | ||
! | ! Web hosted? | ||
|- | |- | ||
| Google Translate | | Google Translate | ||
| | | Google | ||
| | | Statistical | ||
| | | Closed | ||
| | | +70 | ||
| | | translate.google.com | ||
|- | |- | ||
| Microsoft Translator | | Microsoft Translator | ||
| Microsoft | |||
| | | | ||
| | | Closed | ||
| | | | ||
| | | | ||
|- | |- | ||
| Babelfish | | Babelfish | ||
| Yahoo! | |||
| | | | ||
| | | Closed | ||
| | | | ||
| | | | ||
| Line 48: | Line 77: | ||
| MosesMT | | MosesMT | ||
| | | | ||
| | | Statistical | ||
| | | Open | ||
| | | | ||
| | | | ||
|- | |- | ||
| | | Apertium | ||
| | | | ||
| Rule-based | |||
| Open | |||
| 30+ | |||
| [http://wiki.apertium.org/wiki/Apy apy] | |||
|- | |- | ||
| Other | | Other | ||
| Line 112: | Line 141: | ||
| | | | ||
| | | | ||
|- | |||
| [http://www.statmt.org/europarl/ EuroParl] | |||
| European Parliament | |||
| | |||
| Open | |||
| 22 | |||
| Sentence aligned text | |||
|- | |||
| [http://ipsc.jrc.ec.europa.eu/index.php?id=198 JRC-Acquis] | |||
| European Union | |||
| | |||
| Open | |||
| 22 | |||
| Sentence aligned text | |||
|- | |||
| [http://www.isi.edu/natural-language/download/hansard/ Hansards Corpus] | |||
| Canadian Govt | |||
| | |||
| Open | |||
| 2 | |||
| Sentence or smaller aligned text | |||
|- | |||
| [http://opus.lingfil.uu.se/ OPUS] | |||
| | |||
| | |||
| Open | |||
| Many | |||
| Contains a variety of different corpora including some of those mentioned above | |||
|- | |||
|[http://www.euromatrixplus.net/multi-un/ MultiUN] | |||
|United Nations | |||
| | |||
|Open | |||
|7 | |||
|Sentence alignment | |||
|} | |} | ||
===What are the pros and cons of having a Mozilla MT engine?=== | ===What are the pros and cons of having a Mozilla MT engine?=== | ||
===What technology resources would be needed to build our own MT engine?=== | ===What technology resources would be needed to build our own MT engine?=== | ||
===What human resources would be needed to build our own MT engine?=== | ===What human resources would be needed to build our own MT engine?=== | ||
===What partnership opportunities could be available for this project?=== | ===What partnership opportunities could be available for this project?=== | ||
See [https://www.taus.net/taus-machine-translation-showcase https://www.taus.net/taus-machine-translation-showcase]. | |||
=User stories= | =User stories= | ||
==Firefox end-users== | ==Firefox end-users== | ||
| Line 150: | Line 216: | ||
== Researchers == | == Researchers == | ||
* Andrew Ng (Stanford University) | * Andrew Ng (Stanford University) | ||
* Philipp Koehn (University of Edinburgh) - Maintains http://statmt.org/ | |||
* Daniel Marcu (University of Southern California) | |||
== Bibliography == | == Bibliography == | ||
=== Overview === | === Overview === | ||
I have broken down the bibliography into two sections below. The first is pages that contain lists of papers including conference proceedings and other things. The second section is specific papers that would be good to read. | |||
==== Websites/Conference Proceedings ==== | |||
* http://statmt.org/ | |||
* [http://amta2012.amtaweb.org/AMTA2012Files/start.htm AMTA 2012 Proceedings] | |||
* [http://www.mt-archive.info/ Machine Translation Archive] | |||
* [http://blog.globalizationpartners.com/machine-translation.aspx An Introduction to Machine Translation] | |||
* [http://www.smartling.com/blog/2012/04/20/a-brief-history-of-machine-translation/ A (Brief) History of Machine Translation] | |||
* [https://labs.taus.net/mt/mosestutorial TAUS Tutorial] Requires you to be logged in to a free TAUS account | |||
* [http://machinetranslation.wordpress.com/2013/12/13/overcome-challenges-of-building-high-quality-mt-engines-with-sparse-data/ TAUS Blog Post] - Creating Quality MT with sparse parallel corpora | |||
* Heafield, K., & Lavie, A. (2010). Combining Machine Translation Output with Open Source, (93), 27–36. doi:10.2478/v10108-010-0008-4.PBML | |||
* Vasi, A. (2012). Enabling Users to Create Their Own Web-Based Machine Translation Engine, 295–298. | |||
==== Individual Papers/Articles/Presentations ==== | |||
* https://en.wikipedia.org/wiki/Machine_translation | * https://en.wikipedia.org/wiki/Machine_translation | ||
* http://ice.he.net/~hedden/intro_mt.html (A little old but has some good info) | |||
* http://michaelnielsen.org/blog/introduction-to-statistical-machine-translation/ | |||
* Machine Translation: An Introductory Guide; Arnold, Douglas and Balkan, Lorna and Meijer, Siety and Humphreys, R Lee and Sadler, Louisa; 2001; http://promethee.philo.ulg.ac.be/engdep1/download/bacIII/Arnold%20et%20al%20Machine%20Translation.pdf (Direct to PDF link) | |||
edits