Intellego/GSoC/2014

From MozillaWiki
Jump to: navigation, search

Intellego is participating in the Google Summer of Code program for 2014.

Project outline

Intellego is an initiative to develop a machine translation platform from open corpus data, open corpus gathering techniques, and open web services APIs to lower the linguistic accesibility barrier for users and websites and further promote the exploration of freedom of linguistic expression on the web.

This piece of the project will lay the foundational code for Intellego by aiming at the completion of the first key milestone: creating an automatic translation tool for web sites aimed to translate all key source terminology in a site into the target language equivalents. This will be accomplished by scanning the DOM of a site, extracting the translatable text nodes, searching for source terminology matches from within a bilingual termbase, and returning target language terminology within the rendered page. This project will aim to perform these tasks within the Mozilla support sites.

If the student can accomplish the basic scope of the project before the necessary eight weeks, the stretch aim would be to enable the addition of context sensitive retrieval of target terminology.

Skills needed

  • DOM manipulation (JavaScript)
  • Information retrieval
  • XML
  • Understanding of open webservices APIs
  • Python
  • Ability to quickly create an intuitive front-end web UI using an existing framework (e.g., Django)

Timeline

The GSoC program has allotted 8 weeks for coding. Here is the timeline we have come up with for that period:

Week 1

  • Create a bilingual termbase of terminology consisting of Mozilla-specific terminology from Mozilla l10n resources. (bug 983140)

Week 2

  • Create a front-end web portal UI in which the user will simply enter a URL and click a button to execute the MT results. (bug 983142)
  • Create a back-end, Python-based program that will, given a URL, extract the DOM text nodes from the associated webpage. (bug 983143)

Week 3

  • Filter out DOM text nodes with untranslatable (or non-translatable) text. (bug 983257)

Week 4

  • Search the translatable DOM text nodes (the source) for source terminology matches in the bilingual termbase. (bug 983266)

Week 5

  • Map the source terminology to the matching target terminology from the termbase. (bug 983144)

Week 6

  • All-At-Once Replacement Method: Regenerate the DOM with the replaced terminology, output to a new webpage, and render it. (bug 983146)

Week 7

  • On-the-Fly Replacement Method: Perform the terminology replacement operation on the DOM segment by segment, instead of extracting all text nodes from the DOM at once. (bug 983148)

Week 8

  • Evaluate each method (all-at-once or on-the-fly) for efficiency and analyze whether it would be beneficial to use one method over the other, or whether it would be better to offer a choice of either. (bug 983250)

Final deliverable

  • Automatic terminology translation tool consisting of a web interface and a server-side tool. A user will insert the URL of a source language web site and the tool will return the rendered target language website containing partially translated content. (bug 983138)

Interested students

If you are a student interested in participating with Intellego for the Google Summer of Code program, please add your information to the table below.

}

Team liaisons

Mentor
Reporter
Jeff Beatty (gueroJeff)
Name Contact information Website Open source experience Description of interest
Akshay Aurora (akshayaurora) (:system64) akshayaurora[at]yahoo.com Website // LinkedIn Github Full stack developer passionate about open technologies
Abdul Rauf (haseeb) (:haseeb) abdulraufhaseeb[at]gmail.com Website Github // BitBucket
Tharshan Muthulingam tharshan09[at]gmail.com Website Github
Sudheesh Singanamalla sudheesh1994[at]yahoo.com Website Github
Rishabh Roy rishabhsixfeet[at]gmail.com Github python developer making tools to be used by mass