Intellego is participating in the Google Summer of Code program for 2014.
Project outline
Intellego is an initiative to develop a machine translation platform from open corpus data, open corpus gathering techniques, and open web services APIs to lower the linguistic accesibility barrier for users and websites and further promote the exploration of freedom of linguistic expression on the web.
This piece of the project will lay the foundational code for Intellego by aiming at the completion of the first key milestone: creating an automatic translation tool for web sites aimed to translate all key source terminology in a site into the target language equivalents. This will be accomplished by scanning the DOM of a site, extracting the translatable text nodes, searching for source terminology matches from within a bilingual termbase, and returning target language terminology within the rendered page. This project will aim to perform these tasks within the Mozilla support sites.
If the student can accomplish the basic scope of the project before the necessary eight weeks, the stretch aim would be to enable the addition of context sensitive retrieval of target terminology.
Skills needed
- DOM manipulation (JavaScript)
- Information retrieval
- XML
- Understanding of open webservices APIs
- Python
- Ability to quickly create an intuitive front-end web UI using an existing framework (e.g., Django)
Timeline
The GSoC program has allotted 8 weeks for coding. Here is the timeline we have come up with for that period:
Week 1
- Create a bilingual termbase of terminology consisting of Mozilla-specific terminology from Mozilla l10n resources.
Week 2
- Create a front-end web portal UI in which the user will simply enter a URL and click a button to execute the MT results.
- Create a back-end, Python-based program that will, given a URL, extract the DOM text nodes from the associated webpage.
Week 3
- Filter out DOM text nodes with untranslatable (or non-translatable) text.
Week 4
- Search the translatable DOM text nodes (the source) for source terminology matches in the bilingual termbase.
Week 5
- Map the source terminology to the matching target terminology from the termbase.
Week 6
- All-At-Once Replacement Method: Regenerate the DOM with the replaced terminology, output to a new webpage, and render it.
Week 7
- On-the-Fly Replacement Method: Perform the terminology replacement operation on the DOM segment by segment, instead of extracting all text nodes from the DOM at once.
Week 8
- Evaluate each method (all-at-once or on-the-fly) for efficiency and analyze whether it would be beneficial to use one method over the other, or whether it would be better to offer a choice of either.
Final deliverable
- Automatic terminology translation tool consisting of a web interface and a server-side tool. A user will insert the URL of a source language web site and the tool will return the rendered target language website containing partially translated content.
Interested students
If you are a student interested in participating with Intellego for the Google Summer of Code program, please add your information to the table below.
Name | Contact information | Website | Open source experience | Description of interest |
---|---|---|---|---|
Akshay Aurora (akshayaurora) (:system64) | akshayaurora[at]yahoo.com | Website // LinkedIn | Github | Full stack developer passionate about open technologies |
Abdul Rauf (haseeb) (:haseeb) | abdulraufhaseeb[at]gmail.com | Website | Github // BitBucket | |
Tharshan Muthulingam | tharshan09[at]gmail.com | Website | Github |
Team liaisons
- Mentor
- Reporter
- Jeff Beatty (gueroJeff)