Drumbeat/MoJo/hackfest/berlin/projects/followthis: Difference between revisions

From MozillaWiki
< Drumbeat‎ | MoJo‎ | hackfest‎ | berlin‎ | projects
Jump to navigation Jump to search
Line 62: Line 62:
* Promote rNews adoption
* Promote rNews adoption
* Deploy to at least one news site by end of 2011
* Deploy to at least one news site by end of 2011
=== Project Status  ===
* Currently working features include...
* The project is currently capable of doing...
* The project currently functions in these contexts...
=== Collaborators  ===
The following folks helped with this project:
* Laurian/How to model data for RDF storage and how to query that data using SPARQL
* Raynor/TF-IDF (term frequency–inverse document frequency)
* Laurian/Raynor Cosine Similarity concepts for comparing documents
* Jordan What constitutes a valuable difference between documents from a user or journalist perspective
* Chris CMS perspectives from a Journalists standpoint
=== Next steps  ===
- From here I would like to:
* NEXT IMPLEMENTATION STEP 1
* NEXT IMPLEMENTATION STEP 2
* NEXT IMPLEMENTATION STEP 3
Places where this project might be tested include:
* TEST CONTEXT 1
* TEST CONTEXT 2
* TEST CONTEXT 3

Revision as of 08:45, 29 September 2011

Project Name: FollowThis

   Project Lead(s): Matt Terenzio
   

Big Goal for MoJo Hackfest:

  • Ship some usable code. (achieved)
  • Learn how to manage an Open Source project.
  • Work with others on related projects. (achieved)
  • Drink heavily. (achieved)

Key steps toward goal:

  • 1. need to be able to extract RDFa, Microformats from pages. (working)
  • 2. Need to be able to use NLP to extract entities if Semantic metadata is not present. (adopt and contribute to metameta project for this)
  • 3. Need to be able to store and query the metadata. (am currently able to query the RDF triplestore but need to hone queries)
  • 4. Need a solid UI for users to be able to interact with the service. (getting there)
  • 5. A crawler for the news sources would be nice. (deferred to version .2)

Pending needs:

  • Important:Need to make a button that is an embeddable widget for ease of deployment
  • I have a working bookmarklet but it needs work. JQuery help. (still need a session with jquery expert)
  • Totally clueless on entity extraction from pages that don't have semantic metadata. (solved somewhat)
  • Also need to figure out SPARQL and the best persistent data store for RDF. (Laurian gave me some good starting points)

Link for more info:

  • rNews has been brought up. I've installed the RDFa distiller from W3C and you can use it to distill RDFa from pages.
   Example which distills a page with rNews in it: 
   To use it, just call:
http://followth.is/cgi-bin/RDFa.py?uri=uri-of-we-page-youwant-to-distill
  • (update: Matt has a better entity extractor than this using Stanford NLP -- will use that) To extract keywords from some text I set up a CGI script that does so if you feed it text.

example

It should accept posts to that URL as well as gets.

  • First pass at a readability-like way to extract the article text and headline from a web page:

http://followth.is/read/article/http%3A%2F%2Fwww.thehour.com%2Fstory%2F511535%2Ffrank-fay-way-we-were/

  • Another endpoint that distills RDFa froma web page (this one in PHP)

http://followth.is/transform/?type=rdfa&url=http://www.thehour.com/story/511535/frank-fay-way-we-were

  • A SPARQL endpoint for the triplestore of rNews data

http://followth.is/transform/sparql/

Link for demo:

FollowThis demo

Link to source code:

FollowThis on GitHub

Where from here:

  • Though code is in working form, it is necessary to clean and organize a few parts for better forward maintainability and extension
  • Continue to work on open alternatives to some of the portions that use third party APIs
  • Documentation for both developers and users
  • Promote rNews adoption
  • Deploy to at least one news site by end of 2011

Project Status

  • Currently working features include...
  • The project is currently capable of doing...
  • The project currently functions in these contexts...

Collaborators

The following folks helped with this project:

  • Laurian/How to model data for RDF storage and how to query that data using SPARQL
  • Raynor/TF-IDF (term frequency–inverse document frequency)
  • Laurian/Raynor Cosine Similarity concepts for comparing documents
  • Jordan What constitutes a valuable difference between documents from a user or journalist perspective
  • Chris CMS perspectives from a Journalists standpoint

Next steps

- From here I would like to:

  • NEXT IMPLEMENTATION STEP 1
  • NEXT IMPLEMENTATION STEP 2
  • NEXT IMPLEMENTATION STEP 3

Places where this project might be tested include:

  • TEST CONTEXT 1
  • TEST CONTEXT 2
  • TEST CONTEXT 3