Drumbeat/MoJo/hackfest/berlin/projects/followthis

< Drumbeat‎ | MoJo‎ | hackfest‎ | berlin‎ | projects

Revision as of 08:22, 29 September 2011 by Mterenzio (talk | contribs) (→‎Big Goal for MoJo Hackfest:)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Project Name: FollowThis

   Project Lead(s): Matt Terenzio

Big Goal for MoJo Hackfest:

Ship some usable code. (achieved)
Learn how to manage an Open Source project.
Work with others on related projects. (achieved)
Drink heavily. (achieved)

Key steps toward goal:

1. need to be able to extract RDFa, Microformats from pages.
2. Need to be able to use NLP to extract entities if Semantic metadata is not present.
3. Need to be able to store and query the metadata.
4. Need a solid UI for users to be able to interact with the service.
5. A crawler for the news sources would be nice.

Pending needs:

Important:Need to make a button that is an embeddable widget for ease of deployment
I have a working bookmarklet but it needs work. JQuery help. (still need a session with jquery expert)
Totally clueless on entity extraction from pages that don't have semantic metadata. (solved somewhat)
Also need to figure out SPARQL and the best persistent data store for RDF. (Laurian gave me some good starting points)

Link for more info:

rNews has been brought up. I've installed the RDFa distiller from W3C and you can use it to distill RDFa from pages.

   Example which distills a page with rNews in it: 
   To use it, just call:
http://followth.is/cgi-bin/RDFa.py?uri=uri-of-we-page-youwant-to-distill

(update: Matt has a better entity extractor than this using Stanford NLP -- will use that) To extract keywords from some text I set up a CGI script that does so if you feed it text.

It should accept posts to that URL as well as gets.

First pass at a readability-like way to extract the article text and headline from a web page:

http://followth.is/read/article/http%3A%2F%2Fwww.thehour.com%2Fstory%2F511535%2Ffrank-fay-way-we-were/

Another endpoint that distills RDFa froma web page (this one in PHP)

http://followth.is/transform/?type=rdfa&url=http://www.thehour.com/story/511535/frank-fay-way-we-were

A SPARQL endpoint for the triplestore of rNews data

http://followth.is/transform/sparql/

Link for demo:

FollowThis demo

Link to source code:

FollowThis on GitHub

Where from here:

Though code is in working form, it is necessary to clean and organize a few parts for better forward maintainability and extension
Continue to work on open alternatives to some of the portions that use third party APIs
Documentation for both developers and users
Promote rNews adoption
Deploy to at least one news site by end of 2011

Retrieved from "https://wiki.mozilla.org/index.php?title=Drumbeat/MoJo/hackfest/berlin/projects/followthis&oldid=352827"