Drumbeat/MoJo/hackfest/berlin/projects/followthis: Difference between revisions

Revision as of 08:28, 27 September 2011

   Project Lead(s): Matt Terenzio

1. need to be able to extract RDFa, Microformats from pages.
2. Need to be able to use NLP to extract entities if Semantic metadata is not present.
3. Need to be able to store and query the metadata.
4. Need a solid UI for users to be able to interact with the service.
5. A crawler for the news sources would be nice.

I have a working bookmarklet but it needs work. JQuery help.
Totally clueless on entity extraction from pages that don't have semantic metadata.
Also need to figure out SPARQL and the best persistent data store for RDF.

rNews has been brought up. I've installed the RDFa distiller from W3C and you can use it to distill RDFa from pages.

   Example which distills a page with rNews in it: 
   To use it, just call:
http://followth.is/cgi-bin/RDFa.py?uri=uri-of-we-page-youwant-to-distill

(update: Matt has a better entity extractor than this using Stanford NLP -- will use that) To extract keywords from some text I set up a CGI script that does so if you feed it text.

It should accept posts to that URL as well as gets.

First pass at a readability-like way to extract the article text and headline from a web page:

@@ Line 24: / Line 24: @@
 == Link for more info: ==
-rNews has been brought up. I've installed the RDFa distiller from W3C and you can use it to distill RDFa from pages.
+* rNews has been brought up. I've installed the RDFa distiller from W3C and you can use it to distill RDFa from pages.
      [http://followth.is/cgi-bin/RDFa.py?uri=http%3A%2F%2Fwww.thehour.com%2Fstory%2F511537%2Ffewer-people-applied-for-unemployment-benefits&format=pretty-xml&warnings=false&parser=lax&space-preserve=true Example which distills a page with rNews] in it:
      To use it, just call:
   http://followth.is/cgi-bin/RDFa.py?uri=''uri-of-we-page-youwant-to-distill''
-To extract keywords from some text I set up a CGI script that does so if you feed it text.
+* (update: Matt has a better entity extractor than this using Stanford NLP -- will use that) To extract keywords from some text I set up a CGI script that does so if you feed it text.
 [http://followth.is/cgi-bin/extract.py?text=Mary+had+a+little+lamb example]
 It should accept posts to that URL as well as gets.
+* First pass at a readability-like way to extract the article text and headline from a web page:
+http://followth.is/read/article/http%3A%2F%2Fwww.thehour.com%2Fstory%2F511535%2Ffrank-fay-way-we-were/
 == Link for demo: ==