Drumbeat/MoJo/hackfest/berlin/projects/followthis: Difference between revisions
Jump to navigation
Jump to search
| Line 24: | Line 24: | ||
== Link for more info: == | == Link for more info: == | ||
rNews has been brought up. I've installed the RDFa distiller from W3C and you can use it to distill RDFa from pages. | * rNews has been brought up. I've installed the RDFa distiller from W3C and you can use it to distill RDFa from pages. | ||
[http://followth.is/cgi-bin/RDFa.py?uri=http%3A%2F%2Fwww.thehour.com%2Fstory%2F511537%2Ffewer-people-applied-for-unemployment-benefits&format=pretty-xml&warnings=false&parser=lax&space-preserve=true Example which distills a page with rNews] in it: | [http://followth.is/cgi-bin/RDFa.py?uri=http%3A%2F%2Fwww.thehour.com%2Fstory%2F511537%2Ffewer-people-applied-for-unemployment-benefits&format=pretty-xml&warnings=false&parser=lax&space-preserve=true Example which distills a page with rNews] in it: | ||
To use it, just call: | To use it, just call: | ||
http://followth.is/cgi-bin/RDFa.py?uri=''uri-of-we-page-youwant-to-distill'' | http://followth.is/cgi-bin/RDFa.py?uri=''uri-of-we-page-youwant-to-distill'' | ||
To extract keywords from some text I set up a CGI script that does so if you feed it text. | * (update: Matt has a better entity extractor than this using Stanford NLP -- will use that) To extract keywords from some text I set up a CGI script that does so if you feed it text. | ||
[http://followth.is/cgi-bin/extract.py?text=Mary+had+a+little+lamb example] | [http://followth.is/cgi-bin/extract.py?text=Mary+had+a+little+lamb example] | ||
It should accept posts to that URL as well as gets. | It should accept posts to that URL as well as gets. | ||
* First pass at a readability-like way to extract the article text and headline from a web page: | |||
http://followth.is/read/article/http%3A%2F%2Fwww.thehour.com%2Fstory%2F511535%2Ffrank-fay-way-we-were/ | |||
== Link for demo: == | == Link for demo: == | ||
Revision as of 08:28, 27 September 2011
Project Name: FollowThis
Project Lead(s): Matt Terenzio
Big Goal for MoJo Hackfest:
- Ship some usable code.
- Learn how to manage an Open Source project.
- Work with others on related projects.
- Drink heavily.
Key steps toward goal:
- 1. need to be able to extract RDFa, Microformats from pages.
- 2. Need to be able to use NLP to extract entities if Semantic metadata is not present.
- 3. Need to be able to store and query the metadata.
- 4. Need a solid UI for users to be able to interact with the service.
- 5. A crawler for the news sources would be nice.
Pending needs:
- I have a working bookmarklet but it needs work. JQuery help.
- Totally clueless on entity extraction from pages that don't have semantic metadata.
- Also need to figure out SPARQL and the best persistent data store for RDF.
Link for more info:
- rNews has been brought up. I've installed the RDFa distiller from W3C and you can use it to distill RDFa from pages.
Example which distills a page with rNews in it: To use it, just call: http://followth.is/cgi-bin/RDFa.py?uri=uri-of-we-page-youwant-to-distill
- (update: Matt has a better entity extractor than this using Stanford NLP -- will use that) To extract keywords from some text I set up a CGI script that does so if you feed it text.
It should accept posts to that URL as well as gets.
- First pass at a readability-like way to extract the article text and headline from a web page: