Drumbeat/MoJo/hackfest/berlin/projects/followthis: Difference between revisions
Jump to navigation
Jump to search
| Line 62: | Line 62: | ||
* Promote rNews adoption | * Promote rNews adoption | ||
* Deploy to at least one news site by end of 2011 | * Deploy to at least one news site by end of 2011 | ||
=== Project Status === | |||
* Currently working features include... | |||
* The project is currently capable of doing... | |||
* The project currently functions in these contexts... | |||
=== Collaborators === | |||
The following folks helped with this project: | |||
* Laurian/How to model data for RDF storage and how to query that data using SPARQL | |||
* Raynor/TF-IDF (term frequency–inverse document frequency) | |||
* Laurian/Raynor Cosine Similarity concepts for comparing documents | |||
* Jordan What constitutes a valuable difference between documents from a user or journalist perspective | |||
* Chris CMS perspectives from a Journalists standpoint | |||
=== Next steps === | |||
- From here I would like to: | |||
* NEXT IMPLEMENTATION STEP 1 | |||
* NEXT IMPLEMENTATION STEP 2 | |||
* NEXT IMPLEMENTATION STEP 3 | |||
Places where this project might be tested include: | |||
* TEST CONTEXT 1 | |||
* TEST CONTEXT 2 | |||
* TEST CONTEXT 3 | |||
Revision as of 08:45, 29 September 2011
Project Name: FollowThis
Project Lead(s): Matt Terenzio
Big Goal for MoJo Hackfest:
- Ship some usable code. (achieved)
- Learn how to manage an Open Source project.
- Work with others on related projects. (achieved)
- Drink heavily. (achieved)
Key steps toward goal:
- 1. need to be able to extract RDFa, Microformats from pages. (working)
- 2. Need to be able to use NLP to extract entities if Semantic metadata is not present. (adopt and contribute to metameta project for this)
- 3. Need to be able to store and query the metadata. (am currently able to query the RDF triplestore but need to hone queries)
- 4. Need a solid UI for users to be able to interact with the service. (getting there)
- 5. A crawler for the news sources would be nice. (deferred to version .2)
Pending needs:
- Important:Need to make a button that is an embeddable widget for ease of deployment
- I have a working bookmarklet but it needs work. JQuery help. (still need a session with jquery expert)
- Totally clueless on entity extraction from pages that don't have semantic metadata. (solved somewhat)
- Also need to figure out SPARQL and the best persistent data store for RDF. (Laurian gave me some good starting points)
Link for more info:
- rNews has been brought up. I've installed the RDFa distiller from W3C and you can use it to distill RDFa from pages.
Example which distills a page with rNews in it: To use it, just call: http://followth.is/cgi-bin/RDFa.py?uri=uri-of-we-page-youwant-to-distill
- (update: Matt has a better entity extractor than this using Stanford NLP -- will use that) To extract keywords from some text I set up a CGI script that does so if you feed it text.
It should accept posts to that URL as well as gets.
- First pass at a readability-like way to extract the article text and headline from a web page:
- Another endpoint that distills RDFa froma web page (this one in PHP)
- A SPARQL endpoint for the triplestore of rNews data
http://followth.is/transform/sparql/
Link for demo:
Link to source code:
Where from here:
- Though code is in working form, it is necessary to clean and organize a few parts for better forward maintainability and extension
- Continue to work on open alternatives to some of the portions that use third party APIs
- Documentation for both developers and users
- Promote rNews adoption
- Deploy to at least one news site by end of 2011
Project Status
- Currently working features include...
- The project is currently capable of doing...
- The project currently functions in these contexts...
Collaborators
The following folks helped with this project:
- Laurian/How to model data for RDF storage and how to query that data using SPARQL
- Raynor/TF-IDF (term frequency–inverse document frequency)
- Laurian/Raynor Cosine Similarity concepts for comparing documents
- Jordan What constitutes a valuable difference between documents from a user or journalist perspective
- Chris CMS perspectives from a Journalists standpoint
Next steps
- From here I would like to:
- NEXT IMPLEMENTATION STEP 1
- NEXT IMPLEMENTATION STEP 2
- NEXT IMPLEMENTATION STEP 3
Places where this project might be tested include:
- TEST CONTEXT 1
- TEST CONTEXT 2
- TEST CONTEXT 3