Drumbeat/MoJo/hackfest/berlin/projects/MetaProject: Difference between revisions
(→Text) |
(→Text) |
||
| Line 112: | Line 112: | ||
- '''url''':''str'' // The url containing the text to be analyzed | - '''url''':''str'' // The url containing the text to be analyzed | ||
- '''text''':''str'' // The text to be analyzed | - '''text''':''str'' // The text to be analyzed | ||
- '''tasks''':''dictionary'' {D: null} // The list of tasks to perform. | - '''tasks''':''dictionary'' {D: null} // The list of tasks to perform. (null means perform all) | ||
''Note:'' Either miid, url, or text must be provided | ''Note:'' Either miid, url, or text must be provided | ||
| Line 118: | Line 118: | ||
<u>'''Outputs'''</u> | <u>'''Outputs'''</u> | ||
- '''results''':''dictionary'' // The list of task results (one result object per task). | - '''results''':''dictionary'' // The list of task results (one result object per task). | ||
====Tasks==== | ====Tasks==== | ||
| Line 132: | Line 132: | ||
- '''document_keywords''':''array'' // list of keywords for the entire document | - '''document_keywords''':''array'' // list of keywords for the entire document | ||
- '''paragraph_keywords''':''array'' // list of keywords for each paragraph | - '''paragraph_keywords''':''array'' // list of keywords for each paragraph | ||
=====Entity Extraction===== | |||
Identify main keywords found in the text, either document wide, per paragraph, or both | |||
<u>'''Inputs'''</u> | |||
??? | |||
<u>'''Outputs'''</u> | |||
- '''entities''':''array'' // array of [entity, type, position] tuples in the document. | |||
Revision as of 09:30, 27 September 2011
The Meta Project is a tool which provides a simple service: take in any piece of media, spit out all the meta possible.
Meta Standards Resources
(Add links and summaries to documents discussing metadata)
- rNews is a proposed standard for using RDFa to annotate news-specific metadata in HTML documents.
- Metafragments proposed metadata markup for audio and video. - (Julien Dorra)
Known APIs and Tools
(Add links and summaries of toolkits and APIs which can help generate data!)
- http://m.vid.ly/user/ - won't generate metadata but can help with format conversions
Desired Functionality
TEXT
Valid Inputs: URL, Plain Text, HTML
Optional Inputs: Known Metadata
Returned Metadata:
- Primary Themes (Document-wide) - Primary Themes (Per-paragraph) - Suggested Tags - Entities (Names, Locations, Dates, Organizations) and their locations in text - Author - Publishing organization (if any) - Date initially published and date last updated - Names of people who are quoted - Quotes - Other texts cited and/or linked (books, articles, urls) - All other numbers (that aren't dates) and their units (i.e. data points cited) - Corrections
VIDEO
Valid Inputs: URL, Video (format? .mov and .mp4 are the dominate ones)
Optional Inputs: Transcript, Faces, Known Metadata
Returned Metadata:
- Transcript - Moments of audio transition (new speaker) - Moments of video transition (new scene) - OCR data (any text that appears on image) and their timestamps - Entities (Names, Locations) and their timestamps - Suggested Tags - Face identification and their timestamp ranges [only done if faces are provided]
AUDIO
Valid Inputs: URL, Audio (mp3, wav)
Optional Inputs: Transcript, Voice Samples, Known Metadata
Returned Metadata:
- Transcript - Moments of audio transition (new speaker) - Entities (Names, Locations) and their timestamps - Suggested Tags - Voice identification and their timestamp ranges [only done if voice samples are provided]
IMAGE
Valid Inputs: URL, Image (jpg, gif, bmp, png, tif)
Optional Inputs: Faces, Known Metadata
Returned Metadata:
- OCR data and it's coordinate location - Object identification - Face identification [only done if faces are provided]
In photo we have:
- caption - author and job title - headline - keywords - location - date - copyright - news org name
API
API to be as RESTful as posisble. Current thought is that POST will be used to upload the media item (if needed) which will return a Media Item ID (MIID), GET will be used to perform the actual analysis (taking in either an external URL, or the MIID returned from a POST).
Entity Types
* text * image * video * audio
Text
URL: /api/text
POST
Inputs
- text_file:file // The text file to store on the server
- url:str // The url containing the text to store on the server
- text:str // The text to store on the server
- ttl:int {D:0} // The number of seconds until the file will be removed from the system (0 means indefinitely)
Note: Either text_file, url, or text must be provided
Outputs
- miid:int // The unique media item id assigned to this item
GET
Inputs
- miid:int // The server-provided media item id to be analyzed
- url:str // The url containing the text to be analyzed
- text:str // The text to be analyzed
- tasks:dictionary {D: null} // The list of tasks to perform. (null means perform all)
Note: Either miid, url, or text must be provided
Outputs
- results:dictionary // The list of task results (one result object per task).
Tasks
Keyword Extraction
Identify main keywords found in the text, either document wide, per paragraph, or both
Inputs
None
Outputs
- document_keywords:array // list of keywords for the entire document - paragraph_keywords:array // list of keywords for each paragraph
Entity Extraction
Identify main keywords found in the text, either document wide, per paragraph, or both
Inputs
???
Outputs
- entities:array // array of [entity, type, position] tuples in the document.