Drumbeat/MoJo/hackfest/berlin/projects/MetaProject: Difference between revisions
(→Text) |
(→Text) |
||
| Line 93: | Line 93: | ||
====POST==== | ====POST==== | ||
<u>'''Inputs'''</u> | |||
- '''text_file''':''file'' // The text file to store on the server | - '''text_file''':''file'' // The text file to store on the server | ||
| Line 102: | Line 102: | ||
''Note:'' Either text_file, url, or text must be provided | ''Note:'' Either text_file, url, or text must be provided | ||
<u>'''Outputs'''</u> | |||
- '''miid''':''int'' // The unique media item id assigned to this item | - '''miid''':''int'' // The unique media item id assigned to this item | ||
====GET==== | ====GET==== | ||
<u>Inputs</u> | <u>'''Inputs'''</u> | ||
- '''miid''':''int'' // The server-provided media item id to be analyzed | - '''miid''':''int'' // The server-provided media item id to be analyzed | ||
| Line 116: | Line 116: | ||
''Note:'' Either miid, url, or text must be provided | ''Note:'' Either miid, url, or text must be provided | ||
<u>Outputs</u> | <u>'''Outputs'''</u> | ||
- '''results''':''dictionary'' // The list of task results (one result object per task). See the task list for more information | - '''results''':''dictionary'' // The list of task results (one result object per task). See the task list for more information | ||
Revision as of 09:25, 27 September 2011
The Meta Project is a tool which provides a simple service: take in any piece of media, spit out all the meta possible.
Meta Standards Resources
(Add links and summaries to documents discussing metadata)
- rNews is a proposed standard for using RDFa to annotate news-specific metadata in HTML documents.
- Metafragments proposed metadata markup for audio and video. - (Julien Dorra)
Known APIs and Tools
(Add links and summaries of toolkits and APIs which can help generate data!)
- http://m.vid.ly/user/ - won't generate metadata but can help with format conversions
Desired Functionality
TEXT
Valid Inputs: URL, Plain Text, HTML
Optional Inputs: Known Metadata
Returned Metadata:
- Primary Themes (Document-wide) - Primary Themes (Per-paragraph) - Suggested Tags - Entities (Names, Locations, Dates, Organizations) and their locations in text - Author - Publishing organization (if any) - Date initially published and date last updated - Names of people who are quoted - Quotes - Other texts cited and/or linked (books, articles, urls) - All other numbers (that aren't dates) and their units (i.e. data points cited) - Corrections
VIDEO
Valid Inputs: URL, Video (format? .mov and .mp4 are the dominate ones)
Optional Inputs: Transcript, Faces, Known Metadata
Returned Metadata:
- Transcript - Moments of audio transition (new speaker) - Moments of video transition (new scene) - OCR data (any text that appears on image) and their timestamps - Entities (Names, Locations) and their timestamps - Suggested Tags - Face identification and their timestamp ranges [only done if faces are provided]
AUDIO
Valid Inputs: URL, Audio (mp3, wav)
Optional Inputs: Transcript, Voice Samples, Known Metadata
Returned Metadata:
- Transcript - Moments of audio transition (new speaker) - Entities (Names, Locations) and their timestamps - Suggested Tags - Voice identification and their timestamp ranges [only done if voice samples are provided]
IMAGE
Valid Inputs: URL, Image (jpg, gif, bmp, png, tif)
Optional Inputs: Faces, Known Metadata
Returned Metadata:
- OCR data and it's coordinate location - Object identification - Face identification [only done if faces are provided]
In photo we have:
- caption - author and job title - headline - keywords - location - date - copyright - news org name
API
API to be as RESTful as posisble. Current thought is that POST will be used to upload the media item (if needed) which will return a Media Item ID (MIID), GET will be used to perform the actual analysis (taking in either an external URL, or the MIID returned from a POST).
Entity Types
* text * image * video * audio
Text
URL: /api/text
POST
Inputs
- text_file:file // The text file to store on the server
- url:str // The url containing the text to store on the server
- text:str // The text to store on the server
- ttl:int {D:0} // The number of seconds until the file will be removed from the system (0 means indefinitely)
Note: Either text_file, url, or text must be provided
Outputs
- miid:int // The unique media item id assigned to this item
GET
Inputs
- miid:int // The server-provided media item id to be analyzed
- url:str // The url containing the text to be analyzed
- text:str // The text to be analyzed
- tasks:dictionary {D: null} // The list of tasks to perform. See the task list for more information. (null means perform all)
Note: Either miid, url, or text must be provided
Outputs
- results:dictionary // The list of task results (one result object per task). See the task list for more information
Tasks
Keyword Extraction
Identify main keywords found in the text, either document wide, per paragraph, or both
Inputs
None
Outputs
- document_keywords:array // list of keywords for the entire document - paragraph_keywords:array // list of keywords for each paragraph