Drumbeat/MoJo/hackfest/berlin/projects/MetaProject: Difference between revisions

From MozillaWiki
< Drumbeat‎ | MoJo‎ | hackfest‎ | berlin‎ | projects
Jump to navigation Jump to search
Line 123: Line 123:
=====Keyword Extraction=====
=====Keyword Extraction=====
Identify main keywords found in the text, either document wide, per paragraph, or both
Identify main keywords found in the text, either document wide, per paragraph, or both
Powered by [http://csc.media.mit.edu/luminoso Luminoso]


<u>'''Inputs'''</u>
<u>'''Inputs'''</u>
Line 135: Line 136:
=====Entity Extraction=====
=====Entity Extraction=====
Identify main keywords found in the text, either document wide, per paragraph, or both
Identify main keywords found in the text, either document wide, per paragraph, or both
Powered by [???]


<u>'''Inputs'''</u>
<u>'''Inputs'''</u>

Revision as of 09:39, 27 September 2011

The Meta Project is a tool which provides a simple service: take in any piece of media, spit out all the meta possible.

Meta Standards Resources

(Add links and summaries to documents discussing metadata)

  • rNews is a proposed standard for using RDFa to annotate news-specific metadata in HTML documents.
  • Metafragments proposed metadata markup for audio and video. - (Julien Dorra)

Known APIs and Tools

(Add links and summaries of toolkits and APIs which can help generate data!)

Desired Functionality

TEXT

Valid Inputs: URL, Plain Text, HTML

Optional Inputs: Known Metadata

Returned Metadata:

- Primary Themes (Document-wide)
- Primary Themes (Per-paragraph)
- Suggested Tags
- Entities (Names, Locations, Dates, Organizations) and their locations in text
- Author
- Publishing organization (if any)
- Date initially published and date last updated
- Names of people who are quoted
- Quotes
- Other texts cited and/or linked (books, articles, urls)
- All other numbers (that aren't dates) and their units (i.e. data points cited)
- Corrections

VIDEO

Valid Inputs: URL, Video (format? .mov and .mp4 are the dominate ones)

Optional Inputs: Transcript, Faces, Known Metadata

Returned Metadata:

- Transcript
- Moments of audio transition (new speaker)
- Moments of video transition (new scene)
- OCR data (any text that appears on image) and their timestamps
- Entities (Names, Locations) and their timestamps
- Suggested Tags
- Face identification and their timestamp ranges [only done if faces are provided]

AUDIO

Valid Inputs: URL, Audio (mp3, wav)

Optional Inputs: Transcript, Voice Samples, Known Metadata

Returned Metadata:

- Transcript
- Moments of audio transition (new speaker)
- Entities (Names, Locations) and their timestamps
- Suggested Tags
- Voice identification  and their timestamp ranges [only done if voice samples are provided]

IMAGE

Valid Inputs: URL, Image (jpg, gif, bmp, png, tif)

Optional Inputs: Faces, Known Metadata

Returned Metadata:

- OCR data and it's coordinate location
- Object identification
- Face identification [only done if faces are provided]

In photo we have:

- caption
- author and job title
- headline
- keywords 
- location
- date
- copyright
- news org name

API

API to be as RESTful as posisble. Current thought is that POST will be used to upload the media item (if needed) which will return a Media Item ID (MIID), GET will be used to perform the actual analysis (taking in either an external URL, or the MIID returned from a POST).

Entity Types

* text
* image
* video
* audio

Text

URL: /api/text

POST

Inputs

- text_file:file // The text file to store on the server
- url:str // The url containing the text to store on the server
- text:str // The text to store on the server
- ttl:int {D:0} // The number of seconds until the file will be removed from the system (0 means indefinitely)

Note: Either text_file, url, or text must be provided

Outputs

- miid:int // The unique media item id assigned to this item

GET

Inputs

- miid:int // The server-provided media item id to be analyzed
- url:str // The url containing the text to be analyzed
- text:str // The text to be analyzed
- tasks:dictionary {D: null} // The list of tasks to perform.  (null means perform all)

Note: Either miid, url, or text must be provided

Outputs

- results:dictionary // The list of task results (one result object per task).

Tasks

Keyword Extraction

Identify main keywords found in the text, either document wide, per paragraph, or both Powered by Luminoso

Inputs

- type:enum('document','paragraph', 'both') {D: 'document'} // what is the scope of keywords to be extracted

Outputs

- document_keywords:array // list of keywords for the entire document
- paragraph_keywords:array // list of keywords for each paragraph
Entity Extraction

Identify main keywords found in the text, either document wide, per paragraph, or both Powered by [???]

Inputs

???

Outputs

- entities:array // array of [entity, type, position] tuples in the document.