Drumbeat/MoJo/hackfest/berlin/projects/MetaMetaProject

From MozillaWiki
< Drumbeat‎ | MoJo‎ | hackfest‎ | berlin‎ | projects
Jump to: navigation, search

About The Project

Name: Meta Meta Project

Code Repository: On Github

The Meta Meta Project is a tool which provides a simple service: take in any piece of media, spit out all the meta possible.

Project Status

  • Much of the API is designed and documented.
  • Much of the API is stubbed out in code, ready to have the "brains" inserted.
  • Keyword extraction is implemented, in addition to a front-facing "test shell" which can easily be modified to show off the new features as they are added.

Collaborators

Oh so many folks at Hacktoberfest helped in discussions, brainstorms, fleshing out the wish lists, and in some case even in the code. Shout outs in particular go to:

  • [Raynor Vliegendhart] who helped design python server template and serve as a spectacular Python resource.
  • [Tathagata Dasgupta] who has been particularly enthusiastic about contributing his entity extraction work.
  • [Mark Boas] who is going to be a key player in the incorporation of microformats transcription features.
  • Laurian Gridinoc whose comments and advice helped shape the API design.

Next steps

There are some clear next steps:

  • Continue fleshing out the API, particularly for Text and Audio formats.
  • Continue to code the specific tasks in the API.
  • Flesh out and possibly streamline the installation process.
  • Encapsulate library includes so that, when setting up a server, it is possible to only set up specific portions (for instance maybe someone doesn't need the identify_keywords task so ideally they wouldn't have to install nltk).
  • Design a "test script" which will make it clear what tasks are functional and what tasks don't have their dependencies properly installed
  • Design new media type "Web Site" which will focus on component extraction (e.g. "identify_videos" "identify_content" etc.)

Places where this project might be tested include:

  • This tool can be used (and contributed to) by anyone who is hacking together a project using media, be it in a newsroom, a professional in a company, or a hobby coder.

Meta Standards Resources

(Add links and summaries to documents discussing metadata)

  • rNews is a proposed standard for using RDFa to annotate news-specific metadata in HTML documents.
  • Metafragments proposed metadata markup for audio and video. - (Julien Dorra)

Known APIs and Tools

(Add links and summaries of toolkits and APIs which can help generate data!)

Desired Functionality

TEXT

Valid Inputs: URL, Plain Text, HTML

Optional Inputs: Known Metadata

Desired Metadata:

- Primary Themes (Document-wide)
- Primary Themes (Per-paragraph)
- Suggested Tags
- Entities (Names, Locations, Dates, Organizations) and their locations in text
- Author
- Publishing organization (if any)
- Date initially published and date last updated
- Names of people who are quoted
- Quotes
- Other texts cited and/or linked (books, articles, urls)
- All other numbers (that aren't dates) and their units (i.e. data points cited)
- Corrections

VIDEO

Valid Inputs: URL, Video (.mov .mp4 vp8)

Optional Inputs: Transcript, Faces, Known Metadata

Desired Metadata:

- Transcript
- Moments of audio transition (new speaker)
- Moments of video transition (new scene)
- OCR data (any text that appears on image) and their timestamps
- Entities (Names, Locations) and their timestamps
- Suggested Tags
- Face identification and their timestamp ranges [only done if faces are provided]
- caption/summary
- author and job title
- headline
- keywords 
- location
- date
- copyright
- news org name
- URL to related word story

AUDIO

Valid Inputs: URL, Audio (mp3, wav)

Optional Inputs: Transcript, Voice Samples, Known Metadata

Desired Metadata:

- Transcript
- Moments of audio transition (new speaker)
- Entities (Names, Locations) and their timestamps
- Suggested Tags
- Voice identification  and their timestamp ranges [only done if voice samples are provided]

IMAGE

Valid Inputs: URL, Image (jpg, gif, bmp, png, tif)

Optional Inputs: Faces, Known Metadata

Desired Metadata:

- OCR data and it's coordinate location
- Object identification
- Face identification [only done if faces are provided]
- Location identification

In photo we have:

- caption
- author and job title
- headline
- keywords 
- location
- date
- copyright
- news org name

INTERACTIVE

Valid Inputs: URL

Optional Inputs: None

Desired Metadata: ???


WEB PAGE

Valid Inputs: URL

Optional Inputs: None

Desired Metadata:

- images
- audio
- videos
- content
- title
- author
- last update
- meta tags
-

API

API to be as RESTful as posisble. Current thought is that POST will be used to upload the media item (if needed) which will return a Media Item ID (MIID), GET will be used to perform the actual analysis (taking in either an external URL, or the MIID returned from a POST).

Entity Types

* text
* image
* video
* audio

Text

URL: /api/text

POST

Inputs

- text_file:file // text file to store on the server
- url:str // url containing the text to store on the server
- text:str // text to store on the server
- ttl:int {D:180} // number of seconds until the file will be removed from the system (0 means indefinitely)

Note: Either text_file, url, or text must be provided

Outputs

- miid:int // unique media item id assigned to this item

GET

Inputs

- miid:int // server-provided media item id to be analyzed
- url:str // url containing the text to be analyzed
- text:str // text to be analyzed
- tasks:dictionary // list of tasks to perform
- results:dictionary {D: null} // list of results from past tasks

Note: Either miid, url, or text must be provided

Outputs

- results:dictionary // list of task results (one result object per task).

Tasks

identify_entities

Identify entities (e.g. people, organizations, and locations) found in the text, either document wide, per paragraph, or both

Powered by [???]

Inputs

None

Outputs

- entities:array // array of [position, entity, type] tuples in the document
identify_keywords

Identify main keywords found in the text, either document wide, per paragraph, or both

Powered by nltk

Inputs

- type:enum('document','paragraph', 'both') {D: 'document'} // The scope of keywords to be extracted
- klen:int {D: 1} // The number of words per "keyword"
- kcount:int {D: 5} // The number of keywords to return

Outputs

- document_keywords:array // list of keywords for the entire document
- paragraph_keywords:array // list of keywords for each paragraph

Video

URL: /api/video

POST

Inputs

- video_file:file // video file to store on the server
- url:str // url containing the video to store on the server
- ttl:int {D:180} // number of seconds until the file will be removed from the system (0 means indefinitely)

Note: Either video_file or url must be provided

Outputs

- miid:int // unique media item id assigned to this item

GET

Inputs

- miid:int // server-provided media item id to be analyzed
- url:str // url containing the video to be analyzed
- tasks:dictionary // list of tasks to perform
- results:dictionary {D null} // list of results from past tasks

Note: Either miid or url must be provided

Outputs

- results:dictionary // list of task results (one result object per task)

Tasks

identify_audio_transitions

Identify moments of distinct changes in audio content (e.g. speaker changes).

Powered by [???]

Inputs

None

Outputs

- audio_transitions:array // list of [HH:MM:SS, sound_id] tuples
identify_entities

Identify entities (e.g. people, organizations, and locations) found in the video transcript

Powered by [???]

Inputs

None

Outputs

- entities:array // array of [HH:MM:SS, entity, type] tuples in the document
identify_faces

Identify faces that appear in the video

Powered by ??

Inputs

- sample_rate:int {D: 1} // number of frames per second to sample for analysis

Outputs

- faces:array // list of [start HH:MM:SS, end HH:MM:SS, [x,y] miid ]] tuples
identify_keywords

Identify main keywords found in the video, either video wide or per time segment

Powered by nltk

Inputs

- block_size:int {D: 0} // size of the time blocks in seconds (0 means entire video)

Outputs

- video_keywords:array // list of [start HH:MM:SS, [keywords]] tuples for each time block
identify_video_transitions

Identify moments of distinct changes in video content (e.g. scene changes).

Powered by [???]

Inputs

None

Outputs

- video_transitions:array // list of [HH:MM:SS, scene_id] tuples
ocr

Attempt to extract any digital characters found in the video.

Powered by [???]

Inputs

- focus_blocks:array {D: null} // list of [x, y, h, w] boxes that contain specific segments of OCR
- sample_rate:int {D: 1} // number of frames per second to sample for analysis

Outputs

- ocr_results:array // list of [start HH:MM:SS, end HH:MM:SS, [x, y], string]] tuples
transcribe

Attempt to create a timestamped transcript for the video. The transcript will either be ripped from CC data or estimated using speech to text algorithms.

Powered by [???]

Inputs

None

Outputs

- transcript:array // list of [HH:MM:SS, transcript] tuples
- transcription_method:enum('cc','stt') // method used to generate the transcript

Audio

URL: /api/audio

Image

URL: /api/image