Abstract

The goal of this project is bringing the power of computer vision and image processing to the Web. By extending the spec of Media Capture and Streams, the web developers can write video processing related applications in better way. The primary idea is to incorporate Worker-based JavaScript video processing with MediaStreamTrack. The user's video processing script can do the real image processing and analysis works frame by frame. We also propose an extension of ImageBitmap to allow better optimization possibility in both implementation and Web developer side. To support video editor case, we would like to introduce OfflineMediaContext in next phase. We also want to explore the concept of WebImage, a hardware accelerated-able Web API for performance improvement chance. By accomplishing these API step by step, we believe we can improve the Web platform competitiveness a lot on image processing and computer vision area.

Introduction

To get a quick understand what is project FoxEye. Please see below file:
The latest one:

Orlando FoxEye session:Orlando FoxEye

FoxEye Briefing: Briefing
Presentation files in Whistler Work Week:
- Project FoxEye Status Update: FoxEye
- FoxEye Cross Firefox OS:Use case
Latest demo in Youtube: FoxEye 2015 H1 demo

Outdated

Presentation file in Portland Work Week.File:Project FoxEye Portland Work Week.pdf
Presentation file in P2PWeb WorkShop.File:Project FoxEye 2015-Feb.pdf
Youtube: https://www.youtube.com/watch?v=TgQWEWiGaO8

The needs for image processing and computer vision is increasing in recent years. The introduction of video element and media stream in HTML5 is very important, allowing for basic video playback and WebCam ability. But it is not powerful enough to handle complex video processing and camera application. Especially there are tons of mobile camera, photo and video editor applications show their creativity by using OpenCV etc in Android. It is a goal of this project to include the capabilities found in modern video production and camera applications as well as some of the processing, recognition and stitching tasks.
This API is inspired by the WebAudio API[1]. Unlike WebAudio API, we try to reach the goal by modifying existing Media Capture and Streams API. The idea is adding some functions to associate the Woker-based script with MediaStreamTrack. Then the script code of Worker runs image processing and/or analysis frame by frame. Since we move the most of processing work to Worker, the main thread will not be blocked.

Basically, the spirit of this project is four parts. The first part is extend the MediaStreamTrack to associate a Worker. This part provide a way to do image processing job frame by frame. The second part is ImageBitmap extension. This part extended the ImageBitmap interface to allow JavaScript developer to read underlying data out and set an external data into an ImageBitmap in a set of supported color formats. The third part is OfflineMediaContext. It is for offline stream to render as fast as possible. The last part is WebImage, it is a hardware accelerated-able API on computer vision area. The Web developers can use it to combine high performance vision processing.
Thanks for the amazing asm.js and emscripten work, we also provide an asm.js version of OpenCV called OpenCV.js. The web developers can leverage the power of OpenCV in a simpler way.

Design Principle

Follow The Extensible Web Manifesto

The original design of this project is WebAudio like design called WebVideo. But is is deprecated due to we want to follow The Extensible Web Manifesto. We believe this kind of design is best for the Web. Take below examples to against WebVideo design.<bt> Computer vision area is more thriving than ever after deep learning introduced. We can't enumerate video nodes for all kinds of creativeness for the future. For example, suppose we enumerate all we can image in current time. But still something we can't image. Maybe we lost an area called image sentiment analysis. I take this as an example because I just known it recently. Image sentiment analysis is trying to classify the emotion from human face in image. So should we add this as a new video node? Take another extreme example, we think we would not have argument on face detection or face recognition. But what we talk now is only for human face. What if we someone think we need to detect and recognize animal face? Sounds like a ridiculous idea right? But it is a real business. A product called "Bistro: A Smart Feeder Recognizes Your Cat's Face"[2] was on indiegogo. So should we extend the original face detection/recognition node or add a new video node for cat or ask them to implement it in VideoWorker? If we ask them to implement in VideoWorker, then we might need to re-consider are those existing video node make sense? Should we move them all into VideoWorker and implement them in JS? If yse, the WebVideo will degenerate to only source node, video worker node and detestation node. Compare to degenerated WebVideo, the MediaStream with worker design is more elegant.

Performance and power consumption do matter

This project is initialized by the FxOS team which means the performance and power consumption for mobile devices are considered from day one.

Concept

MediaStreamTrack with Worker:

The new design is a simple and minimal change for current API. By extending MediaStreamTrack and adding Worker related API, we can let MediaStream be able to support video processing functionality through the script code in Worker. Below is the draft WebIDL codes. Please see [2] for the draft specification.

[Constructor(DOMString scriptURL)]

interface VidoeMonitor : EventTarget {
   attribute EventHandler onvideomonitor;
};

interface VideoProcessor : EventTarget {
   attribute EventHandler onvideoprocess;
};
partial interface MediaStreamTrack {
   void addVideoMonitor(VidoeMonitor monitor);
   void removeVideoMonitor(VidoeMonitor monitor);
   MediaStreamTrack addVideoProcessor(VidoeProcessor processor);
   void removeVideoProcessor();
};

[Exposed=(Window, Worker),
 Constructor(DOMString type, optional VideoMonitorEventInit videoMonitorEventInitDict)]
interface VideoMonitorEvent : Event {
   readonly	 attribute DOMString     trackId;
   readonly	 attribute double        playbackTime;
   readonly	 attribute ImageBitmap?  inputImageBitmap;
};

[Exposed=(Window, Worker),
 Constructor(DOMString type, optional VideoProcessorEventInit videoProcessorEventInitDict)]
interface VideoProcessEvent : VideoMonitorEvent {
    attribute promise<ImageBitmap>  outputImageBitmap;
};

Main thread:

Worker Thread

Example Code

Please check the section examples in MediaStreamTrack with worker.

ImageBitmap extensions

Please see [2] for more information.

WebImage:

Why do we need WebImage? Because performance is matter to prompt computer vision to the Web. We need it for performance reason. Most image processing task can be accelerated by WebGL. But it is not the case for computer vision area. So we need a hardware accelerated-able computer vision related web API to help Web developer deliver a fast and portable Web page and application. That is the motivation of WebImage.
This is a new zone needed to explore. So we might start from refer to existing related API. In end of 2014, KHRONOS released a portable, power efficient vision processing API, called OpenVX. This specification might be a good start point for WebImage.
The following diagram is a brief concept of OpenVX. It is a graph-node architecture. The role of OpenVX for computer vision is like the role of OpenGL for graph. The developer just construct and execute the graph to process incoming image.

OpenVX can be implemented by OpenCL, OpenGL|ES with compute shader and C/C++ with SIMD. That means we can support OpenVX in wide range Web platform from PC to mobile once a OpenVX engine is provided.

OfflineMediaContext:

We introduce the “OfflineMediaContext” and modify the “MediaStream” here to enable the offline (as soon as possible) MediaStream processing. When developers are going to perform the offline MediaStream processing, they need to form a context which will hold and keep the relationship of all MediaStreams that are going to be processed together.

// typedef unsigned long long DOMTimeStamp;
interface OfflineMediaContext {
    void start(DOMTimeStamp durationToStop);
    attribute EventHandler onComplete;
};
// Add an optional argument into the constructor.
[Constructor (optional OfflineMediaContext context),
 Constructor (MediaStream stream, optional OfflineMediaContext context),
 Constructor (MediaStreamTrackSequence tracks, optional OfflineMediaContext context)]
interface MediaStream : EventTarget {
// No modification.
...
}

OfflineMediaContext is the holdplace of all MediaStreams which are going to be processed together in a non-realtime rate.
OfflineMediaContext is also the object who can trigger the non-realtime processing.
OfflineMediaContext should be instantiated first and then MediaStreams, which are going to be processed together in the same context, could be instantiated by passing the pre-instantiated OfflineMediaContext object into the constructor of MediaStreams. (See the modified MediaStream constructor below)
The native implementation of OfflineMediaContext holds a non-realtime MediaStreamGraph, which is just the same as the OfflineAudioContext in WebAudio specification.
The constructors are modified by adding an optional parameter, OfflineMediaContext. By this way, developers are able to associate a MediaStream to an OfflineMediaContext.
If the optional OfflineMediaContext is given, the native implementation of the MediaStream should be hooked to the non-realtime MSG hold by the given OfflineMediaContext in stead of the global realtime MSG.
If the optional OfflineMediaContext is given, we need to check whether the new created MediaStream is able to be processed in offline rate or not. If not, the constructor should throw an error and return a NULL. (Constructors are always allowed to throw.)

OpenCV.js

OpenCV + Emscripten = OpenCV.js
https://github.com/CJKu/opencv

Unlimited Potentials

FoxEye technology tree

This is the technology tree of FoxEye. The solid line is the dependency. The dash line is for performance improvement. The green blocks are possible applications. The blue blocks are API draft with prototype. The red block is we know how to do but not yet starting. The yellow blocks are only the concept. Need more study. So step by step, we can provide photo manager, wide angle panorama, HDR, and even gesture control.

Use Cases

Digital Image Processing(DIP) for camera:
- Face In, see Sony Face In
- Augmented Reality, see IKEA AR
- Camera Panorama,
- Fisheye camera,
- Comic Effect,
- Long term, might need Android Camera HAL 3 to control camera
  - Smile Snapshot
  - Gesture Snapshot
  - HDR
  - Video Stabilization
- Bar code scanner
Photo and video editing
- Video Editor, see WeVideo on Android
- A faster way for video editing tools.
- Lots of existing image effects can be used for photo and video editing.
- https://www.facebook.com/thanks
Object Recognition in Image(Not only FX OS, but also broswer):
- Shopping Assistant, see Amazon Firefly
- Face Detection/Tracking,
- Face Recognition,
- Text Recognition,
- Text Selection in Image,
  - See http://projectnaptha.com/
- Text Inpainting,
- Image Segmentation,
- Text translation on image, see Waygo
Duo Camera:
- Nature Interaction(Gesture, Body Motion Tracking)
- Interactive Foreground Extraction

and so on....

Some cool applications we can refer in real worlds

Word Lens:
- https://play.google.com/store/apps/details?id=com.questvisual.wordlens.demo
- https://itunes.apple.com/tw/app/word-lens/id383463868?mt=8
Waygo
- http://www.waygoapp.com/
PhotoMath
- https://photomath.net/
Cartoon Camera
- https://play.google.com/store/apps/details?id=com.fingersoft.cartooncamera
Photo Studio
- http://photo-studio.en.uptodown.com/android
Magisto
- https://play.google.com/store/apps/details?id=com.magisto
Adobe PhotoShop Express
- http://www.photoshop.com/products/photoshopexpress
Amazon(firefly app)
- https://play.google.com/store/apps/details?id=com.amazon.mShop.android

Current Status

Meta bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1100203
MediaStream with worker: editor draft completed, refining the prototype for review process. See https://bugzilla.mozilla.org/show_bug.cgi?id=1044102
ImageBitmap: In review process. See https://bugzilla.mozilla.org/show_bug.cgi?id=801176
ImageBitmap extension: editor draft completed, refining the prototype for review process. See https://bugzilla.mozilla.org/show_bug.cgi?id=1141979
OpenCV.js: Working on it. See http://people.mozilla.org/~cku/opencv/
OfflineMediaContext: Not yet started.
WebImage:Not yet started.
Run WebGL on worker: in review process. See https://bugzilla.mozilla.org/show_bug.cgi?id=709490
CanvasRenderingContext2D in Worker: waiting for someone take this bug.See https://bugzilla.mozilla.org/show_bug.cgi?id=801176

Next Phase(2015 H2)

Try to land and standardize MediaStream with worker, ImageBitmap and ImageBitmap extension. see[3] for how to process standardization in Mozilla.
Design a automation way to export JavaScript API for OpenCV.js. Try to upstream it to OpenCV code base.
Start to work on OfflineMediaContext.
Support product requirement, for example, Instagram-like app, wide angle panorama or Fox photo.
Do some explanatory experiment on WebImage concept.
Initialize a sub-project called Project GlovePuppetry

Beyond 2015

Proof of Concept for WebImage.
A crazy idea of Kernel.js or SPIR.js for JS developers to customize WebImage ?
Gestural control API with depth camera? => WebNI(Web Nature Interaction)?
Project Cangjie

Conclusion

Step by step, we can make thing different. FoxEye is a pioneer project trying to push the Web boundary to a new area, image processing and computer vision. The key factor to ensuring this project success is you, every Web contributor. Your supports, your promotions, your feedback, your comments are the nutrition of this project. Please share this project to every Web developers. Let's bring more and more amazing things to the Web. Let's rich and improve the whole Web platform competitiveness.

References

[1]:"WebAudio Spec", http://www.w3.org/TR/webaudio/
[2]:"Media Capture Stream with Worker", https://w3c.github.io/mediacapture-worker/
[3]:"Mozilla Standards", https://wiki.mozilla.org/Standards

Acknowledgements

This whole idea of adopting WebAudio as the reference design for this project was from a conversation between John Lin. Thanks for Robert O'Callahan's great feedback and comments. Thanks for John Lin and Chia-jung Hung's useful suggestions and ideas. Also, big thanks to my team members who help me to debug the code.

About Authors

CTai

My name is Chia-hung Tai. I am a senior software engineer in Mozilla Taipei office. I work on Firefox OS multimedia stuffs. Before this jobs, I have some experience in OpenCL, NLP(Nature Language Processing), Data Mining and Machine Learning. My IRC nickname is ctai. You can find me in #media, #mozilla-taiwan, and #b2g channels. Also you can reach me via email(ctai at mozilla dot com).

Kaku

Tzuhuo Kuo is an engineer in Mozilla Taipel office.

CJ Ku

CJ Ku is responsible for OpenCV.js part.

User talk:Dead project

Contents

Abstract

Introduction

Design Principle

Concept

MediaStreamTrack with Worker:

Example Code

ImageBitmap extensions

WebImage:

OfflineMediaContext:

OpenCV.js

Unlimited Potentials

FoxEye technology tree

Use Cases

Some cool applications we can refer in real worlds

Current Status

Next Phase(2015 H2)

Beyond 2015

Conclusion

References

Acknowledgements

About Authors

CTai

Kaku

CJ Ku

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools