Abstract
The goal of this project is bringing the power of computer vision and image processing to the Web. By extending the spec of Media Capture and Streams, the web developers can write video processing related applications in better way. The primary idea is to incorporate Worker-based JavaScript video processing with MediaStreamTrack. The user's video processing script can do the real image processing and analysis works frame by frame. We also propose an extension of ImageBitmap to allow better optimization possibility in both implementation and Web developer side. To support video editor case, we would like to introduce OfflineMediaContext in next phase. We also want to explore the concept of WebImage, a hardware accelerated-able Web API for performance improvement chance. By accomplishing these API step by step, we believe we can improve the Web platform competitiveness a lot on image processing and computer vision area.
Introduction
To get a quick understand what is project FoxEye. Please see below file:
The latest one:
- Presentation file in Whistler Work Week:
Outdated
- Presentation file in Portland Work Week.File:Project FoxEye Portland Work Week.pdf
- Presentation file in P2PWeb WorkShop.File:Project FoxEye 2015-Feb.pdf
- Youtube: https://www.youtube.com/watch?v=TgQWEWiGaO8
The needs for image processing and computer vision is increasing in recent years. The introduction of video element and media stream in HTML5 is very important, allowing for basic video playback and WebCam ability. But it is not powerful enough to handle complex video processing and camera application. Especially there are tons of mobile camera, photo and video editor applications show their creativity by using OpenCV etc in Android. It is a goal of this project to include the capabilities found in modern video production and camera applications as well as some of the processing, recognition and stitching tasks.
This API is inspired by the WebAudio API[1]. Unlike WebAudio API, we try to reach the goal by modifying existing Media Capture and Streams API. The idea is adding some functions to associate the Woker-based script with MediaStreamTrack. Then the script code of Worker runs image processing and/or analysis frame by frame. Since we move the most of processing work to Worker, the main thread will not be blocked.
Basically, the spirit of this project is four parts. The first part is extend the MediaStreamTrack to associate a VideoWorker. This part provide a way to do image processing job frame by frame. The second part is ImageBitmap extension. This part extended the ImageBitmap interface to allow JavaScript developer to read underlying data out and set an external data into an ImageBitmap in a set of supported color formats. The third part is OfflineMediaContext. It is for offline stream to render as fast as possible. The last part is WebImage, it is a hardware accelerated-able API on computer vision area. The Web developers can use it to combine high performance vision processing.
Thanks for the amazing asm.js and emscripten work, we also provide an asm.js version of OpenCV called OpenCV.js. The web developers can leverage the power of OpenCV in a simpler way.
Concept
MediaStreamTrack with VideoWorker:
The new design is a simple and minimal change for current API. By extending MediaStreamTrack and adding VideoWorker related API, we can let MedisStream be able to support video processing functionality through the script code in Worker. Below is the draft WebIDL codes. Please see [5] for the draft specification.
[Constructor(DOMString scriptURL)]
interface VideoWorker : Worker {
void terminate ();
[Throws]
void postMessage (any message, optional sequence<any> transfer);
attribute EventHandler onmessage;
};
partial interface MediaStreamTrack {
void addWorkerMonitor (VideoWorker worker);
void removeWorkerMonitor (VideoWorker worker);
MediaStreamTrack addWorkerProcessor (VideoWorker worker);
void removeWorkerProcessor ();
};
interface VideoWorkerGlobalScope : WorkerGlobalScope {
[Throws]
void postMessage (any message, optional sequence<any> transfer);
attribute EventHandler onmessage;
attribute EventHandler onvideoprocess;
};
interface VideoProcessEvent : Event {
readonly attribute DOMString trackId;
readonly attribute double playbackTime;
readonly attribute ImageBitmap inputImageBitmap;
readonly attribute ImageBitmap? outputImageBitmap;
};
This example try to run text recognition in VideoWorker. In VideoWorker, developer can directly use OpenCV in asm.js version.
But that means the developer or library provider should provide a way to transform ImageBitmap to OpenCV::Mat type. The alternative is providing a new API(WebImage) which dealing with those kinds of interface problems. We can start this kind of implementation in B2G case first.
Example Code 1
Main javascript file:
var myMediaStream;
navigator.getUserMedia({video:true, audio:false}, function(localMediaStream) {
myMediaStream = localMediaStream;
var videoTracks = myMediaStream.getVideoTracks();
var track = videoTracks[0];
var myWorker = new Worker("textRecognition.js");
track.addWorkerMonitor(myWorker));
myWorker.onmessage = function (oEvent) {
console.log("Worker recognized: " + oEvent.data);
};
var elem = document.getElementById('videoelem');
elem.mozSrcObject = dest.stream;
elem.play();
}, null);
textRecognition.js:
var textDetector = WebImage.createTextDetector(img.width, img.height)
onvideoprocess = function (event) {
var img = event.inputFrame;
// Do text recognition.
// We might use built-in detection function or OpenCV in asm.js
var words= textDetector.findText(img);
var recognizedText;
for (var ix = 0; ix < words.length; ix++) {
recognizedText = recognizedText + words[ix] + " ";
}
postMessage(recognizedText);
}
ImageBitmap extensions
Please see [6] for more information.
WebImage:
The WebImage is a high level library for image processing and analysis library. Would not address too much in this wiki page right now. I am focusing on the worker part in current stage.
The underlying implementation of this library can be asm.js or native code which depend the performance need. Some features in asm.js might run poorly in B2g. We will do some experiments like the performance comparison between OpenCV-asm.js and native OpenCV in B2G.
We might need WebImage for HW accelerator case too. For example, we can provide some built-in detection functions like face detection.
In summary, if the performance is not any consideration, we will use asm.js version as much as we can. So the judgement will be the performance need. For example, if we need text recognition in B2G and the performance is critical and the asm.js version of tesseract can't run smoothly, then we go to native built-in way.
Why do we need WebImage:
- For performance critical cases.
- No need of changes in AP for optimization in particular platform: For example, we can use vendor's face detector in camera on B2G. Or native codes version in B2G if there is any memory/performance concern.
- Can be cross platforms: Can run asm.js version in all platforms/browser.
OfflineMediaContext:
We introduce the “OfflineMediaContext” and modify the “MediaStream” here to enable the offline (as soon as possible) MediaStream processing. When developers are going to perform the offline MediaStream processing, they need to form a context which will hold and keep the relationship of all MediaStreams that are going to be processed together.
// typedef unsigned long long DOMTimeStamp;
interface OfflineMediaContext {
void start(DOMTimeStamp durationToStop);
};
// Add an optional argument into the constructor.
[Constructor (optional OfflineMediaContext context),
Constructor (MediaStream stream, optional OfflineMediaContext context),
Constructor (MediaStreamTrackSequence tracks, optional OfflineMediaContext context)]
interface MediaStream : EventTarget {
// No modification.
...
}
- OfflineMediaContext is the holdplace of all MediaStreams which are going to be processed together in a non-realtime rate.
- OfflineMediaContext is also the object who can trigger the non-realtime processing.
- OfflineMediaContext should be instantiated first and then MediaStreams, which are going to be processed together in the same context, could be instantiated by passing the pre-instantiated OfflineMediaContext object into the constructor of MediaStreams. (See the modified MediaStream constructor below)
- The native implementation of OfflineMediaContext holds a non-realtime MediaStreamGraph, which is just the same as the OfflineAudioContext in WebAudio specification.
- The constructors are modified by adding an optional parameter, OfflineMediaContext. By this way, developers are able to associate a MediaStream to an OfflineMediaContext.
- If the optional OfflineMediaContext is given, the native implementation of the MediaStream should be hooked to the non-realtime MSG hold by the given OfflineMediaContext in stead of the global realtime MSG.
- If the optional OfflineMediaContext is given, we need to check whether the new created MediaStream is able to be processed in offline rate or not. If not, the constructor should throw an error and return a NULL. (Constructors are always allowed to throw.)
Demo Snapshots
Unlimited Potentials
According to "Firefox OS User Research Northern India Findings" [3], one of the key table-stake is camera related features. "Ways to provide photo & video editing tools" is what this WebAPI for. So if we can deliver some cool photo & video editing features, we can fulfill one of the needs of our target market.
In [3], it mentioned that one of purchase motivators is educate my kids. The features like PhotoMath can satisfy the education part.
In long term, if we can integrate text recognition with TTS(text to speech), we can help illiterate person to read words or phrase. That will be very useful features.
Also offline text translation in camera might be a killer application too. Waygo and WordLens is two of such applications in Android and iOS.
Text Selection in Image is also an interesting feature for browser. Project Naptha demos some potential functionality based on yext selection in Image.
Use Cases
- Digital Image Processing(DIP) for camera:
- Face In, see Sony Face In
- Augmented Reality, see IKEA AR
- Camera Panorama,
- Fisheye camera,
- Comic Effect,
- Long term, might need Android Camera HAL 3 to control camera
- Smile Snapshot
- Gesture Snapshot
- HDR
- Video Stabilization
- Bar code scanner
- Photo and video editing
- Video Editor, see WeVideo on Android
- A faster way for video editing tools.
- Lots of existing image effects can be used for photo and video editing.
- https://www.facebook.com/thanks
- Object Recognition in Image(Not only FX OS, but also broswer):
- Shopping Assistant, see Amazon Firefly
- Face Detection/Tracking,
- Face Recognition,
- Text Recognition,
- Text Selection in Image,
- Text Inpainting,
- Image Segmentation,
- Text translation on image, see Waygo
- Duo Camera:
- Nature Interaction(Gesture, Body Motion Tracking)
- Interactive Foreground Extraction
and so on....
Some cool applications we can refer in real worlds
- Word Lens:
- Waygo
- PhotoMath
- Cartoon Camera
- Photo Studio
- Magisto
- Adobe PhotoShop Express
- Amazon(firefly app)
Bugs in bugzilla
https://bugzilla.mozilla.org/show_bug.cgi?id=801176
https://bugzilla.mozilla.org/show_bug.cgi?id=709490
https://bugzilla.mozilla.org/show_bug.cgi?id=1044102
Conclusion
This project is not a JavaScript API version for OpenCV. It is a way to let web developer do image processing and computer vision works easier. It can be a huge project if we want. At least five sub-projects can be based on this work.
- Photo/Video editing tools
- Camera Effect
- Face recognition
- Text recognition(copy/paste/search/translate text in image)
- Shopping Application like the APP "Amazon".
This might be a chance to build some unique features on Firefox OS via this project.
References
- [1]:WebAudio Spec, http://www.w3.org/TR/webaudio/
- [2]:Canvas 2D Context, http://www.w3.org/TR/2dcontext
- [3]:"Firefox OS User Research Northern India Findings", https://docs.google.com/a/mozilla.com/file/d/0B9VT90hlMtdSLWhKNTV1b3pHTnM
- [4]:"Frame by frame video effects using HTML5 canvas and video", http://www.kaizou.org/2012/09/frame-by-frame-video-effects-using-html5-and/
- [5]:"Media Capture Stream with Video Worker", http://chiahungtai.github.io/mediacapture-worker/
- [6]:"ImageBitmap Extensions", http://kakukogou.github.io/spec-imagebitmap-extension/
Acknowledgements
This whole idea of adopting WebAudio as the reference design for this project was from a conversation between John Lin. Thanks for Robert O'Callahan's great feedback and comments. Thanks for John Lin and Chia-jung Hung's useful suggestions and ideas. Also, big thanks to my team members who help me to debug the code.
About Authors
CTai
My name is Chia-hung Tai. I am a senior software engineer in Mozilla Taipei office. I work on Firefox OS multimedia stuffs. Before this jobs, I have some experience in OpenCL, NLP(Nature Language Processing), Data Mining and Machine Learning. My IRC nickname is ctai. You can find me in #media, #mozilla-taiwan, and #b2g channels. Also you can reach me via email(ctai at mozilla dot com).
Kaku
Tzuhuo Kuo is an engineer in Mozilla Taipel office.