PDF.js is an HTML5-based Portable Document Format renderer.
Project Manager: Bill Walker
Developers: Artur Adib, Brendan Dahl
IRC: #pdfjs on irc.mozilla.org
Mailing list: dev-pdf-js
- 1 Meeting Notes
- 2 Current work items
- 3 Tracking in-progress work
- 4 Testing
- 5 Utils
- 6 Coding Style
- 7 Review (aka pull-request) policy
Meeting notes are now kept in etherpad and follow the format of https://etherpad.mozilla.org/pdfjs-<YYYY-MM-DD>. There should be one for every Thursday.
Current work items
Render PDF 1.7 spec perfectly
- (tracked by github issues)
Analyze pdf.js's feature completeness across all PDFs
- gather large corpus of PDFs (unassigned)
- run pdf.js over all PDFs to collect data on missing and broken features (unassigned)
- prioritize missing features (unassigned)
- implement missing features (unknown work)
Integrate code into Firefox/Gecko proper
- (as required)
Tracking in-progress work
We'll try to use github issues to track work. (If that proves too unwieldy, we can move to another system.)
Like bugzilla, "issues" are used to track both bugs filed by users and specific work items for developers. Try to file one issue per problem observed. For example,
Text looks bad in PDFs
is an unspecific issue. A more specific and more helpful one is
Glyph spacing is weird on page 10 of http://foo.com/bar.pdf
is an unspecific work item. Better is
Add code to convert Type 1 into CFF
For big projects that span many issues, let's try to use issue "labels" to track the work. (In the same way one would use metabugs to track big projects in bugzilla.) So for the font example above, while Type1->CFF conversion is a specific work item, we might want to tag it "fonts" for easier searching.
Once a specific work item (issue) has been filed, please assign the issue to yourself if you're working on it. This avoids multiple people working on the same projects and thereby wasting time.
Big project: SVG backend
Most of SVG maps well to PDF (was influenced by?). There are existing PDF->SVG translators. Perf is the biggest concern. We want to build the SVG document in the background, without affecting main-thread interactivity. The way to do that is by building the document with a Web Worker thread. The problem is, Workers don't have access to any DOM APIs. We'll probably need to build the document as a string in the background, then send it over to the main thread for parsing.
Big project: Text selection
- Option 1: In SVG backend
- Draw to canvas first. On first selection, switch to SVG-rendered content.
- Let Gecko do all text selection in SVG document
- Option 2: In canvas backend
- Build data structure representing text drawn to screen (e.g., display list/BSP/etc.). For best results, collapse adjacent and same-height/width "text runs".
- Walk data structure and compute textruns at a particular point and/or within a bounding box
- Add UI for "highlighted" text above PDF and saving selected text to clipboard
- Corner cases: clipped text, occluded text, non-white backgrounds, non-black text
- Maybe: render without display-list building first, then on first selection re-interpret PDF to build display list. Or pre-build display list in the background.
See notes from "Baz", a poppler developer.
Big project: Accessibility
Kind of like text selection, except there's no web-visible accessibility API we could hook with canvas. So
- Somehow detect that a11y is enabled, permanently switch to SVG backend
- Let Gecko implement a11y interfaces
Big project: Vertical text
Somewhat pervasive mode switch in text-drawing code. Is it just a matter of transform hackery to put glyphs in the right place, or do we need canvas support? Canvas support might be a big project.
Big project: Search
We want browsers' find-in-page features to see PDF text too. This is hard with canvas. With an SVG or HTML backend, if we set things up properly, we might get search almost for free. Search is very similar to but slightly harder than text selection because search needs to know about word-, sentence-, paragraph-, column-, and page-breaks, potentially, so as to know which chunks of text are part of the same word, whereas text selection just needs to know about text that might be adjacent.
Big project: XFA or AcroForms
Big project: Sanitizer
PDF files often have malicious content within itself, which can be used to compromise the security of the system. Rendering PDF file with PDF.js is often slow and broken, which makes the users to open the files with native readers. It will be very useful to have a mean to remove malicious content from PDF.
- Use PDF.js to parse PDF into internal representation, but do not render it.
- Decompress and destream it.
- Remove all potentially malicious tags (this should be tweakable in popup window similar to "Clear Recent History"): JS, flash, 3d, forms, signatures, remote content, anything not needed for rendering.
- Recreate PDF file from the internal representation recomputing all the recomputable fields to mitigate memory corruption exploits.
Firefox should suggest the user to sanitize PDF if he downloads PDF by any mean (either using PDF.js GUI, or FF standard download dialogue).
To run tests
- Have a look at this directory
pdf.js/test$ python test.py --help
Once you decide how you want to run it, that script will spin up one or more browsers, and a little web server to serve up test cases.
To uncompress a PDF
- install pdftk (http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/)
- run |pdftk foo.pdf output uncompressed.foo.pdf uncompress|
- make sure HTML files are declared <!DOCTYPE html> (i.e., HTML5). IE9+ will load them in compatibility mode otherwise.
- add a
"use strict";statement (exactly that!) to the top of your JS files
- 2 spaces for indentation.
- Line break are free (I promise) don't hesitate to use them to separate logical block inside your functions.
- Be sure to declare a variable with 'var' before using it you don't want to be hurt by random variables living on the global scope.
- Files are named
- Don't use object methods and properties more than you have to. It is often faster to store the result in a temporary variable.
If you have to do DOM manipulations (hopefully not!):
- Don't call getAttribute to see if an attribute exists, call hasAttribute instead.
- Prefer to loop through childNodes rather than using first/lastChild with next/previousSibling. But prefer hasChildNodes() to childNodes.length > 0. Similarly prefer document.getElementsByTagName(aTag).item(0) != null to document.getElementsByTagName(aTag).length > 0.
Review (aka pull-request) policy
All pushes to the master must go through pull requests.
NBB: this isn't being enforced yet
- New code has to pass all tests
- New code can't regress performance on (TBD) as measured by (TBD). Unless the new code implements a new feature major enough to suffer a temporary perf regression. This is up to common sense.
- Major new features should have architectural review from (TBD). Less major patches can be reviewed by (TBD).