L20n/File format principles

From MozillaWiki
Jump to: navigation, search

There are a few first principles behind the file format for l20n, and I'd like to share those. Hopefully, these principles will help frame our discussion about the file format in a way that focuses on the overarching goals of l20n, instead of a direct comparison with the current LOL format.

  1. terminology
  2. error robustness, defined error recovery
  3. security
  4. familiar concepts. strings should be strings, arrays should be arrays, hashes should be hashes
  5. clear exposure of context

In a second level, there'd be

  1. no significance of whitespace
  2. easy things are easy
  3. complex things can be grokked
  4. l10n brief, aka, file-wide comments
  5. authoring-friendly
  6. no ambiguous syntax

0) Terminology: l20n defines a domain-specific programming language, and is, AFAIK, declarative programming.

Rationale:

ad 1) error robustness, defined error recovery Localization is really fuzz-testing. I'm hacking on l10n tools for the better part of 5 years now, and just last week, I fixed a tooling issue that no localizer managed to trigger yet. There's no real effect of editing tools on the data that comes towards your l10n toolchain, the long tail is just gonna try us. For a technology that's supposed to work off the web, being fault-resistant and -tolerant is even more important than for a build-time infra like we have for Firefox.

I'm hoping to get to an error recovery performance that's comparable to CSS. Which is why we also asked Fantasai for feedback on the format we proposed a while back.

I confess that the error recovery part isn't tangible in the docs so far, we'll need to come to a defined performance here.

ad 2) security You can't trust localized content. It's content you can't read, and it might even come over the web. Malicious content in the localization must not have evil impact.

ad 3) familiar concepts Kinda speaks for itself. On the concept of strings and arrays and hashes, there's not too much conflict between things out there. For macros and strings with attributes, the path is more ambiguous.

ad 4) context Much of a l20n localization file is going to be "use this string instead", but there's going to be a good deal of "this has an attribute and it may depend on that condition". There is a conceptual bucket in which things like multiple variations of a localizable string appear, or in which a tooltip and an accesskey are tied to the UI string. Not being able to group that is one of the failures of our existing infrastructures that we're trying to fix.

ad a) no significant whitespace Gandalf, stas, and I are torn on this one. Stas would have left it out, to avoid it being provocative. Gandalf sees this as an item of consistency with other web specs like html, js, css, xml. I'm doing a host of python programming, and the significance of whitespace there hasn't convinced me in everyday life. It somewhat ties into error robustness. Wrong indention is just the thing that happens to me more frequently than other things in python, and even worse, not all of my indention errors are actually resulting in invalid code.

ad b) easy things Easy things should be easy

ad c) complex things Complex things should be able to be grokked. This ties into context, most prominently. It to some extent ties into whitespace, as people should be able to choose to indent code like they see fit.

ad d) l10n brief A common feature request from the l10n community is a l10n brief, a comment that is defined to actually cover the contents of this file.

ad e) authoring-friendly In it's worst counter-example, this means that we would use sqlite db dumps. (Not making this up, that has been suggested in the past.) Common operations like finding an entity, copy/paste, read the value should be easy. It also ties in to clearly defined syntax concepts, which leads to

ad f) no ambiguity Ambiguity makes it hard to understand where you are, in context, type, etc. Using characters for multiple meanings, requiring escaping that's not expected, mode switches. Also, no "UI" is ambiguous with invisible "UI".