Mozilla2:Unified Storage

From MozillaWiki
Jump to: navigation, search

Author

  • Vladimir Vukicevic <vladimir@pobox.com>

Related Links

Goals

  1. Provide a unified interface for storing and searching through data for all Mozilla components and extensions
  2. Provide Mozilla core components and extension authors with tools to enable richer interaction with user data
  3. Eliminate the need for components to write their own file serialization/deserialization
  4. Eliminate multiple file formats from Mozilla profiles (db1.85, Mork, rdf/xml, xml, various text formats)

Requirements

  1. Low-complexity API suitable for use with simple stores (bookmarks, history)
  2. Direct SQL access for richer interaction (mail header cache)
  3. API for consumers to be able to expose triggers, etc. as notifications (RDF and others; this is tied in for the data source story for moz going forward, whether it stays a RDF-centered world or whether we branch out a bit)
  4. Cross-application notification
  5. Straightforward upgrades from an older version of a particular store's schema to a newer version.

Design Ramblings

Current data stores in Mozilla

Common:

  • Prefs
prefs.js (evaluable JS)
  • SSL Certificates
cert8.db (db1.85)
  • Local Store
localstore.rdf (RDF/XML)
  • MIME Types
mimeTypes.rdf (RDF/XML)

Firefox:

  • History
history.dat (Mork)
Implements RDF data source on top of Mork.
  • Bookmarks
bookmarks.html (NETSCAPE-Bookmark-file-1)
Kept in-memory as RDF, only written out periodically.
  • Cookies
cookies.txt (HTTP Cookie File)
Kept in-memory as a hash-table by host.
  • Saved passwords
signons.txt (text)
  • Form History
formhistory.dat (Mork)
  • Host Permissions
hostperm.1 (text)
  • Download Manager History
downloads.rdf (RDF/XML)

Thunderbird:

  • Addressbook
abook.mab (Mork)
  • Mail Views (Saved Searches)
mailViews.dat (custom, text)
  • secmod.db (?)
secmod.db (db1.85)
  • Folder Cache
panacea.dat (Mork)
  • Junk Bayesian filter training file
training.dat (unknown)
  • Server info
ImapMail/<server>.dat (Mork)
  • Folder summary data
Mail*/*/*.msf (Mork)
  • Filters (?)
mail filter data

Notes

Unified store doesn't imply just one store, but more a unified way of accessing information. I can initially see possibly two stores -- one a more secure one that would store SSL certs and saved passwords, and another for everything else. I'm not sure what "more secure" means here, though; one store might be enough, and might make more interesting relationships possible.

Not all items need/should migrate to unified store: prefs probably wants to stay as JS. Junk mail bayesian training bits probably also want to stay as whatever format they currently are. Everything else should go into the unified store.

Most of the other data we have is fairly simple to store: history, bookmarks, cookies, form history, download manager bits, addressbook, etc. Some of this data really wants to be hierarchical, e.g. bookmarks, while other data, such as tbird folder cache, wants to be slightly hierarchical. A simple child_of column can work well there, especially if the data is being exposed as RDF.

We'll need an IPC interface/service to handle cross-application notifications. Sqlite won't give us trigger functions on multiple sqlite instances. There are two solutions here: one, turn mozStorage into a server that's shared by all the apps. This could suck from a performance standpoint, especially for things like generating content directly from sql. Two, require that each table have an owner component in just one app; this component would be responsible for managing sqlite triggers, and would in turn re-publish interesting bits in some xpcom/ipc-esque way.

Interfaces

Two interfaces:

Full SQL

Core interfaces `mozIStorageService`, `mozIStorageConnection`, `mozIStorageStatement`, `mozIStorageValueArray`. Auxiliary interfaces `mozIStorageFunction`, `mozIStorageSchema`. See the interfaces in bug 261861 for descriptions and available methods on these.

  • how do upgrades work? We could have schema versions -- a `moz_schemas` table that holds a simple string -> integer map, with the integer getting incremented each time there's a schema update. `mozStorage` can provide a `getSchemaVersion` call to get the version, and a `setSchemaVersion` to update it.

We can also just provide a CreateTable call that would compare the schema given with the schema in the database, and notify if the schemas differ.

RDF Data Store interface

Would need two tables per db-backed data store:

   CREATE TABLE foo__resource_store (
       resource_id   SERIAL,
       resource_name STRING
   );
   CREATE TABLE foo__triple_store (
       triple_id     SERIAL,
       subject       INTEGER, -- corresponds to resource_id in the resource_store
       predicate     INTEGER, -- corresponds to resource_id in the resource_store
       object        BLOB,    -- string, integer, datetime, etc. -- depending on lit type!
       object_is_literal BOOLEAN
   );

Perhaps best to consider this illustrative only. There are many RDF engines (Jena, Redland, ARC, ...) that already have schemas for storing RDF data in SQL tables. If a suitably similar structure is adopted here, it could save some debugging time w.r.t. spec compliance, as well as open up possibility of re-coding other useful work in Javascript. For example, http://arc.web-semantics.org/ offers a SPARQL implementation in PHP that works on top of MySQL, recoding this in Javascript could be a reasonably cheap way of providing SPARQL query capabilities over this Mozilla RDF data storage. --DanBri

Shared profiles

See also: Mozilla2:Profile Sharing, Mozilla2:Multi User Sharing

Being able to unify things like the SSL cert database using anything less than a kludge will require a profile directory shared amongst the aviary members.

Ideally, the profile directory will contain the shared data stores (which can be opened multiple times through sqlite), and an individual directory for each application. The application directories would contain things like application-specific extensions -- this scheme would also allow firefox to install thunderbird extensions.

Various things that need to be shared with toolkit/xulrunner-wide apps (cache, cookies, etc.) will all go into the shared profile directory.

Other things should probably be kept in the unified store, just in case an extension wants to do something with them. For example, exposing bookmarks to Thunderbird would allow forumzilla to figure out which feeds you have livemarks for, as well as enable the mailing of bookmarks that can be accepted within the mailer. Firefox could take advantage of addressbook lookup. Both apps would share the same mime type settings, such as default applications.

  • However:* numerous people have indicated that doing yet another profile directory change would be... less than desirable. So, an alternate approach is to use just add a `common` directory at the same level as `firefox` and `thunderbird`. Creating a profile should also create a common profile with the same name. I'm not sure what happens if you create a firefox and thunderbird profile with the same name, but you don't expect them to share any common bits. This probably isn't an issue.

Notifications

To support the existing RDF observer interfaces, changes in unified store data need to be exposed using some sort of notification mechanism -- we can probably install SQLite callbacks for every operation that shouldn't slow us down too much.

Given a triple store, it should be easy to translate this into RDF. Given an arbitrary store, it'll be up to the store user to figure out what changed and what notifications need to be emitted. Helpers for creating triggers would be a plus.

For non-RDF notifications, the trigger mechanism will be used, with the following limitation: triggers are only emitted if the trigger-causing actions occur on the same database connection. This means that different database connections (e.g. different apps) will not receive eachother's triggers through the database. The planned solution is to wrap triggers in such a way that they use the xpcom IPC subsystem to notify any other storage-using applications of the trigger asynchronously -- to the app, the triggers would be delivered just like a database-originated trigger.

Database schema freezability

One of the goals is to provide the ability for app and extension authors to have much more fine-grained control over data queries than would normally be possible. However, we also want to retain the ability to change database schemas as necessary. To this end, table schemas will be frozen at column granularity; e.g. for `moz_bookmarks`, the `url`, `title`, `keyword`, and `last_visited` columns might be frozen, but the others might not be. This means that extension authors need to refrain from using "*" in their SELECT queries, but instead should get only the exact columns that they're interested in. (This is good practice in any case.)

Back-end

SQLite will be the back end for the unified store. Because it implements a SQL engine, we get querying "for free", without having to invent our own query language or query execution system. Its code-size footprint is moderate (250k), but it will hopefully simplify much existing code so that the net code-size change should be smaller. It has exceptional performance, and supports concurrent access to the database. Finally, it is released into the public domain, meaning that we will have no licensing issues.

Other options exist, though an overriding concern is one of license for a piece that is to be a core Mozilla component. For example, Berkeley DB's license is incompatible with Mozilla, as is Firebird's. However, Berkeley DB lacks a query language, meaning we'd have to write our own; Firebird has similar goals to SQLite, but is based upon a codebase unproven in the open source world. Using RDF as a back end is infeasable due to performance; and finally, using a home-grown solution makes no sense.


Comments from Darin:

This is great stuff. Have you seen Mozilla2:Profile Sharing (which should really be made public)?

I agree that we should move profile data that is shared across toolkit apps into a common profile directory. For example, ~/.mozilla/toolkit/ or ~/.mozilla/common/. It would make sense to have separate shared profile directories per profile name. So, you'd then have ~/.mozilla/toolkit/default.xyz/ available to any other toolkit app that is using the profile named "default".

If we introduce a major change in the way we store profile data such as the sqlite based solution you are proposing here, then we should definitely include any changes needed to support profile sharing across toolkit apps.

Notifications need to be interprocess, no? How does sqlite solve the problem of overlapped access? How do we keep the state of the shared database synchronized between processes? Would transactions made on the database be reflected as notifications to other processes?

As for the SSL DBs, have you seen the rdb mechanism? All we have to do is provide a DLL that implements the Berkeley DB API, name it rdb.dll and NSS will use it instead.

Is there any benefit to using the IPC daemon for managing interprocess transaction synchronization since it already has support for that? I guess I'm not clear on how much of this sqlite already does for us. Also, I envision including the IPC daemon as a core mozilla component since it enables support for distributed XPCOM, so we would potentially have it at our disposal.


Note from colmsmyth:

I've added a comment to Mozilla2:Profile Sharing relating to enabling Mozilla data formats and protocols to be used by non-Mozilla applications, facilitating Mozilla's role as an platform for internet-aware clients.

If the IPC daemon is to be used to enable distributed XPCOM, please a) consider using it only for service discovery and b) keep the protocol text-based (utf-8) to allow other applications to use the IPC daemon to synchronize access to Mozilla data like the cache, history and bookmarks.


Note from Standard8

Not all applications that want to share profile data will necessarily have the .mozilla prefix on the path name, e.g. SeaMonkey uses .mozilla.org/seamonkey, I could foresee other apps (songbird?) wanting to be able to access data in a common profile area as well.

Comments from Vlad:

Darin wrote: > Notifications need to be interprocess, no? How does sqlite solve the problem of overlapped access? How do we keep the state of the shared database synchronized between processes? Would transactions made on the database be reflected as notifications to other processes?

Just did some testing with this. SQLite does not give us cross-instance triggers, which is really too bad. We'll have to work out an IPC solution here.

> As for the SSL DBs, have you seen the rdb mechanism? All we have to do is provide a DLL that implements the Berkeley DB API, name it rdb.dll and NSS will use it instead.

Interesting.. I have not. That might be the sanest and safest approach; I was looking at hacking support into NSS directly, and that code scared the crap out of me. I'd much rather work with an implementation of a known API than have to audit random code changes in NSS.

Comments from Axel

I would veto against storing resources as strings for RDF, there should be a table mapping strings to numbers, so that you'd end up querying for numbers instead of strings, should be much more performant. Same for literals, probably. Uniqueness of literals is harder than for resources, btw, as typed literals have redundant string representations.

What about separate RDF DataSources? This is a common shortcoming of RDF, that you can't identify a triple once it is in the wild, but there are good use cases to identify sets of triples, such like, previous settings or stuff like that.

Comments from Vlad

Axel wrote: > I would veto against storing resources as strings for RDF, there should be a table mapping strings to numbers, so that you'd end up querying for numbers instead of strings, should be much more performant.

Hmm.. it would mean two tables per database-backed store rather than one, but that's not a big deal. I'm wondering whether having to do multiple database queries per RDF query (to figure out integer mappings for resources) will hurt more or less than storing strings; probably much less, especially since the results of the resource->id queries can easily be cached. I've updated the schema above.. does that look better?

> Same for literals, probably. Uniqueness of literals is harder than for resources, btw, as typed literals have redundant string representations.

Hmm.. can you give me some examples of this? The literals that we currently have are strings, dates, ints, and blobs; does rdf define more? If it's just those 4, they map nicely to sql types that we can store in that field (sqlite does manifest typing of data values, so we can figure out whether the thing we stored was an int, string, date, or whatever).


Comments from Axel

Vlad wrote: > Hmm.. can you give me some examples of this?

The RDF primer mentions all of schema2 datatypes for typed literals. Prominent are numbers, like 2.5 vs 2.500 or dates from different timezones.

Another important issue I miss is the safeness of the data. If we share application data, the backing store needs to be crash and power-off proof, IMHO.

Is there any method to prune data, like, what happens if application data changes the scheme? We probably don't want to bloat our database with left-overs for all days. What about extension un-installs?

One thing that I still have on my plate is profile migration, like, how would apps like Safari or Opera migrate firefox profiles? We blame MS for obfuscating their internal data and being non-interoperable; we don't wanna end up in the same hot spot, right?

Comments from Vlad

Axel wrote: > Is there any method to prune data, like, what happens if application data changes the scheme?

It will be up to the component to clean up after itself in this case; it can also provide views for backwards-compatability as necessary.

> We probably don't want to bloat our database with left overs for all days. What about extension uninstalls?

Extensions will hopefully clean up after themselves as well. However, the plan is to give extension authors an API to go through for table creation that will involve creating a table name that includes the extension's GUID; this way we can identify if an extension is still installed or not for any given table.

> One thing that I still have on my plate is profile migration, like, how would apps like Safari or Opera migrate firefox profiles? We blame MS for obfuscating their internal data and being not interoperable, we don't wanna end up in the same hot spot, right?

The data won't be obfuscated in any way. sqlite is fully in the public domain; as such, if anyone wants to get at Firefox profile data, they'll be welcome to use sqlite and get the data out. Migration -from- Firefox (or any other xul/toolkit app) is not our problem, provided that we don't intentionally make it difficult.

Comments from shaver

> Migration -from- Firefox (or any other xul/toolkit app) is not our problem, provided that we don't intentionally make it difficult.

sqlite is _worlds_ more usable for other apps than our current mix of ad-hoc formats and Mork chicanery.

Comments from ago

Any plan to have multiuser access (even coarse-grained with full file locks during writes) coupled with offline (i.e. a syncing algorithm, possibly sql based) capabilities? Calendars, addresscards etc. would benefit from it. Typical use: network with a shared contacts storage, which can be accessed directly (rw), or dumped locally for offline use and subsequent sync operation. One way to do syncing might be to store a sequential number (not a timestamp, which depends on the clock, actually the clocks, plural...) for each record/group of records (i.e. a vcard). Every time the record/group is modified, the next sequential number is used. Knowing the max value of this field at the time you go offline, it should be possible to sync the contents of two tables. I am thinking only of a single function to addresses simple situations (which are also quite common), leaving more complex conflict resolution issues to the clients. Assuming identical table structures with an Index field, something like:

sync(recordsetA, recordsetB, indexField, conflictResolution)

conflictResolution = priority_to_A / priority_to_B / duplicate / only_return_a_list_of_conflicts

So defining two items: A and B, and I(A), I(B) their respective sequential indeces, and I(O) the max value of the sequential index of tableA at the time of the last connection, the two recordsets are merged according to the following rules:

  • I(A) <= I(O) && I(B)<= I(O) : nothing to do
  • I(A) > I(O) && I(B) <= I(O) : A->B
  • I(A) > I(O) && B missing : A->B
  • A missing && I(B) <= I(O) : delete B
  • I(A) > I(O) && I(B) > I(O) : conflict -> resolve according to conflictResolution

Same thing inverting letters. More complex conflict resolution schemes can and should be implemented by the clients. Syncing two apps based on Uinifed Storage should not be a problem. Syncing with an external application (which does not use the sequential index) is still possible, but it will require an intermediate step (client-side). In this case, a copy of the recordset as of the time of the last connection must be stored. On the next connection, by comparing the old recordset to the new data, it is possible to establish which items were edited offline (i.e. I(B) > I(O))...

Having such syncing functionality within Unified Storage might avoid a lot of code duplication since it is a common task. Moreover it would be more elegant, safer and faster to assign the sequential index internally rather than letting the clients do so.


Comments from Relyea

If you need help with the NSS integration, let me know. I've already implemented on RDB.DLL (I'm also the one who added the hooks). I have a pretty keen interest in seeing a single unified keystore for all applications.

bob

Comments from Mariuz

>For example, Berkeley DB's license is incompatible with Mozilla, as is >Firebird's.

Firebird license is MPL based and is compatible

>Firebird has similar goals to SQLite, but is based upon a codebase unproven in >the open source world.

How is that? firebird has 20 years of history history

>Using RDF as a back end is infeasable due to performance; and finally, using a >home-grown solution makes no sense. i agree :) it would be nice if Unified Storage will make use of other open source databases (like bugzilla is doing): imagine storing your bookmarks on a remote server using one sql db for it It should be written with db independence in mind


Comments from piers7 (just a random user)

It seems to me from the POV of a user that one of the most important things to consider is not the implementation, it's the interface.

One of my big gripes with Firefox is keeping my bookmarks in sync with IE. To this task there's been a whole host of extensions written, of varying calibre, all of which miss the holy grail of making both browsers just share the same store. I'm thinking of a simple strategy pattern where the default 'NetscapeBookmarkStore' was swappable with an 'IEFavoratesStore' and a whole host of others (eg '3rdPartyNetworkBasedBookmarkStore' etc...)

Unified Storage may simplify matters for the developers, but it's an extensible, modular, plugable storage architecture that'll ultimately deliver the most benefit to the end users.


Comments from Grauw

What about using a native XML database, and XQuery?


Comments from asr

Oh good, let's add Yet Another Windows Registry to OSS systems; the Gnome Windows Registry isn't complex and incompatible enough.

Don't confuse the IPC problem with the settings-storage problem; by conflating them, you discard the profound flexibility provided by simple, plain-text files which can be edited by any old EMACS or PERL or whatever. It is still possible to do this with some current mozilla-suite tools.

Unified -API- for data storage, terrific. Unified storage method, excellent. But the schema you propose is implementable as X resource strings, for goodness sake; why add a freakin' database engine to the mix?

Comments from Kingsley Idehen

Any reason why you cannot use iODBC (http://www.iodbc.org) to keep your SQL Data Storage and Access DB Engine independent? This will not hinder your choice of SQL Lite as the default SQL Engine since ODBC Drivers already exist for SQL Lite. This is an important architectural decision with longterm ramifications.

Follow-on comments from Kingsley Idehen

Since my earlier post (above) we have now released OpenLink Virtuoso in Open Source form. Note that Virtuoso is an Object-Relational DBMS Engine that includes support for; SPARQL, SQL-200n, XPath, and XQuery. It is a bonafide RDF Triple Store amongst other things.

SPARQL protocol and Query Language support in Mozilla will open up alot of intriguing possibilities for Mozilla and Firefox (this would make Virtuoso based Data Spaces and other SPARQL Query Language Language & Procotol compliant Triple Stores directly accessible to the the browser (e.g. as an advanced search option feature)

Technical details are available from the Virtuoso Wiki at: http://virtuoso.openlinksw.com/wiki/main/

Live SPARQL Demo is available at: http://demo.openlinksw.com/sparql_demo/

Comments from guanxi

Per SQLite's website (http://www.sqlite.org/whentouse.html), it performs poorly over networks and they specfically recommend against multiple computers accessing it via a network:

SQLite will work over a network filesystem, but because of the latency associated with most network filesystems, performance will not be great. Also, the file locking logic of many network filesystems implementation contains bugs (on both Unix and windows). If file locking does not work like it should, it might be possible for two or more client programs to modify the same part of the same database at the same time, resulting in database corruption. Because this problem results from bugs in the underlying filesystem implementation, there is nothing SQLite can do to prevent it.
A good rule of thumb is that you should avoid using SQLite in situations where the same database will be accessed simultaneously from many computers over a network filesystem.

Wouldn't that conflict with the goals of Mozilla2:Multi User Sharing? Also, users store profiles on network shares; whether that's officially supported or not (?), do we want to create additional issues for them?

Comments from nekrad

Introducing an unified storage scheme for everything in one seems to be something like reinventing SQL or LDAP. But this can't be the issue. Instead SQL, LDAP or even flat text files should be the available datasource backends.

There were many things mentioned here, ie. bookmarks, ssl certs, preferences, mail header caches, etc, etc. They're not all the same kind of data, instead fall apart into several different classes.

For example, the bookmarks are an hierachical thing, history is an "flat" table, preferences are an propertylist. These are completely different type of data, so we should have different APIs for them. What they all have in common, that they're datasources which should be identified by an unique name or descriptor (URI).

We also should consider that people don't want to have all kind of data in the same place. I personally like to have bookmarks stored in an postgresql db, share my mail aliases w/ mutt, etc. So each kind of data should get its own datasource. Each profile could have its root property list, where the URLs to the actual datasources are defined.

You perhaps like to have a look at a related bug:



SteveW: I think FireBird would be much preferable to SQLite, especially for Windows users as SQLite is flakey on Windows network drives.


Comments from DanBri

The SPARQL query language (see notes in [http://esw.w3.org/topic/SparqlImplementations W3C ESW wiki) is a good candidate for such an abstract layer. Mozilla always had a notion of breaking things out into datasources, ... but the granularity (per-triple calls) was too fine-grained. SPARQL addresses this in a couple of ways. Firstly, you can wrap datasources by having them expose their data "as RDF" without doing it in terms of a triple-centric API. So for example, look at http://jena.sourceforge.net/SquirrelRDF/ which allows both LDAP and vanilla SQL to be mapped into RDF. Secondly, the query language itself has a notion of data provenance, as expressed through the GRAPH keyword, which has some potential for deealing with the 'free floating triples' issue. A query can include constructs that match against RDF graph structures (like a cleaned-up version of the old Mozilla RDF templates mechanism), ... but it can also include constraints that target triples from particular "named graphs". This seems a good match to Mozilla's needs and multi-datasource design, to me.

The query language also comes with a simple XML representation of a result set, suitable for XSLT/Xpath/Xquery etc processing. This could replace a lot of what Templates used to do.

I don't follow all the Mozilla forums but give me a shout if I can be useful somehow!


Comments from Broofa (a pseudo-random user)

Would it be appropriate to have this Unified Storage system allow extension developers/XUL developers in general be able to persist arbitrary data structures? For example, if I wanted to do a "To Do List" extension for Firefox/T-Bird, it'd be nice to have a quick-n-convenient place to store that data w/out having to worry about writing my own XPCOM APIs. (... or maybe there's already a way to do that and I just don't know about it? That's actually what I was looking for when I stumbled across this page. :-) )


Comments from Kingsley Idehen

Unified Storage and Database Specific Storage simply do not mean the same thing. It was always my belief that "Unified Storage" would be an abstraction above SQL (Relational) and RDF (Graph) Data Models. What's going on?

If you want to talk SQL, then why no do so via an ODBC abstraction? Why the DBMS specificity? Also, if you are going to SQL, does it means RDF has to be tossed out of Mozilla? How about simply leaving it alone for those who are RDF oriented to work with?