User:Andrew Sutherland/MailNews/GlobalDatabase
{draft}
(Desired) Features
- Given a Contact, easily find ((Bryan Clark)):
- Previous messages from the person
- Previous attachments from the person
- (A count of) messages from this person (to me)
- (A count of) messages to this person (from me)
- People connected to this person (people they have been involved in threads with)
- Previous conversations with this person
- URL's involved in the messages
 
- When reading a message, be able to:
- Know what lightning tasks/events were spawned from this message
- Take/associate (long) notes ((Ovidiu))
 
Schema
Givens (Assumptions)
We are using the existing SQLite as the database engine already supported by the code-base. CouchDB is getting close to usable (0.8 will have compaction and incremental reduction), but is too risky for the desired Thunderbord 3.0 release date.
The global database is not a message storage mechanism. It is a layer on top of the existing folder-centric message storage mechanism, and will rely on that fact. There are separate plans to make the message storage mechanism modular (for e-mail accounts), but that has nothing to do with the global database.
The global database is not a full-text search mechanism, at least not for now. Each of the major platforms has their own platform-wide search mechanism and we are already planning/working to integrate with those mechanisms.
A Normalized Start
Because we are building on the
- folderLocations TABLE
- id INTEGER PRIMARY KEY
- folderURI TEXT
 
- no index on folderURI for now; the number of folders should be sane enough that a table scan won't be too bad.
- conversations TABLE
- id INTEGER PRIMARY KEY
- subject TEXT
- oldestMessageDate INTEGER although generally less useful than the newest message, track/cache the date of the oldest/earliest message seen for the conversation
- newestMessageDate INTEGER for performance reasons, track/cache the date of the newest message seen for this conversation
 
- subject index: (subject) indexed for threading by subject
- newestMessageDate index: (newestMessageDate) indexed so we can order conversations from most-recent to oldest
- oldestMessageDate index: (oldestMessageDate) indexed mainly for symmetry...
- messages TABLE
- id INTEGER PRIMARY KEY
- folderID INTEGER NOT NULL REFERENCES folderLocations(id) together with messageKey identifies the message in the traditional folder-storage model
- messageKey INTEGER NOT NULL
- conversationID INTEGER NOT NULL REFERENCES conversations(id)
- parentID INTEGER REFERENCES messages(id)
- headerMessageID TEXT the Message-ID header from the message
- ?? bodySnippet ?? -- clarkbw's little snippet of text to display
 
- folderID index: (folderID)
- headerMessageID index: (headerMessageID)
- conversationID index: (conversationID)
- contacts TABLE
- id INTEGER PRIMARY KEY
- directoryUUID TEXT
- contactUUID TEXT
- name TEXT
 
- identities TABLE
- contactID INTEGER NOT NULL REFERENCES contacts(id)
- email TEXT
- description TEXT
 
- attributeDefinitions TABLE
- id INTEGER PRIMARY KEY
- attributeType TEXT
- pluginName TEXT
- name TEXT
- parameter BLOB
 
- messageAttributes TABLE
- conversationID INTEGER NOT NULL REFERENCES conversations(id)
- messageID INTEGER NOT NULL REFERENCES messages(id)
- attributeID INTEGER NOT NULL REFERENCES attributeDefinitions(id)
- value NUMERIC
- ?? maybe date ??
 
- attrib-query (covering) index: (attributeID, value, ...)
- message-attrib-fetch (covering) index: (messageID, ...)
- conversation-attrib-fetch (covering) index: (conversationID, ...)
- aggregateDataDefinitions TABLE
- id INTEGER PRIMARY KEY
- pluginName TEXT
- name TEXT
- subjectType TEXT
- parameter BLOB
 
- aggregateDataCache TABLE
- aggregateID INTEGER NOT NULL REFERENCES aggregateDataDefinitions(id)
- subject INTEGER
- value NUMERIC
- data BLOB
 
- lookup index: (aggregateID, subject, value)
Query Strategy
Examples
Fundamental Attributes
Fundamental attributes are inherent and defining properties of an e-mail message: who sent the message, who they sent it to, when they sent it, etc. More explicitly, they are the pieces of the header that Thunderbird would natively display and expose as column headers, excepting the subject (which is explicitly part of the message).
- From/To/cc
- Definitions
- "fundamental", "built-in", "FROM", NULL
- "fundamental", "built-in", "TO", NULL
- "fundamental", "built-in", "CC", NULL
 
- Usage
- conv(1), msg(1), attr(FROM), identity(my-email@my-domain.tld)
- conv(1), msg(1), attr(TO), identity(cheese@hat.com)
- conv(1), msg(1), attr(TO), identity(row@bow.org)
- conv(1), msg(1), attr(CC), identity(foo@bar.com)
 
 
- Definitions
- Other
- Definition
- "built-in", "DATE", NULL
 
 
- Definition
Optimization Attributes
Optimization attributes are attributes that can be derived from multiple fundamental attributes, but, owing to presumed frequency of use and execution cost, get to be their own special attributes.
The definitive example are messages sent to/from the user. At least in the case of e-mail accounts, we can reasonably expect that nearly all of the (non mailing-list) messages will involve the user in some way. This means that the number of "FROM":user/"TO":user attributes will be quite high. Additionally, many of our queries will involve searching for this attribute, because most of our functionality centers around the user and his/her relationship to others.
Because of the way our database schema (and query strategy) works, any query for messages possessing attribute A:1 and attribute B:2 effectively results in intersecting the set of all messages with attribute A:1 and the set of all messages with attribute B:2. Since the set of FROM:user/TO:user attributes will be relatively large, we can expect a nice performance benefit by adding a bounded amount of redundant data.
- From-Me-*, *-Me-From
- Definitions
- "optimized", "built-in", "FROM-ME-TO", identity(my-email@my-domain.tld)
- "optimized", "built-in", "FROM-ME-CC", identity(my-email@my-domain.tld)
- "optimized", "built-in", "TO-ME-FROM", identity(my-email@my-domain.tld)
- "optimized", "built-in", "CC-ME-FROM", identity(my-email@my-domain.tld)
- "optimized", "built-in", "FROM-ME-TO", identity(my-other-email@my-other-account.tld)
 
- Usage
- conv(1), msg(1), attr(FROM-ME-TO), identity(cheese@hat.com)
- conv(1), msg(1), attr(FROM-ME-TO), identity(row@bow.org)
- conv(1), msg(1), attr(FROM-ME-CC), identity(foo@bar.com)
 
 
 
- Definitions
Derived Attributes
Derived attributes are attributes populated by "data extractors" plugins/extensions from the contents of the message.
Simple Ones:
- Bugzilla/bug-tracker helper plugin
- Definitions
- "derived", "buglamp", "BUG", "bugzilla.mozilla.org"
 
- Usage
- conv(99), msg(99), attr(BUG), 123456
 
 
- Definitions
- A geolocation plugin that attempts to figure out where a message was sent from based on the IP address of the sender, explicit headers, or black magic.
- Definitions
- "derived", "geoplug", "LATITUDE", NULL
- "derived", "geoplug", "LONGITUDE", NULL
 
- Usage
- conv(1), msg(1), attr(LATITUDE), 12.34
- conv(1), msg(1), attr(LONGITUDE), 23.45
 
 
- Definitions
- A purchase detector plugin
- Definitions
- "derived", "amazon-spree-tector", "I-SPENT", NULL
 
- Usage
- conv(99), msg(99), attr(I-SPENT), 40.20
 
 
- Definitions
- Topic Analysis / Message Classifier plugin - Use analysis of the message content to assign the message to a topic.
Complex Ones (need their own table unless we change things up):
- URL Detector
- note: So, one thing we could do is use a (piece of an MD5) hash as the attribute value. This would technically be a win in that we could perform the message query without consulting the URL database. In theory we could discard/prune the URL database without losing the ability to attribute query an exact URL, but we'd lose so much other functionality it wouldn't be worth it.
- note: A more feature-ful implementation might try and overload the attribute value so that the high-order bits actually encode the protocol and domain-name, leaving the low-order bits to refer to the full/specific URL. This would allow attribute queries to use ranges to locate links to a given domain.
- note: We do not generate an attribute definition per URL (or domain) because the rate of growth of URLs (cardinality) is too high and additionally not under the control of the user.
- Definitions
- "derived", "built-in", "URL", NULL
 
- Helper Table: builtin_urls
- id INTEGER PRIMARY KEY
- ?? hash INTEGER -- 8 bytes 'o' MD5(url)
- url TEXT
 
- Helper Index: (url)
- Usage
- conv(99), msg(99), attr(URL), url("http://www.mozillamessaging.com/")
- ?? conv(99), msg(99), attr(URL), md5_8bytes("http://www.mozillamessaging.com/")
 
 
Explicit Attributes
Explicit attributes are attributes that are the result of a user action, creating data that cannot be reproduced, except by the user. For example, when the user applies a 'tag' to a message, this is an explicit attribute. Tags and the set of flags natively supported by MailNews are (in the case of e-mail), in theory, reliably preserved through their storage in the message (or on its keywords for IMAP servers). But any new explicit attributes created by add-ons/plug-ins/extensions (or users, should we expose such a capability, which we probably should) will not have that luxury.
This means we should probably go out of our way to find a way to ensure that explicit attributes (that are not already preserved else-where) can be backed-up and restored and/or journaled as they are set. At least, that's a reasonable lesson from the sadness of the Mork index files, though the situations are arguably different. (In theory, we can rely on SQLite to not corrupt itself, and perhaps we might assume that we back-up the global database at start-up like is under discussion/has been implemented for places.)
The simplest solution, if somewhat fill-up-your-disk-drive-y, is to log the message id of a message, the attribute's attribute definition, the attribute value, and any corresponding BLOB that the extension claims is associated with the attribute.
- Tags!
- note: We generate a separate attribute definition for each tag. This allows us to use the 'value' field of the attribute to provide a time-date stamp as to when the user assigned the tag. Awesome? Probably.
- Definitions
- "explicit", "tags", "TAG", "To Do"
- "explicit", "tags", "TAG", "Personal"
 
- Usages
- conv(2), msg(2), attr("TAG", "To Do"), date(2001/01/01)
 
 
- Calendar Tasks/Events
- note: You can see where I'm going here. I still need to figure out if we can get an internal unique identifier that we can trust to fit in 8 bytes...
 
Implicit Attributes
Like explicit but instead of the user explicitly creating an attribute, we created it by observing the user's behaviour and deciding that we should make a note of it. The most important difference being that it would probably be infeasible/a bad idea to back this information up using a journaling strategy unless the journal aggressively compacted.
Also, let's not go crazy with implicit attributes. Given the number of indices currently suggested on our attributes, it is not a trivial operation to add/modify an attribute.
- User Attention! Track messages the user opens a lot / looks at a lot!
- note: The idea is to not have this as an attribute on every message.  If we do that, it would probably be a better idea to just add a column to the message table.  The idea here is that we would only first tag a message once it crossed a certain threshold of usage.  This would probably be as a result of either:
- Obvious hints from the UI in a single message viewing experience.
- Use of some form of LRU-discarding cache of messages observed that would track enough information to decide when a message could graduate to explicitly having a tag (and at which point it would be evicted from the cache).
 
- Definitions
- "implicit", "achtung!", "ATTENTION", NULL
 
- Usages
- conv(5), msg(6), attr(ATTENTION), 42 (some arbitrary scale-factor)
 
 
- note: The idea is to not have this as an attribute on every message.  If we do that, it would probably be a better idea to just add a column to the message table.  The idea here is that we would only first tag a message once it crossed a certain threshold of usage.  This would probably be as a result of either:
Aggregate Data
'Aggregate Data' providers extract computationally expensive information from messages. We do not expect that users will need or want all of this information for all of their messages. Even when they do want it, they may not want it forever. For this reason, the global database provides a mechanism for extensions to define 'aggregate data providers'. The responsibility for determining when these providers should be invoked, when their results should be cached, when they should be updated, and when they should be deleted is left up to the global database layer (although informed by requests to the layer). By centralizing the responsibility, we simplify the extensions and hopefully avoid over-zealous extensions from consuming too many system resources.
The aggregate data mechanism does not simply cache prior results. Each entry in the table is considered a 'live' representation that is updated as new messages are processed by the global database. (Note: for performance reasons, the updates may not happen immediately, although they will happen before the aggregate data is return to a caller requesting it.)
Examples of possible aggregate data providers:
- Message Body Analysis:
- Topic-Analysis; categorization of message into distinct topics.
- Word-Histograms (term frequencies) per contact per time period, enabling themail-like analysis.
 
- Aggregation of values per time period over a period of time:
- Messages sent by a contact, per week, (since forever). Possibly restricted by tag, recipient, etc. These can be visualized using sparklines, etc.
- (Assuming a derived-attribute provider for money you spent on books from invoice/receipt e-mails), the amount of money you spent on books per month (since forever). Suitable for graphing.
- (Assuming a very fancy bugzilla derived-attribute provider), the number of bugs you fixed per week. Suitable for graphing.
 
Potential Denormalizations
Use Cases
Conversation Centric
- Conversations ordered by their most recent message
- By 'caching' the newest/most recent message on the rows in the conversations table, we are able to use a fairly cheap query to directly retrieve the recent conversations with enough information to perform a basic display. We can then fill in the rest with the actual underlying message information. (The information could also be retrieved from the messageAttributes
Bugzilla Workflow
As raised by Kent James, "If the mailnews team figured out how to effectively use TB or SM *in their own work* to combine information flow, it would be an amazing leap forward."
From the perspective of the global database, this is actually fairly simple. Each message you exchange with other people (e-mail/newsgroup), chat you are involved with (IRC/instant messaging), automated e-mails from bugzilla (e-mail), and check-in feed you subscribe to (RSS/Atom) simply needs to be tagged with a (derived) attribute that uniquely identifies the bug. For example, a BUGZILLA attribute (parameterized/defined by the URL of the bugzilla server which serves as the namespace) with the value being the bug identifier.
The difficult part is exposing those data sources that MailNews does not yet support as messages to the global database. Writing the derived attribute processor may also be non-trivial in some cases, at least in terms of reliably determining when a bug is being mentioned if there's just a plain number, and in the case where multiple bugzilla systems may potentially be discussed. Given the significance of bugzilla in the mozilla world, it might even be oddly practical to have the extension directly query the bugzilla server as a source of information.
Additional things a derived attribute 'data extractor' could provide would be:
- Indicating the relative importance/significance of messages. For example, an automated e-mail about someone being added to the CC list is not very interesting or significant. An automated e-mail where a patch is added or reviewed could be very significant.
- Indicating the state of the bug or significant events in the life of the bug for time-line visualization purposes. See when the bug was first opened, when it was actually confirmed, when patches showed up, patches were reviewed, things were committed, etc. The goal in this case is not to duplicate (or create) bugzilla features, but to provide context to the user's other communication. For example, on a time-line visualization, if the user can see when a patch was first posted and then rejected, it could help them to know that other communication around that time might be of specific interest to them. (And could inform automated assistance, fancy user queries, etc.)
Extension Interaction
Extensions are able to register attribute and aggregate data providers.
API
Implementation Plan
Milestones
The game plan for now is to produce two extensions:
* gloda, the global database extension. * expmess, the experimental message view extension.
The hope is to decouple the experimental message view from the global database development so that they can be worked on by different parties. For the first several milestones, the global database will be growing its fundamental features, limiting the functionality of the experimental message view (and coupling them), but still allowing for the experimental message view's interaction model/integration with the system to be fleshed out at least.
Below are proposed a series of milestones. If the proposed limitations at each stage sound silly/ridiculous, it's only because the goal is to try and get a steady stream of (small) milestones, and because eliminating the silliness requires an additional piece of functionality and adding that functionality would be equivalent to dropping a milestone (because it would be in a subsequent milestone).
- M0, the pathologically simple base case:
- The very primitive global database will:
- Perform simple/free conversation threading (in-reply-to/references). No subject threading. (This is not because threading by subject is difficult, but because naive subject threading, especially across many folders, can result in erroneously associated conversations, potentially to the tune of thousands of messages. This is not desirable for testing or usage.)
- Be able to process only e-mail accounts/folders.
- Process e-mail accounts statically; if you move a message around, it will get confused.
- Probably be far too synchronous and conceivably enjoy making the UI unusuable while it is thinking.
- Do nothing with contacts/identities.
- Do nothing with message attributes.
 
- The very basic experimental message view will:
- Show all the messages in a conversation (using the global database, so cross-folder).
- Be keyed off the currently selected message (easy to hook into...)
- Be its own little panel off to the right side of the screen.
- Be linear by time; no tree stuff.
- Not respond to message selection.
- Derive its display data for display from loading the underlying message header
 
 
- The very primitive global database will:
- M1: An actual (simple) proof-of-concept:
- Global Database
- Process/create contacts/identities. (No interaction with address book!)
- Populate fundamental attributes (To/Cc/From/Date)
- Populate 'body snippet', with at least some attempt to ignore quoted data.
- Expose fundamental attributes as data, not requiring loading the underlying message
- Provide query support for fundamental attributes.
 
- Experimental Message View
- Grow a 'data mine' sidebar in our frankenstein overlay pane.
- Message list gets its data only from the global database, without using the underlying message header.
- Clicking on a message in the list displays information on the sender in the 'data mine'. Namely, some conversations they have been involved in and some people they have e-mailed.
- The data-mine will not have anything you can click on.
 
 
- Global Database
Okay, everything after here is very nebulous, and the experimental message view should be thought of as a separate feature track once the global database is sufficiently feature-complete...
- M2: Broad strokes all there, maybe?  (As proposed, this may be too many changes and should be split into two).
- Global Database
- Process attachments.
- Process user tags.
- Update the global database based on listening to changes that happen.
 
- Experimental Message View
- Don't be a frankenstein overlay, but an actual view.
- Clicking on things in the data mine does stuff.
- Hook message list up to quick-search-alike.
 
 
- Global Database
- M3: Be reasonably usable
- Global Database
- Do as much as possible asynchronously, minimizing blocking of the user interaction thread.
- Have some form of support for subject threading.
 
 
- Global Database
- M4: Support derived attributes by extensions.
- Example Bugzilla extension!
- Global Database
- Support derived attributes by extensions.
 
- Experimental Message View
- Whatever it takes to make the bugzilla case happy.
 
 
- M5: Support aggregate data by extensions.
- Extension that draws pretty sparklines!
- Global Database
- Yeah, support that.
 
- Experimental Message View
- Show sparklines/graphs!
 
 
- Mx: Support newsgroups.
- My: Support RSS feeds.
Future Work
'Correct' Subject Threading
When References/In-Reply-To headers are not available (or Thread-Index/Thread-Topic headers for some Microsoft cases), MailNews has the ability to fail-over to threading based on the subject. (See MailNews:Message_Threading for background information on threading and MailNews' current implementation and the preferences that affect it.) This is workable, but is not without its issues.
A (more expensive) solution is to leverage the fact that many replies will include quotes from, or the entire body of, the message they are in response to. One could even make the argument that people who actually care about top-posting versus bottom-posting and cutting down the quoted content to only the relevant snippets are likely to use a mail client that produces References/In-Reply-To headers. One could then continue to argue that people who don't use well-behaved mail clients probably don't care about these things and probably leave the entire quoted message around (if quoting is enabled).
This suggests it is possible to more reliably thread messages by subject by also processing their content for quoted content from the previous message. Not only that, but that we probably do not need to be exceedingly thorough about the quote analysis; there should be some very juicy low-hanging fruit.
This work can also be used to enhance the user experience by linking quoted text (even reduced quotes) back to their original message.