User:Andrew Sutherland/MailNews/GlobalDatabase

From MozillaWiki
< User:Andrew Sutherland
Revision as of 04:44, 6 June 2008 by Andrew Sutherland (talk | contribs) (New page: {draft} = (Desired) Features = * Given a Contact, easily find ((Bryan Clark)): ** Previous messages from the person ** Previous attachments from the person ** (A count of) messages from ...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

{draft}

(Desired) Features

  • Given a Contact, easily find ((Bryan Clark)):
    • Previous messages from the person
    • Previous attachments from the person
    • (A count of) messages from this person (to me)
    • (A count of) messages to this person (from me)
    • People connected to this person (people they have been involved in threads with)
    • Previous conversations with this person
    • URL's involved in the messages
  • When reading a message, be able to:
    • Know what lightning tasks/events were spawned from this message
    • Take/associate (long) notes ((Ovidiu))

Schema

Givens (Assumptions)

We are using the existing SQLite as the database engine already supported by the code-base. CouchDB is getting close to usable (0.8 will have compaction and incremental reduction), but is too risky for the desired Thunderbord 3.0 release date.

A Normalized Start

  • folderLocations TABLE
    • id INTEGER PRIMARY KEY
    • folderURI TEXT
  • conversations TABLE
    • id INTEGER PRIMARY KEY
    • subject TEXT
  • messages TABLE
    • id INTEGER PRIMARY KEY
    • folderID INTEGER NOT NULL REFERENCES folderLocations(id)
    • messageKey INTEGER NOT NULL
    • conversationID INTEGER NOT NULL REFERENCES conversations(id)
    • parentID INTEGER REFERENCES messages(id)
    • headerMessageID TEXT
    • ?? bodySnippet ?? -- clarkbw's little snippet of text to display
  • contacts TABLE
    • id INTEGER PRIMARY KEY
    • directoryUUID TEXT
    • contactUUID TEXT
    • name TEXT
  • identities TABLE
    • contactID INTEGER NOT NULL REFERENCES contacts(id)
    • email TEXT
    • description TEXT
  • attributeDefinitions TABLE
    • id INTEGER PRIMARY KEY
    • attributeType TEXT
    • pluginName TEXT
    • name TEXT
    • parameter BLOB
  • messageAttributes TABLE
    • conversationID INTEGER NOT NULL REFERENCES conversations(id)
    • messageID INTEGER NOT NULL REFERENCES messages(id)
    • attributeID INTEGER NOT NULL REFERENCES attributeDefinitions(id)
    • value NUMERIC
    • ?? maybe date ??
  • attrib-query (covering) index: (attributeID, value, ...)
  • message-attrib-fetch (covering) index: (messageID, ...)
  • conversation-attrib-fetch (covering) index: (conversationID, ...)


  • aggregateDataDefinitions TABLE
    • id INTEGER PRIMARY KEY
    • pluginName TEXT
    • name TEXT
    • subjectType TEXT
    • parameter BLOB
  • aggregateDataCache TABLE
    • aggregateID INTEGER NOT NULL REFERENCES aggregateDataDefinitions(id)
    • subject INTEGER
    • value NUMERIC
    • data BLOB
  • lookup index: (aggregateID, subject, value)

Query Strategy

Examples

Fundamental Attributes

Fundamental attributes are inherent and defining properties of an e-mail message: who sent the message, who they sent it to, when they sent it, etc. More explicitly, they are the pieces of the header that Thunderbird would natively display and expose as column headers, excepting the subject (which is explicitly part of the message).

  • From/To/cc
    • Definitions
      • "fundamental", "built-in", "FROM", NULL
      • "fundamental", "built-in", "TO", NULL
      • "fundamental", "built-in", "CC", NULL
    • Usage
      • conv(1), msg(1), attr(FROM), identity(my-email@my-domain.tld)
      • conv(1), msg(1), attr(TO), identity(cheese@hat.com)
      • conv(1), msg(1), attr(TO), identity(row@bow.org)
      • conv(1), msg(1), attr(CC), identity(foo@bar.com)
  • Other
    • Definition
      • "built-in", "DATE", NULL


Optimization Attributes

Optimization attributes are attributes that can be derived from multiple fundamental attributes, but, owing to presumed frequency of use and execution cost, get to be their own special attributes.

The definitive example are messages sent to/from the user. At least in the case of e-mail accounts, we can reasonably expect that nearly all of the (non mailing-list) messages will involve the user in some way. This means that the number of "FROM":user/"TO":user attributes will be quite high. Additionally, many of our queries will involve searching for this attribute, because most of our functionality centers around the user and his/her relationship to others.

Because of the way our database schema (and query strategy) works, any query for messages possessing attribute A:1 and attribute B:2 effectively results in intersecting the set of all messages with attribute A:1 and the set of all messages with attribute B:2. Since the set of FROM:user/TO:user attributes will be relatively large, we can expect a nice performance benefit by adding a bounded amount of redundant data.

  • From-Me-*, *-Me-From
    • Definitions
      • "optimized", "built-in", "FROM-ME-TO", identity(my-email@my-domain.tld)
      • "optimized", "built-in", "FROM-ME-CC", identity(my-email@my-domain.tld)
      • "optimized", "built-in", "TO-ME-FROM", identity(my-email@my-domain.tld)
      • "optimized", "built-in", "CC-ME-FROM", identity(my-email@my-domain.tld)
      • "optimized", "built-in", "FROM-ME-TO", identity(my-other-email@my-other-account.tld)
    • Usage
        • conv(1), msg(1), attr(FROM-ME-TO), identity(cheese@hat.com)
        • conv(1), msg(1), attr(FROM-ME-TO), identity(row@bow.org)
        • conv(1), msg(1), attr(FROM-ME-CC), identity(foo@bar.com)

Derived Attributes

Derived attributes are attributes populated by "data extractors" plugins/extensions from the contents of the message.

Simple Ones:

  • Bugzilla/bug-tracker helper plugin
    • Definitions
      • "derived", "buglamp", "BUG", "bugzilla.mozilla.org"
    • Usage
      • conv(99), msg(99), attr(BUG), 123456
  • A geolocation plugin that attempts to figure out where a message was sent from based on the IP address of the sender, explicit headers, or black magic.
    • Definitions
      • "derived", "geoplug", "LATITUDE", NULL
      • "derived", "geoplug", "LONGITUDE", NULL
    • Usage
      • conv(1), msg(1), attr(LATITUDE), 12.34
      • conv(1), msg(1), attr(LONGITUDE), 23.45
  • A purchase detector plugin
    • Definitions
      • "derived", "amazon-spree-tector", "I-SPENT", NULL
    • Usage
      • conv(99), msg(99), attr(I-SPENT), 40.20
  • Topic Analysis / Message Classifier plugin - Use analysis of the message content to assign the message to a topic.

Complex Ones (need their own table unless we change things up):

  • URL Detector
    • note: So, one thing we could do is use a (piece of an MD5) hash as the attribute value. This would technically be a win in that we could perform the message query without consulting the URL database. In theory we could discard/prune the URL database without losing the ability to attribute query an exact URL, but we'd lose so much other functionality it wouldn't be worth it.
    • note: A more feature-ful implementation might try and overload the attribute value so that the high-order bits actually encode the protocol and domain-name, leaving the low-order bits to refer to the full/specific URL. This would allow attribute queries to use ranges to locate links to a given domain.
    • note: We do not generate an attribute definition per URL (or domain) because the rate of growth of URLs (cardinality) is too high and additionally not under the control of the user.
    • Definitions
      • "derived", "built-in", "URL", NULL
    • Helper Table: builtin_urls
      • id INTEGER PRIMARY KEY
      • ?? hash INTEGER -- 8 bytes 'o' MD5(url)
      • url TEXT
    • Helper Index: (url)
    • Usage

Explicit Attributes

Explicit attributes are attributes that are the result of a user action, creating data that cannot be reproduced, except by the user. For example, when the user applies a 'tag' to a message, this is an explicit attribute. Tags and the set of flags natively supported by MailNews are (in the case of e-mail), in theory, reliably preserved through their storage in the message (or on its keywords for IMAP servers). But any new explicit attributes created by add-ons/plug-ins/extensions (or users, should we expose such a capability, which we probably should) will not have that luxury.

This means we should probably go out of our way to find a way to ensure that explicit attributes (that are not already preserved else-where) can be backed-up and restored and/or journaled as they are set. At least, that's a reasonable lesson from the sadness of the Mork index files, though the situations are arguably different. (In theory, we can rely on SQLite to not corrupt itself, and perhaps we might assume that we back-up the global database at start-up like is under discussion/has been implemented for places.)

The simplest solution, if somewhat fill-up-your-disk-drive-y, is to log the message id of a message, the attribute's attribute definition, the attribute value, and any corresponding BLOB that the extension claims is associated with the attribute.

  • Tags!
    • note: We generate a separate attribute definition for each tag. This allows us to use the 'value' field of the attribute to provide a time-date stamp as to when the user assigned the tag. Awesome? Probably.
    • Definitions
      • "explicit", "tags", "TAG", "To Do"
      • "explicit", "tags", "TAG", "Personal"
    • Usages
      • conv(2), msg(2), attr("TAG", "To Do"), date(2001/01/01)
  • Calendar Tasks/Events
    • note: You can see where I'm going here. I still need to figure out if we can get an internal unique identifier that we can trust to fit in 8 bytes...

Implicit Attributes

Like explicit but instead of the user explicitly creating an attribute, we created it by observing the user's behaviour and deciding that we should make a note of it. The most important difference being that it would probably be infeasible/a bad idea to back this information up using a journaling strategy unless the journal aggressively compacted.

Also, let's not go crazy with implicit attributes. Given the number of indices currently suggested on our attributes, it is not a trivial operation to add/modify an attribute.

  • User Attention! Track messages the user opens a lot / looks at a lot!
    • note: The idea is to not have this as an attribute on every message. If we do that, it would probably be a better idea to just add a column to the message table. The idea here is that we would only first tag a message once it crossed a certain threshold of usage. This would probably be as a result of either:
      • Obvious hints from the UI in a single message viewing experience.
      • Use of some form of LRU-discarding cache of messages observed that would track enough information to decide when a message could graduate to explicitly having a tag (and at which point it would be evicted from the cache).
    • Definitions
      • "implicit", "achtung!", "ATTENTION", NULL
    • Usages
      • conv(5), msg(6), attr(ATTENTION), 42 (some arbitrary scale-factor)

Aggregate Data

Potential Denormalizations

Use Cases

Bugzilla Workflow

As raised by Kent James, "If the mailnews team figured out how to effectively use TB or SM *in their own work* to combine information flow, it would be an amazing leap forward."

Extension Interaction

API

Future Work

'Correct' Subject Threading