MailNews:Message Threading

From MozillaWiki
Jump to: navigation, search

Note that MailNews has the capability to group messages by subject (Sort by... Subject, Sort by... Grouped By Sort). Although Message Threading can thread by subject, these are two different pieces of functionality.

E-mail Threading Primer

Well behaved mail clients like to put the following headers in, but they are not required:

  • Message-ID: A (sorta) globally unique identifier for the message.
  • In-Reply-To: Contains the Message-ID(s) of the message(s) this messages is in reply to, assuming the message(s) actually had a Message-ID. (Omitted if the parent didn't have a Message-ID).
  • References: Contains the contents of the "References" for the message(s) being replied to, followed by the Message-ID of the message(s) being replied to. If there is no "References", try again with "In-Reply-To" (followed by the Message-ID). Omitted if the fields are missing.

See RFC 2822 for details.

The bad news is that not all e-mail clients actually generate these message headers. For example, as of May 2008, Yahoo Mail still does not generate In-Reply-To/References headers. This means that clients may need to fall back on using message subjects or other heuristics for threading purposes.

General Implementation

Background

nsMsgKey's are unsigned longs that are used for a few purposes:

  • message key: uniquely identify a message in a folder. They are synonymous with the mOid_Id for the message's mork row object in the hdrRowScopeToken (ns:msg:db:row:scope:msgs:all) mOid_Scope. The meaning of the actual values is determined by (and only relevant to) the protocol in use.
  • thread id: identify a thread. The thread id does not have to be the message key of the root message of a thread, although it does start out that way. They are synonymous with the mOid_Id for the thread's more row object in the threadRowScopeToken (ns:msg:db:row:scope:threads:all) mOid_Scope.
  • thread parent: identify the parent of a message in a thread using the parent's message key.

nsMsgThread/nsMsgIThreads are persistent thread representations. They are characterized by a thread key and a thread root key, which are both initially the message key of the message the thread was created for. As processing continues, the thread root key (the message key of the root message) may change, but the thread key remains the same. The subject of the message initially used to create the thread is stored and never updated no matter what happens to the thread. nsMsgThreads are a thin wrapper around their underlying mork representation, keeping only keys/cached values in memory, with all important/mutable data being stored in mork.

Details

Threading is first attempted using the references on the message provided by the "References" or "In-Reply-To" headers, in that order. For each reference (starting from the last reference and working towards the first), an attempt is made to either a) locate the message with either the given Message-ID, or b) (if mail.correct_threading) locate a sibling message that references the same Message-ID.

If the References/In-Reply-To doesn't pan out and the "mail.strict_threading" preference is False, then the subject is potentially used for threading. If the "mail.thread_without_re" preference is True (default), then the subject is used regardless of whether "Re:" is present. If it is False, then "Re:" or a variant must be present. The "Re:" checking is actually fairly thorough; NS_MsgStripRE allows for all case-variants plus "Re[#]:" variants. Additionally, the preference "mailnews.localizedRe" is used to provide a comma-delimited list of alternative prefixes that are allowed. bug 319037 enhances things to use the locale's region.properties to provide this value.

If a thread still hasn't been determined, and the "mail.correct_threading" preference is true, the code will check if the message is an ancestor of an already-threaded message. Because this comes after the subject-threading, it implies that if you enable mail.correct_threading, you will also want to enable mail.strict_threading.

Finally, if a thread still hasn't been found, a new thread is created.

Preferences Controlling Threading

All core threading logic currently lives in mailnews/db/msgdb/src/nsMsgDatabase.cpp. It is controlled by the following preferences:

  • mail.thread_without_re : Thread by subject even when there is no "Re:" in the subject; default False in 3.0, used to be true in 2.0.
  • mail.strict_threading : Don't thread by subject; default True for 3.0, used to be False in 2.0.
  • mail.correct_threading : Thread things correctly (using References/In-Reply-To) regardless of the order in which messages are added to a folder; default false. Requires extra memory and some extra processing once a folder (nsMsgDatabase) has new messages added to it. If you turn this on, you really should turn on strict_threading too.

mail.thread_without_re

mail.thread_without_re (gThreadWithoutRe/ThreadBySubjectWithoutRe) defaults to False in Thunderbird 3.0, used to default to True in 2.0.

If mail.thread_without_re is True, the subject does not have to start with "Re:" (or variants or localized variants) for threading to occur. If it's false, it does have to start with "Re:"/variants.

mail.strict_threading

mail.strict_threading (gStrictThreading/UseStrictThreading) defaults to True in 3.0, used to be False in 2.0.

If mail.strict_threading is True, subject-threading is disabled entirely. Messages sent by clients that do not generate "References"/"In-Reply-To" headers (or responding to clients that do not generate "Message-ID" headers) will not be threaded.

If mail.strict_threading is False, then we will attempt to thread using the subject. Whether we require the subject to start with "Re:" (or variants) depends on the "mail.thread_without_re" setting.

mail.correct_threading

mail.correct_threading (gCorrectThreading/UseCorrectThreading) defaults to True in 3.0. Implemented by bug 181446, only available in 3.0 releases and later (never on the 2.0.0.x branch).

If mail.correct_threading is True, the references stored on every nsIMsgDBHdr are used to populate a hashtable mapping every Message-ID we have heard about for a thread to that thread's thread id. For example, if message D with Message-ID: D and thread id of 42 "References:" C, B, and A, the hashtable will map C, B, and A to 42.

Having this mapping allows us to do two things we could not otherwise do (as things are implemented...) in order to thread messages correctly regardless of the order in which they are added:

  • Thread messages together with common, but missing (not in the folder), ancestors. Otherwise, they would end up in different threads.
  • When we process one of those missing ancestors, detect it and properly add the message to the existing thread. (At least as long as there are less than 1000 messages in the thread.)

The hashtable is an in-memory-only structure, and is populated by processing all of the existing messages the first time access to the structure is required. This means that the overhead of traversing the messages and extra memory usage should only happen when adding new messages to an nsMsgDatabase.

nsParseMailbox.cpp sets the references via nsIMsgDBHdr::SetReferences using the "References" header as a first choice, and the "In-Reply-To" header as a second choice. (nsNNTPNewsgroupList.cpp and various compose pieces of functionality also call SetReferences, but they are not processing incoming e-mail.)

Deletion

When you delete a message A, all of A's children are re-parented so that their parent becomes A's parent. If A had no parent (it's the root of the thread), then A's first child becomes the root of the thread.