User:Mnyromyr:BayesFacts

From MozillaWiki
Jump to: navigation, search

← back to User:Mnyromyr

This page tries to assemble some facts about the way the Bayesian junk filter works in SeaMonkey/Thunderbird, as seen from a programmer's perspective. It does not intend to explain how to set it up or use it, see your applications's help for that.

Basics

  • Messages are split into tokens, usually devided by whitespace, but we also generate special tokens for headers, attachments, etc.
  • For each token, we store the number of its occurences in good ("ham") and bad ("junk") messages in the training.dat file.
    This data is used later by the Bayesian filter to decide whether an entire message itself is good or bad.
  • For each message, we store its "junkscore" (currently either "0" or "100" or "") and the "junkscoreorigin" (currently either "plugin" or "user") as a string property on its nsMsgHdr.
    A junkscore of "100" means bad ("junk"), a junkscore of "0" means good ("ham"), the empty junkscore denotes a message whose junk state hasn't been set yet, neither by the Bayesian filter nor manually. The junkscoreorigin "plugin" is set by the Bayesian filter, "user" is set for manually set junk states.
  • The default UI hides the difference between good, bad and unclassified by showing both good and unclassified messages as non-junk and only bad ones as junk. This means that usually only by correcting "false positives" (messages incorrectly determined to be junk by the Bayesian filter) good tokens are created. This leads to tokens considered more bad than they actually are.

Manual junk state marking

Hitting J or Shift-J will basically call JunkSelectedMessages, which in turn will go down into the backend to nsMsgDBView::SetAsJunkByIndex. From there, nsBayesianFilter::SetMessageClassification is called, which updates the token counts according to the requested junk state.

Halfautomatical junk detection

"Tools→Run Junk Controls on Folder" calls analyzeMessageForJunk, which ends up in nsBayesianFilter::classifyMessage, where the junk state for a particular message will be determined: a junk probability is computed for each token, then the 150 most significant tokens (of length 3 to 13) are weighed by the chi square method to get the final junk state for the message. No token counts are changed.

Automatical junk detection

This works almost as in the halfautomatical case, just without frontend interaction: nsMsgDBFolder::CallFilterPlugins calls the virtual function SpamFilterClassifyMessage, which in turn just ends up in nsBayesianFilter::classifyMessage again...