User:Rkentjames:Spam

From MozillaWiki
Jump to: navigation, search

Spam Management

The Uncertain category

After completing bugs to enable the use of an uncertain category, I come across lots of issues. First is that virtual folders have problems - which got me working on search and filtering.

Another is, what do you do with messages that are hard to categorize? One example is a forward sent to me from my Father-in-law of some cutsie stuff that I would consider annoying at best if received unsolicited from a stranger. Another example: a forward of a hoax to a long list of family members. Do I mark it as spam (and thereby increase the spam weighting of my entire extended family on the CC list?) Or do I mark it as ham (and encourage acceptance of such junk in my filter?) I think that any display of uncertain emails requires some way to remove the email other than training it as ham or spam.

Bayes data refactoring

The main goal of this will be to:

  1. reduce the memory footprint of Bayes
  2. convert the external storage to an SQLite format
  3. support multiple feature types in counts
  4. support saving detailed data to assist in understanding per-message Bayes performance.

The main concepts are:

  1. Have separate token management for corpus and message
  2. Convert storage of training.dat into an SQLite database
  3. Store string values in the SQLite database; store only hash keys in memory.
  4. Unify storage of corpus counts per token. Store counts as "per feature" with Junk and Good as first two supported features.

So how might I divide that into separate bugs?

  1. Split tokenizer into Message and Corpus versions
  2. Support detail interface to message classification
  3. Create an SQlite database that parallels training.dat
  4. Convert local hash functions to use saved hash function results instead of strings directly