User:Rkentjames:Spam

Spam Management

The Uncertain category

After completing bugs to enable the use of an uncertain category, I come across lots of issues. First is that virtual folders have problems - which got me working on search and filtering.

Another is, what do you do with messages that are hard to categorize? One example is a forward sent to me from my Father-in-law of some cutsie stuff that I would consider annoying at best if received unsolicited from a stranger. Another example: a forward of a hoax to a long list of family members. Do I mark it as spam (and thereby increase the spam weighting of my entire extended family on the CC list?) Or do I mark it as ham (and encourage acceptance of such junk in my filter?) I think that any display of uncertain emails requires some way to remove the email other than training it as ham or spam.

Bayes data refactoring

The main goal of this will be to:

reduce the memory footprint of Bayes
convert the external storage to an SQLite format
support multiple feature types in counts
support saving detailed data to assist in understanding per-message Bayes performance.

The main concepts are:

Have separate token management for corpus and message
Convert storage of training.dat into an SQLite database
Store string values in the SQLite database; store only hash keys in memory.
Unify storage of corpus counts per token. Store counts as "per feature" with Junk and Good as first two supported features.

So how might I divide that into separate bugs?

Split tokenizer into Message and Corpus versions
Support detail interface to message classification
Create an SQlite database that parallels training.dat
Convert local hash functions to use saved hash function results instead of strings directly

User:Rkentjames:Spam

Spam Management

The Uncertain category

Bayes data refactoring

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools