User:Rkentjames:Bug228675: Difference between revisions

no edit summary
mNo edit summary
No edit summary
 
Line 22: Line 22:


Data set 4 "2005&ALL" is the only set that starts with a non-empty training file. It uses a training file obtained by training on all of the data from the TREC 2005 corpus. Note this corpus is roughly 92,000 messages (compared to 37821 for the 2006 corpus) so that is a lot of previous training!
Data set 4 "2005&ALL" is the only set that starts with a non-empty training file. It uses a training file obtained by training on all of the data from the TREC 2005 corpus. Note this corpus is roughly 92,000 messages (compared to 37821 for the 2006 corpus) so that is a lot of previous training!
For the combined TREC 2005+2006 corpus, there were 52309 good messages, 77568
junk messages, 593,186 good tokens, and 484,172 junk tokens. training.dat size
is 43 megabytes. That's what gives the best results!
Training using 5/100 of the TREC 2006 corpus gave 27108 good tokens, 31002 junk
tokens, combined is 58110 tokens. For the test using the patch, I limited
tokens to 4/3 of that, or 77480.


[[Image:SpamGraph.png]]
[[Image:SpamGraph.png]]
90

edits