User:Rkentjames:Bug228675: Difference between revisions

User:Rkentjames:Bug228675 (view source)

399 bytes added , 6 June 2008

no edit summary

90

edits

@@ Line 22: / Line 22: @@
 Data set 4 "2005&ALL" is the only set that starts with a non-empty training file. It uses a training file obtained by training on all of the data from the TREC 2005 corpus. Note this corpus is roughly 92,000 messages (compared to 37821 for the 2006 corpus) so that is a lot of previous training!
+For the combined TREC 2005+2006 corpus, there were 52309 good messages, 77568
+junk messages, 593,186 good tokens, and 484,172 junk tokens. training.dat size
+is 43 megabytes. That's what gives the best results!
+Training using 5/100 of the TREC 2006 corpus gave 27108 good tokens, 31002 junk
+tokens, combined is 58110 tokens. For the test using the patch, I limited
+tokens to 4/3 of that, or 77480.
 [[Image:SpamGraph.png]]