PopFile Review
I wrote about my e-mail classification program, PopFile in my E-mail/Spam solution entry. Since there are few reviews of PopFile, I am publishing the results of six months of classification (June 1, 2005 to November 30). The results are pretty impressive. Drawn from a sample of over 46000 messages, the accuracy is over 99% and approaching 100%.
Read more
The messages were hand-classified and are from more than one e-mail account (including one Hotmail account), with the majority of non-spam being work-related messages of a software technical nature. There is a large sample of "bounce" messages which stem from spammer(s) forging the domain name to send messages, and various types of bounces (spam filters, address not found). That would be the subject of another article, but I did not keep statistics on that.
The "buckets" are pretty straight-forward, with "win" being Windows related, "net" being network/domain related. The "ad" bucket is a but of a misnomer as it refers to blatant adult related messages. Most of those types of messages are now classified as spam instead. The "spam-virus" can either be messages already processed by Norton Antivirus and the attachment removed, or obvious virus type messages with attachments that were not caught by the antivirus.
To calculate the spam hit rate I took out the false positives and negatives to come up with a score of .992, the ham strike rate is .003 (false negatives only). PopFile doesn't include the unclassified value in its calculation so it claimed 99.65% accurate.
Read more
The messages were hand-classified and are from more than one e-mail account (including one Hotmail account), with the majority of non-spam being work-related messages of a software technical nature. There is a large sample of "bounce" messages which stem from spammer(s) forging the domain name to send messages, and various types of bounces (spam filters, address not found). That would be the subject of another article, but I did not keep statistics on that.
The "buckets" are pretty straight-forward, with "win" being Windows related, "net" being network/domain related. The "ad" bucket is a but of a misnomer as it refers to blatant adult related messages. Most of those types of messages are now classified as spam instead. The "spam-virus" can either be messages already processed by Norton Antivirus and the attachment removed, or obvious virus type messages with attachments that were not caught by the antivirus.
To calculate the spam hit rate I took out the false positives and negatives to come up with a score of .992, the ham strike rate is .003 (false negatives only). PopFile doesn't include the unclassified value in its calculation so it claimed 99.65% accurate.
Bucket Name | Distinct Words | Word Count | Classification Count | False Pos. | False Neg. |
---|---|---|---|---|---|
ad | 8,307 | 22,709 (11.82%) | 99 (0.21%) | 0 | 4 |
bounce | 3,920 | 25,035 (13.03%) | 36,592 (79.39%) | 40 | 91 |
fax | 88 | 460 (0.23%) | 397 (0.86%) | 1 | 0 |
in | 14,078 | 54,053 (28.13%) | 2,967 (6.43%) | 13 | 18 |
net | 381 | 1,182 (0.61%) | 15 (0.03%) | 0 | 0 |
spam | 13,358 | 48,693 (25.34%) | 4,280 (9.28%) | 36 | 32 |
spam-virus | 14,261 | 19,624 (10.21%) | 1,062 (2.30%) | 0 | 9 |
win | 4,605 | 20,341 (10.58%) | 464 (1.00%) | 1 | 5 |
unclassified | 210 (0.45%) | 117 | |||
| |||||
Total | 58,998 | 192,256 (100.00%) | 46,086 (100.00%) | 208 | 159 |
Labels: email