DSPAM has an intersting idea. They convert all of their tokens into CRC64 checksum values (long long integer value) so every token is a fixed 8 byte width. This helps reduce the overall size of the training file.
What about words that are less than 8 bytes long? Should they be exempt from the CRC'fication? That might depend a lot on the language, but I guess a lot of words are <8 chars.
Many words which can be taken as quasi-sure signs of spam are actually shorter than 8 bytes: v1/\gr/\ is one of the longest, and it is exactly 8 bytes of text.
I am not enthusiastic about this because it makes it very difficult to diagnose problems in tokenization. Instead, I would prefer to move in a direction where we make it easy for users to see which tokens are being used to classify a message.
Product: Core → MailNews Core
taken comment 3 as motivation for WONTFIX
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.