Tidy up nsBayesianFilter Tokenizer
Categories
(MailNews Core :: Filters, task)
Tracking
(Not tracked)
People
(Reporter: benc, Unassigned)
References
Details
Following on from Bug 1729682.
The nsBayesianFilter Tokenizer::ScannerNext() kind of looks like it's fighting against WordBreaker::Next(). e.g. We probably shouldn't be doing our own character classification.
So it'd be nice to poke through and see if the Tokenizer could be better aligned/replaced entirely with WordBreaker. If so a lot of code could be removed (the bustage patch from Bug 1729682, to start with).
Note that Bug 1684927 is an effort to do text segmenting Properly(tm).
Looks like WordBreaker is being adapted to use that work behind the scenes, so if nsBayesianFilter is using WordBreaker "correctly", then we might gain from more correct tokenising (especially for languages without spaces or obvious word breaks). It also might avoid further bustage.
Description
•