Open Bug 1729842 Opened 3 years ago Updated 3 years ago

Tidy up nsBayesianFilter Tokenizer

Categories

(MailNews Core :: Filters, task)

Tracking

(Not tracked)

People

(Reporter: benc, Unassigned)

References

Details

Following on from Bug 1729682.
The nsBayesianFilter Tokenizer::ScannerNext() kind of looks like it's fighting against WordBreaker::Next(). e.g. We probably shouldn't be doing our own character classification.

So it'd be nice to poke through and see if the Tokenizer could be better aligned/replaced entirely with WordBreaker. If so a lot of code could be removed (the bustage patch from Bug 1729682, to start with).

Note that Bug 1684927 is an effort to do text segmenting Properly(tm).
Looks like WordBreaker is being adapted to use that work behind the scenes, so if nsBayesianFilter is using WordBreaker "correctly", then we might gain from more correct tokenising (especially for languages without spaces or obvious word breaks). It also might avoid further bustage.

See Also: → 1729682
You need to log in before you can comment on or make changes to this bug.