Open Bug 1729842 Opened 3 years ago Updated 3 years ago

Tidy up nsBayesianFilter Tokenizer

Tracking

(Not tracked)

Status:

NEW

People

(Reporter: benc, Unassigned)

References

Details

Ben Campbell

Reporter

Description

•

3 years ago

Following on from Bug 1729682.
The nsBayesianFilter Tokenizer::ScannerNext() kind of looks like it's fighting against WordBreaker::Next(). e.g. We probably shouldn't be doing our own character classification.

So it'd be nice to poke through and see if the Tokenizer could be better aligned/replaced entirely with WordBreaker. If so a lot of code could be removed (the bustage patch from Bug 1729682, to start with).

Note that Bug 1684927 is an effort to do text segmenting Properly(tm).
Looks like WordBreaker is being adapted to use that work behind the scenes, so if nsBayesianFilter is using WordBreaker "correctly", then we might gain from more correct tokenising (especially for languages without spaces or obvious word breaks). It also might avoid further bustage.

Ben Campbell

Reporter

Updated

•

3 years ago

Bugzilla

Tidy up nsBayesianFilter Tokenizer

Categories

(MailNews Core :: Filters, task)

Tracking

(Not tracked)

People

(Reporter: benc, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated