Closed Bug 471885 Opened 13 years ago Closed 13 years ago
Bayes analysis should set probability = 0 or 1 with unbalanced tokens
Currently, if the number of training tokens is small in the bayesian filter, and a message only matches trained messages of a certain type (either pro or anti indicators of the presence of the trait) then the result will be 50%. The problem with this, for traits with few messages, is that users do not get any explicit clues that they are unbalanced. What I would prefer to happen is this. If, for example, users only train messages as matching the "Personal" trait, then any messages that have a matching Personal token will then classify as 100% Personal, unlike the present where they will classify as 50% Personal. That way, there will be an excessive number of Personal classified messages, and users will be encouraged to mark some messages as not Personal, thereby restoring the balance. This is only an issue with the initial few trainings for traits, but that issue will be more important than in the past as we start to use the Bayesian filter for new traits that are likely to have few trainings. I am admittedly applying here the needs for my specific use case, which is soft tagging in the TaQuilla extension (http://mesquilla.com/extensions/taquilla) but I think this is the correct behaviour in general as well.
Calculations changed per description. I also tested this with junk startup, and it still works as we want (and matching current behaviour). That is, when the user only trains junk and not good, everything is classified as junk until the user trains some good messages.
fix checked in - http://hg.mozilla.org/comm-central/rev/629517ab551c
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.