Closed Bug 492555 Opened 11 years ago Closed 11 years ago

attachment reminder: Too common keywords, many false triggers

Categories

(Thunderbird :: Message Compose Window, defect, major)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0b3

People

(Reporter: BenB, Assigned: mkmelin)

References

Details

Attachments

(1 file, 2 obsolete files)

See bug 244455 comment 80 ff and comment 41 ff.

It currently triggers on:
attachment
attached
doc
pdf
resume
cover letter

The list should not contain "attached" and "resume", as these
are normal language words ("I resumed the computer", "I was so attached to this
girl..." - TB: "Did you forget to add an attachment?"). Also, you can only
trigger on ".pdf", if you have code to not trigger on URLs, as .pdf very often
appears in URLs, and we do *not* want to encourage people to attach PDFs which
can just as well be linked, rather the opposite.
Even then, the keyword should be "\.doc", not ".doc", as this is a regexp and "." is a wildcard.

Keywords should, for now, only contain:

"attachment" and "cover letter"

After URLs are not checked, it should only contain:
attachment,\.doc,\.pdf

-mail.compose.attachment_reminder_keywords=attachment,attached,.doc,.pdf,resume,cover letter
+mail.compose.attachment_reminder_keywords=attachment,cover letter
Flags: blocking-thunderbird3?
Blocks: 244455
I'm not a native speaker, but "please find attached ..." is definitely one of the common phrases, at least in business correspondence.
Others I stumbled upon are
"attached please find/see"
"please find enclosed"
"I'm enclosing"

So I'd leave "attached" in, and add "enclosed" and "enclosing".
@Ben: You used as an example of false trigger, "I resumed the computer".

Does this then mean that, in the regexp, there is no checking to make sure that the characters preceding and following the keyword are non-word character?  That is: (^|[^\w])keyword(?!\w) (or something to that effect).  This would be a problem as not checking would trigger on sections of long or compound words.

And then there is the issue of this triggering if someone is sending from a saved draft of a reply having original correspondences that contain the keyword.

That said, it would also be nice to have an option with the dialog to check "Do not use this feature", in case there a user is exacerbated by too many false positives.
Steffen: Yes, but to me, the number (or percentage) of false positive is what matters, not the positives. And if a word is also often used in ways that do not referly to attachments, that means it should not trigger, as that would be
* annoying - MS paperclip effect
* diminish the useful of the feature itself.
* Worst, it causes dialogbox-weardown and -ignore, it's a well-known effect that users then ignore *all* dialog boxes, which is harmful for security. Such a feature here can well destroy the efforts the security team puts in reducing useless dialog boxen and improving security warnings.
> it would also be nice to have an option with the dialog to check
> "Do not use this feature"

Filed bug 492573 about that.
(In reply to comment #3)
> Steffen: [...]

Not Steffen. :-)  I usually add myself only to the voter list; that way I can add or remove myself from a bug without generating bug spam--especially on large lists.

Nonetheless, what do you think of using a regexp like: (^|[^\w])keyword(?!\w)

That would more strictly match a word and significantly reduce false positives.

So, for instance, if you have a keyword like ".doc", and people are communicating about code, you would avoid triggering on something like: xmlDoc.documentElement

You would also avoid, if a keyword was "resume", from triggering on: "resumed", "presumed", etc.

In my above regexp, "keyword" assumes that if you have a keyword like ".doc", the "." is already properly escaped somewhere else in code.
Yeah it's probably a good idea include word boundaries in the regexp, and the make the code escape dots before using dot-words in the regexp.

Though of course we can never be 100% right...
Chinmay: is that something you can take on?
Flags: blocking-thunderbird3? → blocking-thunderbird3+
Target Milestone: --- → Thunderbird 3.0b3
Attached patch Fix, step 1, v1 (obsolete) — Splinter Review
This - for now - removes the keywords that cause too many false positives.

Once we have code to ignore URLs, we can re-add "\.doc" and "\.pdf", but for now, it's triggering too often (see comment 0).
Attachment #377021 - Flags: review?(mkmelin+mozilla)
Magnus: Yes, I will handle it.

I will add word boundaries in the regexp, and make the code escape dots before using dot-word in the regexp.

As I am busy with some other stuffs, I will be back with the solution in just 2 days.
Attached patch Fix, step 1, v2 (obsolete) — Splinter Review
This keeps "attached" in the list
Attachment #377030 - Flags: review?(mkmelin+mozilla)
(In reply to comment #8)
> Once we have code to ignore URLs, we can re-add "\.doc" and "\.pdf", but for
> now, it's triggering too often (see comment 0).

The word boundaries (+ dot escaping) will fix that, no?
Hm, though the extensions won't be terribly usable then i suppose.
> > ignore URLs, we can re-add "\.doc" and "\.pdf"

> The word boundaries (+ dot escaping) will fix that, no?

If you only want to trigger on "Please see the .doc in this mail", yes.
If you want to trigger on "In planning.doc, you can see", but not on
"http://foo/planning.pdf", then no, word boundaries don't help, you'll
need to recognize URLs.
I guess it shouldn't be too hard to filter out words starting with http: https:
Comment on attachment 377021 [details] [diff] [review]
Fix, step 1, v1

We'll have to bump the entity name too... I do thing "attached" should be in there. Resume should proably be résumé or get removed. Anyway, since it's a pain to bump, lets hold on until we know we have word boundary matching.
Attachment #377021 - Flags: review?(mkmelin+mozilla) → review-
Attachment #377030 - Attachment is obsolete: true
Attachment #377030 - Flags: review?(mkmelin+mozilla)
Re-assigning to dmose so we can move this forward.  Chinmay, I'm assuming you've gotten busy with work over the summer.  Let me know if you are in fact working on this still.
Assignee: chinmay_patel → dmose
Status: NEW → ASSIGNED
David:
My laptop has broken and I have sent it to HP to fix it :( .
I use Toronto Public Library PCs for internet access. Therefore, I am sorry to say that I won't be able to continue with this bug. Certainly, I am willing to work on other projects once I get my laptop back. 
Once again, I am really sorry to leave the bug unresolved.
Taking.
Assignee: dmose → mkmelin+mozilla
Status: ASSIGNED → NEW
This might help.

This code will set word boundaries and escape dotes from the keyword. Moreover, dot at the beginning of the keyword is most of the time an extension. Hence, I am taking out the starting word boundaries for such keywords.

          for (let i = 0; i < keywordsArray.length && !keywordFound; i++)
          {
            var temp = keywordsArray[i].replace(/\./g, "\\\.");

            if (temp.match("^\\\\\."))
              temp = temp + "\\b";
            else
              temp = " \b" + temp + "\\b";

            var re = new RegExp(temp, "i");
            keywordFound = re.test(mailData);
          }


Hopefully, this will help. 
(Note: I could't test this code for all possible cases.)
this is too smart. I created an email in the German language and used "doch" which also triggered this sheet. I have deactivated this *smart* feature because it is not useful for me. Please keep in mind keywords vs. locales.
Attached patch proposed fixSplinter Review
Require word boundaries, ignore links, tweak the keywords a bit (seem we can't bump the l10n value after all).

Some testing i did:
 o .doc          test http://example.com/test.doc          -> no match
 o doc           test http://example.com/test.doc now      -> no match
 o .doc          lieber doch nicht                         -> no match
 o doc           lieber doch nicht                         -> no match
 o .pdf          see this .pdf                             -> MATCH
 o .pdf          ...have a look at this .pdf?              -> MATCH
 o .pdf          see the file testcase.pdf                 -> MATCH
 o attachment    can you have a look at this attachment?   -> MATCH
 
Word boundary does mean stuff like /doc will also match, which sounds ok to me.
Attachment #377021 - Attachment is obsolete: true
Attachment #382356 - Flags: review?(bugzilla)
Status: NEW → ASSIGNED
Comment on attachment 382356 [details] [diff] [review]
proposed fix

This looks reasonable at a quick glance, but I'm going to pass to Phil as I think he'll be better at reviewing this.
Attachment #382356 - Flags: review?(bugzilla) → review?(philringnalda)
Duplicate of this bug: 498615
Attachment #382356 - Flags: review?(philringnalda) → review+
Comment on attachment 382356 [details] [diff] [review]
proposed fix

Where by "better" I think he meant "way slower." Looks fine, let's see if it's enough.
changeset:   2879:7674cfc3f3a4
http://hg.mozilla.org/comm-central/rev/7674cfc3f3a4

->FIXED
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Summary: attachment warning: Too common keywords, many false triggers → attachment reminder: Too common keywords, many false triggers
Devs,

You're not going to come up with a list which pleases everyone; some of us (who forget attachments a lot) want a very broad list.  Other people who don't forget attachments find this feature very annoying ... and a lot of us use different words to indicate attachments.

The only real answers is to somehow make the word list user-editable.  If that's too hard, then just at least give this feature an on|off preference.
Josh: Have a look into the Composition preference settings, General tab.
Both options (on/off checkbox, list of keywords) are available there.
RSX,

Not on Mac they're not.
Josh: are you using the latest nightly? (This wasn't available for tb3beta2)
You need to log in before you can comment on or make changes to this bug.