Closed Bug 543210 Opened 14 years ago Closed 13 years ago

Disable .pdf attachment warner when .pdf is in a full URL

Categories

(Thunderbird :: Message Compose Window, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: burnus, Assigned: luckydog_1)

References

Details

(Keywords: polish, student-project)

Attachments

(1 file, 2 obsolete files)

Thunderbird 3.0.1 bothers me all the time when I paste an URL such as
  http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1124.pdf
claiming that I forgot to attach a file because of the string ".pdf".

It should never warn for .pdf if such a string is part of a URL.
Blake would that fit for students ?
Keywords: polish
Hmmm…  Okay, for a while there, I couldn't reproduce it, but now that I can, yeah, it seems like a reasonable student project.

(As a hint for the implementor, http://foo.com/bar.pdf doesn't show the warning, but http://foo.com/bar1.pdf does.  :)

Thanks,
Blake.
Keywords: student-project
I think that there is actually two bugs of this. 
  first, the logic of "attachment reminder" is wrong. Because when there is a English letter before ".pdf", thunderbird will not trigger a attachment reminder, but when there are other characters, like %, *, a number, or even Chinese characters, before ".pdf", the attachment reminder will show. so I think the logic is wrong. 
  Second, when ".pdf" is part of a URL, the attachment reminder shouldn't show.
Attached patch first try #1 (obsolete) — Splinter Review
This patch tries to fix the issue by splitting "mailData" into an array delimited by spaces or carriage returns and handling every string in the paragraphs instead of just handling the paragraphs as a whole.
Attachment #481470 - Flags: review?(bwinton)
Assignee: nobody → luckydog_1
Status: NEW → ASSIGNED
Version: 3.0 → Trunk
Attached patch second try #2 (obsolete) — Splinter Review
when i review my code, i found that my code will lead to duplicate keywords in the attachment reminder bar. so I just use the "break" in the for loop to solve this problem.
  In addition, I also fix the bug which is mentioned in my previous commment(the Comment 3 made by 2010-09-16 04:58:28). I add a new regular expression and handle the keywords separately. I think there are two types of keywords, the first type is ".pdf, .doc" and the second type is "attach, attachment, cv". The reason why I separate the keywords in two group is that when there is a letter, like "a", the first group should show the attachment reminder bar but the second group shouldn't.
Attachment #481784 - Flags: review?(bwinton)
Comment on attachment 481470 [details] [diff] [review]
first try #1

Beibei, please also turn your older patch obsolete if you're attaching newer and better patches, thanks! :)
Attachment #481470 - Attachment is obsolete: true
Attachment #481470 - Flags: review?(bwinton)
(In reply to comment #6)
> Comment on attachment 481470 [details] [diff] [review]
> first try #1
> 
> Beibei, please also turn your older patch obsolete if you're attaching newer
> and better patches, thanks! :)

okay, thanks!
Comment on attachment 481784 [details] [diff] [review]
second try #2

So, this patch is kind of bothering/worrying me for a couple of reasons, which I'll try to explain below.

>+++ b/mail/base/modules/attachmentChecker.js	Fri Oct 08 16:31:24 2010 +0800
>@@ -68,6 +68,8 @@
> {
>   // The empty string would get split to an array of size 1.  Avoid that...
>   var keywordsArray = (keywordsInCsv) ? keywordsInCsv.split(",") : [];
>+  var delimiter = /[" ", "\n"]/;
>+  var mailDataArray = mailData.split(delimiter);

Since we don't use delimiter anywhere else, we can just inline it.  But, it's really not going to match what you think it will.  What you have there will match any of the characters <quote>, <space>, <comma>, and <newline>, which is probably harmless, but I don't think it's what you intended.

>@@ -91,12 +93,17 @@
>     var start = IsCJK(kw.charCodeAt(0)) ? "" : ("(^|" + NOT_W + ")");
>     var end = IsCJK(kw.charCodeAt(kw.length - 1)) ? "" : ("(" + NOT_W + "|$)");
>     var re = new RegExp(start + kw + end, "i");

And then looking at where we use re got me thinking that perhaps we should split on NOT_W instead.  What do you think of that idea?

>+    var re1 = new RegExp(keywordsArray[i] + "$", "i");
>+    for(var j = 0; j < mailDataArray.length; j++) {
>+        var matching = re.test(mailDataArray[j]);

I'm also a little worried about the potential slowdown if we run the regular expression a million times.
Would you mind writing a little test which compares the two approaches, and sees how they perform on various sizes of input (and of keywords array)?

>+        // Ignore the match if it was a URL.
>+        if ((matching || (keywordsArray[i].charAt(0) == '.' && re1.test(mailDataArray[j]))) && !(/^http|^ftp/i.test(mailDataArray[j]))) {

So this will still catch "http://attachments.example.com/".  Do we want it to?
And it will also catch "<http://example.com/Long Directory/test.pdf>".  Again, do we want it to?

Thanks,
Blake.
Attachment #481784 - Flags: review?(bwinton) → review-
(In reply to comment #8)
> Comment on attachment 481784 [details] [diff] [review]
> second try #2
> 
> So, this patch is kind of bothering/worrying me for a couple of reasons, which
> I'll try to explain below.
> 
> >+++ b/mail/base/modules/attachmentChecker.js	Fri Oct 08 16:31:24 2010 +0800
> >@@ -68,6 +68,8 @@
> > {
> >   // The empty string would get split to an array of size 1.  Avoid that...
> >   var keywordsArray = (keywordsInCsv) ? keywordsInCsv.split(",") : [];
> >+  var delimiter = /[" ", "\n"]/;
> >+  var mailDataArray = mailData.split(delimiter);
> 
> Since we don't use delimiter anywhere else, we can just inline it.  But, it's
> really not going to match what you think it will.  What you have there will
> match any of the characters <quote>, <space>, <comma>, and <newline>, which is
> probably harmless, but I don't think it's what you intended.
> 
i already inline it, but i think you misunderstand my code, "/[" ", "\n"]/" is a regular expression, and the delimiter is <space> and <newline>, not <quote>, <space>, <comma>, and <newline>

> >@@ -91,12 +93,17 @@
> >     var start = IsCJK(kw.charCodeAt(0)) ? "" : ("(^|" + NOT_W + ")");
> >     var end = IsCJK(kw.charCodeAt(kw.length - 1)) ? "" : ("(" + NOT_W + "|$)");
> >     var re = new RegExp(start + kw + end, "i");
> 
> And then looking at where we use re got me thinking that perhaps we should
> split on NOT_W instead.  What do you think of that idea?
>
i try to use the NOT_W, but it seems that this doesn't work, i think we have to use <space> and <newline>, as we have to check every individual string or word, so that we can check whether it is a url or not.
 
> >+    var re1 = new RegExp(keywordsArray[i] + "$", "i");
> >+    for(var j = 0; j < mailDataArray.length; j++) {
> >+        var matching = re.test(mailDataArray[j]);
> 
> I'm also a little worried about the potential slowdown if we run the regular
> expression a million times.
> Would you mind writing a little test which compares the two approaches, and
> sees how they perform on various sizes of input (and of keywords array)?
> 
yes, this is a problem. i copy the whole book of little prince, and found that the performance is not good, after 1 to 2 seconds, the reminder bar will show. for the original one, the reminder bar will show in about 0.5 second. however, i think that people usually will not write such a long email. if they really need, they may prefer using attachments. in addition, 1 to 2 seconds may be a bit slow, but i think it is still acceptable. i try to come up with a new algorithm, but it seems to be hard to check whether it is a url by didn't split the mailData into strings by <space> and <newline>

> >+        // Ignore the match if it was a URL.
> >+        if ((matching || (keywordsArray[i].charAt(0) == '.' && re1.test(mailDataArray[j]))) && !(/^http|^ftp/i.test(mailDataArray[j]))) {
> 
> So this will still catch "http://attachments.example.com/".  Do we want it to?
> And it will also catch "<http://example.com/Long Directory/test.pdf>".  Again,
> do we want it to?
> 
actually, this will not catch "http://attachments(if this is "attachment", it will not match, either).example.com/", http://attachments.example.com/(without the quote), "<http://example.com/Long Directory/test.pdf>" or <http://example.com/Long Directory/test.pdf>(without the quote).i know what may want to show that, this will catch <http://example.com/Long Directory/test.pdf>
but actually, because the the original author define NOT_W in his own way, which means that "u003e"(unicode of ">") is not in the NOT_W, so neither the author's regular expression, which is re, nor my regular expression, which is re1, will catch this url.
> Thanks,
> Blake.
(In reply to comment #9)
> > >+  var delimiter = /[" ", "\n"]/;
> > >+  var mailDataArray = mailData.split(delimiter);
> > Since we don't use delimiter anywhere else, we can just inline it.  But,
> > it's really not going to match what you think it will.  What you have there
> > will match any of the characters <quote>, <space>, <comma>, and <newline>,
> > which is probably harmless, but I don't think it's what you intended.
> i already inline it,

By "inline it", I meant "delete the delimiter variable, and just use the regular expression in the call to mailData.split".

> but i think you misunderstand my code, "/[" ", "\n"]/" is
> a regular expression, and the delimiter is <space> and <newline>, not <quote>,
> <space>, <comma>, and <newline>

I realize that's what you intended, but that's not what the code does.

Testing it in node-repl (a handy Javascript command-line) shows us:
Fin:accountprovisioner (default⚡)… node-repl
Type '.help' for options.
(The REPL can also be started by typing 'node' without arguments)
> var delimiter = /[" ", "\n"]/;
> var mailData = "This,is a\"test.";

> var mailDataArray = mailData.split(delimiter);
> mailDataArray
[ 'This'
, 'is'
, 'a'
, 'test.'
]

Now, if it was only splitting on <space> and <newline>, then it would have given me:
[ 'This,is'
, 'a"test'
]
instead.

(Firebug in Firefox shows:
>>> var delimiter = /[" ", "\n"]/; var mailData = "T...Array = mailData.split(delimiter); mailDataArray
["This", "is", "a", "test."]
>>> mailData.split(" ");
["This,is", "a"test."]
as a check to see that we got the same result.)

> > >@@ -91,12 +93,17 @@
> > >     var start = IsCJK(kw.charCodeAt(0)) ? "" : ("(^|" + NOT_W + ")");
> > >     var end = IsCJK(kw.charCodeAt(kw.length - 1)) ? "" : ("(" + NOT_W + "|$)");
> > >     var re = new RegExp(start + kw + end, "i");
> > 
> > And then looking at where we use re got me thinking that perhaps we should
> > split on NOT_W instead.  What do you think of that idea?
> i try to use the NOT_W, but it seems that this doesn't work, i think we have
> to use <space> and <newline>, as we have to check every individual string or
> word, so that we can check whether it is a url or not.

Hmmm, I kind of agree, but I'm not sure about urls in CJKV languages.
i.e. the sentence "私がurlとしてhttp://www.thunderbird.com/attachment好き" (very roughly translated as "I have a url like http://www.thunderbird.com/attachment"), shouldn't trigger the attachment detector, even though it doesn't start with "http".  (Because we should split on the て before the http, and the 好 after attachment.)

Does that make sense to you?

> > I'm also a little worried about the potential slowdown if we run the regular
> > expression a million times.
> > Would you mind writing a little test which compares the two approaches, and
> > sees how they perform on various sizes of input (and of keywords array)?
> yes, this is a problem. i copy the whole book of little prince, and found that
> the performance is not good, after 1 to 2 seconds, the reminder bar will show.
> for the original one, the reminder bar will show in about 0.5 second.
> i try to come up with a new
> algorithm, but it seems to be hard to check whether it is a url by didn't split
> the mailData into strings by <space> and <newline>

Well, we could check the whole thing for "words" with attachment in them, and then see if each of those starts with "http".  (One call that scans the whole string should be faster than many calls that scan small pieces of the string.)

Thanks,
Blake.
i improve the algorithm. though the algorithm looks troublesome, but it is more efficient. the performance is almost the same with the original one. the idea of the algorithm is base on your suggestion(we could check the whole thing for "words" with attachment in them, and then see if each of those starts with "http")

thanks,
Beibei
Attachment #481784 - Attachment is obsolete: true
Attachment #486297 - Flags: review?
Attachment #486297 - Flags: review? → review?(bwinton)
Comment on attachment 486297 [details] [diff] [review]
improve the algorithm

A few general things first:

1) We should really add a new test.  Hopefully one that fails without this patch, but passes with this patch.  Fortunately, the testing infrastructure is pretty much all set up for you, and so you should only need to edit the file:
http://mxr.mozilla.org/comm-central/source/mail/base/test/unit/test_attachmentChecker.js#95
and run "make SOLO_FILE="test_attachmentChecker.js" -C mail/base/test/ check-one"

2) A bunch of these lines are longer than 80 characters.  You'll need to break them up at a reasonable point.

3) Come to think of it, you should probably fix all the problems listed at http://beaufour.dk/jst-review/?patch=https%3A%2F%2Fbug543210.bugzilla.mozilla.org%2Fattachment.cgi%3Fid%3D486297&file=

  (You can check future versions of your patch at http://beaufour.dk/jst-review/ yourself before I do, so that you don't run into this again.  :)

4) Another idea I just had was "What if we just removed the urls before checking for attachment keywords?"  There's a function called mozITXTToHTMLConv.FindURLInPlaintext which looks like it will find URLs in plaintext, and so in theory we could use that to loop through the mail data, removing urls as we go, before we pass it to the attachment checker.  (I'm not sure how well that would work, and I suspect that you won't be able to load that file from within the Web Worker, and so will need to filter the urls out in http://mxr.mozilla.org/comm-central/source/mail/components/compose/content/MsgComposeCommands.js#1364 before you call the GetAttachmentKeywords function, but it's probably worth trying that approach out in a different patch.  And, of course, that would require a different test to verify that it works correctly.)

>+++ b/mail/base/modules/attachmentChecker.js	Wed Oct 27 14:42:00 2010 +0800
>@@ -90,13 +90,45 @@
>     // space delimited.
>     var start = IsCJK(kw.charCodeAt(0)) ? "" : ("(^|" + NOT_W + ")");
>     var end = IsCJK(kw.charCodeAt(kw.length - 1)) ? "" : ("(" + NOT_W + "|$)");
>-    var re = new RegExp(start + kw + end, "i");
>-    var matching = re.exec(mailData);
>-    // Ignore the match if it was a URL.
>-    if (matching && !(/^http|^ftp/i.test(matching[0])))
>-      // We're not worried about matching too much because we only add the
>-      // keyword to the list of found keywords.
>-      keywordsFound.push(keywordsArray[i]);
>+    var re = new RegExp(start + kw + end, "g" + "i");
>+    var re1 = new RegExp("\\" + keywordsArray[i], "g" + "i");

I don't really understand why you're putting a \ in front of each keyword, instead of using the kw variable (which is the keyword with the regular expression characters escaped.

>+    do {
>+      var matching = re.exec(mailData);
>+      var matching1 = re1.exec(mailData);
>+      if (matching || (keywordsArray[i].charAt(0) == '.' && matching1 && !(/[a-z]/i.test(mailData.charAt(re1.lastIndex))))) {

Won't this miss things that end with Greek letters?  (I did a bunch of work to get that kind of thing working correctly in bug 527018, and I don't want to see that get reverted.)

>+        var j = 2;
>+	var splitString = "";
>+        // to get the string which contains the keywords out of the mailData
>+        //"[$-_.+!*(),';/?:@=&]|[0-9a-zA-Z]" is the characters which is allowed in an URL
>+        while(matching && re.lastIndex-j >= 0 && /[$-_.+!*(),';/?:@=&]|[0-9a-zA-Z]/.test(mailData.substring(re.lastIndex-j, re.lastIndex-j+1))) {
>+          splitString = mailData.substring(re.lastIndex - j, re.lastIndex);
>+          j++;
>+          // if there is http or ftp, the while loop will stop

Shouldn't that be "http://" or "ftp://"?  And should we also test for "https://" and "mailto:" and "sftp://"?

>+          if(/^http|^ftp/i.test(mailData.substring(re.lastIndex-j, re.lastIndex-j+4))) {
>+            splitString = mailData.substring(re.lastIndex-j, re.lastIndex);
>+            break;
>+          }
>+        }

So, you're incrementing j, but then subtracting it, and taking a lot of substrings?  Would it be easier to just start j at re.lastIndex, and decrement it?  And you're not updating matching, so you're really just looping on the rest of the conditions?  I think this code needs a comment describing what you're trying to do, and I also think that there's probably a way to do it that only runs the regular expression matcher once.

If you really want to go this route, what about adding:
([$-_.+!*(),';/?:@=&0-9a-zA-Z]+)?
to the start of the regular expression, and then check if matching[1] exists, and starts with http or ftp?

Of course, I think that will fail if you have input data like:
http://test.pdf  and another .pdf
since I think it will stop after finding the first ".pdf", which it will then ignore, and never get to the next ".pdf" which should trigger the warning.

(I've also run thunderbird with this patch a bit, and I find it extremely flaky.  It doesn't seem to warn me for most of my attachment keywords…)

>+	var k = 2;
>+	var splitString1 = "";
>+        // to get the string which contains the keywords out of the mailData
>+        //"[$-_.+!*(),';/?:@=&]|[0-9a-zA-Z]" is the characters which is allowed in an URL
>+        while(matching1 && re1.lastIndex-k >= 0 && /[$-_.+!*(),';/?:@=&]|[0-9a-zA-Z]/.test(mailData.substring(re1.lastIndex-k, re1.lastIndex-k+1))) {

Okay, I'm really unclear as to why you seem to be doing this whole thing twice.

Perhaps it would help if you added some comments to the code describing what you're trying to do, and why you need to do it.

Thanks,
Blake.
Attachment #486297 - Flags: review?(bwinton) → review-
Depends on: 571486
This bug has been fixed in Bug 635993.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: