Closed Bug 124641 Opened 22 years ago Closed 13 years ago

Filter or Search: does not handle multi-line (wrapped, folded) headers correctly when search term spans lines

Categories

(MailNews Core :: Search, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 5.0b1

People

(Reporter: superbiskit, Assigned: rkent)

References

(Blocks 1 open bug)

Details

(Keywords: testcase, Whiteboard: [see comment 68][datalossy])

Attachments

(2 files, 6 obsolete files)

Our friends at GNU have a couple of mailing lists where they got very wordy
about the List-Id: header, thus-
<quote>
List-Subscribe: <http://mail.gnu.org/mailman/listinfo/ddd>,
	<mailto:ddd-request@gnu.org?subject=subscribe>
List-Id: Discussion list for DDD,
	the GNU graphical debugger front end <ddd.gnu.org>
List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/ddd>,
	<mailto:ddd-request@gnu.org?subject=unsubscribe>
List-Archive: <http://mail.gnu.org/pipermail/ddd/>

</quote>

RFC822/2822 is precise that the additional lines with leading whitespace are
exactly equivalent to stringing the whole thing out with a single space
separating the "lines."
However, Mozilla cannot recognize either "List-Id contains <ddd.gnu.org>" or
"List-Id IS [first line][space][second line]"
Both of these constructions should be recognized.
QA Contact: esther → laurel
WFM 2002111512.  Closing.
Status: UNCONFIRMED → RESOLVED
Closed: 22 years ago
Resolution: --- → WORKSFORME
marking verified
Status: RESOLVED → VERIFIED
Not my experience at all!  In fact, this became markedly more critical for me
recently when many of the GNU mailing lists began wrapping the List-Id:

If the wrapped ID were considered correctly by replacing all the whitespace
(including nl) as a single %20, the filters would have continued to work. 
Rather, all broke.

Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.3a) Gecko/20021206
Status: VERIFIED → UNCONFIRMED
Resolution: WORKSFORME → ---
Hmmm, this is odd, I can filter just fine with these multi-line items- I signed
up to the list just to check.   Adding qawanted keyword to try to get others to
reproduce.
Keywords: qawanted
Using Mac OS X.2.3, Moz 1.3a.

I think this bug may apply to search but not filters.  One of my filters
(searching for a spam-prone email address of mine in the custom header
"Received") worked on a multi-line header, but the search feature does not.  A
message that was caught by the filter, and logged, and put into my SPAM folder,
does not show up when I search the SPAM folder using that criterion!

Here is the header:

"Received: from yahoo.com ([200.21.238.118])
	by machine.domain.edu (8.12.5/8.12.5) with SMTP id h0D0XdGR019147	for
 <user@domain.edu>; Sun, 12 Jan 2003 18:33:44 -0600 (CST)"

Mind you, this was one of about six "Received:" headers in the message.

This is the operative bit of my filter:

OR (\"Received\",contains,user@domain.edu)"

which works.  But searching does not.
I probably should have included more precise examples, myself.  Here's one.

NB: Because some mail agents might do their own wrapping, I have 
 manually wrapped the "List-Id:" line at "GNU\" -- the actual example is
 all on one line and the '\' is not real at all.

Here is a header /before/ it was wrapped:
<quote>
List-Subscribe: <http://mail.gnu.org/mailman/listinfo/help-emacs-windows>,
	<mailto:help-emacs-windows-request@gnu.org?subject=subscribe>
List-Id: Discussion forum for users of the GNU\
 Emacs port to Windows <help-emacs-windows.gnu.org>
List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/help-emacs-windows>,
	<mailto:help-emacs-windows-request@gnu.org?subject=unsubscribe>
</quote>

And here is the RULE that was used to catch it:
<quote>
condition="OR (\"List-ID\",is,Users list for the GNU\
 Emacs text editor <help-gnu-emacs.gnu.org>)"
</quote>

Then they went and wrapped the header just before the canonical address, thus:
<quote>
Precedence: list
List-Id: Discussion forum for users of the GNU Emacs port to Windows
 <help-emacs-windows.gnu.org>
List-Help: <mailto:help-emacs-windows-request@gnu.org?subject=help>
</quote>

And the above RULE no longer worked.  If I am correctly interpreting the RFC,
the two versions of the header are exactly equivalent and the rule should *not*
have broken.  The List-Id: header should have been un-wrapped before doing the
comparison.

In my case, Filter and Search are having exactly the same results.
I too use filters, and sometimes mozilla fails to move a mail. 
I Use a X-Text-Classification: tag in the header and have a mozilla filter
to check the value for that. It works most of the times but here one sample
where it doesnt.

From - Thu Jan 16 23:52:40 2003
X-UIDL: 6d4d7ef704b9251136a425945481b254
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
X-Auth-No: 
Return-Path: <pharmacy_118@yahoo.com>
Received: from mx1.mail.yahoo.com not authenticated [148.78.243.121]
	by smtp1.home.se with NetMail SMTP Agent $Revision:   3.16  $ on Novell NetWare;
	Thu, 16 Jan 2003 23:42:25 -119304547
From: Pharmacy <pharmacy_118@yahoo.com>
To: <doesnt@matter>
Subject: amazing youth regained
The health discovery that actually reverses aging while burning fat, without
dieting or exercise! This proven discovery has even been reported on by the New
England Journal of Medicine. Forget aging and dieting forever! And it's
Guaranteed! * Reduce body fat and build lean muscle WITHOUT EXERCISE! 
* Enhace sexual performance 
* Remove wrinkles and cellulite 
* Lower blood pressure and improve cholesterol profile 
* Improve sleep, vision and memory 
* Restore hair color and growth 
* Strengthen the immune system 
* Increase energy and cardiac output 
* Reverse Aging dramatically years in 6 months of use !!! 
X-Text-Classification: spam
X-POPFile-Link: http://127.0.0.1:8080/jump_to_message?view=popfile1042675200_145.msg

I'd guess it have to do with the subject line being on multiple rows.

(Using Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.2) Gecko/20021126)
I can reproduce, and am confirmin by popular demand. Removing qawanted keyword
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: qawanted
*** Bug 199540 has been marked as a duplicate of this bug. ***
OS: Windows 98 → All
Hardware: PC → All
Attached patch proposed patch (obsolete) — Splinter Review
This patch replaces GetNextLocalLine() by GetNextLocalLogicalLine() which
doesn't use ReadLine() but Read() and then gets a logical line from the buffer
using RemoveCRLF().
RemoveCRLF() is able to distinguish line breaks within a folded header from a
line break at the header line end. So we also get multi-line headers correctly.


Although GetNextFilterLine() already extracted multi-line headers correctly, I
also added RemoveCRLF() to GetNextFilterLine() to remove the line breaks in
folded headers. That's because of RFC 2822, 2.2.3: "Each header field should be
treated in its unfolded form for further syntactic and semantic evaluation."

RemoveCRLF() may look complicated and costly, but most of the constructs are
also in nsRandomAccessInputStream::readline() and called functions. So the new
behaviour will not be noticeable slower than the old.

An existing problem is now also fixed: A line that couldn't be read completely
(if a line is longer than the buffer (512 bytes at the moment) doesn't contain
at least one CR or LF) led to wasTruncated == true and stopped Mozilla read
more of the message.

I also did some changed in MatchArbitrary() to save the PL_strncasecmp() if buf
starts with CR or LF (is the header terminator).
Assignee: mscott → ch.ey
Status: NEW → ASSIGNED
Attachment #141319 - Flags: review?(bienvenu)
Comment on attachment 141319 [details] [diff] [review]
proposed patch

mostly looks good, thx for doing this - I have some nits are about the variable
and method names in RemoveCRLF and a question.

cp and tp aren't useful variable names to me - I'm guessing that cp is charPtr
but I'm not sure what tp is.

Doesn't RemoveCRLF potentially remove multiple CRLF's? In which case it should
be called RemoveCRLFs. Or maybe CoalesceMultilineHeader().

If the next line starts with multiple spaces or tabs, I don't see us removing
all the spaces, just the leading space. And I don't see us replacing the tab
with a space. I could be mis-reading, however...
Comment on attachment 141319 [details] [diff] [review]
proposed patch

Well, cp and tp were taken from readline that has originally been used. I guess
it means target pointer.
I don't know an intelligent name because these pointers are simply first and
second pointer. Only in the memmove it's clear that cp is the destination and
tp the source.

Coalesce - I had to look it up in my dictionary. :) But yes, that name would be
ok.

This patch doesn't remove a single blank or tab and that was intended. I don't
know if the spaces and tabs are always just folding or if there can be leading
ones for another reasons.
But replacing all spaces and tabs by a single space is simple. Shall I?


One thing I just noticed is, that GetNextLine() is also used for MatchBody()
and that my changes screwed this up completely.
It was necessary to add a condition and move another.
Attachment #141319 - Attachment is obsolete: true
Attachment #141319 - Flags: review?(bienvenu)
Currently lines being part of a folded header but starting after byte bufSize
(512) in the header entry aren't recognized as part of the header entry.
I could work around this in MatchArbitraryHeader() with some effort. But I'm not
sure if this is worth the costs/trouble, that long headers are very uncommon.
Any thoughts?
Christian, I think you're right that's hdrs > 512 bytes is an edge case that
might not be worth handling right now. 

Yes, I figured those var names were from copied code but this is still new code.

This comment made me think we should do the replacing of multiple tabs/spaces:

"RFC822/2822 is precise that the additional lines with leading whitespace are
exactly equivalent to stringing the whole thing out with a single space
separating the "lines."

But thinking about it more, it just says "equivalent". However, if you wanted a
filter to match "Discussion list for DDD,
	the GNU graphical"..., it would be easier to write the filter if we made it into:

"Discussion list for DDD, the GNU graphical"

So I think it would be a good idea.
> Yes, I figured those var names were from copied code but this is still
> new code.

What about cp == beforeSkip and tp == behindSkip. As I wrote, I don't know
intelligent names for them.

> So I think it would be a good idea.

Ok, done.
ok, cp points to the linebreak, and tp starts off at the line break and then
gets advanced to the start of the next line. Perhaps these variables are hard to
name because their meaning shifts around, which doesn't make the code easy to
read. Let me see if I can rewrite this to make more sense.
How about this? I haven't tested it, but I think it does what the old code does,
and it's more readable.


+PRInt32 nsMsgBodyHandler::RemoveCRLF(char *s)
+{
+  PRInt32 skipped = 0;
+
+  char *eol = strpbrk(s, "\n\r");
+  char *nextLine;
+  while(eol && eol != s)
+  {
+    nextLine = eol + 1;
+    if ((*eol == '\n' && *nextLine == '\r') || (*eol == '\r' && *nextLine == '\n'))
+      nextLine++; // possibly a pair.
+
+    // add chars we will skip
+    skipped += nextLine - eol;
+
+    // next line begins with white space
+    // and contains a line break (what means we have a whole line)
+    if(*nextLine == ' ' || *nextLine == '\t')
+    {
+      memmove(eol, nextLine, strlen(nextLine) + 1);   // delete the line breaks
+      eol = strpbrk(nextLine, "\n\r");          // jump to the next line break
+    }
+    else
+    {
+      *eol = '\0';
+      break;
+    }
+  }
+  
+  return skipped;
+ }
Hm, yes, that's fine.
But
> eol = strpbrk(nextLine, "\n\r");
is dangerous since nextLine is a few bytes in the next line after the memmove().
I'd use
 eol = strpbrk(eol, "\n\r");

And
> if ((*eol == '\n' && *nextLine == '\r') || (*eol == '\r' && *nextLine == '\n'))
is just copied, but
> if (*eol == '\r' && *nextLine == '\n')
should also be save. Or am I missing something?


And another new question. MatchArbitraryHeader() is only called for non-standard
headers. So e.g. Subject: or To: lines that are extraced using other functions
do in fact contain all data from multiple lines but also the CRLF and all
leading whitespaces.

Do you think that's an issue we should care about? Applying 
CoalesceMultilineHeader() to the data in RowCellColumnToCharPtr() (which
delivers data to e.g. GetSubject() called in ProcessSearchTerm()) would work.
But besides for Subject: I don't think someone searches for a string over line
borders (for Suject, CoalesceMultilineHeader() could also be applied to the
subject string in ProcessSearchTerm() before MatchRfc2047String()).
>I'd use
> eol = strpbrk(eol, "\n\r");

ah, you're right because the old line terminators are gone.

>> if (*eol == '\r' && *nextLine == '\n')
>should also be save. Or am I missing something?

I wondered about that too - it seems safe to me.

RowCellColumnToCharPtr() is too low a level since it can be used for arbitrary
string values, other than header. It should either be in GetSubject(), GetTo,
etc, or in the caller. I lean towards the caller of GetSubject, your last
suggesting.
In order to call CoalesceMultilineHeader() from ProcessSearchTerm() we've to
create a new nsMsgBodyHandler object. Or it has to be moved outside any class.
The first one is slower and more complex, the later not so good style.
Problem still in Thunderbird version 0.6+ (20040428)
Right, since no patch has been checked in, why should it be different?

Unfortunately the patch from bug 197166 broke my existing draft, so it will take
another weeks to get a working patch.
Spam Assassin attaches some pretty long headers in verbose mode--over 512 
characters.
yeah, I know - even though I broke Christian's patch, I did fix our handling of
lines over 512 bytes...
*** Bug 252115 has been marked as a duplicate of this bug. ***
*** Bug 148612 has been marked as a duplicate of this bug. ***
Would this bug be a blocker for bug 240924?

What about the case of bug 87653, where a MIME boundary is getting folded and 
then, on unfolding, has whitespace in the middle of the boundary text?  (It's 
some other mail program doing that rather bogus folding.)
See comments #18 and #19 about using this patch for GetSubject() too. It doesn't
look like it's a good idea to use it for non-search/-filter work.
*** Bug 243479 has been marked as a duplicate of this bug. ***
Adding keywords to summary for easier searching.
Summary: Filter or Search: do not handle multi-line headers correctly → Filter or Search: do not handle multi-line (wrapped, folded) headers correctly
This mishandling of headers also manifests in improper thread display when the
In-Reply-To: header is wrapped.

<quote>
In-Reply-To: Your message of "Sun, 07 Nov 2004 20:39:07 +1100."
             <20041107093907.GK79646@cirb503493.alcatel.com.au>
</quote>

Do the proposed patches address this manifestation of the bug as well?
Let me just say: this is an amazingly rookie bug. I don't think there's another
RFC822 parser on the planet that doesn't handle continued lines in headers
correctly and automatically. A freshman compsci student who wrote RFC822
header-parsing code that didn't take continued lines into account would get an
F. How such a parser got into production code is beyond me.

That this is *even an issue* makes me extremely concerned; that it's been left
open for TWO+ YEARS is simply mind-boggling.
(In reply to comment #32)
> Let me just say: this is an amazingly rookie bug. I don't think there's
> another RFC822 parser on the planet that doesn't handle continued lines in
> headers correctly and automatically.

This is no excuse to the bug, but let me say, that the affected code is not the
one for header-handling like Date, From, To and also MIME headers. It's a search
code that only looks for the header name at the line start. This search is done
block-wise and that is IMHO the main problem.

> A freshman compsci student who wrote RFC822
> header-parsing code that didn't take continued lines into account would get an
> F. How such a parser got into production code is beyond me.
> 
> That this is *even an issue* makes me extremely concerned; that it's been left
> open for TWO+ YEARS is simply mind-boggling.

It's great to have such a professional here and we always appreciate people who
know how everything works. We'd be happy if you'd let us partake in your wisdom
when you fix the bug. Thanks for helping.
Some extra information on folded headers which I hope will be useful. Some email
clients fold the Content-Type header field for each keyword-value pair after the
first, thus:

	Content-Type: text/plain;
		format=flowed;
		charset="iso-2022-jp";
		reply-type=original

Noteworthy, perhaps, because folding of the Content-Type is performed regardless.

 Having read the comments, may I suggest that the Summary for this bug be
modified to read "Search matches text only on the first line of multi-line
(wrapped, folded) header fields"?
Product: MailNews → Core
Cases where it failed before are now working correctly in Thunderbird 1.0. Huge
thanks to whoever fixed it.
(In reply to comment #35)
> Cases where it failed before are now working correctly in Thunderbird 1.0.

I'm not seeing this working.  Could you provide specific examples?

For instance, I tried a search on    Content-Type, contains, "boundary" --
the only hits have unfolded Content-Type headers.
I have the following in an email header:

X-Spam-Status: No, score=-102.6 required=6.5 tests=BAYES_00,NO_REAL_NAME,
	USER_IN_WHITELIST autolearn=no version=3.0.1

In Thunderbird 0.6 I could search for "WHITELIST" in X-Spam-Status and it would
not find it. Now it does.
(In reply to comment #37)
> In Thunderbird 0.6 I could search for "WHITELIST" in X-Spam-Status and it would
> not find it. Now it does.

Doesn't work for me in 1.0.  Example:

X-Spam-Status: No, score=-3.595 required=5
	tests=BAYES_00,NO_REAL_NAME,SPF_HELO_PASS,SPF_PASS
	autolearn=no version=3.0.2

Searched for "BAYES_00" in X-Spam-Status and the message was not found. 
Searching for "No" or "required" worked, though.
By any chance does the term "whitelist" appear in any other header?
See bug 209488 comment 4 -- false positives occur when searching using 
"contains" and other headers have the search string.

I don't use whitelisting, but if I search X-Spam-Status for "HTML" I get a bunch 
of hits -- I believe because Content-Type also has "html".  But if I search for 
"TBODY" or "REAL_NAME", the spams containing those strings in X-Spam-Status 
don't show a hit.
I checked several more examples, including a case where "autolearn" was on the
4th line. It works rock solid.

Did you delete your old TB installation before installing the new one? The
version I'm running is 1.0 20041206.
(In reply to comment #40)
> Did you delete your old TB installation before installing the new one? The
> version I'm running is 1.0 20041206.

Yes.

First time through I uninstalled and renamed the remaining directory before
installing the new version.  Just to be sure, I just uninstalled, wiped the
folder, and reinstalled.  No change.

It's also 1.0 20041206.

I have not wiped my profile, however, because I expect that would be a serious
pain to rebuild.
Just to make sure we're on the same page: I'm selecting a folder, say Inbox, and
then doing a control-Shift-F to display the "Search Messages" dialog. I'm
selecting a custom entry of X-Spam-Status, the operator "contains" and then a
string of "autolearn" (not in quotes).

It finds emails where the string autolearn appears on the second line or later
of the X-Spam-Status header.

I don't have a clue what the difference is. It used to fail for me and I've been
checking whether it was fixed or not with every new Thunderbird Release (I think
it worked at 0.9 too.)

FWIW, I'm running Windows 2000.
(In reply to comment #38)
> Searched for "BAYES_00" in X-Spam-Status and the message was not found. 
> Searching for "No" or "required" worked, though.
Will "Bayes_00" and "bayes_00" work?
(In reply to comment #42)
> Just to make sure we're on the same page: I'm selecting a folder, say Inbox, and
> then doing a control-Shift-F to display the "Search Messages" dialog. I'm
> selecting a custom entry of X-Spam-Status, the operator "contains" and then a
> string of "autolearn" (not in quotes).

Exactly the same.

> FWIW, I'm running Windows 2000.

Same here.

(In reply to comment #43)
> Will "Bayes_00" and "bayes_00" work?

No, though I can't imagine why they would if an exact match fails.
(In reply to comment #44)
> (In reply to comment #43)
> > Will "Bayes_00" and "bayes_00" work?
> No, though I can't imagine why they would if an exact match fails.
To calrify your problem is NOT "case" related problem(already opened bug when
'begin with').
 
As you say in comment #38, your problem sounds an example of this bug.
> X-Spam-Status: No, score=-3.595 required=5
> 	tests=BAYES_00,NO_REAL_NAME,SPF_HELO_PASS,SPF_PASS
> 	autolearn=no version=3.0.2
> Searched for "BAYES_00" in X-Spam-Status and the message was not found. 
> Searching for "No" or "required" worked, though.

You say as follows in comment #42.
>I'm selecting a folder, say Inbox, and then doing a control-Shift-F to display
the "Search Messages" dialog.
> I'm selecting a custom entry of X-Spam-Status, the operator "contains" and
then a string of "autolearn" (not in quotes).
> It finds emails where the string autolearn appears on the second line or later
of the X-Spam-Status header.

Does "CTRL+Sfift-F" find "BASE_00" or "tests="?
Does message filter find "tests="?
This is question to clarify ;
  - General search algorythm problem is involded or not
  - Message filter only problem or not
    (Problem in search, summary says, was already resolved?) 
(In reply to comment #45)
> Does "CTRL+Sfift-F" find "BASE_00" or "tests="?

No.

> Does message filter find "tests="?

No.

Ctrl-Shift-F, then searching in the header X-Spam-Status, the "contains" option,
and then a string that only ever appears in the second or third lines of the
folded header does not find any messages.

Searching the exact same header with the same options for something that appears
in the *first* line of the folded header DOES return results.

The same is true of message filters.
(In reply to comment #46)
Slight correction on filters: It does seem to work *as mail comes in*, but not
when I run the filter through Tools->Run Filters on Folder.

I didn't delete my test filter, and a message came in and tripped it.  (The
action was setting a label.  I then reset the label to None, ran the filter from
the menu, and this time the same message did *not* trip the filter.)

I have no idea why this would make a difference, but it's there in my filter log.
(In reply to comment #47)
> (In reply to comment #46)
> Slight correction on filters: It does seem to work *as mail comes in*, but not
> when I run the filter through Tools->Run Filters on Folder.

Do you use "Global Inbox"?
If yes, read thru next FAQs,
 http://kb.mozillazine.org/Thunderbird_:_FAQs_:_Global_Inbox
 http://kb.mozillazine.org/Thunderbird_:_FAQs_:_Filters
then clarify what/where/when problem occurs, please.
(In reply to comment #48)
> Do you use "Global Inbox"?

No.

> then clarify what/where/when problem occurs, please.

This is a filter on a particular account, and I did all my testing on the inbox
for that account.

(In fact, if I were using the global inbox, the FAQ you referenced suggests I
would see the *opposite* problem - i.e. filters working manually, but not on
incoming mail)

And of course, it doesn't work with searches either.
 A little testing on the same email account has convinced me that searching the
header contents in a Thunderbird POP account setup and a Thunderbird IMAP
account setup yeild different results.

 The POP and IMAP accounts point to the same email account on the same server
and should yeild the same results. Searching for the unique receiving server
name in the 'Received' header, I see that:

   the POP account only matches on the first line of a folded Received header
   the IMAP account matches on any line of a folded Received header

Perhaps this explains the difference in results observed above?

Kevin, Mike and Kelson: what kind of accounts are you using, POP or IMAP?
(In reply to comment #50)
> Kevin, Mike and Kelson: what kind of accounts are you using, POP or IMAP?

I'm using POP.


(In reply to comment #50)
>    the POP account only matches on the first line of a folded Received header
>    the IMAP account matches on any line of a folded Received header

This sounds to be able to explain Kelson's result in comment #47.
> Slight correction on filters: It does seem to work *as mail comes in*, but not
> when I run the filter through Tools->Run Filters on Folder.

"Filter for incoming mail on POP3" is done during receiving mail, then it
bocomes same process as search on IMAP, then no problem.
 - Header is to be analyzed on receive, 
   internal header data is possibly unfolded.
But, when "manual filter" or "search" on localy saved mail, header data is
folded in mail folder file.
AHA! Things didn't get fixed with a new version of Thunderbird, but when I
switched from POP3 to IMAP.

When I search on IMAP folders it works. When I search on Local folders, it fails.
*** Bug 283759 has been marked as a duplicate of this bug. ***
*** Bug 274753 has been marked as a duplicate of this bug. ***
*** Bug 302396 has been marked as a duplicate of this bug. ***
I confirm this is still happening on Thunderbird 1.0.6 on Windows 2000, using POP3.

Could someone please increase the priority and/or severity? It's very annoying.
This bug can not be a blocker of Tb 1.5, since all of released Mozilla mail&news
and Tb has this problem, but I hope Tb 1.5 with this bug fixed.
I'm very tired to answer "it's very old bug" in BBS, and to close bugs as DUP of
this bug :-) 
Severity: normal → major
I can confirm that it is still occuring on TB trunk nightly 20050713

To be SURE that it is because of the newline characters, I copied a test message
to two sunfolders of one folder..

I then existed TB

I modified the folder file for *one* of the folders and stripped the newline
characters from a particular Received header. Then I deleted the .msf file for
that TB folder

I then started TB and did a search for unique characters that were in the
'second line' of that received header..
TB found the message I modified to have NO newlines ( because the characters
were now in the 'first line') but did NOT find the one that still had the newlines

Good luck
One of simple solutions is keeping unfolded version of mail header in local mail
file, as Frank has proven.
Limitation of single line length by SMTP/POP/IMAP is not applicable when a line
in a file, except file size limitation by OS/file system and buffer size
limitation by implementation, although care on mail forwarding or moving to IMAP
server is required in order not to violate standards.
Biggest problem is difficulty in fall back or compatibilty.
Can older version of Moz mail/Tb read/handle unfolded mail file properly?
How about other mail import programs or mailers?
 
I don't think it's important to check how other mail clients handle theese
headers, the parser simply should comply with the RFC.

I'm seeing the same problem with the Subject field. I don't know if it's because
of the break lines or tabs, but it looks bad.
It appears that the Subject: field is searched fully even if it is wrapped. Nevertheless, if the matching string straddles a line break then the match will fail. That is, if the search string is "the string" and it occurs in the Subject: field as "the\n string" then the match will fail.

This bug also, naturally, affects saved search folders (which is where I just noticed it, again).
Is someone working on this?
I took a look at the source code.

I don't know much about it, but my best guess is that the problem is inside mailnews/mime/src, probably mimehdrs.cpp:MimeHeaders_get. It seems that function adds ",CR-LF-TAB" at the end of each line break, and the whatever uses that function needs to parse the header field.

I would say the whole code needs a redesign, but for now I would say MimeHeaders_get shouldn't add those line breaks, or whatever is being used to pass header fileds to the ui should parse accordingly. I think it's nsMimeHeaders::ExtractHeader.

Any thoughts?
re: "Any thoughts?"

Why is there any need to preserve the line-breaks in the headers at all?  If I'm correctly interpreting RFC-?(2822?), the headers should simply be "normalized" by replacing everything from the last non-blank on line (n) up to the first non-blank on line (n+1) by a single blank (x#20).  I'm not aware of any of the storage mechanisms that require short lines.  

So, why not simply normalize each header as it is received, and store it that way.
(In reply to comment #65)
> So, why not simply normalize each header as it is received, and store it that
> way.

In fact I digged a little bit more and what I found is that the MimeHeaders_get function is only called at some point in the message display, not on the search, and not on the folder display.

So it seems there is code that handle folded header fields in message display, search, and folder display for each of IMAP, POP3 and local. In each one of these it's different and in none of those is working properly except message display AFAIK. So it's really ugly.

I would say a single message header normalization code is the way to go, it's just that I don't know where to put it, and how each of the relevant modules should call it.

I tried IRC and the mailing lists to find someone who knows the code better, but no one answered.
*** Bug 330597 has been marked as a duplicate of this bug. ***
the fix in bug 338310 has fixed this problem except for the case where the search term spans lines. I'll leave this bug open for that problem.
Summary: Filter or Search: do not handle multi-line (wrapped, folded) headers correctly → Filter or Search: do not handle multi-line (wrapped, folded) headers correctly when search term spans lines
(In reply to comment #68)
> the fix in bug 338310 has fixed this problem except for the case where the
> search term spans lines. I'll leave this bug open for that problem.

And what about the subject being displayed incorrectly on the main window? It is also a problem because related to the parsing of multiple lines.

*** Bug 356156 has been marked as a duplicate of this bug. ***
I can confirm that this bug is in place in Thunderbird 1.5.0.9 (Windows/20061207).

Example:
I need to search messages with the " 2006 " substring in the Received: header. The search results include messages with single-line Received: only, just like this one

Received: by beetle.zenon.net (Postfix, from userid 400) id AFB945286; Wed, 30 Nov 2005 14:16:24 +0300 (MSK)

but not these multi-line

Received: from pc-170-40-215-201.cm.vtr.net (pc-170-40-215-201.cm.vtr.net [201.215.40.170])
	by bird.zenon.net (Postfix) with ESMTP id 2A94A2CBC1;
	Thu, 15 Dec 2005 19:32:02 +0300 (MSK)
Received: by fly.zenon.net (Postfix, from userid 400)
	id 0647E7E6; Fri, 16 Dec 2005 21:31:27 +0300 (MSK)
Received: from SERVER2.computery.ru ([213.85.58.145] verified)
  by backend4.aha.ru (CommuniGate Pro SMTP 4.2.10)
  with SMTP id 72147738 for andris@aernet.ru; Mon, 14 Nov 2005 16:23:45 +0300

and the similar ones.
have you tried 2.0? The fix in that bug is not in 1.5.0.9, afaik.
Not yet but i'll try with the different installation of TB 2.0 because 2.0 is still in alpha stage.

Thanks for the tip, David.
2.0 beta 2 came out last week.
And it still have the same bugs.

 From: "Felipe Contreras" <felipe.contreras@foobar.com>
 To: felipe.contreras@foobar.com
 List-Id: Discussion list for breakline,
 	the breakline <break.org>
 Subject: This is a test
 	that is testing

 This is a one line message.

/usr/sbin/sendmail felipe.contreras@foobar.com < testmail.txt

The filter on the second line of the Subject works, but not on List-Id. And the subject has weird characters when the character after the breakline is a tab.

This is on IMAP.
Attached image Bad subject line
The bad-subject-line problem is bug 271312 (TB, bug 240924 for the suite).

Per bug 184490, custom headers (which List-ID is) under IMAP get filtered OK in arriving mail, but fail when the filter is run "after-the-fact."  MailViews (the dropdown list at the top of the thread pane) also fail for custom headers under IMAP.  In these cases, the failure will occur whether the text is on the first line of the header or one of the folded lines after.

Which is not to say that arriving-mail, IMAP, custom-header filtering on folded headers is working 100% correctly, but I haven't had a chance to test that yet.
Right. Maybe there should be a bug that supersedes all these bugs: "Headers' parser is ****".

From what I could see on the code each backend type (IMAP, POP3, etc) has it's own way to handle the messages, which is far from ideal.

Anyway, I'll check if the List-ID works when the text to filter is in the first line, and at arrival or after-the-fact.

It seems I won't be using Thunderbird for another couple of years.

Sorry if this seems like negative feedback, but I tried to solve it and I got no feedback... I don't think these kinds of things should be happening.
All the code (imap, news, pop) use the exact same header parser. They all have different ways of fetching headers, but that's because the protocols are different, which should be obvious on the face of it.
FWIW, I can confirm that this problem still exists in TB 2.0.0.0 final -- at least on incoming mail or after the fact for an IMAP account. (Haven't had a case to reproduce on POP.)
I'm seeing this in Thunderbird 2.0.0.6 (20070811) when using the Quick Search box to search subject lines which are folded. A direct copy from part of the subject line (where the text is folded) in the message pane, and pasted into the box, fails to find the message.

This raw header...
Subject: this is a long
    subject line
...is displayed in the message pane as one line. Copy and paste various bits into the Quick Search box:
* "this is a" -> match
* "a long subject" -> FAIL
* "subject line" -> match
QA Contact: laurel → backend
Whiteboard: [see comment 68]
Product: Core → MailNews Core
Summary: Filter or Search: do not handle multi-line (wrapped, folded) headers correctly when search term spans lines → Filter or Search: does not handle multi-line (wrapped, folded) headers correctly when search term spans lines
Version 2.0.0.19 (20081209)
I just experienced this problem when applying a subject filter on my INBOX. The message contains a single subject line spanning multiple lines in the message source. The Subject is displayed in a single line everywhere, but the "subject"-"contains"-filter does not match the message.

Source:
Message-ID: <23984203.01231276489500.JavaMail.lm@PCCM2>
Subject: JDialog Server Build N-trunk on PCCM2 is finished. (Build
 successful)
MIME-Version: 1.0

Filter:
Subject contains: "trunk on PCCM2 is finished. (Build successful)"
The title may not be entirely accurate.  Sometimes filtering fails -- even when the search term does not wrap -- as in the case of the following:

I occasionally get spam with non-Western character encoded subject or sender information.  I try to filter these out by looking for the '?=' that begins the precedes the text.  This fails.

Here's a scenario:

1. Download and unzip the attached spam message
2. Drag-and-drop the extracted message to your Shredder Inbox
3. Create a message filter as follows:
a) Click Tools --> Message Filters...
b) Select the Inbox where you dropped the message and choose New...
c) Filter name: test
d) Apply filter when: Check Mail or Manually Run
e) Match all of the following: {Subject} {begins with} ?=
f) Perform these actions: Delete Message
g) Click OK
4. Highlight the filter you've just added and click "Run Now"
5. Nothing happens.

So then, I guess this bug isn't going to be fixed, given it's been languishing now for well over 7 years. :-(
(In reply to comment #88)
> test case: Spam email with victim addresses obscured
> e) Match all of the following: {Subject} {begins with} ?=
> The title may not be entirely accurate.

Subject: header of attacched mail. ({CRLF}==0x0D0A)
> Subject: =?GB2312?B?1MLIscqxztLP68Tjo6zUwtSyyrHO0sTuxOOjrM7ewtvUwtSy1MLIsaOsztK1?={CRLF}
>  =?GB2312?B?xNDEyOfEx7rjucWyu7HktcTUwrnixKzErLXEzqrE49ejuKMu16PW0Mfvvdq/?={CRLF}
>  =?GB2312?B?7MDW?={CRLF}
Decoded subject begins with:
> 月缺时我想你,...

Matching is executed after decoding of Base-64 with GB2312.
=> Your comment #88 is absolutely invalid report and is absolutely irrelevant to this bug.
The title of this bug is absolutely accurate.
(In reply to comment #89)
> Matching is executed after decoding of Base-64 with GB2312.
> => Your comment #88 is absolutely invalid report and is absolutely irrelevant
> to this bug.
> The title of this bug is absolutely accurate.

Whoa.  Somebody needs some fresh air.
Here is an example, which could be quite relevant for a lot of people:

I have a sieve filter moving all messages, where "X-Spam-Level" contains "***" in to a spam folder. Running a search for "X-Spam-Level" doesn't contains "***" gives me quite a lot of hits. So for people without a sieve capable IMAP server-side spam filtering won't really work.

This is IMHO clearly some-how related to this bug but "search term spans lines" does not apply here. There seems is multi-line header before that seems to destroy all custom header searches afterwards. The behavior however seems to me a little inconsistent because sometimes headers seem to be found even in the presence of multi-line headers.

Should I open a new bug or is there a possibility to handle all header parsing issues relating to search in this bug, i.e to remove "when..." from the title of this bug? 

Or is there a bug concerning this already ? At least it is not part of Bug 519202.

BTW: Is still some work done on this? As stated before, it seems that header parsing is quite messed up with quite some side-effects and nothing is listed as "depends on". This quite irritates me...
Wayne asked me to look at this. I'll add it to my ASSIGN list to keep it on my radar screen, but if anyone else feels inspired to work on this (as if!) don't be dissuaded.

Changing the component to Search.
Assignee: ch.ey → kent
Component: Backend → Search
QA Contact: backend → search
xref 
- bug 338761 - searching body of emails for text doesn't match word-wrapped text
- Bug 353746 - [mozTXTToHTMLConv] Structured text recognition should span line breaks / Bug 5351 - [mozTXTToHTMLConv] MIME linkifying code should cross linebreaks
another encounter with this (and Bug 338761) - <insert adjectives>.  gloda gets it right - or at least, it returns the right results.
Keywords: testcase
Whiteboard: [see comment 68] → [see comment 68][datalossy]
Hello,

I wish to add to the chorus: I am seeing this behaviour when I try to run a filter on existing message sitting in the inbox and the filter just isn't catching the message.  Therefore, as I see it, the component still should include "filter".

I initially searched for this issue in google and ended up (mistakenly) posting my 2 cents worth in this forum topic here:

http://forums.mozillazine.org/viewtopic.php?f=31&t=344189&start=0

Unfortunately, I should have read that post more clearly - a review of a previous postsin that article ended up leading me here.  To avoid duplication, please read that post for some additional info I wish to add.

There you will find a detailed description of a case of the behaviour I have encountered. I have included a copy of the headers in the email (sanitized) and a copy of my filter rules file.  I hope this helps.

Thanks.
Attached patch Fix for arbitrary headers (obsolete) — Splinter Review
This works for strings split over multiple lines in arbitrary headers, but not for standard headers (like subject). Perhaps we need to split this bug in two to deal with the separate cases.

I also decided to accumulate received headers, so that if there are multiple received headers you can match on a string in any of them. Not sure if that is the right behaviour or not - comments?

I'll ask for a review in a day or two.
Certain values, including the important subject, are parsed in nsParseMailbox.cpp and not in the search. This patch also fixes folded whitespace there.

I need to review it some more myself before I ask for a review since this is such critical code.
Attachment #527425 - Attachment is obsolete: true
Attached patch Patch for review (obsolete) — Splinter Review
Let's get this reviewed.

Two issues to consider:

1) I accumulate received headers in search to allow a single search to look for values in any received header

2) I fix a line of code which I assume was supposed to allow the parser to detect a header with an improper space before the colon.

These are both optional if you think these are bad ideas.
Attachment #413431 - Attachment is obsolete: true
Attachment #528137 - Attachment is obsolete: true
Attachment #528457 - Flags: review?(dbienvenu)
Whiteboard: [see comment 68][datalossy] → [see comment 68][datalossy][has patch for review]
Accumulating received headers seems like the right thing to do. 

Not sure about 2) - I'll look at that more closely.
a couple nits:

+    // Should we allow an incorrect space after a header name? It seems like
+    //  that is what this code was supposed to do.
+    while (end > buf && (*(end - 1) == ' ' || *(end - 1) == '\t'))
       end--;

that code hasn't worked in a very long time, if ever, so I think it should just go.

+    PRBool isContinuationHeader = searchingHeaders ? NS_IsAsciiWhitespace(buf.CharAt(0))
+                                                   : false;

PR_FALSE, not false.
Comment on attachment 528457 [details] [diff] [review]
Patch for review

r=me, modulo the aforementioned nits.
Attachment #528457 - Flags: review?(dbienvenu) → review+
Attached patch patch to checkin (obsolete) — Splinter Review
with nits fixed.
Attachment #528457 - Attachment is obsolete: true
Attachment #529119 - Attachment is obsolete: true
Comment on attachment 529120 [details] [diff] [review]
No, THIS is the patch to checkin

Checked in http://hg.mozilla.org/comm-central/rev/7b75008cb771
Status: ASSIGNED → RESOLVED
Closed: 22 years ago13 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Whiteboard: [see comment 68][datalossy][has patch for review] → [see comment 68][datalossy]
Target Milestone: --- → Thunderbird 3.3a4
Depends on: 655578
I'm having problems with filters since v5. Could the work done here be related to https://bugzilla.mozilla.org/show_bug.cgi?id=678322 ?
The "break" from the loop introduced in this patch (looking via https://github.com/mozilla/releases-comm-central.git mirror):

commit 786ac11a7b1899cba6f42b314ea9dcdf6bcac745
Author: Kent James <kent@caspia.com>
Date:   Fri Apr 29 12:14:13 2011 -0700

    bug 124641 - Filter or Search: does not handle multi-line (wrapped, folded) headers correctly when search term spans lines, r=bienvenu


seems to cause huge regression tracked in #678322 where if there are multiple headers of the same name only one is checked (while all should be checked).

@rkent, could you look at it?
You need to log in before you can comment on or make changes to this bug.