Closed Bug 132340 Opened 23 years ago Closed 15 years ago

Local body search does not work if the body is encoded as Base64

Categories

(MailNews Core :: Search, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.9beta2

People

(Reporter: nhottanscp, Assigned: jcranmer)

References

(Blocks 1 open bug)

Details

(Keywords: intl)

Attachments

(4 files, 7 obsolete files)

5.37 KB, text/plain
Details
12.26 KB, patch
jcranmer
: review+
Bienvenu
: superreview+
Details | Diff | Splinter Review
1.17 KB, text/plain
Details
12.06 KB, patch
standard8
: review+
standard8
: superreview+
Details | Diff | Splinter Review
Local body search does not work if the body is encoded as Base64. IMAP search does not have this problem.
*** Bug 177825 has been marked as a duplicate of this bug. ***
I'd like to point out (incase it's not obvious to everyone) this causes spam-filtering to fail on base-64 email messages. I've been seeing a marked increase in these spams getting through my filters and it's very annoying.
This bug is a real problem (as already noted) for spam filtering. I can confirm that mail filtering doesn't work in Mozilla 1.2 beta when the message arrives in base 64 encoding format. I wonder what the effect of creating a filter rule searching for 'base-64' (or whatever the identifying text may be) as a short term workaround. (Knowing how forgetful I am I know I'd forget to take it out once this bug is fixed.)
nhotta wrote: > I thought the spam filtering is getting output from libmime which is already transfer-encoding decoded unlike the old filter which is using the message search code. Yes, I'm pretty sure we took care of this for spam. But nsMsgSearchTerm::MatchBody() doesn't appear to be doing any sort of decoding.
Todd/David: when you say spam-filtering, are you referring to constructing your own regular mail filters or using Mozilla's new Junk Mail infrastructure? The Junk Mail infrastructure should not have this problem. Seth: I wonder if this shouldn't be handled one level lower down, in nsMsgBodyHandler, which is called by nsMsgSearchTerm.
*** Bug 185804 has been marked as a duplicate of this bug. ***
mass re-assign.
Assignee: naving → sspitzer
re: comment 6 I've run into this as well. base64 hidden spam take a good half-dozen attempts at training before they filter. I strongly suspect the junkmail filtering skips all base64. This sort of makes sense, you wouldn't want to train it on the contents of a binary. What really needs to be done is the base64 portion opened, sniffed, and if HTML/text, bayesian filtered.
Product: Browser → Seamonkey
local body search definitely doesn't convert from base64, and I doubt the junk mail filter does either.
Assignee: sspitzer → mail
Let me take it for now.
Assignee: mail → jshin1987
Keywords: intl
*** Bug 273580 has been marked as a duplicate of this bug. ***
I think bug 37031 is the same as this.
*** Bug 335493 has been marked as a duplicate of this bug. ***
I found this bug could also be reproduced on Thunderbird 1.5.0.8 (Win32) and Thunderbird 3.0 alpha1 (Win32). The "search messages" function does not work on lots of base64-encoded mails on Chinese and Japanese platform. I really hope someone could help us and improve this great mail client.
Bug 268459 is this same problem, but for quoted-printable.
Assignee: jshin1987 → nobody
Product: Mozilla Application Suite → Core
QA Contact: laurel → search
(In reply to comment #13) > I think bug 37031 is the same as this. Without reading that bug in details, I would agree that they are duplicates, as decoding the attachment would imply to stop processing it undecoded ;-)
(In reply to comment #6) > Todd/David: when you say spam-filtering, are you referring to constructing your > own regular mail filters or using Mozilla's new Junk Mail infrastructure? The > Junk Mail infrastructure should not have this problem. > > Seth: I wonder if this shouldn't be handled one level lower down, in > nsMsgBodyHandler, which is called by nsMsgSearchTerm. Why hasn't this bug been fixed long ago when it sounds like a relatively simple fix? Why isn't this a major bug? It completely prevents anyone receiving massive amounts of encoded messages from using thunderbird.
Searching local messages by [BODY] [contains] "word" criteria do not find any messages with requested word(s) if messages are encoded as Base64, although there are a lot of messages in the mailbox with requested "word" visible via mail preview. Instead, it searches encoded messages in plain text without decoding the body, so may produce invalid results. Also, caused by this bug, the tool "Message Filters..." also does not work correctly with [BODY] [contains] criteria. Thus mail filtering mostly does not work on [BODY] [contains] criteria. Tested on messages containing plain English text and text/html encoded with base64. Confirmed to be present in Thunderbird version 2.0.0.6 (20070728) and Thunderbird version 3.0a1pre (2007091203)
(In reply to comment #19) > Why hasn't this bug been fixed long ago when it sounds like a relatively simple > fix? > Why isn't this a major bug? It completely prevents anyone receiving massive > amounts of encoded messages from using thunderbird. > It's not quite as simple as it sounds: base64 in messages arbitrarily breaks at line points (always 72 chars in length minus endline?), and searching works by feeding line by line. A mild hack would probably work and I can probably pound this out tomorrow, but I would need to look more in the operation of the feeding of the lines.
Assignee: nobody → Pidgeot18
This patch will not work for multipart messages where the leaves are base64-encoded (nsMsgBodyHandler doesn't know about those... yet). With any luck, I should have multipart messages working by the end of the week.
Attached patch Full patch for base64 (obsolete) — Splinter Review
This patch is my review-ready version. It properly parses messages that are fully base64 and multipart/* messages with base64, although it skips non-text/* leaves of the multipart/* messages and avoids proper recursing into the multipart/* tree. The idea behind not parsing non-text/* leaves is that people probably don't want spurious matches for binary attachments (e.g., "PK" matching all zip files).
Attachment #282483 - Attachment is obsolete: true
Attachment #282642 - Flags: superreview?(bienvenu)
Attachment #282642 - Flags: review?(dmose)
Status: NEW → ASSIGNED
Comment on attachment 282642 [details] [diff] [review] Full patch for base64 This looks like a good direction. However, as written, it makes the code somewhat more difficult to follow. I'd like to see more comments overall describing what the code is doing. Among other things, it would be good to describe how the modal parsing works and why we're doing it (maybe linking to a spin-off bug about doing the Right Thing once better MIME-parsing code is available). A few other thoughts: * making it clear from the names of the member variables which ones apply to the message as a whole versus the current part being decoded would be good. * it feels to me like the base 64 decoding wants to be called from inside ApplyTransformations, as we discussed on IRC. >Index: public/nsMsgBodyHandler.h >=================================================================== >+ PRBool m_base64encoded; // PR_TRUE if the message is in base64 >+ PRBool m_isMultipart; // PR_TRUE if the message is a multipart/* message >+ PRBool m_isText; // PR_TRUE if the message is a text/* message >+ >+ nsCString boundary; // The boundary string to look for Please use spaces instead of tabs for indentation. > void nsMsgBodyHandler::Initialize() > // common initialization code regardless of what body type we are handling... > { > // Default transformations for local message search and MAPI access > m_stripHeaders = PR_TRUE; > m_stripHtml = PR_TRUE; > m_messageIsHtml = PR_FALSE; >+ m_base64encoded = PR_FALSE; >+ m_isMultipart = PR_FALSE; >+ m_isText = PR_FALSE; > m_passedHeaders = PR_FALSE; > m_headerBytesRead = 0; >- >+ boundary.Assign(""); nsCString should automagically initialize to the empty string; no need to do that explicitly.
Attachment #282642 - Flags: review?(dmose) → review-
Attachment #282642 - Flags: superreview?(bienvenu)
Attached patch Revised base64 implementation (obsolete) — Splinter Review
Refactored the code slightly so that it makes more sense and saves 2 PRBool's of memory (from the last patch)! Also much more comments and more sense in my MIME parsing.
Attachment #282642 - Attachment is obsolete: true
Attachment #283791 - Flags: review?(dmose)
Comment on attachment 283791 [details] [diff] [review] Revised base64 implementation This is definitely moving in the right direction. I'd still like to see more commentary explaining how ApplyLineTransformations works, complete with comments describing in detail how the parameters work. > PRBool m_messageIsHtml; // PR_TRUE if the Content-type header claims text/html >+ PRBool m_base64encoded; // PR_TRUE if the message is in base64 >+ PRBool m_isText; The naming and commentary for the above three variables still doesn't reflect that they only apply to the part of the message currently being processed, not the entire message. > PRInt32 nsMsgBodyHandler::GetNextLine (nsCString &buf) > { > PRInt32 length = 0; > PRBool eatThisLine = PR_FALSE; >- >+ nsCString nextLine; >+ Prefer nsCAutoString for strings that are on the stack. If I recall correctly, they're actually the same type in the external string API, but it's a good habit to get into for cases where you need to touch code that's in the core. >+ if (m_base64encoded) >+ { >+ Base64Decode(buf); >+ length = ApplyTransformations(buf, buf.Length(), eatThisLine, buf); >+ } It would be good to have a comment here explaining why the above is necessary. >- if (StringBeginsWith(buf, NS_LITERAL_CSTRING("Content-Type:")) && buf.Find("text/html") >= 0) >- m_messageIsHtml = PR_TRUE; >+ if (StringBeginsWith(line, NS_LITERAL_CSTRING("Content-Type:"))) >+ { >+ if (line.Find("text/html") >= 0) >+ m_messageIsHtml = PR_TRUE; >+ else if (line.Find("multipart/") >= 0) >+ m_isMultipart = PR_TRUE; >+ else if (line.Find("text/") == kNotFound) >+ m_isText = PR_FALSE; // We have disproved our assumption >+ } >+ >+ // TODO: make this work for nested multiparts (requires some redesign) >+ if (m_isMultipart && boundary.IsEmpty() && line.Find("boundary") >= 0) >+ { >+ PRInt32 start=line.Find("boundary=",PR_TRUE); >+ start += 9; >+ if (line[start] == '\"') >+ start++; >+ PRInt32 end = line.RFindChar('\"'); >+ if (end == kNotFound) >+ end = line.Length(); >+ >+ boundary.Assign("--"); >+ boundary.Append(Substring(line,start,end-start)); >+ } >+ >+ if (StringBeginsWith(line, NS_LITERAL_CSTRING("Content-Transfer-Encoding:")) && >+ line.Find("base64") >= 0) >+ m_base64encoded = PR_TRUE; Pushing the above chunk down to its own static function would make ApplyTransformations easier to read, I think.
Attachment #283791 - Flags: review?(dmose) → review-
Attached patch base64 patch, version 4 (obsolete) — Splinter Review
Hmm... I still find Javadoc comments easy to write. Using nsCAutoString, as you recommended, spliced out SniffMIME, and put in a lot more documentation.
Attachment #283791 - Attachment is obsolete: true
Attachment #284374 - Flags: review?(dmose)
Comment on attachment 284374 [details] [diff] [review] base64 patch, version 4 >+ // For non-multipart messages, the entire message minus headers is encoded >+ // ApplyTransformations can only decode a part >+ if (m_base64part) >+ { This seems to be relying on the behavior of ApplyTransformations resetting m_base64part at the end of the part. Presumably, if the message has been truncated, that won't happen. Maybe this wants to be if (!m_isMultipart && m_base64part) >-PRInt32 nsMsgBodyHandler::ApplyTransformations (nsCString &buf, PRInt32 length, PRBool &eatThisLine) >+/** >+ * This method applies a sequence of transformations to the message. >+ * >+ * It applies the following sequences in order >+ * @ Removes headers if the searcher doesn't want them >+ * @ Determines the current MIME type. >+ * @ Strips any HTML if the searcher doesn't want it >+ * @ Strips non-text parts >+ * @ Decodes any base64 part The above comments make for harder reading because it doesn't really apply transformations to the message: it applies transformations to the line of the message that is passed in, and also updates the state of various member vars and out/inout params. Cleaning this up would help readability, I think. >+ * @param line (inout) the current line >+ * @param length (in) the length of said line >+ * @param eatThisLine (out) whether or not to ignore this line >+ * @param partSoFar (inout) the part so far, needed for base64 decoding How about "the concatenation of the still-base64-encoded lines that have been seen so far while traversing the current MIME part"? >+PRInt32 nsMsgBodyHandler::ApplyTransformations (nsCString &line, PRInt32 length, >+ PRBool &eatThisLine, nsCString &partSoFar) > { > PRInt32 newLength = length; > eatThisLine = PR_FALSE; > >- if (!m_passedHeaders) // buf is a line from the message headers >+ if (!m_passedHeaders) // line is a line from the message headers > { > if (m_stripHeaders) > eatThisLine = PR_TRUE; >+ >+ SniffMIME(line); > >- if (StringBeginsWith(buf, NS_LITERAL_CSTRING("Content-Type:")) && buf.Find("text/html") >= 0) >- m_messageIsHtml = PR_TRUE; >- >- m_passedHeaders = buf.IsEmpty() || buf.First() == '\r' || buf.First() == '\n'; >+ m_passedHeaders = line.IsEmpty() || line.First() == '\r' || line.First() == '\n'; >+ >+ // We have already grabbed all worthwhile information from the headers, >+ // so there is no need to keep track of the current lines >+ partSoFar.Assign(line.get()); How about explicitly returning after this assignment? It feels like it would be less fragile in the long-term, and then you don't need the following |else| at all, which removes some a scope and contributes to readability. > } >- else // buf is a line from the message body >+ else // line is a line from the message body > { >- if (m_stripHtml && m_messageIsHtml) >+ PRBool partIsHtml = m_partIsHtml; >+ // Check to see if this is the boundary string => we ignore it This comment seems misleading: we take the opportunity to finish up the work associated with the part, possibly even doing the base64 decoding of the entire accumulated part as well as resetting some variables. >+ if (m_isMultipart && StringBeginsWith(line, boundary)) > { >- StripHtml (buf); >- newLength = buf.Length(); >+ if (m_base64part && m_partIsText) >+ { >+ Base64Decode(partSoFar); >+ // Work on the parsed string >+ line.Assign(partSoFar); >+ } >+ else >+ line.Truncate(); >+ >+ // Rest all assumed headers >+ m_base64part = PR_FALSE; >+ m_passedHeaders = PR_FALSE; >+ m_partIsHtml = PR_FALSE; >+ m_partIsText = PR_TRUE; >+ // And work on those... >+ eatThisLine = PR_TRUE; >+ } It appears that control will now fall-through to the "partSoFar.Assign(line.get());" and then return. Is doing that assignment necessary at all? In any case, the control flow is hard to see without careful reading and feels fragile. How about just explicitly returning here too (possibly doing the Assign first, if that's really required)? >+ if (!m_partIsText) >+ { >+ // Ignore non-text parts >+ line.Truncate(); >+ eatThisLine = PR_TRUE; >+ } >+ Shouldn't we just return immediately here? Otherwise it looks like we'll do a bunch of unnecessary processing. >+/** >+ * Determines the MIME type, if present, from the current line. >+ * >+ * m_partIsHtml, m_isMultipart, m_partIsText, m_base64part, and boundary are >+ * all set by this method at various points in time. >+ * >+ * @param line (in) a header line that may contain a MIME header >+ */ >+void nsMsgBodyHandler::SniffMIME (nsCString &line) How about calling this function SniffPossibleMIMEHeader in order to add to the readability of calling code? Unfortunately, I have to run to meet someone for dinner. I'll finish the review tomorrow. Sorry for the delay!
Comment on attachment 284374 [details] [diff] [review] base64 patch, version 4 >+/** >+ * Decodes the given base64 string. >+ * >+ * It returns its decoded string in its input. >+ * >+ * @param pBufInOut (inout) a buffer of the string >+ */ >+void nsMsgBodyHandler::Base64Decode (nsCString &pBufInOut) >+{ >+ char *decodedBody = PL_Base64Decode(pBufInOut.get(), pBufInOut.Length(), nsnull); >+ if (decodedBody) >+ pBufInOut.Adopt(decodedBody); >+ pBufInOut.ReplaceChar("\n\r",' '); // Searching does not like EOLs >+} One thing it's worth checking here is whether the allocators are correctly matched: PL_Base64Encode probably uses PR_Alloc under the hood; I'm not sure what the nsCString destructor in the external string API uses to free its storage. Otherwise, once the comments from last night are addressed, things should be in good shape.
Attachment #284374 - Flags: review?(dmose) → review-
Remember the value of Content-Transfer-Encoding, as well as mime types are not case sensitive. (Also true for the headers them selves I believe.)
Attached patch base64 patch, version 5 (obsolete) — Splinter Review
Version 5...
Attachment #284374 - Attachment is obsolete: true
Attachment #286085 - Flags: review?(dmose)
Comment on attachment 286085 [details] [diff] [review] base64 patch, version 5 >+ if (!m_partIsText) >+ { >+ // Ignore non-text parts >+ buf.Truncate(); >+ eatThisLine = PR_TRUE; >+ return 0; >+ } Looks like the above line has tabs instead of spaces. Otherwise; looks great! r=dmose with that that fixed. Thanks for your patience with my review turnaround.
Attachment #286085 - Flags: review?(dmose) → review+
Attachment #286085 - Flags: superreview?(bienvenu)
Comment on attachment 286085 [details] [diff] [review] base64 patch, version 5 I tried this patch out and it failed the first test I tried :-( I'll forward you the e-mail (as an attachment) so you can try creating a mailbox with that single message and try the search...
Attachment #286085 - Flags: superreview?(bienvenu) → superreview-
Attached patch Version 6 patch (obsolete) — Splinter Review
So the reason why it failed was because the message was a (breaths in) multipart/mixed message with a message/rfc822 inline part which is a multipart/alternative message... No base64 involved at all. :-( The message/rfc822 edge case has been resolved by assuming that message/* mime-types are really multipart/* messages (which makes your first test succeed).
Attachment #286085 - Attachment is obsolete: true
Attachment #289263 - Flags: review?(dmose)
Depends on: 37031
Comment on attachment 289263 [details] [diff] [review] Version 6 patch for StringBeginsWith, I think you're going to want to use nsCaseInsensitiveCStringComparator() because you can't count on the case you've specified.
e.g., the code that looks for Content-Type /Content-Transfer-Encoding, etc, should be case-insensitive. if (!m_passedHeaders) // line is a line from the message headers { if (m_stripHeaders) eatThisLine = PR_TRUE; // We have already grabbed all worthwhile information from the headers, // so there is no need to keep track of the current lines buf.Assign(line); SniffPossibleMIMEHeader(buf); m_passedHeaders = buf.IsEmpty() || buf.First() == '\r' || buf.First() == '\n'; return length; } why set eatThisLine to TRUE here? We don't ever check the value before returning...
sorry to do this one comment at a time... m_passedHeaders - does that mean m_pastHeaders? Or that the headers were passed in? If it means we've gone past the msg headers, we usually call that something like m_inMsgBody. And it looks like there are other places where we set eatThisLine and then return w/o checking the value...
oh, nm about eatThisLine - it's passed in by reference, sorry about that.
Comment on attachment 289263 [details] [diff] [review] Version 6 patch Since you're just making small changes to address David's comments, I don't think I need to re-review this. You can keep carrying my r+ forward unless you make fairly significant revisions.
Attachment #289263 - Flags: review?(dmose) → review+
Attached patch Version 7 patch (obsolete) — Splinter Review
Changed m_passedHeaders -> m_pastHeaders for enhanced readability and made the MIME header comparisons case-insensitive. r=dmose (carried over) per comment #40.
Attachment #289263 - Attachment is obsolete: true
Attachment #289565 - Flags: superreview?(bienvenu)
Attachment #289565 - Flags: review+
Attached patch Version 8 patchSplinter Review
Argggh... one tiny little omission. r+ carried over from dmose per comment #40 (again). Sorry for the continual spam and wastage of space on bugzilla with this 8th version of the patch.
Attachment #289565 - Attachment is obsolete: true
Attachment #290632 - Flags: superreview?
Attachment #290632 - Flags: review+
Attachment #289565 - Flags: superreview?(bienvenu)
Comment on attachment 290632 [details] [diff] [review] Version 8 patch thx, Joshua, that works a lot better. Thx for all the hard work!
Attachment #290632 - Flags: superreview? → superreview+
Keywords: checkin-needed
Checking in mailnews/base/search/public/nsMsgBodyHandler.h; /cvsroot/mozilla/mailnews/base/search/public/nsMsgBodyHandler.h,v <-- nsMsgBodyHandler.h new revision: 1.12; previous revision: 1.11 done Checking in mailnews/base/search/src/nsMsgBodyHandler.cpp; /cvsroot/mozilla/mailnews/base/search/src/nsMsgBodyHandler.cpp,v <-- nsMsgBodyHandler.cpp new revision: 1.31; previous revision: 1.30 done
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9 M10
V fixed with TB 3a1pre-1201, Win2K. Thanks all!
Status: RESOLVED → VERIFIED
Depends on: 411988
Product: Core → MailNews Core
I read the whole thread through, but it didn't clear to me whether this bug should be eliminated in ThunderBird 2.0.0.16 (20080708) or not. My searches on base64 encoded messages are still fails with the above version on XP sp3 + all post SP3 patches. I can attach sample mail if required. The letter Content-Type is multipart/mixed; there are two actual message part the first is a text/plain body part (base64 encoded) and the second one is an application/x-zip-compressed attachment (base64 encoded).
No, 3.0 builds starting 2007-12-01 only.
Attached file Base64 encoded mail
Version 3.0a3 don't seems to find 'Monsieur' in this mail
Attachment #349381 - Attachment mime type: application/x-mimearchive → text/plain
(In reply to comment #49) > Created an attachment (id=349381) [details] > Base64 encoded mail > > Version 3.0a3 don't seems to find 'Monsieur' in this mail Oh, yuck, MIME's being its stupid old case-insensitive self again.
I don't think this bug is solved, as Joshua stated, there is still a case sensitivity problem.
Reopening. I suspect this doesn't need to block, since (afaik) our default search plan for Tb3 is Gloda-based.
Status: VERIFIED → REOPENED
Flags: wanted-thunderbird3?
Resolution: FIXED → ---
Blocks: 471328
Note that this bug also applies to Thunderbird's Quick Search feature (both for searching the body and the subject header). I had entered it as a separate bug (471328), but it has been marked as a duplicate of this one, even though it's not about the local body search and also affects subject searches.
No longer blocks: 471328
Sure would be nice to get this fixed once and for all...
Flags: wanted-thunderbird3? → wanted-thunderbird3+
Whiteboard: [case sensivity issue remaining]
This patch should fix the case-insensitivity problems.
Attachment #388104 - Flags: superreview?(bugzilla)
Attachment #388104 - Flags: review?(bugzilla)
Attachment #388104 - Flags: superreview?(bugzilla)
Attachment #388104 - Flags: superreview+
Attachment #388104 - Flags: review?(bugzilla)
Attachment #388104 - Flags: review+
Comment on attachment 388104 [details] [diff] [review] Fix case-sensitivity and test b/mailnews/base/test/unit/test_searchBody.js ... >+/* >+ * Testing of tag search features. >+ * >+ * Specifically tests changes implemented in bug 217034 >+ * Does not do comprehensive testing. >+ * >+ */ This comment needs updating. r/sr=Standard8 with that fixed.
Pushed as changeset 3108:561f6f3354fb, with minor comment updates in test_searchBody.js.
Status: REOPENED → RESOLVED
Closed: 17 years ago15 years ago
Resolution: --- → FIXED
Whiteboard: [case sensivity issue remaining]
I'm seeing this very same issue on version 45.1.1 over 14 years after the initial report... was this ever resolved or was this fix rolled back at some point? Thanks!
(In reply to Christian Dümmer from comment #60) > I'm seeing this very same issue on version 45.1.1 over 14 years after the > initial report... was this ever resolved or was this fix rolled back at some > point? Thanks! yes (you can see the patch listed in the bug. And no. For current issues see bug 1167972 and bug 1259534. No need to comment in them unless you have new information or a patch to contribute.
See Also: → 1259534
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: