Open Bug 521649 Opened 15 years ago Updated 6 months ago

Quick Search "Message body filter" does not find message text with umlauts (ä,ö,ü) in saved drafts messages (searching unparsed HTML entities ü etc. in text/html), but succeeds for same msg when received (as text/plain, charset=ISO-8859-1)

Categories

(Thunderbird :: Search, defect)

x86
Windows XP
defect

Tracking

(Not tracked)

People

(Reporter: thomas8, Unassigned)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files)

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5pre) Gecko/20091010 Shredder/3.0pre

STR (test mail attached)
1) compose mail containing words with umlauts (ä,ö,ü) in body text
2) save as draft
3) in draft folder, select "Message body filter" quicksearch and
4) filter for word with German umlauts (ä,ö,ü, e.g. Münster) that is in the body
5) do the same on msg after having received it (see second attachment)

expected
4) and 5): msg body filter should find the msg

actual
4) msg body filter does not find the draft msg
5) msg body filter finds the same msg after it was received
Summary: Quick Search "Message body filter" does not find message text with umlauts in saved drafts (Character-encoding?) → Quick Search "Message body filter" does not find message text with umlauts in saved drafts messages (Character-encoding?)
This is basically the same message as testmail1, but after receiving it in inbox.
Thomas, does (ä,ö,ü) still fail?
Summary: Quick Search "Message body filter" does not find message text with umlauts in saved drafts messages (Character-encoding?) → Quick Search "Message body filter" does not find message text with umlauts (ä,ö,ü) in saved drafts messages (Character-encoding?)
it's WFM with current nightly
Blocks: tb-drafts
(In reply to Wayne Mery (:wsmwk) from comment #2)
> Thomas, does (ä,ö,ü) still fail?
Flags: needinfo?(bugzilla2007)
Yes, this still fails, both TB24 and Trunk (32.0a1 (2014-05-01))

This obviously depends how the umlauts are saved in draft, e.g. the word Münster:

In TB 24, composing new msg:

<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    M&uuml;nster<br>

Quick filtering for ü fails, regardless of containing folder (after copying the draft into other folders).
Quick filtering for &uuml; (sic) succeeds.
Fwiw, that's on a German Version of TB 24 sharing Profile with English Version of TB 24.
Not sure if that can cause confusion in language settings?

In Trunk, composing new msg:

<meta http-equiv="content-type" content="text/html; charset=utf-8">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    Münster

Or at least that's what Ctrl+U msg source viewer shows, which is probably a bug in the source viewer.
If saved as .eml, then opened with Notepad++ advanced text editor, it shows as UTF correctly having the word "Münster". But search still fails, see below.
That's on English Daily, profile should be reasonably clean.

Quick filtering for ü fails, regardless of folder (after copying the draft around)
Quick filtering for ü fails.
Quick filtering for &uuml; succeeds when they are in source (not applicable here).
Flags: needinfo?(bugzilla2007)
For later duping
See Also: → 344130
See Also: → 1042681
As bug 1427124 shows, this doesn't have anything to do with drafts, but with messages which have plaintext and HTML part as the same time, like all drafts.
See Also: → 1427124
Summary: Quick Search "Message body filter" does not find message text with umlauts (ä,ö,ü) in saved drafts messages (Character-encoding?) → Quick Search "Message body filter" does not find message text with umlauts (ä,ö,ü) in saved drafts messages (since they are multipart)
(In reply to Jorg K (GMT+1) from comment #7)
> As bug 1427124 shows, this doesn't have anything to do with drafts, but with
> messages which have plaintext and HTML part as the same time, like all
> drafts.

???

I described correctly what I saw at the time, and the evidence is still attached. This had everything to do with drafts at the time of reporting, because the same msg failed when searching the saved draft, but succeeded when searching the received message. And I don't see any multipart in either test message, both have only one part, and both are MIME messages.
In a way this another variation/symptom of the downgrading HTML to plain text saga, only this time plain text won for successful quick filtering, and HTML failed at the time of reporting.

My Comment 5 (almost 5 years later, so might not exactly apply to test cases from 8 years ago) correctly points to the most likely cause of this at the time, which is encoding (so I don't see why you removed that from the sumary):
Draft = HTML -> &uuml; in source -> search für "ü" fails, but search for "&uuml;" succeeds -> searching raw text/HTML
Received = plaintext -> some other encoding (charset=ISO-8859-1) -> search succeeds for that text/plain encoding/charset.

That's essentially the same as what you're saying in bug 1427124, comment 5:
> Most likely the search is done on the raw UTF-8, so only ASCII text is found.

I don't see the link of that with multipart messages, can you enlighten me?
Pls don't just make me look wrong without reading the bug and testcases.
Summary: Quick Search "Message body filter" does not find message text with umlauts (ä,ö,ü) in saved drafts messages (since they are multipart) → Quick Search "Message body filter" does not find message text with umlauts (ä,ö,ü) in saved drafts messages (searching unparsed entities in text/html), but succeeds for same msg when received (as text/plain, charset=ISO-8859-1)
So from here we need to revisit the testcases and see what we're doing today under the same circumstances.
Attachment #405765 - Attachment mime type: message/rfc822 → text/plain
Attachment #405766 - Attachment mime type: message/rfc822 → text/plain
Wow, you're right, I didn't look at the test cases from back then. And drafts aren't even multipart :-(
So I was all wrong. That said, I have no idea how you managed to get
  M&uuml;nster ist eine der sch&ouml;nsten St&auml;dte der Welt.
into the draft. But yes, that wouldn't be found.

Sorry about the confusion and my mistake.

The basic problem is another facet of bug 1259534: We search some raw data instead of converting it into un-escaped and decoded text first.
(In reply to Jorg K (GMT+1) from comment #10)
> Wow, you're right, I didn't look at the test cases from back then. And
> drafts aren't even multipart :-(
> So I was all wrong. That said, I have no idea how you managed to get
>   M&uuml;nster ist eine der sch&ouml;nsten St&auml;dte der Welt.
> into the draft.

Wasn't me, it was Thunderbird (at the time, long back, but I was already there...)

> But yes, that wouldn't be found.

Even today, in 2017. Just tested. And then, it's not all that hard to get &uuml; in source when importing .eml messages not created by TB...

> Sorry about the confusion and my mistake.

No problem, thanks.

> The basic problem is another facet of bug 1259534: We search some raw data
> instead of converting it into un-escaped and decoded text first.

Yes. That's an ugly bug that should be terminated. I know it's a multipart (pun intended) hydra, but cutting off a head here and there might one day kill the beast. Alternatively, blast the whole thing away and start reassembling phoenix from the ashes... Ah well, just dreaming... :|
Summary: Quick Search "Message body filter" does not find message text with umlauts (ä,ö,ü) in saved drafts messages (searching unparsed entities in text/html), but succeeds for same msg when received (as text/plain, charset=ISO-8859-1) → Quick Search "Message body filter" does not find message text with umlauts (ä,ö,ü) in saved drafts messages (searching unparsed HTML entities &uuml; etc. in text/html), but succeeds for same msg when received (as text/plain, charset=ISO-8859-1)
Severity: normal → S3

Bug 1855637 looks very similar to this

Duplicate of this bug: 1855637
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: