Closed Bug 1509493 Opened 6 years ago Closed 6 years ago

Parsing bug leads to duplication of characters in plain-text emails

Categories

(Core :: Networking, defect, P2)

60 Branch
Desktop
All
defect

Tracking

()

RESOLVED FIXED
mozilla65
Tracking Status
firefox65 --- fixed

People

(Reporter: fabian, Assigned: fabian)

Details

(Whiteboard: [necko-triaged])

Attachments

(4 files, 5 obsolete files)

User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36

Steps to reproduce:

View a plain-text mail with the following body content in Thunderbird's mail view:
```
johndoe@gmail.com)@gmail.com
```


Actual results:

The content of the mail is shown as:
```
johnjohndoe@gmail.com)@gmail.com
```
where `john` is a hyperlink with target `mailto:johnjohndoe@gmail.com` and the rest of the body is a hyperlink with target `mailto:johnjohndoe@gmail.com)@gmail.com`. When inspecting the generated HTML markup with a debugger, the anchor tag for the first hyperlink is left dangling.


Expected results:

The plain-text part of the content as shown by Thunderbird should agree with the actual plain-text content of the mail. Thunderbird may add some hyperlink markup with mailto: links, but it should not change the textual representation of the mail.

This behavior occurs because `mozTXTToHTMLConv.cpp::CalculateURLBoundaries` incorrectly calculates the boundaries of URLs/mail addresses that are nested (as in the example). In order to avoid loops, it calculates the boundaries of the prefix without taking URLs into account. Thus, when the second `@` character is encountered, it cuts into the HTML markup that had already been inserted when processing the address `johndoe@gmail.com`. Since the markup generated from this address ends in `johndoe@gmail.com</a>` and `</a>` is four characters long, this results in the first four characters of the address remaining together with the opening anchor tag.
OS: Unspecified → All
Hardware: Unspecified → Desktop
Confirmed. Grrr, that damn mozTXTToHTMLConv again :-( - That code is actually in Mozilla core. Since you looked at it, can you submit a fix? Also please add a unit test to test_mozTXTToHTMLConv.js (that's the easy bit).
Status: UNCONFIRMED → NEW
Component: Message Reader UI → Networking
Ever confirmed: true
Product: Thunderbird → Core
Version: 60 → 60 Branch
I think that the current single-pass implementation with its recursive calls is hard to get right and reason about. I would propose switching to the following two-pass algorithm, which should not require too many changes:

1. Iterate over the plain-text message as before and store start, end and replacement string, but don't make any changes yet.
2. Sort the triples of start, end and replacement string first by start, then by (end - start).
3. Go through the string and greedily apply non-overlapping replacements in the order given by 2.

What do you think?
Not my code or responsibility. Original author: BenB, current reponsibility: Core::Netwerk people. You can attach a patch with a test case and someone will take a look in due course. I'm just the poor sod reading most bug reports and getting them channelled the right way (and fix some if I can).
Flags: needinfo?(ben.bucksch)
Bug confirmed. Minor severity, because this is malformed to start with. Garbage in, garbage out.
Fabian, if you have a small and specific patch that does not change the existing logic (which might break other cases), then great.
Severity: normal → minor
Flags: needinfo?(ben.bucksch)
(It would be great if someone from the TB team just took the component and rewrote it with mozilla::Tokenizer)
That comment looks like a déjà vu ;-) - See bug 1274242 comment #22 from May 2016.
The bug can also be triggered by more natural strings like:

(johndoe@gmail.com)johndoe@gmail.com

Log messages containing such substrings showed up in log files of an email server I maintain, and led to plenty of confusion when sent to Thunderbird users. 

The CalculateURLBoundaries function contains a comment /* prevent loops */ which supposedly explains why ScanTXT has to be called without URL checks enabled. Does anybody have a concrete example for why this is needed? If not, then my first attempt at a fix would just be to call the full ScanTXT, but if this would cause a regression, it would likely be observed in terms of degraded performance and thus not caught by the tests.
Flags: needinfo?(ben.bucksch)
Depends on D13389
In mozTXTToHTMLConv, FindURL is not able to correctly calculate replaceBefore for nested email addresses/URLs such as john@doe.org}john@doe.org. As a workaround, we keep track of the end of the last URL HTML markup in the output string and skip any subsequent URLs whose replaceBefore would cut into this markup.

Depends on D13390
Flags: needinfo?(ben.bucksch)
Attachment #9028597 - Flags: review+
Comment on attachment 9028598 [details]
Bug 1509493 - Do not include unmatched ')' in email addresses

r+ after a comment change
Attachment #9028598 - Flags: review+
Comment on attachment 9028599 [details]
Bug 1509493 - Fix markup generation for nested email addresses

r+ after a comment change
Attachment #9028599 - Flags: review+
Sadly we don't have approval rights in Core::Netwerk. Valentin or Honza, can you please take a look.
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(honzab.moz)
You want to land this stuff?  If so, then you only need to add the checkin-needed keyword.
Flags: needinfo?(honzab.moz)
Keywords: checkin-needed
Hold on, I can land it, but it needs review by a Core::Netwerk peer. Sure, BenB is the original author (and the patches look good to me as well), but formally they need approval by an M-C peer.
Flags: needinfo?(honzab.moz)
Keywords: checkin-needed
Assignee: nobody → fabian
Status: NEW → ASSIGNED
I've reviewed changes to this component since decades :).
you can land this. Thanks.
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(honzab.moz)
OK, but somehow Lando is declining landing this, something not accepted in https://phabricator.services.mozilla.com//D13391.
(In reply to Jorg K (GMT+1) from comment #18)
> OK, but somehow Lando is declining landing this, something not accepted in
> https://phabricator.services.mozilla.com//D13391.

For some reason it appears as one parent revision with two children.
You can land https://lando.services.mozilla.com/D13390/ first, then then https://phabricator.services.mozilla.com/D13391/
My bad, I messed up the hierarchy. It should have been reverted to the default now.
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/integration/autoland/rev/5e4a5cf81c26
Provide basic test coverage for email address parsing r=BenB
https://hg.mozilla.org/integration/autoland/rev/6498765e1d65
Do not include unmatched ')' in email addresses r=BenB
https://hg.mozilla.org/integration/autoland/rev/05562b7d3eff
Fix markup generation for nested email addresses r=BenB
Landed. Fabian, since you're an expert in this code now, maybe had over to bug 1505911. Feel free to adopt my test enhancements and fix the code, it shouldn't be hard.
The test failure is on Windows 7:

17:43:10     INFO -  Unexpected conversion by scanTXT: input=RFC1738: <URL:mailto:john.doe+test@mozilla.org> then, output=RFC1738: &lt;URL:mailto:john.doe+test@mozilla.org&gt; then, link= href="mailto:john.doe+test@mozilla.org"`

The mailto: link is not recognized. The only potentially OS dependent part of the code seems to be ShouldLinkify, is it known to behave differently under Windows?
Flags: needinfo?(fabian) → needinfo?(ben.bucksch)
I think you're on very dangerous ground here. Looks like the test only failed on Win7 and not Win10. Running it locally in a TB binary it passes on Win10, but then TB also provides a mailto handler. For FF the handler should be external
https://searchfox.org/mozilla-central/rev/e22c0d152060f4f8d4ca8904094f15f65a1b6f93/browser/app/profile/firefox.js#658
and who knows what |rv = externalHandler->ExternalAppExistsForScheme(scheme, &exists);| will return.

I think your best bet is to exclude Windows from the test using AppConstants.platform == "win".
> The only potentially OS dependent part of the code seems to be ShouldLinkify, is it known to behave differently under Windows?

ShouldLinkify() checks whether the concrete URL scheme is valid and can be handled, based on which URL handlers are installed on the user's system. This is an important feature. This way, we don't offer links that won't work when clicked, and we can support syntactically valid URLs of schemes we don't know yet, be it tel: or skype: or what have you.

What's surprising that it fails for mailto:. That shouldn't fail, and I've never heard of that. It might be because of the build machine setup, that it doesn't have a mail app installed, and Thunderbird OS integration hasn't run, because this is just unit tests. So, this is not a realistic case.

Still, I think it would make sense to hardcode "mailto", "http" and "https" and OK these 3 schemes, and not look up URL scheme handlers for them, and just return OK for them. This would be in ShouldLinkify() after the call to ExtractScheme().
Flags: needinfo?(ben.bucksch)
(In reply to Ben Bucksch (:BenB) from comment #26)
> Still, I think it would make sense to hardcode "mailto", "http" and "https"
> and OK these 3 schemes, ...
What happens if the code runs in FF and there is no mailto handler on the system?
> I think you're on very dangerous ground here. Looks like the test only failed on Win7 and not Win10. Running it locally in a TB binary it passes on Win10, but then TB also provides a mailto handler.

This is probably because Win10 ships with a mail application, whereas Win7 does not.

> I think your best bet is to exclude Windows from the test using AppConstants.platform == "win".

I feel a bit uneasy about this. Why should we be able to assume that other platforms will continue to offer the "mailto:" scheme by default?

> What happens if the code runs in FF and there is no mailto handler on the system?

Based on my understanding of the code, this would simply amount to rendering a link with an unsupported scheme, which an HTML mail could do anyway. Would this be considered an issue in certain situations? When clicked, a link with an unregistered scheme leads to a "The address wasn’t understood" error message in Firefox, which sounds reasonably descriptive.

> Still, I think it would make sense to hardcode "mailto", "http" and "https" and OK these 3 schemes, and not look up URL scheme handlers for them, and just return OK for them. This would be in ShouldLinkify() after the call to ExtractScheme().

I would prefer this solution, as it will guarantee consistent unit test behavior even in the face of potential platform changes in the future.
Attachment #9028910 - Attachment is obsolete: true
Attachment #9028911 - Attachment is obsolete: true
Priority: -- → P2
Whiteboard: [necko-triaged]
Please rebase all patches here. M-C reformatted all 10,000+ C++ files in the tree in bug 1511181 and also bug 1505911 caused conflicts on the test file.
Depends on D13644
In mozTXTToHTMLConv, FindURL is not able to correctly calculate replaceBefore for nested email addresses/URLs such as john@doe.org}john@doe.org. As a workaround, we keep track of the end of the last URL HTML markup in the output string and skip any subsequent URLs whose replaceBefore would cut into this markup.

Depends on D13645
I have rebased the commits, creating new Phabricator diffs for the three commits that had already been submitted for landing previously.

Possibly due to the changes in bug 1505911 (change to the signature of ScanTXT), moz build currently fails for me and I haven't been able to run the tests locally.
Valentin, in https://phabricator.services.mozilla.com//D13511 they suggest to add this code:
  nsAutoCString scheme;
  if (scheme == "http" || scheme == "https" || scheme == "mailto") {
    return true;
I'm really surprised that that even compiles. Can you help my confusion?

(In reply to Fabian Henneke from comment #36)
> Possibly due to the changes in bug 1505911 (change to the signature of
> ScanTXT), moz build currently fails for me and I haven't been able to run
> the tests locally.
Well, FF and TB compile so what's the problem if you rebased your patches properly?
Flags: needinfo?(valentin.gosu)
I've applied
D13645.patch
D13646.patch
D13644.patch
D13511.patch
and it all compiles and |mach xpcshell-test netwerk/test/unit/test_mozTXTToHTMLConv.js| passes. I thinks it's good to go after we clarify the string comparison issue.
(In reply to Jorg K (GMT+1) from comment #38)
> I've applied
> D13645.patch
> D13646.patch
> D13644.patch
> D13511.patch
> and it all compiles and |mach xpcshell-test
> netwerk/test/unit/test_mozTXTToHTMLConv.js| passes. I thinks it's good to go
> after we clarify the string comparison issue.

Never mind, I forgot to also hg pull /comm. Builds and passes tests for me now. If you don't mind, please submit for a try run in my stead, as I don't have the relevant commit access yet.
(In reply to Jorg K (GMT+1) from comment #37)
> Valentin, in https://phabricator.services.mozilla.com//D13511 they suggest
> to add this code:
>   nsAutoCString scheme;
>   if (scheme == "http" || scheme == "https" || scheme == "mailto") {
>     return true;
> I'm really surprised that that even compiles. Can you help my confusion?

I suspect one of the string implementations has an operator== defined, but I can't find it right now.
But I did find another similar instance so this should be fine:
https://searchfox.org/mozilla-central/rev/2edebf41a2b2e624859bf15ed5a98bdd002f73b4/dom/cache/DBSchema.cpp#2486
All green. Ben can you stick the last r+ on and I'll re-land. Fabian, I hope the dependencies are set correctly.
Attachment #9028883 - Flags: review+
Attachment #9029207 - Flags: review+
Comment on attachment 9029206 [details]
Bug 1509493 - Provide basic test coverage for email address parsing

Sorry that I missed one
Attachment #9029206 - Flags: review+
Attachment #9029208 - Flags: review+
Comment on attachment 9028597 [details]
Bug 1509493 - Provide basic test coverage for email address parsing

Fabian, you can mark supercedes patches with the "obsolete" flag in the attachment details, like I'm doing here.
Attachment #9028597 - Attachment is obsolete: true
Attachment #9028598 - Attachment is obsolete: true
Attachment #9028599 - Attachment is obsolete: true
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/integration/autoland/rev/51e67c682e39
Always linkify common URL schemes in plain text r=BenB
https://hg.mozilla.org/integration/autoland/rev/b4808c4e7d61
Provide basic test coverage for email address parsing r=BenB
https://hg.mozilla.org/integration/autoland/rev/00d7ff2eeb0a
Do not include unmatched ')' in email addresses r=BenB
https://hg.mozilla.org/integration/autoland/rev/95e75223f6c7
Fix markup generation for nested email addresses r=BenB
Looking at this https://hg.mozilla.org/integration/autoland/rev/95e75223f6c7#l2.14 again:
+      input: "bug#1509493 (john@mozilla.org)john@mozilla.org test",
+      url: "mailto:john@mozilla.org",
+      text: "john@mozilla.org"
Would it have been better to have two different e-mail addresses here? Which one does it pick for URL and text?
Not really. If it linkyfies only one of them, or both, that's both correct behavior. What's wrong is if it takes one of the () into the linked text. That was the bug, and that's what the test checks. Insofar, it's a good test for this specific bug, without producing false positives.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: