Closed Bug 1692771 Opened 10 months ago Closed 8 months ago

Simple HTML message with correctly formatted hyperlink <a href="URI">URI</a> gets sent as plaintext with ugly link duplication: URI <URI>

Categories

(MailNews Core :: Composition, defect)

Thunderbird 86
defect

Tracking

(thunderbird_esr78 wontfix, thunderbird88 fixed)

RESOLVED FIXED
89 Branch
Tracking Status
thunderbird_esr78 --- wontfix
thunderbird88 --- fixed

People

(Reporter: peter_bugzilla, Assigned: rnons)

References

Details

(Keywords: regression)

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:85.0) Gecko/20100101 Firefox/85.0

Steps to reproduce:

In message body create a link. Put "https://bugzilla.mozilla.org" in both the text and the Link Location fields.

Actual results:

Link text is displayed in blue as expected, but when the mail is sent "https://bugzilla.mozilla.org https://bugzilla.mozilla.org" is inserted into the text with no formatting to indicate it's a link.

Expected results:

We should send something like "<a href="https://bugzilla.mozilla.org">https://bugzilla.mozilla.org</a>"
This started happening sometime last year I think. I've tested it with 86.0b2 (64-bit)

I don't see any problem using the 86.0b3 release candidate on Fedora 33 Workstation.
The link appears only once in the sent and received message.
I view messages as Original HTML.

If I view it as Plain Text there are two clickable links.

The source of my test message.

This is a multi-part message in MIME format.
--------------ulpaOwK54HUjAvgnJrqcMRyO
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

This is a test.

https://bugzilla.mozilla.org https://bugzilla.mozilla.org/

--------------ulpaOwK54HUjAvgnJrqcMRyO
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit

<html>
<head>

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

</head>
<body>
<p><font size="+1">This is a test.</font></p>
<p><font size="+1"><a moz-do-not-send="true"
href="https://bugzilla.mozilla.org/">https://bugzilla.mozilla.org</a></font><br>
</p>
</body>
</html>
--------------ulpaOwK54HUjAvgnJrqcMRyO--

(In reply to WaltS48 [:walts48] from comment #1)

I don't see any problem using the 86.0b3 release candidate on Fedora 33 Workstation.

Hey Walt, lucky you are, but I think you may not be using default settings.

Default is [x] Send messages as plaintext if possible, and the stupid algorithm thinks a link <a href...> where href == linktext isn't worth preserving and can be emitted as plaintext. So I'm not sure how you got HTML + plaintext.

Now if there's only plaintext, the double link is shown to any recipient by default, and it's looking worse than in your comment (second link with angle brackets) because you forgot to markdown the source as code block (with triple single quote marks) here in your comment, so it gets shown different on BMO.

This is the real resulting source of outgoing message (from outbox), and it looks amateurish on screen with the double link to say the least!

Content-Language: en-US
From: jd <johndoe@example.com>
Subject: test link with <a href...>
To: undisclosed-recipients: ;
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

This is a test

https://bugzilla.mozilla.org/ <https://bugzilla.mozilla.org/>
Status: UNCONFIRMED → NEW
Component: Untriaged → Backend
Ever confirmed: true
Keywords: dupeme
Product: Thunderbird → MailNews Core
See Also: → 780365
Summary: Not formatting links correctly → Message with correctly formatted hyperlink <a href="URI">URI</a> gets sent as plaintext with ugly link duplication: URI <URI>
Summary: Message with correctly formatted hyperlink <a href="URI">URI</a> gets sent as plaintext with ugly link duplication: URI <URI> → Simple HTML message with correctly formatted hyperlink <a href="URI">URI</a> gets sent as plaintext with ugly link duplication: URI <URI>

Okay, new profile.
Created POP3 and IMAP accounts

Send options are Send messages as plain text if possible.

Composed a message in HTML used Insert > Link with https://bugzilla.mozilla.org in the "Enter text to display for the link:" field in the POP3 account.

Entered https://bugzilla.mozilla.org in the Link Location field.

Sent it to the IMAP account and the received message still only displays the link once.

Test #2

Without typing anything in the text field and only typing https://bugzilla.mozilla.org in the Location field I do get both in the received message.

Subject: test 2
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

https://mozilla.bugzilla.org https://mozilla.bugzilla.org

I guess I also follow directions when I compose an email and enter only text in the "Enter text to display for the link:" field, not the link I want the text to link to in the email.

If I entered text and a link like instructed and the text didn't work as a link then I would consider it a bug.

Sorry, I got involved.

See Also: → 414299

Happened to me yesterday - but it's the first time I've seen this

WaltS48, thanks for the details, including specific reproductions where it works and where it does not work in comment 3, and for the sample message in comment 1.

When you just type or paste a URL as plaintext in the HTML editor (which is what most people do), without manually marking it as link, we detect the plaintext URL within the HTML text nodes (using the scanHTML() recognizer) and make it a link. This works as expected.

In the sample message, in the HTML part, you see

<a moz-do-not-send="true" href="https://bugzilla.mozilla.org/">https://bugzilla.mozilla.org</a>

The moz-do-not-send="true" means: When we downconvert to plaintext and see this attribute, we should output only the link text and not the link target. It's all correct until here.

(We do it in such an explicit way, instead of relying on heuristics like target == linktext, because we also convert other forms of URLs where the linktext does not match the URL precisely, but only vaguely, e.g. linktexts "foobar.com", "http://foobar.com" etc. give URL https://foobar.com . Instead of relying on heuristics that would reverse the recognition, we unabiguously mark which <a href> was artificially inserted.)

However, the moz-do-not-send="true" seems to get lost on the way.

This used to work, but apparently broke recently.

Probably, the mail sending code in Thunderbird was rewritten recently and broke it. I suspect that this broke simply during the recent C++ to JS rewrite of the mail send code.

This is not a bug in the HTML<->plaintext converters, but it's a regression in the sending process. Something gets lost between the call to scanHTML() and the HTML->plaintext converter.

Component: Backend → Composition
Keywords: dupeme
OS: Unspecified → All
Hardware: Unspecified → All

Ping Chen, you were working on MessageSend.jsm a lot recently. I would suspect that it's a regression from that. Could you please take a look at this, where this information gets lost?

Assignee: nobody → remotenonsense
Keywords: regression

This seems to be an old bug, I can reproduce on 78. It's related to nsIParserUtils and nsIDocumentEncoder.OutputFormatted. To reproduce in the devtools,

parserUtils = Cc["@mozilla.org/parserutils;1"].getService(Ci.nsIParserUtils)
console.log(parserUtils.convertToPlainText('<a href="https://a.b">https://a.b</a>', Ci.nsIDocumentEncoder.OutputFormatted, 80))
// https://a.b <https://a.b>
console.log(parserUtils.convertToPlainText('<b>text</b>', Ci.nsIDocumentEncoder.OutputFormatted, 80))
// *text*

This is almost equivalent to https://searchfox.org/comm-central/rev/06250493d40d4007238a8cd0a486cb4c5176094a/mailnews/base/src/nsMsgUtils.cpp#1789,1794. Without Ci.nsIDocumentEncoder.OutputFormatted, we get

console.log(parserUtils.convertToPlainText('<a href="https://a.b">https://a.b</a>', 0, 80))
// https://a.b 
console.log(parserUtils.convertToPlainText('<b>text</b>', 0, 80))
// text

Notice for a text/plain mail, *text* is rendered as bold in the message pane.

The important part is the moz-do-not-send="true" attribute, but you're right, even with that attribute, the duplicate URL is still output:

parserUtils = Cc["@mozilla.org/parserutils;1"].getService(Ci.nsIParserUtils)
console.log(parserUtils.convertToPlainText('<a moz-do-not-send="true" href="https://a.b">https://a.b</a>', Ci.nsIDocumentEncoder.OutputFormatted, 80))
// returns https://a.b <https://a.b>

Turns out that I was confused, and the attribute is actually supposed to be class="moz-txt-link-*", e.g. class="moz-txt-link-freetext". scanHTML() inserts that and nsPlainTextSerializer notices that and skips it. That's there specifically to prevent this bug here, and I know it used to work, but it broke at some point.

See:

parserUtils = Cc["@mozilla.org/parserutils;1"].getService(Ci.nsIParserUtils)
console.log(parserUtils.convertToPlainText('<a class="moz-txt-link-freetext" href="https://a.b">https://a.b</a>', Ci.nsIDocumentEncoder.OutputFormatted, 80))
// returns https://a.b

which returns the correct result.

If the link was indeed generated by ScanHTML() (it should be, but it doesn't seem to be the case), the question would be:

  • Where does the class="moz-txt-link-freetext" get lost?
  • What inserts the moz-do-not-send="true" attribute?

So, what generates that <a moz-do-not-send="true"> link? It's definitely not scanHTML(), otherwise this bug wouldn't exist. Is scanHTML() called at all? It should be called (and used to be called) after the HTML composer closes and before SMTP, to make links clickable that the user didn't manually mark.

Can we define the expected output of <a href="$1">$2</a> in a text/plain part first? The way GMail handles it makes sense to me

  1. if $1 != $2, output $2 <$1>
  2. if $1 == $2, output $1

In the EdLinkProps dialog, we can check if $1 == $2 and maybe set class="moz-txt-link-freetext" accordingly. However, user can still change $2 in the mail editor later. So even if $1 == $2 at first, and we correctly set class="moz-txt-link-freetext", $2 can become different from $1 before sending, and $1 will be lost in the sent text/plain output.

So I think we need an extra step before sending, to check every <a> element, and set class="moz-txt-link-freetext" if $1 == $2, and remove the classname otherwise.

Can we define the expected output of <a href="$1">$2</a> in a text/plain part first?

That's not the bug here. It's not as simple as that. We also have:

(Please note that bugzilla mis-recognizes the links here. Our mozTXTtoHTMLConverter.scanHTML() does this better.)

You cannot just check URL == link text. That will not work.

Please see my comment 5. Even if you fix URL == link text, the bug here is not fixed. The link recognition (plain URL -> link) and the HTML->TXT conversion must be round trips. We achieve that with the HTML classes. The bug here is that the class is missing.

Which code is inserting the links here in this case during HTML email composition, if the user does not explicitly mark the URL as link?
Is scanHTML() being used? If not, why not? That's the bug here.

Well it's not as simple as checking $1 == $2, I think it's still useful for the discussion. Isn't this bug about html -> txt conversion?

Which code is inserting the links here in this case during HTML email composition, if the user does not explicitly mark the URL as link?
Is scanHTML() being used? If not, why not? That's the bug here.

Can you explain how scanHTML can fix this bug? I don't really know how it works.

cs = Cc["@mozilla.org/txttohtmlconv;1"].getService(Ci.mozITXTToHTMLConv);
cs.scanHTML('<a href="http://www.foobar.com">www.foobar.com</a>', Ci.mozITXTToHTMLConv.kURLs)
// "<a href=\"http://www.foobar.com\">www.foobar.com</a>"

ScanHTML() searches the text nodes in HTML, and enriches user-written plaintext URLs with clickable links. It's specifically designed to run after (or during, if you wish) HTML email composition and turn manually typed URLs into links. It does not touch existing explicitly <a href> links, only adds those that are missing. It recognizes a number of different kinds of links. Those in comment 10 were only an excerpt.

userhtmlinput = '<span>www.foobar.com</span>';
cs = Cc["@mozilla.org/txttohtmlconv;1"].getService(Ci.mozITXTToHTMLConv);
output = cs.scanHTML(userinput, Ci.mozITXTToHTMLConv.kURLs);
// "<span><a class=\"moz-txt-link-abbreviated\" href=\"http://www.foobar.com\">www.foobar.com</a></span>"

parserUtils = Cc["@mozilla.org/parserutils;1"].getService(Ci.nsIParserUtils);
plaintext = parserUtils.convertToPlainText(output, Ci.nsIDocumentEncoder.OutputFormatted, 80);
console.log(userhtmlinput);
console.log(output);
console.log(plaintext);

If you run that output HTML through our HTML->TXT converter, you will see that you get the original www.foobar.com back, fixing this bug.

If we're not using ScanHTML() to look for URLs during email sending, then that's the bug here. Can you please show us the relevant code? I don't know where it is in the new JS implementation.

I think one of us misunderstood this bug.

From comment 0, the reporter says

In message body create a link. Put "https://bugzilla.mozilla.org" in both the text and the Link Location fields.

The reporter was using the EdLinkProps dialog to insert the link. The dialog can be opened from Insert --> Link menu. After that, the following html is inserted into the editor

<a moz-do-not-send="true" href="https://bugzilla.mozilla.org">https://bugzilla.mozilla.org</a>

The above happens in the compose window. The bug happens in converting this html to txt in the sending process. As I mentioned in comment 7, this is an old bug that can be reproduced on esr78. It's not introduced by the new JS code.

That's why I said the bug is about html -> txt conversion. I still don't see how scanHTML can help, since it seems to be dealing with txt -> html conversion.

:rnons, yes, indeed, I misunderstood the bug. I thought the user simply types a URL in the composer, and then sends it, and Thunderbird recognizes the URL and automatically makes it a link. But that's apparently not what the bug here is about.
I just tested it, and it works correctly in this case.

Reproduction:

  1. Write an email in the HTML mail composer, with:
    www.foobar.com
    http://www.foobar.com
    you@foobar.com
  2. Send the email as both plaintext and HTML, using menu | Options | Email Format | HTML and Plain text - (Do not forget this step!)

Actual and expected result:

--------------3FEF359792E8686CBF9A3BA6
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit


www.foobar.com

http://www.foobar.com

you@foobar.com



--------------3FEF359792E8686CBF9A3BA6
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 7bit

<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p><br>
    </p>
    <p><a class="moz-txt-link-abbreviated" href="http://www.foobar.com">www.foobar.com</a></p>
    <p><a class="moz-txt-link-freetext" href="http://www.foobar.com">http://www.foobar.com</a></p>
    <p><a class="moz-txt-link-abbreviated" href="mailto:you@foobar.com">you@foobar.com</a></p>
    <p><br>
    </p>
  </body>
</html>

This these links are generated by scanHTML(), as expected. You do not see duplicated URLs in the plaintext. That all works correctly.


Thanks for pointing out that I misunderstood the bug here. I understand now that the user manually marks the URL as link, using the composer [ Add link ] button and dialog.

Reproduction:

  1. Write an email in the HTML mail composer, with:
    http://www.foobar.com
  2. Select "http://www.foobar.com"
  3. Open dropdown, click on "Link" button
  4. Enter "http://www.foobar.com" as link target URL
  5. Send the email as both plaintext and HTML, using menu | Options | Email Format | HTML and Plain text - (Do not forget this step!)

Actual result:

--------------D4B3B2F174A84B5360C59504
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

http://www.foobar.com <http://www.foobar.com>



--------------D4B3B2F174A84B5360C59504
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 7bit

<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <a moz-do-not-send="true" href="http://www.foobar.com">http://www.foobar.com</a><br>
    <br>
    <br>
  </body>
</html>

--------------D4B3B2F174A84B5360C59504--

Expected result:

--------------D4B3B2F174A84B5360C59504
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

http://www.foobar.com



--------------D4B3B2F174A84B5360C59504
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 7bit

<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <a moz-do-not-send="true" class="moz-txt-link-manual" href="http://www.foobar.com">http://www.foobar.com</a><br>
    <br>
    <br>
  </body>
</html>

--------------D4B3B2F174A84B5360C59504--

The quick fix here is to let the Link dialog check whether link target == link text, and then add class="moz-txt-link-manual" to the <a href>.

I just tested it (inserting that manually as HTML in the composer, without changing TB), and it works: I get the expected result above.

I will send a patch soon.

Status: NEW → ASSIGNED
Target Milestone: --- → 89 Branch

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/daac333fc257
Avoid duplicated links when converting to text/plain. r=mkmelin

Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED

Comment on attachment 9213702 [details]
Bug 1692771 - Avoid duplicated links when converting to text/plain. r=mkmelin

[Approval Request Comment]
Regression caused by (bug #): A quite old bug
User impact if declined: <a href="URI">URI</a> becomes URI <URI> in text/plain, but should be just URI
Testing completed (on c-c, etc.): c-c
Risk to taking this patch (and alternatives if risky): low

Attachment #9213702 - Flags: approval-comm-beta?

Comment on attachment 9213702 [details]
Bug 1692771 - Avoid duplicated links when converting to text/plain. r=mkmelin

[Triage Comment]
Approved for beta

Attachment #9213702 - Flags: approval-comm-beta? → approval-comm-beta+

Awesome! Thanks Ben and Ping for getting this fixed. Thunderbird is getting better every day! :-))

You need to log in before you can comment on or make changes to this bug.