Closed Bug 1563452 Opened 6 years ago Closed 2 years ago

When serialising with nsIDocumentEncoder::OutputFormatted, attributes, like href, can be broken across lines, which alters their value (broken links etc.)

Categories

(Core :: DOM: Core & HTML, defect, P2)

60 Branch
defect

Tracking

()

RESOLVED FIXED
109 Branch
Tracking Status
firefox109 --- fixed

People

(Reporter: niente0, Assigned: benc)

References

Details

(Whiteboard: [patchlove])

User Story

For users of Thunderbird, this bug can make plain vanilla links like `http://www.test.com/pdf/Title of this document.pdf` get broken after wrapping adds extra spaces inside the href, which are then considered part of the href, making the link disfunctional.

Attachments

(2 files, 3 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0

Steps to reproduce:

  1. Open a new html message
  2. Add this text "http://www.test.com/pdf/Title of this document.pdf"
  3. Select this text
  4. Click on "add hyperlink" button
  5. Paste the same text ("http://www.test.com/pdf/Title of this document.pdf")
  6. Click on OK
  7. Send message

Actual results:

Lokk at the message in "sent messages", when you hover the mouse on the link, it was transformed to:
http://www.test.com/pdf/Title%20%20%20%20%20 of this document.pdf
and the link will not work.

Expected results:

Hyperlink should be transformed to:
http://www.test.com/pdf/Title%20of%20this%20document.pdf

Spaces in links are a constant source of trouble.

The HTML stored is in fact:

    <p><a moz-do-not-send="true" href="http://www.test.com/pdf/Title of
        this document.pdf">XXX<br>
      </a></p>

So the link gets broken over two lines. I don't see any multiple spaces when hovering, but copy link location results in
http://www.test.com/pdf/Title%20of%20%20%20%20%20%20%20%20this%20document.pdf

Masayuki-san, what's your opinion here? Looks like a serialiser bug to me. A href containing spaces should EDIT: NOT be serialised out like this. Or is this a TB editor bug? When we store the href, we should replace the spaces so this can't happen?

Also note bug 1371010 where percent-encoding was remove for hrefs and bug 1371559.

Status: UNCONFIRMED → NEW
Ever confirmed: true
Attached patch 1563452-spaces-in-links.patch (obsolete) — Splinter Review

Something like this?

Attachment #9076064 - Flags: feedback?(masayuki)
Attachment #9076064 - Flags: feedback?(iann_bugzilla)
Attachment #9076064 - Flags: feedback?(acelists)
Attached patch 1563452-spaces-in-links.patch (obsolete) — Splinter Review

Typo :-(

Attachment #9076064 - Attachment is obsolete: true
Attachment #9076064 - Flags: feedback?(masayuki)
Attachment #9076064 - Flags: feedback?(iann_bugzilla)
Attachment #9076064 - Flags: feedback?(acelists)
Attachment #9076065 - Flags: feedback?(masayuki)
Attachment #9076065 - Flags: feedback?(iann_bugzilla)
Attachment #9076065 - Flags: feedback?(acelists)

Masatoshi-san, any comment here since you worked on bug 1371010?

Flags: needinfo?(VYV03354)
Attached file Test case (obsolete) —

How other browsers behave with this test case? If they percent-encode spaces, this is a web-compatiblity problem and we should fix the serializer. If they do not, we should deal with the problem in our editor.

Flags: needinfo?(VYV03354)

We should also make sure that the change does not affect contenteditable elements.

Comment on attachment 9076065 [details] [diff] [review] 1563452-spaces-in-links.patch Yeah, sounds like that solving in the UI code must be the best unless this causes some web-compat issues of Gecko. But perhaps, using XPCOM API must be better to fix this if you add this hack. But be careful, user may input %-encoded UI. So, your hack should run only when input URL is invalid.
Attachment #9076065 - Flags: feedback?(masayuki)
Comment on attachment 9076126 [details] Test case On this test case, Chrome does show the link with all the spaces when hovering the link with mouse. Copying the link copies http://www.test.com/pdf/Title%20of%20%20%20%20%20%20%20%20this%20document.pdf .

Is it actually legal to break HTML attributes inside the value like this? Could we prevent it for total safety?
I know contents of elements (text) can be line-wrapped freely (expcept <pre>) but then the renderer ignores surplus spaces (except one). Also tag attributes can be line-wrapped between attributes. But inside of attribute value is really strange.

I agree, that's why I wrote in comment #1: Looks like a serialiser bug to me. A href containing spaces should NOT be serialised out like this. But given that bug 1371010 stopped percent encoding, all we can do it make sure that the link doesn't get broken across lines.

(In reply to :aceman from comment #8)

On this test case, Chrome does show the link with all the spaces when
hovering the link with mouse.
Copying the link copies
http://www.test.com/pdf/Title%20of%20%20%20%20%20%20%20%20this%20document.pdf .

It is exactly the same behavior as us.

(In reply to :aceman from comment #9)

Is it actually legal to break HTML attributes inside the value like this? Could we prevent it for total safety?
I know contents of elements (text) can be line-wrapped freely (expcept <pre>) but then the renderer ignores surplus spaces (except one). Also tag attributes can be line-wrapped between attributes. But inside of attribute value is really strange.

Even if it is not legal, The HTML spec defines the error handling algorithm almost step-by-step, and all browsers follow that. We can't break the spec, interop with other browsers, and web-compat just for Thunderbird, sorry.

(In reply to Jorg K (GMT+2) from comment #10)

I agree, that's why I wrote in comment #1: Looks like a serialiser bug to me. A href containing spaces should NOT be serialised out like this. But given that bug 1371010 stopped percent encoding, all we can do it make sure that the link doesn't get broken across lines.

Could you quote the spec text that we failed to follow (that is, our bug)? I could not even find the word "escape" or "percent-encoding" in the spec.

(In reply to Masatoshi Kimura [:emk] from comment #11)

Even if it is not legal, The HTML spec defines the error handling algorithm almost step-by-step, and all browsers follow that. We can't break the spec, interop with other browsers, and web-compat just for Thunderbird, sorry.

We don't mean to change the reading of the attr value if it is already stored like that.
But can we somehow achieve that Thunderbird does not save the attr value wrapped like that? Replacing the spaces as in the patch is a way, but would there be a cleaner solution using some m-c API?

(In reply to Masatoshi Kimura [:emk] from comment #11)

(In reply to Jorg K (GMT+2) from comment #10)

I agree, that's why I wrote in comment #1: Looks like a serialiser bug to me. A href containing spaces should NOT be serialised out like this. But given that bug 1371010 stopped percent encoding, all we can do it make sure that the link doesn't get broken across lines.

Could you quote the spec text that we failed to follow (that is, our bug)? I could not even find the word "escape" or "percent-encoding" in the spec.

Is there a misunderstanding here? I'm not saying that serialising should escape or percent-encode. I am saying that if you pass DOM with a href of "x y z", it is not expected that the serialiser function returns a pretty-printed/formatted result where there is a line break inserted into the tag. That is a bug. If it wants to format, it should break before the tag. We can work around it by not producing tags with spaces, but that would be a work-around.

Note that TB does the serialising around here:
https://searchfox.org/comm-central/search?q=nsIDocumentEncoder%3A%3AOutputFormatted&case=false&regexp=false&path=mailnews%2F

If we didn't requests "formatted" resulting in illegible spaghetti HTML in the e-mail body, then there wouldn't be a problem.

What am I missing?

Ah, thanks for the clarification. I was misunderstanding because bug 1371010 was solely about percent-encoding.

Since nsIDocumentEncoder::OutputFormatted is not exposed to the web, this is indeed our (m-c) bug. nsXHTMLContentSerializer should consider the context on formatting.

Component: Message Compose Window → DOM: Core & HTML
Product: Thunderbird → Core
Version: 60 → 60 Branch
Summary: Extra space characters when inserting a hyperlink with spaces → When serialising with nsIDocumentEncoder::OutputFormatted, attributes, like href, can be broken across lines
Attachment #9076065 - Attachment is obsolete: true
Attachment #9076065 - Flags: feedback?(iann_bugzilla)
Attachment #9076065 - Flags: feedback?(acelists)

Just as a matter of interest, why doesn't that go to Core::Serialisers? Sadly still unowned, https://wiki.mozilla.org/Modules/Core, very bottom of page.

Comment on attachment 9076126 [details] Test case If I'm not mistaken, this is not a text case but an experiment. A test case would be some addition to TestPlainTextSerializer.cpp that recreated the case were the serialiser inserts a line break into the tag.
Attachment #9076126 - Attachment is obsolete: true

@Jork K: Core::Serializers is presumably the correct component. However, I still haven't understood how to reproduce this in Firefox (not Thunderbird). I've seen comment 13 but couldn't produce an example where this fails. A concrete case/STR leading to a failure in Firefox would be helpful.

Flags: needinfo?(jorgk)

It's a problem of the underlying backend libraries. You could write a page which contains a link (anchor) with a href attribute with spaces in it. Then serialise that using some JS using nsIDocumentEncoder with nsIDocumentEncoder::OutputFormatted which is triggered on load. Tweak the HTML of the page until the href is broken at one of the spaces and something like is shown in comment #1 comes out as a result.

I suggested to add some tests to TestPlainTextSerializer.cpp, or looking at this again, maybe that isn't the right test bed for it. But I'm sure you have tests for serialisation somewhere, like
https://searchfox.org/mozilla-central/search?q=nsIDocumentEncoder&case=false&regexp=false&path=.html

Flags: needinfo?(jorgk)

Maybe around here: https://searchfox.org/mozilla-central/rev/f16b7b648f2b294225edd0192f0fbb543327bce9/dom/serializers/nsXMLContentSerializer.cpp#671,673
I can't think it's ever allowed to do any line wrapping inside an attribute value.

Attached patch 1563452.patchSplinter Review

With the change Magnus suggested in comment #20 and with the suggestion for a gTest from comment #18 I took a copy of
https://searchfox.org/mozilla-central/source/dom/base/test/gtest/TestParser.cpp
and modified it slightly to show the problem. Without the changes in nsXMLContentSerializer.cpp, the gTest fails, with the changes it succeeds.

To run the gTest, simply type this into a command prompt:
mach gtest TestSerializerNoBreakLink*
(with the * at the end).

Magnus, can you take it from here? This would need a try run on M-C, agreement that this is the right way to go, review, etc.

Flags: needinfo?(mkmelin+mozilla)

Comment on attachment 9222976 [details] [diff] [review]
1563452.patch

Sadly Magnus hasn't moved this forward.

Is this the correct way forward? I can't f?hsivonen, hence the NI?

Flags: needinfo?(hsivonen)
Attachment #9222976 - Flags: feedback?(mbrodesser)
Comment on attachment 9222976 [details] [diff] [review] 1563452.patch Review of attachment 9222976 [details] [diff] [review]: ----------------------------------------------------------------- The fix looks correct to me, but I'll let Mirko confirm.
Attachment #9222976 - Flags: feedback+
Flags: needinfo?(hsivonen)
Blocks: 1707360

I'm not very familiar with percent-encoding and the specific code which is touched in this review. However, I'm surprised Firefox allows one space character, and doesn't percent-encode that one. Only the surplus space characters are percent-encoded as "%20". According to https://en.wikipedia.org/wiki/Percent-encoding#Types_of_URI_characters no space characters should be allowed.

Hence, the expectation of the reporter isn't met with the patch of #c23.

Chrome percent-encodes all space characters.

Firefox' behavior might be intentional, because URLs stay more human-readable. That's to be clarified. Does someone know? Only after answering that, it makes sense to continue evaluating the correctness of that patch.

(In reply to Mirko Brodesser (:mbrodesser) from comment #25)

Firefox' behavior might be intentional, because URLs stay more human-readable. That's to be clarified. Does someone know? Only after answering that, it makes sense to continue evaluating the correctness of that patch.

I believe this in an artifact of the serializer by default retaining the in-DOM attribute value string unless rewriting relative URLs is requested. I'm not sure if absolute URLs go through URL-level reserialization even when rewriting of relative URLs is requested.

Is it an option for Thunderbird to additonally use OutputRaw? That would avoid wrapping, too.

(In reply to Mirko Brodesser (:mbrodesser) from comment #27)

Is it an option for Thunderbird to additonally use OutputRaw? That would avoid wrapping, too.

No. The idea is to send HTML in human-readable formatted form, see bug 1707360 comment #6. The more general question is whether it is permissible to introduce extra white-space and even line breaks when serializing attribute values, see Magnus' comment #20.

One has to be careful to not break existing behavior. OutputFormatted is used at https://searchfox.org/mozilla-central/search?q=OutputFormatted&case=true&regexp=true&path=, which contains two relevant cases:

The patch proposed in #c23 would affect all attributes and I'm wondering if there could be cases, where readability for humans matters, but additional space characters not. For instance, if an attribute is used to store a long list of values "value1 value2 ... value100" that line would become long.

I'd assume there are such, rare, cases where readability is "worse". However, those cases are likely mostly about debugging, and then it's essential for people to actually have the technically correct value of the attribute. You can't debug if you can't trust what's on the screen.

Hence, the expectation of the reporter isn't met with the patch of #c23.

I would assume the reporter doesn't care about the actual percent encoding. He just want the link not to turn invalid due to the wrong handling of the attribute, breaking it up and adding spaces where it should not. In that regard the patch would fix the use case.

(In reply to Magnus Melin [:mkmelin] from comment #30)

I'd assume there are such, rare, cases where readability is "worse". However, those cases are likely mostly about debugging, and then it's essential for people to actually have the technically correct value of the attribute. You can't debug if you can't trust what's on the screen.

Hence, the expectation of the reporter isn't met with the patch of #c23.

I would assume the reporter doesn't care about the actual percent encoding. He just want the link not to turn invalid due to the wrong handling of the attribute, breaking it up and adding spaces where it should not. In that regard the patch would fix the use case.

Agree with the latter.

I'm considering to have the patch adapted to only affect "href" attributes.

There are also many other attributes that can have urls. Most notably "src".

(In reply to Magnus Melin [:mkmelin] from comment #32)

There are also many other attributes that can have urls. Most notably "src".

Thanks for pointing this out. Some of them can be found from the references of https://html.spec.whatwg.org/multipage/urls-and-fetching.html#valid-non-empty-url-potentially-surrounded-by-spaces.

Also, introducing line feeds in the title attribute changes rendering. I think it would be good not to add breaks inside any attribute values.

(In reply to Henri Sivonen (:hsivonen) from comment #34)

Also, introducing line feeds in the title attribute changes rendering.

Thanks for pointing this out.

I think it would be good not to add breaks inside any attribute values.

I'm not sure. With the current patch, the semantics of OutputWrap would change, too. It's used at multiple places, see https://searchfox.org/mozilla-central/search?q=%5CbOutputWrap%5Cb&path=&case=true&regexp=true. If we accept that patch, OutputWrap should be renamed to reflect attribute values are not wrapped, e.g. OutputWrapNonAttributeValues.

Comment on attachment 9222976 [details] [diff] [review] 1563452.patch Review of attachment 9222976 [details] [diff] [review]: ----------------------------------------------------------------- José: if possible, could you please incorporate the desired changes and create a Phabricator review for this? ::: dom/base/test/gtest/TestSerializerNoBreakLink.cpp @@ +1,1 @@ > +/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ Please apply Mozilla's C++ coding style, https://firefox-source-docs.mozilla.org/code-quality/coding-style/coding_style_cpp.html#c-coding-style, to this file, `./mach clang-format` could help. @@ +1,5 @@ > +/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ > +/* This Source Code Form is subject to the terms of the Mozilla Public > + * License, v. 2.0. If a copy of the MPL was not distributed with this > + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ > + Please check the order of the includes. @@ +12,5 @@ > +#include "mozilla/ErrorResult.h" > + > +// This is a test for serialising some DOM with a link > +// whose href contains a space. > + Please remove the line below. @@ +20,5 @@ > + bool allTestsPassed = false; > + constexpr auto htmlInput = > + u"<html><head>" > + "<meta http-equiv=\"content-type\" content=\"text/html; charset=\">" > + "</head><body>Hello Thunderbird! <a href=\"http:www.example.com/" Please make the test more robust by, checking a really long line. E.g. "link with space 1 2 ... 125". @@ +24,5 @@ > + "</head><body>Hello Thunderbird! <a href=\"http:www.example.com/" > + "link with space.txt\">Link</a></body></html>"_ns; > + > + do { > + // Parse the HTML source. `using namespace mozilla;` `using namepsace mozilla::dom;` would be OK for this .cpp file. Please add a separate scope for `rv2` and define `nsCOMPtr<mozilla::dom::Document> document` above it, so that it can be used outside of `rv2`'s scope below. @@ +34,5 @@ > + htmlInput, mozilla::dom::SupportedType::Text_html, rv2); > + if (rv2.Failed()) break; > + > + // Serialize it back to HTML source again. > + nsCOMPtr<nsIDocumentEncoder> encoder = Here we only test the "text/html" serializer, even though the patch affects other serializers too. Please either test all affected serializers, or rename the test file accordingly, e.g. "TestHtmlSerializer...". @@ +36,5 @@ > + > + // Serialize it back to HTML source again. > + nsCOMPtr<nsIDocumentEncoder> encoder = > + do_createDocumentEncoder("text/html"); > + if (!encoder) break; Could "text/html"_ns be bound to a variable and be reused above? @@ +37,5 @@ > + // Serialize it back to HTML source again. > + nsCOMPtr<nsIDocumentEncoder> encoder = > + do_createDocumentEncoder("text/html"); > + if (!encoder) break; > + nsresult rv = encoder->Init(document, u"text/html"_ns, Please also test this for `OutputWrap`. @@ +43,5 @@ > + if (NS_FAILED(rv)) break; > + nsString parsed; > + rv = encoder->EncodeToString(parsed); > + if (NS_FAILED(rv)) break; > + `EXPECT_NE(...)` would be more concise. ::: dom/serializers/nsXMLContentSerializer.cpp @@ +664,5 @@ > } > NS_ENSURE_TRUE(attrString.Append(sValue, mozilla::fallible), false); > NS_ENSURE_TRUE(attrString.Append(cDelimiter, mozilla::fallible), false); > } > + NS_ENSURE_TRUE(AppendToStringConvertLF(attrString, aStr), false); Please update the documentation of `OutputWrap` and perhaps `OutputFormatted` in the .idl file, too. See https://bugzilla.mozilla.org/show_bug.cgi?id=1563452#c35.

In general I agree with the fix, it just needs some polishing as requested in the review.

Thanks for the comments, Mirko. I don't intend to work on this any further. Hopefully someone from the TB team can pick this up. It was after all Magnus's idea in the first place.

Flags: needinfo?(lasana)
Flags: needinfo?(bugzilla2007)

(In reply to José M. Muñoz from comment #38)

Thanks for the comments, Mirko. I don't intend to work on this any further. Hopefully someone from the TB team can pick this up. It was after all Magnus's idea in the first place.

Thanks for your help so far, much appreciated :-)

Attachment #9222976 - Flags: feedback?(mbrodesser) → feedback-
Flags: needinfo?(lasana)

For Thunderbird, broken links in messages are a serious fallout.
I've sent a ping to tb-devs on Matrix today if someone can finish this off.

Severity: normal → S2
Flags: needinfo?(bugzilla2007)
Priority: P3 → P2

Isn't it expected that as messages are bumped around between SMTP servers that they may be wordwrapped in different ways? So unless we are base64-encoding the HTML, we have to expect that line breaks could be placed at any whitespace position. HTML parsers are robust to tolerate this, and URLs are defined to not allow white-space for this kind of reason.

It therefore seems to me that the correct fix is to percent-encode URLs that are entered.
Or at least, that should be done as well as any other fix.

From a user point of view, they won't know anymore that they need to percent-encode URLs for them to be valid as web browser address bars remove the percent-encoding when viewing the URL in the address bars to make the URLs look nicer,

Ryan, for users of Thunderbird, this bug can make plain vanilla links like http://www.test.com/pdf/Title of this document.pdf get broken in outgoing messages (depending on the link length or position in the text). This is caused by HML source wrapping/indenting adding extra spaces inside the href attribute (link URL), which are then considered part of the URL, making the link disfunctional (mailnews bug 1707360).

See also my bug 1707360 comment 1 with alternative STR and analysis.
Broken links sounds like something which we should try to avoid also wrt enterprise usage.

The fix isn't exactly trivial, but I understand from comment 36 that the WIP patch in this bug has been thoroughly discussed and would just need a couple of changes to be accepted and landed (comment 36 and some of the previous comments iiuc). What do you think?

User Story: (updated)
Flags: needinfo?(ryan)
Summary: When serialising with nsIDocumentEncoder::OutputFormatted, attributes, like href, can be broken across lines → When serialising with nsIDocumentEncoder::OutputFormatted, attributes, like href, can be broken across lines, which alters their value (broken links etc.)
Whiteboard: [patchlove]

Ben, any cycles to get this over the finishing line?

Flags: needinfo?(mkmelin+mozilla) → needinfo?(benc)
Assignee: nobody → benc
Flags: needinfo?(benc)

I can't pretend that I really grasp this area, so I'm just going by the patch itself and Mirko's feedback in comment #36.
Mirko seems to be away for another week or so - I've left it with no reviewer for now, but maybe someone else would like to jump in? (or even just point out the obvious bits I screwed up ;- )

Gotcha, eager to see this addressed. Thanks for pushing this forward Magnus and Ben.

Flags: needinfo?(ryan)
Attachment #9271852 - Attachment description: Bug 1563452 - Don't break attributes when serializing with nsXMLContentSerializer. → WIP: Bug 1563452 - Don't break attributes when serializing with nsXMLContentSerializer.
Attachment #9271852 - Attachment description: WIP: Bug 1563452 - Don't break attributes when serializing with nsXMLContentSerializer. → Bug 1563452 - Don't break attributes when serializing with nsXMLContentSerializer.
Status: NEW → ASSIGNED
Attachment #9271852 - Attachment description: Bug 1563452 - Don't break attributes when serializing with nsXMLContentSerializer. → WIP: Bug 1563452 - Don't break attributes when serializing with nsXMLContentSerializer.
Attachment #9271852 - Attachment description: WIP: Bug 1563452 - Don't break attributes when serializing with nsXMLContentSerializer. → Bug 1563452 - Don't break attributes when serializing with nsXMLContentSerializer.
Attachment #9271852 - Attachment description: Bug 1563452 - Don't break attributes when serializing with nsXMLContentSerializer. → Bug 1563452 - Don't break attributes when serializing with nsXMLContentSerializer. r=mbrodesser

Sorry, there was a problem with the detection of inactive users. I'm reverting the change.

Assignee: nobody → benc
Status: NEW → ASSIGNED
Flags: needinfo?(htsai)

Thanks! I got sidetracked with other stuff, but I'm still working on this one. Just a unit test hiccup or two to nail down and we're done.

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:benc, could you have a look please?
If you still have some work to do, you can add an action "Plan Changes" in Phabricator.
For more information, please visit auto_nag documentation.

Flags: needinfo?(mbrodesser)
Flags: needinfo?(benc)
Flags: needinfo?(mbrodesser)
Flags: needinfo?(benc)
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 109 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: