When serialising with nsIDocumentEncoder::OutputFormatted, attributes, like href, can be broken across lines, which alters their value (broken links etc.)
Categories
(Core :: DOM: Core & HTML, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox109 | --- | fixed |
People
(Reporter: niente0, Assigned: benc)
References
Details
(Whiteboard: [patchlove])
User Story
For users of Thunderbird, this bug can make plain vanilla links like `http://www.test.com/pdf/Title of this document.pdf` get broken after wrapping adds extra spaces inside the href, which are then considered part of the href, making the link disfunctional.
Attachments
(2 files, 3 obsolete files)
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0
Steps to reproduce:
- Open a new html message
- Add this text "http://www.test.com/pdf/Title of this document.pdf"
- Select this text
- Click on "add hyperlink" button
- Paste the same text ("http://www.test.com/pdf/Title of this document.pdf")
- Click on OK
- Send message
Actual results:
Lokk at the message in "sent messages", when you hover the mouse on the link, it was transformed to:
http://www.test.com/pdf/Title%20%20%20%20%20 of this document.pdf
and the link will not work.
Expected results:
Hyperlink should be transformed to:
http://www.test.com/pdf/Title%20of%20this%20document.pdf
Comment 1•6 years ago
•
|
||
Spaces in links are a constant source of trouble.
The HTML stored is in fact:
<p><a moz-do-not-send="true" href="http://www.test.com/pdf/Title of
this document.pdf">XXX<br>
</a></p>
So the link gets broken over two lines. I don't see any multiple spaces when hovering, but copy link location results in
http://www.test.com/pdf/Title%20of%20%20%20%20%20%20%20%20this%20document.pdf
Masayuki-san, what's your opinion here? Looks like a serialiser bug to me. A href containing spaces should EDIT: NOT be serialised out like this. Or is this a TB editor bug? When we store the href, we should replace the spaces so this can't happen?
Also note bug 1371010 where percent-encoding was remove for hrefs and bug 1371559.
Comment 2•6 years ago
|
||
Something like this?
Comment 3•6 years ago
|
||
Typo :-(
Comment 4•6 years ago
|
||
Masatoshi-san, any comment here since you worked on bug 1371010?
Comment 5•6 years ago
|
||
How other browsers behave with this test case? If they percent-encode spaces, this is a web-compatiblity problem and we should fix the serializer. If they do not, we should deal with the problem in our editor.
Comment 6•6 years ago
|
||
We should also make sure that the change does not affect contenteditable
elements.
Comment 7•6 years ago
|
||
Is it actually legal to break HTML attributes inside the value like this? Could we prevent it for total safety?
I know contents of elements (text) can be line-wrapped freely (expcept <pre>) but then the renderer ignores surplus spaces (except one). Also tag attributes can be line-wrapped between attributes. But inside of attribute value is really strange.
Comment 10•6 years ago
|
||
I agree, that's why I wrote in comment #1: Looks like a serialiser bug to me. A href containing spaces should NOT be serialised out like this. But given that bug 1371010 stopped percent encoding, all we can do it make sure that the link doesn't get broken across lines.
Comment 11•6 years ago
|
||
(In reply to :aceman from comment #8)
On this test case, Chrome does show the link with all the spaces when
hovering the link with mouse.
Copying the link copies
http://www.test.com/pdf/Title%20of%20%20%20%20%20%20%20%20this%20document.pdf .
It is exactly the same behavior as us.
(In reply to :aceman from comment #9)
Is it actually legal to break HTML attributes inside the value like this? Could we prevent it for total safety?
I know contents of elements (text) can be line-wrapped freely (expcept <pre>) but then the renderer ignores surplus spaces (except one). Also tag attributes can be line-wrapped between attributes. But inside of attribute value is really strange.
Even if it is not legal, The HTML spec defines the error handling algorithm almost step-by-step, and all browsers follow that. We can't break the spec, interop with other browsers, and web-compat just for Thunderbird, sorry.
(In reply to Jorg K (GMT+2) from comment #10)
I agree, that's why I wrote in comment #1: Looks like a serialiser bug to me. A href containing spaces should NOT be serialised out like this. But given that bug 1371010 stopped percent encoding, all we can do it make sure that the link doesn't get broken across lines.
Could you quote the spec text that we failed to follow (that is, our bug)? I could not even find the word "escape" or "percent-encoding" in the spec.
Comment 12•6 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #11)
Even if it is not legal, The HTML spec defines the error handling algorithm almost step-by-step, and all browsers follow that. We can't break the spec, interop with other browsers, and web-compat just for Thunderbird, sorry.
We don't mean to change the reading of the attr value if it is already stored like that.
But can we somehow achieve that Thunderbird does not save the attr value wrapped like that? Replacing the spaces as in the patch is a way, but would there be a cleaner solution using some m-c API?
Comment 13•6 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #11)
(In reply to Jorg K (GMT+2) from comment #10)
I agree, that's why I wrote in comment #1: Looks like a serialiser bug to me. A href containing spaces should NOT be serialised out like this. But given that bug 1371010 stopped percent encoding, all we can do it make sure that the link doesn't get broken across lines.
Could you quote the spec text that we failed to follow (that is, our bug)? I could not even find the word "escape" or "percent-encoding" in the spec.
Is there a misunderstanding here? I'm not saying that serialising should escape or percent-encode. I am saying that if you pass DOM with a href of "x y z", it is not expected that the serialiser function returns a pretty-printed/formatted result where there is a line break inserted into the tag. That is a bug. If it wants to format, it should break before the tag. We can work around it by not producing tags with spaces, but that would be a work-around.
Note that TB does the serialising around here:
https://searchfox.org/comm-central/search?q=nsIDocumentEncoder%3A%3AOutputFormatted&case=false®exp=false&path=mailnews%2F
If we didn't requests "formatted" resulting in illegible spaghetti HTML in the e-mail body, then there wouldn't be a problem.
What am I missing?
Comment 14•6 years ago
|
||
Ah, thanks for the clarification. I was misunderstanding because bug 1371010 was solely about percent-encoding.
Since nsIDocumentEncoder::OutputFormatted
is not exposed to the web, this is indeed our (m-c) bug. nsXHTMLContentSerializer
should consider the context on formatting.
Updated•6 years ago
|
Updated•6 years ago
|
Comment 15•6 years ago
|
||
Just as a matter of interest, why doesn't that go to Core::Serialisers? Sadly still unowned, https://wiki.mozilla.org/Modules/Core, very bottom of page.
Comment 16•6 years ago
|
||
@Jork K: Core::Serializers is presumably the correct component. However, I still haven't understood how to reproduce this in Firefox (not Thunderbird). I've seen comment 13 but couldn't produce an example where this fails. A concrete case/STR leading to a failure in Firefox would be helpful.
Comment 18•6 years ago
|
||
It's a problem of the underlying backend libraries. You could write a page which contains a link (anchor) with a href attribute with spaces in it. Then serialise that using some JS using nsIDocumentEncoder with nsIDocumentEncoder::OutputFormatted which is triggered on load. Tweak the HTML of the page until the href is broken at one of the spaces and something like is shown in comment #1 comes out as a result.
I suggested to add some tests to TestPlainTextSerializer.cpp, or looking at this again, maybe that isn't the right test bed for it. But I'm sure you have tests for serialisation somewhere, like
https://searchfox.org/mozilla-central/search?q=nsIDocumentEncoder&case=false®exp=false&path=.html
Comment 20•4 years ago
|
||
Maybe around here: https://searchfox.org/mozilla-central/rev/f16b7b648f2b294225edd0192f0fbb543327bce9/dom/serializers/nsXMLContentSerializer.cpp#671,673
I can't think it's ever allowed to do any line wrapping inside an attribute value.
Comment 21•4 years ago
|
||
With the change Magnus suggested in comment #20 and with the suggestion for a gTest from comment #18 I took a copy of
https://searchfox.org/mozilla-central/source/dom/base/test/gtest/TestParser.cpp
and modified it slightly to show the problem. Without the changes in nsXMLContentSerializer.cpp, the gTest fails, with the changes it succeeds.
To run the gTest, simply type this into a command prompt:
mach gtest TestSerializerNoBreakLink*
(with the * at the end).
Magnus, can you take it from here? This would need a try run on M-C, agreement that this is the right way to go, review, etc.
Comment 22•4 years ago
|
||
Comment 23•4 years ago
|
||
Comment on attachment 9222976 [details] [diff] [review]
1563452.patch
Sadly Magnus hasn't moved this forward.
Is this the correct way forward? I can't f?hsivonen, hence the NI?
Comment 24•4 years ago
|
||
Updated•4 years ago
|
I'm not very familiar with percent-encoding and the specific code which is touched in this review. However, I'm surprised Firefox allows one space character, and doesn't percent-encode that one. Only the surplus space characters are percent-encoded as "%20". According to https://en.wikipedia.org/wiki/Percent-encoding#Types_of_URI_characters no space characters should be allowed.
Hence, the expectation of the reporter isn't met with the patch of #c23.
Chrome percent-encodes all space characters.
Firefox' behavior might be intentional, because URLs stay more human-readable. That's to be clarified. Does someone know? Only after answering that, it makes sense to continue evaluating the correctness of that patch.
Comment 26•3 years ago
|
||
(In reply to Mirko Brodesser (:mbrodesser) from comment #25)
Firefox' behavior might be intentional, because URLs stay more human-readable. That's to be clarified. Does someone know? Only after answering that, it makes sense to continue evaluating the correctness of that patch.
I believe this in an artifact of the serializer by default retaining the in-DOM attribute value string unless rewriting relative URLs is requested. I'm not sure if absolute URLs go through URL-level reserialization even when rewriting of relative URLs is requested.
Is it an option for Thunderbird to additonally use OutputRaw
? That would avoid wrapping, too.
Comment 28•3 years ago
|
||
(In reply to Mirko Brodesser (:mbrodesser) from comment #27)
Is it an option for Thunderbird to additonally use
OutputRaw
? That would avoid wrapping, too.
No. The idea is to send HTML in human-readable formatted form, see bug 1707360 comment #6. The more general question is whether it is permissible to introduce extra white-space and even line breaks when serializing attribute values, see Magnus' comment #20.
One has to be careful to not break existing behavior. OutputFormatted
is used at https://searchfox.org/mozilla-central/search?q=OutputFormatted&case=true®exp=true&path=, which contains two relevant cases:
-
https://searchfox.org/mozilla-central/rev/36904ac58d2528fc59f640db57cc9429103368d3/editor/libeditor/HTMLEditorDataTransfer.cpp#2978, which is only used by Thunderbird.
-
https://searchfox.org/mozilla-central/rev/36904ac58d2528fc59f640db57cc9429103368d3/dom/webbrowserpersist/WebBrowserPersistLocalDocument.cpp#1204, to which
ENCODE_FLAGS_FORMATTED
maps, see https://searchfox.org/mozilla-central/search?q=ENCODE_FLAGS_FORMATTED&redirect=false. It's only used by https://searchfox.org/mozilla-central/rev/36904ac58d2528fc59f640db57cc9429103368d3/toolkit/content/contentAreaUtils.js#480, for which one would have to check all callers.
The patch proposed in #c23 would affect all attributes and I'm wondering if there could be cases, where readability for humans matters, but additional space characters not. For instance, if an attribute is used to store a long list of values "value1 value2 ... value100" that line would become long.
Comment 30•3 years ago
|
||
I'd assume there are such, rare, cases where readability is "worse". However, those cases are likely mostly about debugging, and then it's essential for people to actually have the technically correct value of the attribute. You can't debug if you can't trust what's on the screen.
Hence, the expectation of the reporter isn't met with the patch of #c23.
I would assume the reporter doesn't care about the actual percent encoding. He just want the link not to turn invalid due to the wrong handling of the attribute, breaking it up and adding spaces where it should not. In that regard the patch would fix the use case.
(In reply to Magnus Melin [:mkmelin] from comment #30)
I'd assume there are such, rare, cases where readability is "worse". However, those cases are likely mostly about debugging, and then it's essential for people to actually have the technically correct value of the attribute. You can't debug if you can't trust what's on the screen.
Hence, the expectation of the reporter isn't met with the patch of #c23.
I would assume the reporter doesn't care about the actual percent encoding. He just want the link not to turn invalid due to the wrong handling of the attribute, breaking it up and adding spaces where it should not. In that regard the patch would fix the use case.
Agree with the latter.
I'm considering to have the patch adapted to only affect "href" attributes.
Comment 32•3 years ago
|
||
There are also many other attributes that can have urls. Most notably "src".
(In reply to Magnus Melin [:mkmelin] from comment #32)
There are also many other attributes that can have urls. Most notably "src".
Thanks for pointing this out. Some of them can be found from the references of https://html.spec.whatwg.org/multipage/urls-and-fetching.html#valid-non-empty-url-potentially-surrounded-by-spaces.
Comment 34•3 years ago
|
||
Also, introducing line feeds in the title
attribute changes rendering. I think it would be good not to add breaks inside any attribute values.
(In reply to Henri Sivonen (:hsivonen) from comment #34)
Also, introducing line feeds in the
title
attribute changes rendering.
Thanks for pointing this out.
I think it would be good not to add breaks inside any attribute values.
I'm not sure. With the current patch, the semantics of OutputWrap
would change, too. It's used at multiple places, see https://searchfox.org/mozilla-central/search?q=%5CbOutputWrap%5Cb&path=&case=true®exp=true. If we accept that patch, OutputWrap
should be renamed to reflect attribute values are not wrapped, e.g. OutputWrapNonAttributeValues
.
In general I agree with the fix, it just needs some polishing as requested in the review.
Comment 38•3 years ago
|
||
Thanks for the comments, Mirko. I don't intend to work on this any further. Hopefully someone from the TB team can pick this up. It was after all Magnus's idea in the first place.
(In reply to José M. Muñoz from comment #38)
Thanks for the comments, Mirko. I don't intend to work on this any further. Hopefully someone from the TB team can pick this up. It was after all Magnus's idea in the first place.
Thanks for your help so far, much appreciated :-)
Updated•3 years ago
|
Comment 41•3 years ago
|
||
For Thunderbird, broken links in messages are a serious fallout.
I've sent a ping to tb-devs on Matrix today if someone can finish this off.
Comment 42•3 years ago
|
||
Isn't it expected that as messages are bumped around between SMTP servers that they may be wordwrapped in different ways? So unless we are base64-encoding the HTML, we have to expect that line breaks could be placed at any whitespace position. HTML parsers are robust to tolerate this, and URLs are defined to not allow white-space for this kind of reason.
It therefore seems to me that the correct fix is to percent-encode URLs that are entered.
Or at least, that should be done as well as any other fix.
From a user point of view, they won't know anymore that they need to percent-encode URLs for them to be valid as web browser address bars remove the percent-encoding when viewing the URL in the address bars to make the URLs look nicer,
Comment 43•3 years ago
|
||
Ryan, for users of Thunderbird, this bug can make plain vanilla links like http://www.test.com/pdf/Title of this document.pdf
get broken in outgoing messages (depending on the link length or position in the text). This is caused by HML source wrapping/indenting adding extra spaces inside the href attribute (link URL), which are then considered part of the URL, making the link disfunctional (mailnews bug 1707360).
See also my bug 1707360 comment 1 with alternative STR and analysis.
Broken links sounds like something which we should try to avoid also wrt enterprise usage.
The fix isn't exactly trivial, but I understand from comment 36 that the WIP patch in this bug has been thoroughly discussed and would just need a couple of changes to be accepted and landed (comment 36 and some of the previous comments iiuc). What do you think?
Comment 44•3 years ago
|
||
Ben, any cycles to get this over the finishing line?
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 45•3 years ago
|
||
Assignee | ||
Comment 46•3 years ago
|
||
I can't pretend that I really grasp this area, so I'm just going by the patch itself and Mirko's feedback in comment #36.
Mirko seems to be away for another week or so - I've left it with no reviewer for now, but maybe someone else would like to jump in? (or even just point out the obvious bits I screwed up ;- )
Comment 47•3 years ago
|
||
Gotcha, eager to see this addressed. Thanks for pushing this forward Magnus and Ben.
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment hidden (off-topic) |
Comment 49•3 years ago
|
||
Sorry, there was a problem with the detection of inactive users. I'm reverting the change.
Assignee | ||
Comment 50•3 years ago
|
||
Thanks! I got sidetracked with other stuff, but I'm still working on this one. Just a unit test hiccup or two to nail down and we're done.
Comment 51•2 years ago
|
||
There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:benc, could you have a look please?
If you still have some work to do, you can add an action "Plan Changes" in Phabricator.
For more information, please visit auto_nag documentation.
Assignee | ||
Updated•2 years ago
|
Comment 52•2 years ago
|
||
bugherder |
Description
•