Closed Bug 89780 Opened 19 years ago Closed 16 years ago

Mozilla wrongly converts space to CR+LF in innerHTML.


(Core :: DOM: Serializers, defect, P2)






(Reporter: val, Assigned: bzbarsky)



(Keywords: testcase)


(3 files)

From Bugzilla Helper:
User-Agent: Mozilla/4.6 [en-gb]C-CCK-MCD  (Win98; I)

The innerHTML of an element appears to be 'sectioned' by Mozilla, so that every 
so often (roughly every 75 characters) a space is converted to a carriage-return 
+ line-feed pair.

This may make no difference in the rendering of the HTML, but in string 
processing one would normally expect the original text to be faithfully 
represented in the innerHTML.

Reproducible: Always
Steps to Reproduce:
The attached testcase shows (eventually) some text, and the character codes of 
the innerHTML of that text.

Expected Results:  The innerHTML should reflect the original whitespace.
I've finally managed to isolate just what has been causing me problems here.

Whilst Mozilla inially inserts CR+LF pairs in the original innerHTML, when that 
innerHTML is copied to another element, the CR characters are dropped, causing 
serious inconsistencies in the two strings.

A second attached testcase illustrates this.
Ever confirmed: true
Keywords: testcase
OS: Windows 98 → All
Hardware: PC → All
Target Milestone: --- → Future
Priority: -- → P5
Blocks: 117617
QA Contact: lchiang → ian
this seems to work now. Val: could you check if this works for you in a recent
build? I don't understand the test case very well.

Marking WORKSFORME pending feedback.
Closed: 18 years ago
Resolution: --- → WORKSFORME
Hi Ian - looks like one of those badly written bug reports that should have been
separated into two.

First problem was that Mozilla was inserting CR+LF pairs in the innerHTML of an
element, as described in the summary, and as shown in the first testcase.

But the real problem I was having turned out to be that when the innerHTML was
copied, the CR characters were then dropped, so the two strings were no longer
identical, as shown in the second testcase.

The second problem gave rise to serious implications for string handling, but
seems to be resolved in that the CR characters are no longer dropped when
copying the innerHTML, so the innerHTML and the copy of the innerHTML are now
the same. So that looks like a WORKSFORME too (Mozilla 1.2b).

But since CR+LF are still being inserted in the innerHTML, then strictly
speaking the problem described in the summary still remains. I'm guessing it's
required by the editor or some such, so I wouldn't be surprised if it turns out
to be a WONTFIX.

... but I still think it's bad form to tamper with the innerHTML of an element ;-)
Ah, the first testcase.

Ok, reopening. This does indeed seem wrong. I can't work out why we would do this.
Resolution: WORKSFORME → ---
Mass-reassigning bugs to
Assignee: jst → dom_bugs
This is a serializer bug.  As far as I can tell, there is no way I can keep the
HTML serializer from inserting these line breaks.  Setting the "raw" output flag
is ignored in AppendText unless I also set the "do formatting" flag!  And if I
do that, some other functions check the latter but not the former.

That's why the newlines are inserted at all.  As for why they're CRLF on
windows, that's because the "use CR for newline" flag is NOT passed to the
document encoder by GetInnetHTML.  That should probably be fixed....
Assignee: general → dom-to-text
Component: DOM: Mozilla Extensions → DOM to Text Conversion
If the serializer is inserting linebreaks at 70-odd char intervals, that
generally means someone is telling it to use line-wrapping mode.

Where is the serializer being called in this code path?  Maybe the caller just
needs to stop setting nsIDocumentEncoder::OutputWrap.
(In reply to comment #10)
> Where is the serializer being called in this code path?  Maybe the caller just
> needs to stop setting nsIDocumentEncoder::OutputWrap.

The caller is at:

> If the serializer is inserting linebreaks at 70-odd char intervals, that
> generally means someone is telling it to use line-wrapping mode.

You must be thinking of the plaintext serializer, which is like sane and all. 
The HTML serializer is not at all like that...  Analysis follows.

In nsHTMLContentSerializer::AppendText we have the following code:

190   if (mPreLevel > 0) {
191     AppendToStringConvertLF(data, aStr);
192   }
193   else if (!mDoFormat) {
195     PRBool hasLongLines = HasLongLines(data, lastNewlineOffset);
196     if (hasLongLines) {
198       AppendToStringWrapped(data, aStr, PR_FALSE);
201     }
202     else {
203       AppendToStringConvertLF(data, aStr);
204     }
205   }
206   else if (mFlags & nsIDocumentEncoder::OutputRaw) {
208     AppendToString(data, aStr);
211   }
212   else {
213     AppendToStringWrapped(data, aStr, PR_FALSE);
214   }

where I've left out some lines that aren't really relevant to the basic logic. 
Now let's see here:

1)  mPreLevel > 0 tests false (since we're not inside a pre/style/script).
2)  mDoFormat is false (since we did not pass in the
    nsIDocumentEncoder::OutputFormatted flag).

So we follow the !mDoFormat path.  HasLongLines() returns true (since we do), so
we end up in AppendToStringWrapped.

Further notes:

A) nsHTMLContentSerializer never looks at the nsIDocumentEncoder::OutputWrap
B) The code has been this way at least since the noXIF landing
C) The fact that you can't even get to the nsIDocumentEncoder::OutputRaw branch
   without setting mDoFormat to true(!) is a regression from bug 108024

Fixing note C and using the "output raw" flag for innerHTML _may_ work.  I
suspect it will flub the entity conversion we want (which it probably does
anyway, since aTranslateEntities in AppendToString()'s arguments defaults to
false, no caller ever passes it that I can tell, and the entity conversion flags
are ignored if it's false).
I think nsIDocumentEncoder::OutputRaw is what should be used here -- and yes,
it's not well documented at all in nsIDocumentEncoder.

Needing to set mDoFormat in order to get raw output seems like a bug.  Is that
filed as a regression anywhere?  Should we handle it here?

If entity conversion changes as a result of using raw, that's not intentional,
or at least it wasn't the original intent.  So if that's true (and it may be)
then that's also a bug and worth fixing at the same time.

> nsHTMLContentSerializer never looks at the nsIDocumentEncoder::OutputWrap 
> flag.

I just noticed a few weeks ago that composer has no way to set the wrap column,
either (a related problem).  I could swear that it used to check a pref called
editor.htmlWrapColumn, and in fact nsEditor::GetWrapWidth does, but nsHTMLEditor
never actually calls that method.  We're all screwed up there (I've been meaning
to file a bug, and maybe fix it, or track down whether it used to work and
regressed).  But as you point out, the serializer would have to be fixed before
the editor could do anything about it.

Composer has a problem with wrapping, anyway, because there's no way of marking
text nodes with a flag to indicate which are already wrapped/formatted and which
have just been added by composer and so need to be wrapped.  That's what the
HasLongLines stuff is about, trying to guess what's already been formatted for
composer's "don't reformat existing html" case.  It's all a mess, and that's
probably why none of the html serializer wrapping stuff ever got properly tested.

Enough history.  I would suggest:
- change nsGenericHTMLElement.cpp to add nsIDocumentEncoder::OutputRaw
- Fix nsHTMLContentSerializer to not require Formatted in order to do raw
  (can we just reorder those two clauses?  It looks like the only user of
OutputRaw is nsGenericElement, and that should have the same needs as this bug
so it shouldn't break anything there)
- Fix the comment in nsIDocumentEncoder to state clearly that it means no
formatting and no wrapping, and should not change entity conversion

Then test, making sure entities and the nsGenericElement code both still work.

Meanwhile, the wrapping stuff is still broken, but that's not needed for this
bug and I (or someone) should file it as two separate bugs, one on the
serializer and one on   the editor.
Akkana, that sounds like an excellent plan of action.  Yes, I think we should
just handle the outputraw issue (and just reordering sounds great to me) in this
bug, as well as handling the entity conversion issue (if any).

For general wrapping, a new bug is definitely the way to go.  I have some
thoughts on the problems that may need some discussion.... ;)
*** Bug 280579 has been marked as a duplicate of this bug. ***
I was wrong about entities.  Those are handled fine, since we call
AppendTextData (which handles them), _then_ start messing with the

So this fixes the bug and doesn't break anything that I can see.  akkana, if
you have time to look at this that would be great, but I'm not requesting your
review in case you don't....
Attachment #173053 - Flags: superreview?(peterv)
Attachment #173053 - Flags: review?(peterv)
Comment on attachment 173053 [details] [diff] [review]
Patch per comment 12

>+    // XXXbz this doesn't seem to be used by all serializers...

Right, because only the HTML serializer needed stuff like pretty printing
and space preservation for Composer.
Now, Nvu needs it in XML too so...
> >+    // XXXbz this doesn't seem to be used by all serializers...
> Right, because only the HTML serializer needed stuff like pretty printing

The option that comment is on is not used by the HTML serializer, as it happens.
 It's used by the plaintext one only.

Hence the need for a comment.  Someone needs to actually document this interface
(and perhaps decide whether we really need so many mutually exclusive options in
a bitfield, or whether we should be actually using bits with combinations of
bits meaning something).
Attachment #173053 - Flags: superreview?(peterv)
Attachment #173053 - Flags: superreview+
Attachment #173053 - Flags: review?(peterv)
Attachment #173053 - Flags: review+
Fixed.   Akkana, do you want to file those followup bugs?
Assignee: dom-to-text → bzbarsky
Closed: 18 years ago16 years ago
Priority: P5 → P2
Resolution: --- → FIXED
Target Milestone: Future → mozilla1.8beta
bug 281407 on documenting the rest of the serializer flags (like those XXXbz
bug 281409 on setting wrap width in the editor.
Did I miss any?

I didn't file that second wrap-width bug I said should be filed on the
serializer; I'm not clear at the moment on exactly what the bug should say, so
that can be filed later or grouped into any work that goes into bug 281409.

I won't be working on these right away since I don't think many people care, but
at least they're in the system now.  Thanks for the reminder, bz, and the work
on this more important bug.
You need to log in before you can comment on or make changes to this bug.