Composer editor deletes all the   characters on pages using UTF-8 character set

RESOLVED WORKSFORME

Status

()

Core
Serializers
RESOLVED WORKSFORME
16 years ago
15 years ago

People

(Reporter: Jim Booth, Assigned: Tanu Mutreja)

Tracking

Trunk
x86
Windows ME
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

16 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Win 9x 4.90; en-US; rv:1.2b) Gecko/20020928
Build Identifier: Mozilla/5.0 (Windows; U; Win 9x 4.90; en-US; rv:1.2b) Gecko/20020928

All the Non-breaking space characters ( ) are deleted when you switch to
HTML Source View.  

Reproducible: Always

Steps to Reproduce:
1. Open Composer Test Page from the Debug menu
2. Observe the third line that says "This sentence has two   tags between
each word."
3. Switch to HTML SOURCE view and note that it DOES NOT contain any  s,
just two spaces.
4. Make any change to the source code
5. Switch back to NORMAL view, and note that the double spaces are now ignored
in the display  (The sentence appears with single spaces.)
6. Switch back to HTML SOURCE view and manually replace the double spaces with
  
7. Switch to NORMAL view and back to HTML SOURCE view.  Note the  s are
gone again
8. Change 
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">  TO
    <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
     and resave the file.
9. Close Composer and reopen it with the same file (or change View/Character
coding to Western (ISO-8859-1)
10. Switch to HTML SOURCE view and manually replace the double spaces with
&nbsp;&nbsp; again
11. Switch to NORMAL view and back to HTML SOURCE view.  Note the &nbsp;s ARE
THERE NOW! Now the formatting of the page will remain correct.

Actual Results:  
&nbsp; characters are converted to spaces and subsequently ignored (displayed as
single spaces)

Expected Results:  
Keep the formatting as it was originally meant to be.  (Retain multiple spaces
between words where the author intended.)

Bug is also present in Mozilla 1.1 (20020826) and 1.2 Alpha (20020910) at least.

Workaround: check the character set on any page before you open it in Composer,
and if it's UTF-8, open it first in another editor and change the character set.
 You'll then have to manually fix any extended characters that display incorrectly.

Also can be reproduced by setting View/Character Coding to UTF-8 and then typing
in multiple spaces between words in a blank page.

Comment 1

16 years ago
I have a feeling this is serializer related ... but over to jfrancis first to
make sure.
Assignee: kin → jfrancis

Comment 2

16 years ago
kin may be hesitant to hand off to serializer sans investigation, but i'm not. 
This has to be serializer.
Assignee: jfrancis → harishd
Status: UNCONFIRMED → NEW
Component: Editor: Core → DOM to Text Conversion
Ever confirmed: true
(Assignee)

Comment 3

16 years ago
In nsHTMLContentSerializer.cpp, we are checking for charset and based on that 
we convert the character to corresponding entity. Right now only for 
charset "ISO-8859-1", we do this conversion. 
From what I understand about the character references, they are encoding-
independent mechanism. I'm not exactly understanding the reasoning behind doing 
it only for the ISO-8859-1. Any pointer???
I have no idea why we do that. My advice would be to see from Bonsai who
introduced those lines and ask from them if possible.
(Assignee)

Comment 5

16 years ago
Thanks Heikki. This bug seems to be the side effect of patch for bug#:65324. 
CC'ing JST and Nhotta for their inputs.

I feel this bug is valid only for "nbsp". It's correct that UTF-8 has a code 
point for space and hence for a space it does not need any reference 
like "nbsp" but then HTML squeezes all the adjacent spaces to a single space. 
This is exactly the case here and seems correct(unless there is some 
specification for utf-8 to treat all adjacent spaces in a way similar to nbsp). 
Also, from a list a character references that fall in the range of 127 to 256, 
I feel that no HTML specific action is taken for them. Based on this 
assumption, I'm attaching a patch here...
Assignee: harishd → t_mutreja
(Assignee)

Comment 6

16 years ago
Created attachment 101813 [details] [diff] [review]
PatchV1.0

Irrespective of the "charset" value, treating "&nbsp;" as an special case and
retaining it for all encodings.

Comment 7

16 years ago
I am not sure if everybody wants &nbsp;. 
I think this should be a pref for the serializer like the charset check (bug
169590).
I have a similar problem.  &nbsp; is being converted first into real spaces,
then into &Acirc;&nbsp; when publishing in Composer.  It happens in this version:

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020913 Debian/1.1-1

but not in:

Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.1) Gecko/20020826
(Reporter)

Comment 9

15 years ago
Checking back through my old bugs, this one seems to be fixed now.  

Can someone confirm that and mark it WFM?
(Reporter)

Comment 10

15 years ago
Marking as wFM.  Can't reproduce my test case anymore.  Some other patch must
have fixed this.
Status: NEW → RESOLVED
Last Resolved: 15 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.