Closed Bug 194498 Opened 21 years ago Closed 20 years ago

non-breaking space turned into space

Categories

(Core :: DOM: Editor, defect)

x86
Windows 98
defect
Not set
normal

Tracking

()

VERIFIED DUPLICATE of bug 218277

People

(Reporter: m.d.nauta, Assigned: mozeditor)

References

()

Details

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.3b) Gecko/20030221
Build Identifier: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.3b) Gecko/20030221

If I enter a non-breaking space character (Alt-0160) in a input type=text box my
asp code that handles the form receives an ordinary space character. See also
test html code below under reproduce.

Reproducible: Always

Steps to Reproduce:
1.write this simple html-page.
<html><head><title>Try a non-breaking space.</title></head>
<body><form method='get' action='./Test0160.html'>
	<label>Enter non-breaking space Alt-0160</label>
	<input type='text' name='inbox' />
	<button type='submit'>Submit</button>
</form></body></html>

2. enter alt-0160 in entry box and push submit.
3. look at addressbar.

Actual Results:  
The address ends on inbox=+

Expected Results:  
The address should end in inbox=%A0

Same trouble with Version 1.2.1  (20021130) and on Linux with 2001091712.
submission.
Assignee: form → form-submission
Component: Layout: Form Controls → Form Submission
QA Contact: desale → ashishbhatt
*** Bug 206925 has been marked as a duplicate of this bug. ***
This affects TEXTAREA and single line "text" INPUT fields, but not "hidden" form
elements. I have set up a testcase under
<http://bugzilla.ohrbelag.de/mozilla-bug-06.php>.

Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko/20030624
Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6
Confirmed for Mozilla 2003121816 on Win2k.
Status: UNCONFIRMED → NEW
Ever confirmed: true
I can't imagine what was in the developper's mind when he wrote a
nsPlainTextSerializer::Output which converts U+00A0 (NO-BREAK SPACE) to U+0020
(SPACE) and does nothing else!	It doesn't make any kind of sense, especially
if other Unicode space characters aren't treated in the same way: this code
just *has* to be wrong.  The attached patch (somebody please review!) just
removes this absurd transformation.
All right, now that I made this patch, what's the next step? I don't have CVS
checkin permission, obviously. And I'm sure nobody who does is watching this
bug, since it's assigned to <form-submission@content.bugs> which is a euphemism
for "nobody". So how do I attract attention?
Comment on attachment 140944 [details] [diff] [review]
Remove replacement of nbsp by space

How do I know whom to ask to review the patch?	(The help is most unclear.)
Attachment #140944 - Flags: review?
This affects the editor of Mozilla Mailnews too. I cannot write any mail which contains the 
character U+00A0. 
 
I do not think that Form Submission is the right component. Editor: Core should be better. 
 
(Note that I have put several non-breaking spaces in this comment, using the browser 
Konqueror.) 
 
(In reply to comment #8) 
> This affects the editor of Mozilla Mailnews too. I cannot write any mail which contains the  
> character U+00A0.  
 
The Mozilla localtion bar is affected too. If I type a non-breaking space in the location bar, it 
is turned out into a simple space. 
 
> (Note that I have put several non-breaking spaces in this comment, using the browser  
> Konqueror.)  
 
Well... Not convincing. :-\ 
 
Component: HTML: Form Submission → Editor: Core
Moving bug to "Editor: Core" component, at Laurent Rineau's suggestion.
Assignee: form-submission → mozeditor
QA Contact: ashshbhatt → bugzilla
Attachment #140944 - Flags: review? → review?(mozeditor)
Why should I think that the original comment is wrong?  If it isn't, there are
plaintext consumers out there that will be confused.

What happens if I copy and paste into a unix command line, for example?
It won't work right on a Unix command line; it won't display right in many
plaintext mailers.  Non-breaking space is something that doesn't exist in a lot
of fonts.

The biggest reason for the conversion is that consecutive spaces in the mozilla
editor are turned into runs of <space><nbsp><space><nbsp>.  So if you go into
the plaintext mail editor and type Hello<sp><sp><sp><sp>world, if we didn't do
the conversion, it would send as Hello<sp><nbsp><sp><nbsp>world, and lots of
recipients would see it as Hello ? ?world, or instead of question marks they
might see an A-umlaut or a sequence like Hello \0xa0 \0xa0world.

It would be nice if we had some way of distinguishing non-breaking spaces
inserted by our editor from non-breaking spaces deliberately typed by the user.
 Suggestions for how to represent this in the DOM would be welcomed.  Simply
emitting non-breaking spaces and breaking all the plaintext people is not a
viable option.
Attachment #140944 - Flags: review?(mozeditor) → review-
Well, I work in a French-speaking environment, we exchange a lot of plain text
files, typically encoded in iso-8859-1, and these tend to contain lots of
unbreakable spaces, because certain punctuation marks in French (colon,
semicolon, question mark, exclamation mark) are supposed to be preceded by an
unbreakable space.  The result of Mozilla mangling these spaces to ordinary
breakable spaces is catastrophic.  And the unbreakable space isn't any less
supported than any other iso-8859-1 character.  (Well, I admit I did *once* find
a font that was buggy as far as \240 was concerned, but it was on the Solaris
2.5 Openwin, which is a long time ago.)

But I won't argue.  If people can't agree as to what the correct behavior is,
the simple solution is to make that a configuration option.  I'm willing to code
this if nobody else is (although given my lack of Mozilla expertise I'd much
prefer it if somebody else volunteered; it's probably a very simple job just to
add one boolean option).  Just tell me how the option should be called, and
whether the patch would have a chance of being positively reviewed (assuming I
did things correctly; and understanding of course that this is no promise, but
I'd rather make sure in advance that it won't be rejected by principle).

As for a more acceptable long-term solution than replacing multiple spaces by
unbreakable spaces in the editor, I might suggest replacing them by a span
element with appropriate width (say, as many em's as the user typed spaces).
I'm confused about your example.  How is the conversion to spaces catastrophic? 
(I am a french speaker too.)

(In reply to comment #12)
> It won't work right on a Unix command line;

Can you explain what you mean by not working on unix command line? Obviously a
nbsp will not be taken as word separator.

> Non-breaking space is something that doesn't exist in a lot of fonts.

U+00A0 is in the iso-8859-1 charset (aka latin-1). I have never seen a
iso-8859-1 font where U+00A0 does not exist.


> The biggest reason for the conversion is that consecutive spaces in the
> mozilla editor are turned into runs of <space><nbsp><space><nbsp>.

It seems that is the problem. The editor creates non-breaking spaces, to display
consecutive space. As some (really?) us-ascii fonts does not know U+00A0, the
plain text serializer has to transform them. This is not a good policy, because
some latin-1 users will to use it. In french, for example, a lot of ponctuation
marks should almost be preceded by a non-breaking space. Actually, there are lot
of other spacing characters that should be used in well-written french: those
between U+2000 and U+200B, but they are not in the latin-1 charset. I am part of
the french traduction project and we decide not to use them, because only a few
fonts have them. Anyway, as far a U+00A0 is concerned, some latin-1 users would
like to use them.

Consecutives spaces, in editor, should be replaced by something else than
<sp><nbsp><sp><nbsp><sp>.
It's easy to say that the editor shouldn't use nbsp's, but thats a dead snake. 
The alternatives are even worse.  Whatever the solution is here, changing the
editor's nbsp usage internally is unlikely to be it.
Are you aware that Internet Explorer (and Opera for textareas only) doesn't
replace nbsp by space and still nothing "breaks"? 

If this is an issue for plain text mail composition (it really shouldn't),
couldn't that "feature" only be turned on in mail, and not in web forms?
(In reply to comment #14)
> I'm confused about your example.  How is the conversion to spaces catastrophic? 

Well, it can happen in two ways.  One is when the user enters plain text in a
textarea or text input field and takes care to enter unbreakable spaces before
punctuations (or in various other contexts) and these are transformed to
ordinary spaces: if the resulting text is, say, HTMLified and displayed by the
server, line breaks can occur at wrong places; I first noticed the problem
because, when commenting on various people's blogs, my comments tended to be
broken at places where I was certain I had placed unbreakable spaces (I enter
them by typing compose-space-space under Unix, incidentally).  The other problem
is the reverse: assume the user copies some text from an HTML page into a plain
text editor, and asks the editor to reformat the text (M-q under Emacs, say);
then spaces which were (justly) unbreakable in the HTML document might have
become ordinary spaces and might cause unwanted line breaks after reformating.

Basically, unbreakable spaces are just a different character from the ordinary
space.  Turning one into the other is as bas as turning an X into a Y: then the
whole point of having unbreakable spaces is lost.

But I agree that the editor's "internal" use of unbreakable spaces, although
bugware, can't be simply ignored.  This is why I suggest having a configuration
option as a short-term solution to the problem (for knowledgeable users who are
aware of the existence of the unbreakable space and don't use the Mozilla editor
ot at least don't type to spaces in a row or something), although a longer-term
solution probably has to be found also, but it won't be as urgent.
First the context at my point of view:
Recently fr.wikipedia.org switched from &nbsp ; to non-breakable space, we
have more than 8 K pages with non-breakable space. It's annoying since each
time someone edit one of these page with any mozilla version on any platform
it'll break them so we will either re run a bot on all these page to reswitch
to &nbsp or live with the fact we need to run periodically a bot on all these
broken page and through regexp to replace space by 0xa0 code. Both solution
have cons, the first uglify html code, the second mean we will get many
spurious to review and the log history we will uglier.

So I'll be very happy if there is any progress on this issue. If I understand
correctly the magic sequence used internally is 0x20 0xA0 so can't you change
the 0xA0 to 0x20 only if preceeded by a 0x20 ? Is this an acceptable workaround
for other people looking at this problem ?
(In reply to comment #19) 
> So I'll be very happy if there is any progress on this issue. If I understand 
> correctly the magic sequence used internally is 0x20 0xA0 so can't you change 
> the 0xA0 to 0x20 only if preceeded by a 0x20 ? Is this an acceptable workaround 
> for other people looking at this problem ? 
 
Well, I agree that U+0020 U+00A0 is not a sequence that may appear in french, for 
example. 
 
Supposed that the editor has to do some temporary replacement, I don't see any
reason why it must be U+00A0 of all characters. If it is for internal use only
and replaced by an ordinary space later anyway, every other character could be
used as well, such as U+001A (substitute character), U+FFFC (object
placeholder), or some character from the U+E000 to U+E0FF range (private use
area). All of these are much less likely to appear in text entered by the user.
Philippe: On the Wikipedia problem, what do you mean by "each time someone edit
one of these page with any mozilla version"?  Are they submitting a form, where
the user has been editing in a textarea and the submitted text changes nbsp to
space?  Form submission might be a different question, since we (I hope?) know
the charset the server is using, so it might be more reasonable to keep the nbsp
characters if we know we're using utf-8 (or some other charset) and not plain
us-ascii.

There may be other issues, though; do I remember rightly that the text in a
textarea might not always be the same charset as the page which embeds it?

As to Uwe's suggestion of using some other character for space runs in the
editor: that might be possible in the editor (Joe would have to comment on
that), and the serializer part would be very easy, but it would also need a
layout change, since whatever character we use would have to display as a space. 

The core issue for the editor: if users type two spaces, they want to see the
caret move right twice, and be farther right than if they'd just typed one space.
(In reply to comment #22)

Yes, user submit new text through a form and afaik both page and form use
utf8, you can try it here http://fr.wikipedia.org/wiki/Utilisateur:Phe/test1
click on  "Modifier cette page" to modify the text.

> The core issue for the editor: if users type two spaces, they want to see
> caret move right twice, and be farther right than if they'd just typed one
> space.

I see what is your problem, magic value are evil but such case is very
boring to fix w/o a some sort of magic value
This breaks i18n, and I think the input of a Mozilla i18n expert would be useful
here to find the best way out of this, I CC smontagu and jshin. 

Maybe a solution would be, instead of replacing by nbsp one of every two spaces
typed, to insert a default_ignorable_code_point or a ZWNJ/ZWJ character between
successive spaces. And conversion to plain text would remove that character when
found between two spaces.
This is huge bug. It break the most important rule: don't touch user data. If you need to replace 
something because another code needs it, fix the other broken code.
Glad to know Mozilla alters user data for the benefit of the outdated us-ascii
encoding... Who is still using it? ISO-8859-1 exists guys!

When I read that "non-breaking space is something that doesn't exist in a lot
of fonts" I can't believe it!
"&nbsp;" is a well-know entity, used since the early days of mass web authoring.
Don't tell me authors used a character user agents didn't know how to render! So
what's the difference with the equivalent character?

We do want to use the richness of our encoding, not even considering Unicode!
This is another bug where Mozilla sucks, see #70610 for instance. And this bug
again has been put to sleep…
(This non us-ascii character was brought to you by UTF-8.)
Bug #218277 is about the same issue and has a r+sr patch and dependencies, this
bug should probably marked as a duplicate.
Depends on: 218277

*** This bug has been marked as a duplicate of 218277 ***
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → DUPLICATE
Status: RESOLVED → VERIFIED
*** Bug 268995 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: