Closed Bug 20062 Opened 25 years ago Closed 25 years ago

Send messages with non-encoded NBSPs

Categories

(Core :: DOM: Editor, defect, P3)

All
Mac System 8.5
defect

Tracking

()

VERIFIED FIXED

People

(Reporter: sfraser_bugs, Assigned: nhottanscp)

References

()

Details

Attachments

(2 files)

I've seen a couple of times that HTML messages posted to newsgroups with 5.0
contain raw non-breaking space characters (ASCII 160) in the middle of HTML text.
These should have been converted to   somewhere.
Status: NEW → ASSIGNED
Target Milestone: M13
This isn't supposed to happen now that we're using Naoki's new entity
converter.  I'll look into what's happening (maybe I'm calling the converter
with the wrong flags).
This is expected because the entity conversion is applied as a fallback for
charset conversion. Since nbsp is in ISO-8859-1, no fallback (i.e. entity
conversion happens).
I think that option is currently used for messenger only (because it may
benefits message search). But if this is undesirable it can be changed easily by
resetting flag.
Naoki, if I change the flags to change the fallback option, will we still get
double quotes?  We don't want to go back to where " was always encoded into
" since lots of people complained about that.  What's the right flag to use
to get   but not " ?
No, &quot, &amp, &lt, &gt are always excluded from the conversion.
It is in mail/news code,
mailnews/base/util/nsMsgI18N.cpp line 409
nsISaveAsCharset::attr_EntityAfterCharsetConv
+ nsISaveAsCharset::attr_FallbackDecimalNCR :
change to
nsISaveAsCharset::attr_htmlTextDefault :
Assignee: akkana → rhp
Status: ASSIGNED → NEW
Sounds like this is Rich's bug, not mine, then; but I've changed the
nsHTMLContentSinkStream.cpp to follow Naoki's suggestion.  But Naoki: even after
I make that change, I still don't see the   entities; I just see spaces,
the same thing I saw when I was passing the flag
nsISaveAsCharset::attr_EntityAfterCharsetConv |
nsISaveAsCharset::attr_FallbackDecimalNCR.
Assignee: rhp → nhotta
Actually it's my bug. I need to change the flag in mail/news code.
Status: NEW → ASSIGNED
There are two issues around this bug.
1) Currently, messenger uses unicode interface to get data from editor then
convert to mail charset (using nsISaveAsCharset) inside messenger. This is why
the editor change didn't affect the messenger output. One benefit of getting
uinocde is that it makes text manipulation easier (e.g. parse the body and
generate links and mailto like messenger does). Using the stream interface
instead, we need to consider some of the text manipulation to be moved to
editor. Although the issue itself is a separate one from this bug.
2) Sending nbsp (and other latin1 characters) as not entity encoded, that's the
current behavior of html mail send. Not using entity is not illegal because mail
is labeled as ISO-8859-1 which includes nbsp as a code point 0xA0. The change to
output   can be done by just flipping the flag but I need to know if the
change is really needed. For message searching, I don't think both IMAP and
locale search (the current server and client) supports entity decoding (e.g.
cannot search &Eacute but can search É).
I agree that using raw NBSP is as legitimate as using raw e-acutes in iso-8859-1
HTML files.  But the rationale for using raw codes for alphanumerics and
punctuation was to make search/find work easier.  I don't think this rationale
holds true for NBSP.  Maybe we should entity-ize NBSP along with the mandatory
entities ('<', '>', '&', etc.).
I'm agnostic on this issue currently. Other than the fact that it is not widely done to
insert raw NBSP code point rather than the entity representation, is there a strong reason
why we must use the entity representation?  Are there other processes depending on
this NBSP entity?
BTW, what is the relevance of the news article link above?
I'd like to see nbsp turned into the entity &nbsp;.
I've been confused more than once by the non-entityizing of nbsp, wondering why
all the nbsp's in the editor's OutputHTML function were disappearing when
printed to stdout, and wondering whether it was a bug in the sink streams.
Plain ascii users use nbsp's even if they can't display characters like e-acute,
and by the time we get to the point of converting from unicode to ascii, we're
past the point where we can decide on what flags to pass in to the converter.
Editor uses a flag to do the entity conversion before the charset conversion
(for save/saveas), so &nbsp; is always generated. I think this option is good
for editor because charset label is optional (by META tag), it's safer to use as
match as entities. For mail, we always label charset so we don't have to always
generate entities. Regarding treating nbsp special, I prefer non special
handling (i.e. do not want to change the interface only for nbsp). Other latin1
characters may be invisible depending on the glyph availability of the installed
fonts.
I hold the change until we have a reason to change this for mail send.
Currently, we special case the mandatory entities:
     "&lt;" represents the < sign
     "&gt;" represents the > sign
     "&amp;" represents the & sign
     "&quot; represents the " mark
So, I was suggesting that we might add "&nbsp" to this list even though it is
not mandatory.  The assumption that raw 0xA0 is not very useful...
Yes, those four characters are excluded from the entity conversion
interface (i.e. the interface does not generate entities for those characters).
The four characters are entity encoded before coming to the entity converter
(needed for html escape). I am not sure if nbsp to be the same category but if
we add nbsp then probably contentsink/parser need changes, I think.
It sounds like we're in agreement. Where possible, let's convert these
characters back to their entity versions when emitting HTML/XIF.
Akkana, can that be done in ContentSinks?
The content sinks depend on nsISaveAsCharset to do this entity encoding, so the
sinks will output whatever that class returns.
If we want that capability, the interface needs to be extended to accept
character base options in addition to the category (e.g. Latin1, Symbol, etc.).
Or I can flip the flag now to do the entity conversion before the charaset
conversion.
Maybe flipping the order is the best solution, then.  What's the disadvantage of
that?
All the Latin1 characters <= 160 are converted to entity. So far, I have not
heard clear advantage or disadvantage of doing that, seems to be a matter of
taste.
Let me flip the flag in early M13 (so that I can at least close the bug).
Depends on: 22315
There is an entity related bug 22315 which needs to be resolved before check in
the fix.
Checked in, now it always generates &nbsp;.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
I know i'm coing in way too late here, but nonetheless I'll suggest that encoding
nbsp's as entities is not the right thing to do for mail.  There are plenty of
mail clients that read ISO-8859-1 but not html.  When these recieve html mail
they would be better off seeing the nbsp as 0xa0, which will render correctly,
rather than as a clutter of '&nbps', which will make the mail even harder to
read.

For non-mail use, I agree that &nbsp is superior.
Adding phil for his opinion.
Status: RESOLVED → VERIFIED
verified in 1/7 build.
This issue is raised again in other bug 27376. The bug mentions the HTML spec.
I think we want to reconsider the current behavior (both mail and composer).

>HTML 4.01 5.3 says

>A given character encoding may not be able to express all characters of the
>document character set. For such encodings, or when hardware or software
>configurations do not allow users to input some document characters directly,
>authors may use SGML character references.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: