Closed Bug 20062 Opened 26 years ago Closed 26 years ago

Send messages with non-encoded NBSPs

Categories

(Core :: DOM: Editor, defect, P3)

All
Mac System 8.5
defect

Tracking

()

VERIFIED FIXED

People

(Reporter: sfraser_bugs, Assigned: nhottanscp)

References

()

Details

Attachments

(2 files)

I've seen a couple of times that HTML messages posted to newsgroups with 5.0 contain raw non-breaking space characters (ASCII 160) in the middle of HTML text. These should have been converted to   somewhere.
Status: NEW → ASSIGNED
Target Milestone: M13
This isn't supposed to happen now that we're using Naoki's new entity converter. I'll look into what's happening (maybe I'm calling the converter with the wrong flags).
This is expected because the entity conversion is applied as a fallback for charset conversion. Since nbsp is in ISO-8859-1, no fallback (i.e. entity conversion happens). I think that option is currently used for messenger only (because it may benefits message search). But if this is undesirable it can be changed easily by resetting flag.
Naoki, if I change the flags to change the fallback option, will we still get double quotes? We don't want to go back to where " was always encoded into " since lots of people complained about that. What's the right flag to use to get   but not " ?
No, &quot, &amp, &lt, &gt are always excluded from the conversion. It is in mail/news code, mailnews/base/util/nsMsgI18N.cpp line 409 nsISaveAsCharset::attr_EntityAfterCharsetConv + nsISaveAsCharset::attr_FallbackDecimalNCR : change to nsISaveAsCharset::attr_htmlTextDefault :
Assignee: akkana → rhp
Status: ASSIGNED → NEW
Sounds like this is Rich's bug, not mine, then; but I've changed the nsHTMLContentSinkStream.cpp to follow Naoki's suggestion. But Naoki: even after I make that change, I still don't see the   entities; I just see spaces, the same thing I saw when I was passing the flag nsISaveAsCharset::attr_EntityAfterCharsetConv | nsISaveAsCharset::attr_FallbackDecimalNCR.
Assignee: rhp → nhotta
Actually it's my bug. I need to change the flag in mail/news code.
Status: NEW → ASSIGNED
There are two issues around this bug. 1) Currently, messenger uses unicode interface to get data from editor then convert to mail charset (using nsISaveAsCharset) inside messenger. This is why the editor change didn't affect the messenger output. One benefit of getting uinocde is that it makes text manipulation easier (e.g. parse the body and generate links and mailto like messenger does). Using the stream interface instead, we need to consider some of the text manipulation to be moved to editor. Although the issue itself is a separate one from this bug. 2) Sending nbsp (and other latin1 characters) as not entity encoded, that's the current behavior of html mail send. Not using entity is not illegal because mail is labeled as ISO-8859-1 which includes nbsp as a code point 0xA0. The change to output   can be done by just flipping the flag but I need to know if the change is really needed. For message searching, I don't think both IMAP and locale search (the current server and client) supports entity decoding (e.g. cannot search &Eacute but can search É).
I agree that using raw NBSP is as legitimate as using raw e-acutes in iso-8859-1 HTML files. But the rationale for using raw codes for alphanumerics and punctuation was to make search/find work easier. I don't think this rationale holds true for NBSP. Maybe we should entity-ize NBSP along with the mandatory entities ('<', '>', '&', etc.).
I'm agnostic on this issue currently. Other than the fact that it is not widely done to insert raw NBSP code point rather than the entity representation, is there a strong reason why we must use the entity representation? Are there other processes depending on this NBSP entity? BTW, what is the relevance of the news article link above?
I'd like to see nbsp turned into the entity &nbsp;. I've been confused more than once by the non-entityizing of nbsp, wondering why all the nbsp's in the editor's OutputHTML function were disappearing when printed to stdout, and wondering whether it was a bug in the sink streams. Plain ascii users use nbsp's even if they can't display characters like e-acute, and by the time we get to the point of converting from unicode to ascii, we're past the point where we can decide on what flags to pass in to the converter.
Editor uses a flag to do the entity conversion before the charset conversion (for save/saveas), so &nbsp; is always generated. I think this option is good for editor because charset label is optional (by META tag), it's safer to use as match as entities. For mail, we always label charset so we don't have to always generate entities. Regarding treating nbsp special, I prefer non special handling (i.e. do not want to change the interface only for nbsp). Other latin1 characters may be invisible depending on the glyph availability of the installed fonts. I hold the change until we have a reason to change this for mail send.
Currently, we special case the mandatory entities: "&lt;" represents the < sign "&gt;" represents the > sign "&amp;" represents the & sign "&quot; represents the " mark So, I was suggesting that we might add "&nbsp" to this list even though it is not mandatory. The assumption that raw 0xA0 is not very useful...
Yes, those four characters are excluded from the entity conversion interface (i.e. the interface does not generate entities for those characters). The four characters are entity encoded before coming to the entity converter (needed for html escape). I am not sure if nbsp to be the same category but if we add nbsp then probably contentsink/parser need changes, I think.
It sounds like we're in agreement. Where possible, let's convert these characters back to their entity versions when emitting HTML/XIF.
Akkana, can that be done in ContentSinks?
The content sinks depend on nsISaveAsCharset to do this entity encoding, so the sinks will output whatever that class returns.
If we want that capability, the interface needs to be extended to accept character base options in addition to the category (e.g. Latin1, Symbol, etc.). Or I can flip the flag now to do the entity conversion before the charaset conversion.
Maybe flipping the order is the best solution, then. What's the disadvantage of that?
All the Latin1 characters <= 160 are converted to entity. So far, I have not heard clear advantage or disadvantage of doing that, seems to be a matter of taste. Let me flip the flag in early M13 (so that I can at least close the bug).
Depends on: 22315
There is an entity related bug 22315 which needs to be resolved before check in the fix.
Checked in, now it always generates &nbsp;.
Status: ASSIGNED → RESOLVED
Closed: 26 years ago
Resolution: --- → FIXED
I know i'm coing in way too late here, but nonetheless I'll suggest that encoding nbsp's as entities is not the right thing to do for mail. There are plenty of mail clients that read ISO-8859-1 but not html. When these recieve html mail they would be better off seeing the nbsp as 0xa0, which will render correctly, rather than as a clutter of '&nbps', which will make the mail even harder to read. For non-mail use, I agree that &nbsp is superior.
Adding phil for his opinion.
Status: RESOLVED → VERIFIED
verified in 1/7 build.
This issue is raised again in other bug 27376. The bug mentions the HTML spec. I think we want to reconsider the current behavior (both mail and composer). >HTML 4.01 5.3 says >A given character encoding may not be able to express all characters of the >document character set. For such encodings, or when hardware or software >configurations do not allow users to input some document characters directly, >authors may use SGML character references.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: