Closed Bug 20062 Opened 26 years ago Closed 26 years ago

Send messages with non-encoded NBSPs

Tracking

()

Status:

VERIFIED FIXED

Milestone:

M13

People

(Reporter: sfraser_bugs, Assigned: nhottanscp)

References

(
URL
)

Details

Attachments

(2 files)

body of the message with 0xA0 replaced by nbsp 26 years ago Simon Fraser [no longer active] 2.06 KB, text/html		Details
a patch for the bug 26 years ago nhottanscp 404 bytes, patch		Details \| Diff \| Splinter Review

Simon Fraser [no longer active]

Reporter

Description

•

26 years ago

I've seen a couple of times that HTML messages posted to newsgroups with 5.0 contain raw non-breaking space characters (ASCII 160) in the middle of HTML text. These should have been converted to   somewhere.

Akkana Peck

Updated

•

26 years ago

Status: NEW → ASSIGNED

Target Milestone: M13

Akkana Peck

Comment 1

•

26 years ago

This isn't supposed to happen now that we're using Naoki's new entity converter. I'll look into what's happening (maybe I'm calling the converter with the wrong flags).

Simon Fraser [no longer active]

Reporter

Comment 2

•

26 years ago

Attached file body of the message with 0xA0 replaced by *nbsp* — Details

nhottanscp

Assignee

Comment 3

•

26 years ago

This is expected because the entity conversion is applied as a fallback for charset conversion. Since nbsp is in ISO-8859-1, no fallback (i.e. entity conversion happens). I think that option is currently used for messenger only (because it may benefits message search). But if this is undesirable it can be changed easily by resetting flag.

Akkana Peck

Comment 4

•

26 years ago

Naoki, if I change the flags to change the fallback option, will we still get double quotes? We don't want to go back to where " was always encoded into " since lots of people complained about that. What's the right flag to use to get   but not " ?

nhottanscp

Assignee

Comment 5

•

26 years ago

No, &quot, &amp, &lt, &gt are always excluded from the conversion. It is in mail/news code, mailnews/base/util/nsMsgI18N.cpp line 409 nsISaveAsCharset::attr_EntityAfterCharsetConv + nsISaveAsCharset::attr_FallbackDecimalNCR : change to nsISaveAsCharset::attr_htmlTextDefault :

Akkana Peck

Updated

•

26 years ago

Assignee: akkana → rhp

Status: ASSIGNED → NEW

Akkana Peck

Comment 6

•

26 years ago

Sounds like this is Rich's bug, not mine, then; but I've changed the nsHTMLContentSinkStream.cpp to follow Naoki's suggestion. But Naoki: even after I make that change, I still don't see the   entities; I just see spaces, the same thing I saw when I was passing the flag nsISaveAsCharset::attr_EntityAfterCharsetConv | nsISaveAsCharset::attr_FallbackDecimalNCR.

nhottanscp

Assignee

Updated

•

26 years ago

Assignee: rhp → nhotta

nhottanscp

Assignee

Comment 7

•

26 years ago

Actually it's my bug. I need to change the flag in mail/news code.

nhottanscp

Assignee

Updated

•

26 years ago

Status: NEW → ASSIGNED

nhottanscp

Assignee

Comment 8

•

26 years ago

There are two issues around this bug. 1) Currently, messenger uses unicode interface to get data from editor then convert to mail charset (using nsISaveAsCharset) inside messenger. This is why the editor change didn't affect the messenger output. One benefit of getting uinocde is that it makes text manipulation easier (e.g. parse the body and generate links and mailto like messenger does). Using the stream interface instead, we need to consider some of the text manipulation to be moved to editor. Although the issue itself is a separate one from this bug. 2) Sending nbsp (and other latin1 characters) as not entity encoded, that's the current behavior of html mail send. Not using entity is not illegal because mail is labeled as ISO-8859-1 which includes nbsp as a code point 0xA0. The change to output   can be done by just flipping the flag but I need to know if the change is really needed. For message searching, I don't think both IMAP and locale search (the current server and client) supports entity decoding (e.g. cannot search &Eacute but can search É).

bobj

Comment 9

•

26 years ago

I agree that using raw NBSP is as legitimate as using raw e-acutes in iso-8859-1 HTML files. But the rationale for using raw codes for alphanumerics and punctuation was to make search/find work easier. I don't think this rationale holds true for NBSP. Maybe we should entity-ize NBSP along with the mandatory entities ('<', '>', '&', etc.).

Katsuhiko Momoi

Comment 10

•

26 years ago

I'm agnostic on this issue currently. Other than the fact that it is not widely done to insert raw NBSP code point rather than the entity representation, is there a strong reason why we must use the entity representation? Are there other processes depending on this NBSP entity? BTW, what is the relevance of the news article link above?

Akkana Peck

Comment 11

•

26 years ago

I'd like to see nbsp turned into the entity  . I've been confused more than once by the non-entityizing of nbsp, wondering why all the nbsp's in the editor's OutputHTML function were disappearing when printed to stdout, and wondering whether it was a bug in the sink streams. Plain ascii users use nbsp's even if they can't display characters like e-acute, and by the time we get to the point of converting from unicode to ascii, we're past the point where we can decide on what flags to pass in to the converter.

nhottanscp

Assignee

Comment 12

•

26 years ago

Editor uses a flag to do the entity conversion before the charset conversion (for save/saveas), so   is always generated. I think this option is good for editor because charset label is optional (by META tag), it's safer to use as match as entities. For mail, we always label charset so we don't have to always generate entities. Regarding treating nbsp special, I prefer non special handling (i.e. do not want to change the interface only for nbsp). Other latin1 characters may be invisible depending on the glyph availability of the installed fonts. I hold the change until we have a reason to change this for mail send.

bobj

Comment 13

•

26 years ago

Currently, we special case the mandatory entities: "<" represents the < sign ">" represents the > sign "&" represents the & sign "" represents the " mark So, I was suggesting that we might add "&nbsp" to this list even though it is not mandatory. The assumption that raw 0xA0 is not very useful...

nhottanscp

Assignee

Comment 14

•

26 years ago

Yes, those four characters are excluded from the entity conversion interface (i.e. the interface does not generate entities for those characters). The four characters are entity encoded before coming to the entity converter (needed for html escape). I am not sure if nbsp to be the same category but if we add nbsp then probably contentsink/parser need changes, I think.

rickg

Comment 15

•

26 years ago

It sounds like we're in agreement. Where possible, let's convert these characters back to their entity versions when emitting HTML/XIF.

nhottanscp

Assignee

Comment 16

•

26 years ago

Akkana, can that be done in ContentSinks?

Akkana Peck

Comment 17

•

26 years ago

The content sinks depend on nsISaveAsCharset to do this entity encoding, so the sinks will output whatever that class returns.

nhottanscp

Assignee

Comment 18

•

26 years ago

If we want that capability, the interface needs to be extended to accept character base options in addition to the category (e.g. Latin1, Symbol, etc.). Or I can flip the flag now to do the entity conversion before the charaset conversion.

Akkana Peck

Comment 19

•

26 years ago

Maybe flipping the order is the best solution, then. What's the disadvantage of that?

nhottanscp

Assignee

Comment 20

•

26 years ago

All the Latin1 characters <= 160 are converted to entity. So far, I have not heard clear advantage or disadvantage of doing that, seems to be a matter of taste. Let me flip the flag in early M13 (so that I can at least close the bug).

nhottanscp

Assignee

Comment 21

•

26 years ago

Attached patch a patch for the bug — Details — Splinter Review

nhottanscp

Assignee

Updated

•

26 years ago

Depends on: 22315

nhottanscp

Assignee

Comment 22

•

26 years ago

There is an entity related bug 22315 which needs to be resolved before check in the fix.

nhottanscp

Assignee

Comment 23

•

26 years ago

Checked in, now it always generates  .

Status: ASSIGNED → RESOLVED

Closed: 26 years ago

Resolution: --- → FIXED

Joe Francis

Comment 24

•

26 years ago

I know i'm coing in way too late here, but nonetheless I'll suggest that encoding nbsp's as entities is not the right thing to do for mail. There are plenty of mail clients that read ISO-8859-1 but not html. When these recieve html mail they would be better off seeing the nbsp as 0xa0, which will render correctly, rather than as a clutter of '&nbps', which will make the mail even harder to read. For non-mail use, I agree that &nbsp is superior.

nhottanscp

Assignee

Comment 25

•

26 years ago

Adding phil for his opinion.

sujay

Updated

•

26 years ago

Status: RESOLVED → VERIFIED

sujay

Comment 26

•

26 years ago

verified in 1/7 build.

nhottanscp

Assignee

Comment 27

•

25 years ago

This issue is raised again in other bug 27376. The bug mentions the HTML spec. I think we want to reconsider the current behavior (both mail and composer). >HTML 4.01 5.3 says >A given character encoding may not be able to express all characters of the >document character set. For such encodings, or when hardware or software >configurations do not allow users to input some document characters directly, >authors may use SGML character references.

You need to log in before you can comment on or make changes to this bug.