Closed
Bug 57248
Opened 25 years ago
Closed 24 years ago
Options|Format|Plaintext is not converting contents of HTMLcompose message to
Categories
(Core :: DOM: HTML Parser, defect, P2)
Core
DOM: HTML Parser
Tracking
()
VERIFIED
FIXED
mozilla0.9.1
People
(Reporter: bugzilla, Assigned: harishd)
References
()
Details
(Whiteboard: [nsbeta1+])
Attachments
(5 files)
40.47 KB,
patch
|
Details | Diff | Splinter Review | |
39.83 KB,
patch
|
Details | Diff | Splinter Review | |
32.28 KB,
patch
|
Details | Diff | Splinter Review | |
31.29 KB,
patch
|
Details | Diff | Splinter Review | |
40.24 KB,
patch
|
Details | Diff | Splinter Review |
Go to:
http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1
Choose File -> Send Page
The Compose window now correctly says:
http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1
Send the mail to yourself.
In the mail you revieve it says:
http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1
<http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1>
Note the & in the last link!
That's dead wrong!
Updated•25 years ago
|
Reporter | ||
Comment 1•25 years ago
|
||
Dont delete the URL...:)
Comment 2•25 years ago
|
||
I'm investigating.
Reporter | ||
Comment 3•25 years ago
|
||
FYI: When reading the mail in Netscape 4.75 it says:
<http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1>
Comment 4•25 years ago
|
||
No, it doesn't, not for me. The msg source is wrong already, and 4.x doesn't
have a chance to recover from that, just like our viewer. It is a problem during
composition. -> Composition.
This bug does already appear on a 2000-09-13 build, so it is not caused by NOXIF.
If you do Debug|OutputText during composition, everything is fine.
Comment 5•25 years ago
|
||
.
Assignee: putterman → ducarroz
Component: Mail Window Front End → Composition
Comment 6•25 years ago
|
||
Update: The last sentence suggests that this is a bug in Composition. We seem
have a very odd data path during send (somthing like HTML->TXT->HTML->TXT or
similar*), which, IIRC, exposed various bugs already (which is partly good and
partly bad). IMO, it is time to clean this up.
The only thing I am wondering is what exactly inserts the &. This is
propably an additional bug, because, in a perfect world, HTML->TXT->HTML->TXT
would produce the same as HTML->TXT.
*IIRC, in previous bugs, we found that it is not just HTML/TXT conversions -
there was some other converter involved, but I don't remember the details. Do
you akk? Or am I erronously confusing this with the bugs we found during reply?
Summary: URL is htmlified (& = &, etcs..) in revieved mails that contains attached webpages → URL is htmlified (& -> & etc.)
Comment 7•25 years ago
|
||
The path followed by entity conversion changed with the noxif landing. This bug
was fixed at one point; perhaps it has regressed. Adding jst since he knows the
current status of entity handling in the output converters.
Comment 8•25 years ago
|
||
akk, I don't understand. And I said, I see the exact same behavious in an old build.
Comment 9•25 years ago
|
||
Isn't this the same issue as in bug 44372 (entities inside tags not converted
properly?) That was fixed some time ago, so if this is the same issue, it would
help to know what regressed.
What does the href look like if you do Debug->Output HTML during composition?
Comment 10•25 years ago
|
||
> Isn't this the same issue as in bug 44372 (entities inside tags not converted
> properly?)
Dunno. Note that here, the part in the tag has &.
We has a similar bug (tags not processed corrently) in libmime's flowed->HTML
converter, but that was supposed to be fixed, too. Ccing Daniel Bratell
nevertheless.
> What does the href look like if you do Debug->Output HTML during composition?
<a
href="http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1">http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1</a>
(which is very odd, because the link is not blue in the HTML composer).
OutputText gives
<<
http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1
>>
because the converter correctly recognized that text and href are identical.
Comment 11•25 years ago
|
||
Oh, I think I get it. This is the opposite of bug 44372. That bug was that
entities like & inside of a tags were getting converted to &, which was not
considered to be correct html.
This bug is that you WANT the & inside the a tag and don't want it converted to
&.
I don't know what is technically correct, or whether both are correct. Once
they're in the dom, though, we can't tell the two cases apart; the parser makes
them equivalent, so we have to choose one or the other form on output.
Comment 12•25 years ago
|
||
akk, the output is plaintext - the & is wrong.
Daniel, I enabled some debugging output and is seems that the libmime
flowed->HTML converter is not hit at all during send (which is correct).
Comment 13•25 years ago
|
||
The HTML srializer will convert '&' to '&', and doing that is absolutely
corect, not doing that conversion will produce non-wellformed HTML (ha ha, I
know, there's no well-formed HTML:-). If the plaintext serializer grabs
attribute values it should unescape the value before using it.
Comment 14•25 years ago
|
||
Accepting. I am confuse, is it a real bug?
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9
Comment 15•25 years ago
|
||
> is it a real bug?
Yes. We send out plaintext. "&" is valid in URLs - no need to
escape. Actually the way we escape gives a wrong URL, since "&" is
HTML-escaping, not URL-escaping. I guess the only reason why it works
nevertheless in Mozilla (you can click on either link and get to the same page)
is that Mozilla is "smart" and fixes up the URL.
The only question is *where* the bug is. Is it in ScanHTML, ScanTXT, Compose, ...?
Comment 16•25 years ago
|
||
Daniel, akk, could HTML->TXT be the problem? Here's the code:
Opening a tag:
nsAutoString url;
if (NS_SUCCEEDED(GetAttributeValue(nsHTMLAtoms::href, url))
&& !url.IsEmpty()) {
mURL = url;
}
Closing a tag:
nsAutoString temp;
temp.AssignWithConversion(" <");
temp += mURL;
temp.AppendWithConversion(">");
Write(temp);
mURL.Truncate();
Does |GetAttributeValue| or |Write| any HTMLunescaping, i.e. "&" -> "&"? If
not, that would be the culprit.
Comment 17•25 years ago
|
||
I don't believe Write does any unescaping. GetAttributeValue calls mContent's
GetAttribute and mParserNode's GetValueAt, neither of which, AFAIK, should do
any unescaping.
Comment 18•25 years ago
|
||
Based on akk's comments, over to DOM->TXT.
BTW: How do we unescape normal text in the converter? Didn't find any Unescape
(*scape*) function call.
Assignee: ducarroz → akkana
Status: ASSIGNED → NEW
Component: Composition → DOM to Text Conversion
Product: MailNews → Browser
QA Contact: esther → sujay
Comment 19•25 years ago
|
||
The parser splits up text containing entitys in parts so foo<bar would enter
the converter as
text: "foo"
entity: "lt"
text: "bar"
You will se an entity handler in DoAddLeaf (IIRC).
Comment 20•25 years ago
|
||
Anthony is taking over dom-text conversion. I really need to figure out who to
ask to change the default component owner. :-)
Any unescaping in the plaintext serializer comes from the charset converters, AFAIK.
Assignee: akkana → anthonyd
Comment 21•25 years ago
|
||
this should be resolved for moz0.9
Comment 22•25 years ago
|
||
reassign to myself since anthonyd has a large buglist
Assignee: anthonyd → brade
Updated•24 years ago
|
Target Milestone: mozilla0.9 → mozilla0.9.1
Comment 23•24 years ago
|
||
based on RFC1738, & when it's not a delimiter should be %26.
If it's a delimiter, then we need to do the html-escaping (&)
Comment 25•24 years ago
|
||
> based on RFC1738, & when it's not a delimiter should be %26.
Please quote the exact sentence, with paragraph number. I think, you are wrong.
It must be %26, if it isn't used as query specifier. In this case (the common
case), it is a query specifier, so the "?" is completely legal. See RFC2396
(which updates RFC1793 - please use the former, not the latter), Section 3..
REOPENing.
> If it's a delimiter, then we need to do the html-escaping (&)
We're talking about the case where it is *not* a delimiter.
> verified
Did you actually check the RFC?
> in 3/5 build.
That's complete nonsense.
Status: VERIFIED → REOPENED
Resolution: INVALID → ---
Comment 26•24 years ago
|
||
reassign to beppe for explanation/resolution
Ben--in the case where it is *not* a delimiter, then the user should have entered
"%26" in the url. Am I mistaken?
Assignee: brade → beppe
Status: REOPENED → NEW
Comment 27•24 years ago
|
||
item 1:
Reserved:
Many URL schemes reserve certain characters for a special meaning:
their appearance in the scheme-specific part of the URL has a
designated semantics. If the character corresponding to an octet is
reserved in a scheme, the octet must be encoded. The characters ";",
"/", "?", ":", "@", "=" and "&" are the characters which may be
reserved for special meaning within a scheme. No other characters may
be reserved within a scheme.
Usually a URL has the same interpretation when an octet is
represented by a character and when it encoded. However, this is not
true for reserved characters: encoding a character reserved for a
particular scheme may change the semantics of a URL.
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
On the other hand, characters that are not required to be encoded
(including alphanumerics) may be encoded within the scheme-specific
part of a URL, as long as they are not being used for a reserved
purpose.
Item 2:
3.1. Common Internet Scheme Syntax
While the syntax for the rest of the URL may vary depending on the
particular scheme selected, URL schemes that involve the direct use
of an IP-based protocol to a specified host on the Internet use a
common syntax for the scheme-specific data:
//<user>:<password>@<host>:<port>/<url-path>
Some or all of the parts "<user>:<password>@", ":<password>",
":<port>", and "/<url-path>" may be excluded. The scheme specific
data start with a double slash "//" to indicate that it complies with
the common Internet scheme syntax. The different components obey the
following rules:
user
An optional user name. Some schemes (e.g., ftp) allow the
specification of a user name.
password
An optional password. If present, it follows the user
name separated from it by a colon.
The user name (and password), if present, are followed by a
commercial at-sign "@". Within the user and password field, any ":",
"@", or "/" must be encoded.
NOTE: ampersand is not listed
Item 3:
ip-schemepart = "//" login [ "/" urlpath ]
login = [ user [ ":" password ] "@" ] hostport
hostport = host [ ":" port ]
host = hostname | hostnumber
hostname = *[ domainlabel "." ] toplabel
domainlabel = alphadigit | alphadigit *[ alphadigit | "-" ] alphadigit
toplabel = alpha | alpha *[ alphadigit | "-" ] alphadigit
alphadigit = alpha | digit
hostnumber = digits "." digits "." digits "." digits
port = digits
user = *[ uchar | ";" | "?" | "&" | "=" ]
password = *[ uchar | ";" | "?" | "&" | "=" ]
urlpath = *xchar ; depends on protocol see section 3.1
NOTE: for user and password the & is present, and must be encoded
Item 4:
; FTP (see also RFC959)
ftpurl = "ftp://" login [ "/" fpath [ ";type=" ftptype ]]
fpath = fsegment *[ "/" fsegment ]
fsegment = *[ uchar | "?" | ":" | "@" | "&" | "=" ]
ftptype = "A" | "I" | "D" | "a" | "i" | "d"
; FILE
fileurl = "file://" [ host | "localhost" ] "/" fpath
; HTTP
httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
NOTE: & in fsegment,ftptype, hsegment and search, and must be encoded
Item 5:
; NEWS (see also RFC1036)
newsurl = "news:" grouppart
grouppart = "*" | group | article
group = alpha *[ alpha | digit | "-" | "." | "+" | "_" ]
article = 1*[ uchar | ";" | "/" | "?" | ":" | "&" | "=" ] "@" host
NOTE: the ampersand is within article, and must be encoded
Item 6:
; PROSPERO
prosperourl = "prospero://" hostport "/" ppath *[ fieldspec ]
ppath = psegment *[ "/" psegment ]
psegment = *[ uchar | "?" | ":" | "@" | "&" | "=" ]
fieldspec = ";" fieldname "=" fieldvalue
fieldname = *[ uchar | "?" | ":" | "@" | "&" ]
fieldvalue = *[ uchar | "?" | ":" | "@" | "&" ]
NOTE: the ampersand is within psegment, fieldname and fieldvalue, and must be
encoded
Item 7:
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "="
NOTE: the & is listed as a reserved value -- now go back up and re-read item 1
To further illustrate:
2.1. The main parts of URLs
A full BNF description of the URL syntax is given in Section 5.
In general, URLs are written as follows:
<scheme>:<scheme-specific-part>
A URL contains the name of the scheme being used (<scheme>) followed
by a colon and then a string (the <scheme-specific-part>) whose
interpretation depends on the scheme.
Scheme names consist of a sequence of characters. The lower case
letters "a"--"z", digits, and the characters plus ("+"), period
("."), and hyphen ("-") are allowed. For resiliency, programs
interpreting URLs should treat upper case letters as equivalent to
lower case in scheme names (e.g., allow "HTTP" as well as "http").
Thus, if the & is being utilized as a delimiter, it must be escaped,
consequently not be translated when passing through necko. If the & is used for
any other purpose, it must be encoded so it does get translated when passing
through necko
Status: NEW → RESOLVED
Closed: 24 years ago → 24 years ago
Comment 28•24 years ago
|
||
First, with "delimiter", I mean a charater that shows where the URL starts and
ends, i.e. is not part of the URL, i.e. is part of the normal text. The usual
delimiters are "<" and ">", e.g. <http://www.mozilla.org>.
RFC2396, 2.4.3.:
> The angle-bracket "<" and ">" and double-quote (") characters are
> excluded because they are often used as the delimiters around URI in
> text documents and protocol fields. The character "#" is excluded
> because it is used to delimit a URI from a fragment identifier in URI
> references (Section 4). The percent character "%" is excluded because
> it is used for the encoding of escaped characters.
> delims = "<" | ">" | "#" | "%" |
If you mean "delimiter" = HTTP query variable separator, e.g. "&" in
<http://www.mozilla.org/cgi/script.pl?var1=string1&var2=string2>, then, yes,
we're talking about a "delimiter" in this bug.
Second, apart from the discussion if to encode or not, even *if* you had to
encode the "&", using "&" would be wrong: It is HTML-encoding, while we're
talking about a URL in plaintext. *If* you'd encode, you'd have to use %26 or
similar. So, this is a bug in *any* case.
You quote RFC1738:
> Reserved:
> Many URL schemes reserve certain characters for a special meaning:
> their appearance in the scheme-specific part of the URL has a
> designated semantics.
"&" is such a "certain", "reserved" charater for the HTTP scheme - it separates
query variables.
> If the character corresponding to an octet is
> reserved in a scheme, the octet must be encoded.
...for the non-reserved use! If you encoded the variable separator "&", how
would you represent a "&" in the variable content? You'd have to encode it, too,
and it would be impossible to tell, if the encoded "&" was meant as variable
content or separator. To avoid exactly *that* is the whole point of the
encoding! The variable separator appears as raw "&", while "&" as variable
content has to be encoded as %26 or similar.
> The characters ";",
> "/", "?", ":", "@", "=" and "&" are the characters which may be
> reserved for special meaning within a scheme.
But, again, please do not use RFC1738, it is outdated. In fact, RFC2396 is much
more clear:
> 2.2. Reserved Characters
> Many URI include components consisting of or delimited by, certain
> special characters. These characters are called "reserved", since
> their usage within the URI component is limited to their reserved
> purpose. If the data for a URI component would conflict with the
> reserved purpose, then the conflicting data must be escaped before
> forming the URI.
> reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
> "$" | ","
Basically exactly what I said above.
> 3.4. Query Component [...]
> query = *uric
> Within a query component, the characters ";", "/", "?", ":", "@",
> "&", "=", "+", ",", and "$" are reserved.
REOPENing.
Status: RESOLVED → REOPENED
Comment 29•24 years ago
|
||
and reclosing
Status: REOPENED → RESOLVED
Closed: 24 years ago → 24 years ago
Resolution: --- → INVALID
Comment 30•24 years ago
|
||
Huh? Please say why I am wrong.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Comment 31•24 years ago
|
||
I must say I find it pretty barfling that you just reclose, completely ignoring
my well-founded comment.
Comment 32•24 years ago
|
||
ok, let me try this one last time before I reclose the bug. From a user
perspective, when they are entering a value they are normally doing that in some
sort of dialog such as the link dialog. As the user types, the expectation is
that they will actually see the ampersand. The reason I bring this up, is
because we need to account for that instance during this whole process. So, if a
user types in (or copies) the following string into such a dialog:
http://ruby:ban&far@www.foo.bar/foo.html?p="1"&q="2"&r="bert&ernie"
the user would expect to see just that, however, in the backend there will be a
need for the translations. Based on the above example, the ampersand (&) is
being utilized in three different instances. The first instance is within a
password and is treated as string value, the second instance in which it is
being used is as a delimiter, the thrid and last instance in which it is being
used is within an attribute value and is again treated as a text string value.
The ampersand cannot be used in any other fashion within the url -- this is it,
no other uses in the url string.
In such cases where the ampersand is being used as a text string value: it MUST
BE encoded to %26. It is done so, so it is not translated when retrieving the
value.
In such cases where it is used as a delimiter: it MUST BE html-entitied (&
or &#nn;). It is done so, so it IS translated when retrieving the value.
In either case, there is an additional translation required for rendering to the
user.
Now, hopefully for the last time -- if the ampersand is not being used as a
delimiter -- which would mean that no matter where it is placed in the string,
if it not specifially being used as the delimiter, it is translated as text, if
it is translated as text, then it gets encoded - that means if it is at the
beginning, the middle or the end -- doesn't matter.
Closing yet again
Status: REOPENED → RESOLVED
Closed: 24 years ago → 24 years ago
Resolution: --- → INVALID
Updated•24 years ago
|
Whiteboard: [nsbeta1+] → [nsbeta1-]
Comment 33•24 years ago
|
||
> let me try this one last time before I reclose the bug.
You seem to be very sure that you are right and I am too dumb to understand you.
Somebody else has an opinion here?
> the second instance in which it is being used is as a delimiter
(variable separator)
> In such cases where it is used as a delimiter: it MUST BE html-entitied
> (& or &#nn;).
Why? How do you come to HTML? This is a URL, sent in *plaintext*. No HTML at all
here. URLs are not "HTML-entitied". Plaintext isn't either.
*If* were talking about HTML source code, you were right - & has spepcial
memaning in HTML and thus must be "entitied" in HTML. But "&" appears in the
plaintext, on-the-wire msg source.
Comment 34•24 years ago
|
||
I'm was not alluding that you're dumb or any such thing. why you interpreted the
comment as such is an incorrect interpretation.
I was addressing this issue via html based context and in that context, it is
working correctly. However, in the context of plaintext mail, what I believe is
really happening is that the string (the url) is actually being filtered as if
it is an anchor. When you receive the mail message, the string is viewed as a
link -- do you see that? So, what I think is happening is that on send, the url
is being wrapped within an anchor, which in turn is/was getting escaped and what
you saw is happening because of that. However, using the current builds, the
string is still embedded in an anchor, but the last ampersand is being
displayed correctly. In that respect, the delimited (variable separator) is
being handled correctly.
If in the plaintext context, you are correct in that it should not be escaped,
but I do not think that we are actually doing plaintext in that sense, I believe
there is still some level of html.
variable separator == delimiter
Comment 35•24 years ago
|
||
beppe, do the following:
Reproduction:
1. Go to <http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1>
2. File|Send Link...
3. Add yourself as recipient
4. Send, as Plaintext Only (this is the default, at least in Mozilla)
5. Recieve mail, in any mailer, e.g. Mozilla or 4.x
6. View Source
Actual result:
http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1
<http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1>
Expected result:
http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1
As for *why* this happens, see the discussion above.
Comment 36•24 years ago
|
||
I did the test and I do not see the & in the source -- regardless if I open
it in mozilla or 4.x -- In fact, if it is sent via plaintext, in the source
there is only this:
This is a multi-part message in MIME format.
--------------030401010301090802020903
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
<http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1>
--------------030401010301090802020903
However, in HTML compose, the source looks like this:
This is a multi-part message in MIME format.
--------------010300030303040103020804
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1
<http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1>
Notice, that the html compose source is exactly what was mentioned in this bug
-- not the plaintext. Also note, that the second url line is wrapped in the < >
delimiters, and is treated as if it is the content (value) of the href
attribute.
What does this lead me to believe: that the issue is within html compose and not
plaintext compose, due to 1. the contstructs of the source within the plaintext
and html files verses the discussion and samples in this bug are consistent
with the source of html compose, 2. within the html source, the 2nd url is
treated as a link (easily tested by displaying the mail and mousing over the
line).
Consequently, since the first line being displayed is the text string of the
url, and the ampersand is not encoded, it is correct. The second url is being
treated as the content of the href atribute, the encoding of the ampersand is
correct.
Comment 38•24 years ago
|
||
beppe,
I'm sorry that I didn't give very exact instructions for reproduction. Next try:
Reproduction:
1. Set your prefs to use the HTML composer
2. Go to <http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1>
3. File|Send Link...
4. Add yourself as recipient
5. Options|Format|Plaintext Only
6. Send
5. Recieve mail, in any mailer, e.g. Mozilla or 4.x
6. View Source
Yes, you do have to use the HTML composer. Nevertheless, the HTML composer can
send as Plaintext, after the HTML is converted. This is the default, at least in
Mozilla: Compose in HTML editor, send as Plaintext.
No matter how you compose is, if you send as plaintext, it has to be plaintext.
An "&" is completely misplaced in plaintext.
Yes, IIRC, the bug comes from "&" being escaped in the href attribute. But that
doesn't change the fact that the endresult is bogus and thus this is a bug.
As for where exactly the bug has to be fixed, see above.
Comment 39•24 years ago
|
||
I just tried Ben's instructions: the mail I got back, viewed in a plaintext mail
UA, said:
http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1
<http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1>
If I tried to go to the URL inside the angle brackets, I would go to the wrong
URL. So I have to agree with BenB that this is still a bug. Beth, am I missing
something?
Ben, any idea what the right fix for this is? Perhaps the plaintext serializer
is doing entity conversion and shouldn't? But I don't see anywhere in
nsPlaintextSerializer where it's doing entity conversion except for entity tags
(which obviously aren't the problem since this is inside an anchor tag).
Doesn't look like the plaintext serializer ever calls the charset encoder
either. Not sure where this conversion is happening if it isn't in the html
source inside the edit window (Beth, how are you viewing the source? I wish
mail would enable the debug menu in their compose window, or offer a source pane
tab).
Status: VERIFIED → REOPENED
Resolution: INVALID → ---
Comment 40•24 years ago
|
||
> Ben, any idea what the right fix for this is?
See the comments from 2000-12-08 17:45 to 2000-12-09 10:43.
I suggest to put some code that HTML-unescapes into
|nsPlaintextSerializer::GetAttributeValue|. (The actualy unescaping should
probably be written as a function in a more central place and just be called
from |GetAttributeValue|.)
Comment 41•24 years ago
|
||
Actually, I have an even simpler reproduction, which surely doesn't involve the
TXT->HTML converter:
1. Open HTMl composer
2. Enter "bla1", select it
3. Creat Link, with link-target (href) "bla&bla"
4. Options|Format|Plaintext Only
5. Send
6. Recieve with any mailer
7. View Source
Acutal result:
bla1 <bla&bpa>
Expected result:
bla1 <bla&bpa>
Comment 42•24 years ago
|
||
If I follow Ben's simpler repro instructions in the html editor (not mail
compose), so I can get the Debug menu and see what's going on, if I then Output
Text, I see:
bla <bla&bla>
If I follow the longer (mail send) instructions and check the value of url at
line 672 of nsPlaintextSerializer, just before it sets mURL = url, I see that
the & is already there. url came straight from mParserNode->GetValueAt(i)
and hasn't gone through any extra url escapes. This suggests that the mail
(libmime?) code which inserts the initial text into the compose window did the
url escaping.
That's at odds with Ben's latest experiment, though, where he typed the value in
to the link dialog in html compose. His experiment suggested to me (since I
didn't see the same thing happening in Output Text from the editor) that the
problem likely had something to do with the fact that mail saves as html, then
reads the html back in and converts it to plaintext, instead of just writing it
as plaintext in the first place.
In any case, I'm leaning strongly toward thinking this is a mail problem, unless
someone can find a way to demonstrate it in the editor without going through
mail code at all. I can't seem to repro it in the editor.
Comment 43•24 years ago
|
||
ok, yes that one step is quite important (Options|Format|Plaintext), if
selecting that option is indeed supposed to convert the content of the message
to plaintext, then yes that is an issue. But it is quite evident that that is
not what is happening (the content of the web page is sent as an inline html
file and should not if it truly is plaintext). I contacted Seth and have asked
him to take a look at this (will also cc him now).
Comment 44•24 years ago
|
||
forgot to mention, if that is the case, then this should be assigned to mail
team to work out the conversion issues.
Updated•24 years ago
|
Component: DOM to Text Conversion → Mail Window Front End
Product: Browser → MailNews
Summary: URL is htmlified (& -> & etc.) → Options|Format|Plaintext is not converting contents of HTML compose message to plaintext
Comment 45•24 years ago
|
||
reassigning to the mail team. The real problem:
Options|Format|Plaintext is not converting contents of HTML compose message to
plaintext
updating Summary with the correct issue
Comment 46•24 years ago
|
||
reassigning to mail team
Assignee: beppe → sspitzer
Status: REOPENED → NEW
QA Contact: sujay → esther
Comment 47•24 years ago
|
||
beppe, no, the msg is converted. You did "Send Page", not "Send Link". Even if
you do Send Page, the behaviour is correct, because the page is an attachment,
and the conversion applies only to the main body, i.e. the text you write. This
is exactly the same as 4.x. Restoring SUMMARY.
Summary: Options|Format|Plaintext is not converting contents of HTML compose message to plaintext → URL is htmlified (& -> & etc.)
Comment 48•24 years ago
|
||
no Ben, when you send page with the initial mail message set to plaintext it
works just fine, when you set the default method to be html and then select via
the options menu it doesn't work fine -- get it?
changing the summary back to THE CORRECT SUMMARY STATEMENT
Summary: URL is htmlified (& -> & etc.) → Options|Format|Plaintext is not converting contents of HTML compose message to
Comment 49•24 years ago
|
||
Beth and I talked about it and agreed that it is converting to plaintext, it
just isn't doing it correctly (since it's expanding the entities). Seth can
change the summary to whatever he feels best describes the bug.
One interesting thing I just discovered: when I view a folder that has a message
like this, I see both instances of the url linkified (the first one, which
corresponds to the link text, and the second one in angle brackets, which is the
real url) and the one in angle brackets still displays with the & in it; but
if I click on the one in angle brackets, it removes the & before it goes to
the link, so it goes to the right link. Perhaps someone noticed this problem
earlier, and wrote a workaround so that the links would work right, but forgot
that they'd still look wrong and they wouldn't work in other mail clients. (And
if someone ever did want to mail a link that had & in it, it wouldn't work
right because it would be converted to &.)
Summary: Options|Format|Plaintext is not converting contents of HTML compose message to → Options|Format|Plaintext is not converting contents of HTMLcompose message to
Comment 50•24 years ago
|
||
reassigning to ducarroz and marking nsbeta1+
Whiteboard: [nsbeta1-] → [nsbeta1+]
Comment 53•24 years ago
|
||
CCing harishd and jst.
Comment 54•24 years ago
|
||
AFAICT (from looking at the code and discussing this with harishd and varada)
this problem is in the HTML serializer (and is visible in the HTML editor as
well), the seiralizer is converting '&'s in anchor href attributes into '&'.
Attributes that are url's need to be special cased for HTML to not do the entity
expansion since that's just the way they work. Reassigning to anthonyd, fixing
this sohuld be fairly straight forward in the HTML serializer. Also cc:ing
nhotta since he's working on related changes to getting the characterset of
href's correct, or something.
Assignee: varada → anthonyd
Comment 55•24 years ago
|
||
Comment 56•24 years ago
|
||
Johnny: did you read my comment of 2001-03-08 14:35 ? This does not happen when
we go straight from the editor to the plaintext serializer, only when we go
through mailnews to the html serializer, then re-parse, after which the extra
characters are already in the dom and the serializer can't do anything about that.
That convinces me that the serializer is doing the right thing, and the problem
happens when mail/news sends the contents back around to be re-parsed and then
re-serialized.
Are you seeing something I'm not? Have you come up with a way to reproduce this
without going through mime code?
Comment 57•24 years ago
|
||
Akkana, start composer, click on the intsert link button, enter "foo" as the
text to show in the link and type "http://foo.bar/foo.html&a=b" as the href and
insert the link. Then save the file and look at what the serializer produced,
you'll see
<a href="http://foo.bar/foo.html&a=b">foo</a>
when it should've been
<a href="http://foo.bar/foo.html&a=b">foo</a>
Comment 58•24 years ago
|
||
jst, I believe that the current behaviour you describe is correct. [Note for
all: I am discussing that with jst in private mail, will log results here.]
Anyways, I think you can reproduce this bug even without URLs (so changing the
HTML serializer for URLs only won't fix this bug completely):
Reproduction:
1. New mail
2. Insert Image
3. Enter as alternative text: "An image & a test".
4. Send as 'Plaintext and HTML'.
5. Recieve
6. View Source
Actual result (similar to):
[...]
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
ghkfguk [an image & a text]
[...]
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit
<html><head></head><body>ghkfguk<img
src="cid:part1.03000002.09080505@netscape.com" alt="an image & a text"
width="191" height="79"></body></html>
[...]
I.e. in both the plaintext part and the HTML part, an "&" appears for "&".
Expected result:
[...]
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
ghkfguk [an image & a text]
[...]
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit
<html><head></head><body>ghkfguk<img
src="cid:part1.03000002.09080505@netscape.com" alt="an image & a text"
width="191" height="79"></body></html>
[...]
I.e. in the HTML part, an "&" appears for "&", while in the plaintext part,
a normal "&" appears.
This shows that this bug has nothing to do with URL-encoding at all, although
URLs are the worst and most common case.
Comment 59•24 years ago
|
||
Yeah, you're right, this is not specific to url attributes, it applies to all
attributes, and my suggestion would only fix this for href attributes. My
suggestion would fix the bug, bit it would make the serializer produce
non-standard (non-well formed if you like) HTML, but all browsers out there
deals with '&'s in href attributes correctly, I bet the common useage on the web
is to not encode '&' in an URL as '&', but it's still wrong...
So the problem here is probably that the serializer needs to expand entities in
attribute values when it's driven by the contentsink API (whihc it really
shouldn't be, but oh well), just like the HTML content sink does
(HTMLContentSink::GetAttributeValueAt()), the parser is really the one that
should be doing this, but for some odd reason it doesn't, it leaves it up to the
sink to do it.
So it seems like the real fix for this problem, in the long run, is to make
mailnews use the plaintext serializer to generate the plaintext output directly
from the DOM in stead of first serializing it to HTML and then doign a HTML to
text conversion (which is an expensive operation). But in the short run the fix
is to make the plaintext serializer expand entities in attribute values that are
gotten from the parser.
Comment 60•24 years ago
|
||
reassigning back to the mail team, as suggested back on 09-03-2001, the
Format|Options|Plaintext option needs to do the correct thing
Assignee: anthonyd → ducarroz
Comment 61•24 years ago
|
||
Adding extra code to "fix" the serializer (which one, html or plaintext?) to do
the wrong thing intentionally, to compensate for problems in the interaction
between incorrect mail code mail and the parser caused by doing extra work that
we know we shouldn't be doing, seems like the wrong approach. Please, let's not
do that.
Assignee | ||
Comment 62•24 years ago
|
||
It looks like the entity reduction should happen in the parser than in the sinks
( currently only the HTMLContentSink supports this ). So, moving over to the
parser and will come up with a fix based on risk factors.
Assignee: ducarroz → harishd
Component: Mail Window Front End → Parser
Product: MailNews → Browser
Comment 63•24 years ago
|
||
Akkana, I wasn't suggesting we'd add code to the serializers (plaintext) to do
the wrong thing, what I suggested was to add code to the plaintext serializer to
make it work correctly in the current model (that we've lived with for 3 years
now, but the model is IMO wrong) where the parser doesn't expand entities in
attribute values. To fix the plaintext serializer we wouldn't even need to write
any new code, we could use the code in the HTML content sink. I'd feel better
about adding the code to the plaintext serializer for now and not change the
parser since that change can and will change the interaction with the HTML sink
so we could screw up normal parsing, who knows what will regress, and we're
getting close to a critical milestone.
Harish is looking into changing the parser to do the right thing, and if that
change doesn't look to scary then I'm ok with that, but if that gets complicated
then I'd say we should fix the plaintext serializer (which is trivial).
In either case, this is not a mailnews problem, they might be using an API to
the serializer that they shouldn't be using but since that API exists, it should
work correctly.
Assignee | ||
Comment 65•24 years ago
|
||
Assignee | ||
Comment 66•24 years ago
|
||
Assignee | ||
Comment 67•24 years ago
|
||
I have moved the attribute entity parsing code from the sinks to the parser. The
patch also covers the issue raised by dbaron in bug 63188 ( where the parameter
aMode was being used inconsistently ). This change in addition to fixing the bug
should also enhance performance, a little bit ( sorry I don't have the numbers
..yet), since it completely avoids attribute parsing in the sink.
Status: NEW → ASSIGNED
You could add a little whistespace to make things more readable, for example on
these lines:
mFlags |= (aCommand==eViewSource)?
NS_IPARSER_FLAG_VIEW_SOURCE:NS_IPARSER_FLAG_VIEW_NORMAL;
isUsableAttr=!(mFlags &
(NS_IPARSER_FLAG_STRICT_MODE|NS_IPARSER_FLAG_TRANSITIONAL_MODE));
This comment does not make sense:
//Enabling strict comment parsing for Bug 53011 and 2749 contradicts!!!!
In ConsumeAttributeEntity() you have two declarations for theNCRValue. Easiest
would be to just use the outer, I guess. There are some other things in that
function I do not really like, but since they were copied from the old
implementation they can stay unless you want to do extra work (cast of PRInt32
to PRUnichar, ignoring error value from ToInteger).
Also I have trouble reading this:
+ // Nav 4.x does not support entities greater than 255 however IE
+ // does support them if terminated with semicolon. Let's do the same...
+ // ...atleast in the quirks mode.
+ if (-1!=theNCRValue &&
+ (theNCRValue<255 || !(aFlag & NS_IPARSER_FLAG_QUIRKS_MODE) ||
theTermChar == ';')) {
+ aString.Append(PRUnichar(theNCRValue));
To me the comment and code don't seem to match, and I think you changed how it
works from the old code. How would this work:
+ if ( (theNCRValue > 255) && (theTermChar == ';') &&
(aFlag & NS_IPARSER_FLAG_QUIRKS_MODE) ) {
+ aString.Append(PRUnichar(theNCRValue));
Comment 69•24 years ago
|
||
> // Nav 4.x does not support entities greater than 255 however IE
> // does support them if terminated with semicolon. Let's do the same...
> // ...atleast in the quirks mode.
HTML 4.0 spec <http://www.w3.org/TR/REC-html40/charset.html#h-5.3>:
> Note. In SGML [, on which HTML is based], it is possible to eliminate the
> final ";" after a character reference in some cases (e.g., at a line break
> or immediately before a tag). In other circumstances it may not be
> eliminated (e.g., in the middle of a word). We strongly suggest using
> the ";" in all cases to avoid problems with user agents that require this
> character to be present.
Assignee | ||
Comment 70•24 years ago
|
||
Heikki, I don't think that the logic is incorrect but will try to simplify it
further so that it can be more readable.
Assignee | ||
Comment 71•24 years ago
|
||
Ben: Are you suggesting that we should always look for ";", for entities greater
than 255, whether the mode is strict or not?
According to the spec., from what I understand, the semicolon is not a MUST but
strongly recommends using it. So, according to my comment, mozilla will not care
for a semicolon, for entity values greater than 255, in strict mode. That is,
in strict mode, the entities would get reduced irrespective of a semicolon.
Comment 72•24 years ago
|
||
harishd, I don't know enough about the actual web (pages) to make a good
suggestion, but I would try to allow what SGML allows, at least in strict mode.
The "recommondation" is for web authors to guard them from non-SGML-complying
user agents (what Mozilla would be with the logic in your patch, if I interpret
it correctly).
Assignee | ||
Comment 73•24 years ago
|
||
Assignee | ||
Comment 74•24 years ago
|
||
Assignee | ||
Comment 75•24 years ago
|
||
Comment 76•24 years ago
|
||
sr=jst
Assignee | ||
Comment 77•24 years ago
|
||
Fix is in. Marking FIXED.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago → 24 years ago
Resolution: --- → FIXED
Comment 78•24 years ago
|
||
Using build 2001-05-22 on win, linux and mac this is fixed per original
scenario. the & is not in the viewed msgs after doing a send page in html
and or plain text. Side note: when viewing this page with linux I sometimes
crash. We had crashing bugs in today's build with links and viewing msgs. I
will check this page again tomorrow (with a fix) to see if it still crashes. If
so, I will log a new bug.
Verified.
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•