Closed Bug 57248 Opened 25 years ago Closed 24 years ago

Options|Format|Plaintext is not converting contents of HTMLcompose message to

Categories

(Core :: DOM: HTML Parser, defect, P2)

defect

Tracking

()

VERIFIED FIXED
mozilla0.9.1

People

(Reporter: bugzilla, Assigned: harishd)

References

()

Details

(Whiteboard: [nsbeta1+])

Attachments

(5 files)

Go to: http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1 Choose File -> Send Page The Compose window now correctly says: http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1 Send the mail to yourself. In the mail you revieve it says: http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1 <http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&amp;test=1> Note the &amp; in the last link! That's dead wrong!
Dont delete the URL...:)
I'm investigating.
FYI: When reading the mail in Netscape 4.75 it says: <http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1>
No, it doesn't, not for me. The msg source is wrong already, and 4.x doesn't have a chance to recover from that, just like our viewer. It is a problem during composition. -> Composition. This bug does already appear on a 2000-09-13 build, so it is not caused by NOXIF. If you do Debug|OutputText during composition, everything is fine.
.
Assignee: putterman → ducarroz
Component: Mail Window Front End → Composition
Update: The last sentence suggests that this is a bug in Composition. We seem have a very odd data path during send (somthing like HTML->TXT->HTML->TXT or similar*), which, IIRC, exposed various bugs already (which is partly good and partly bad). IMO, it is time to clean this up. The only thing I am wondering is what exactly inserts the &amp;. This is propably an additional bug, because, in a perfect world, HTML->TXT->HTML->TXT would produce the same as HTML->TXT. *IIRC, in previous bugs, we found that it is not just HTML/TXT conversions - there was some other converter involved, but I don't remember the details. Do you akk? Or am I erronously confusing this with the bugs we found during reply?
Summary: URL is htmlified (& = &amp;, etcs..) in revieved mails that contains attached webpages → URL is htmlified (& -> &amp; etc.)
The path followed by entity conversion changed with the noxif landing. This bug was fixed at one point; perhaps it has regressed. Adding jst since he knows the current status of entity handling in the output converters.
akk, I don't understand. And I said, I see the exact same behavious in an old build.
Isn't this the same issue as in bug 44372 (entities inside tags not converted properly?) That was fixed some time ago, so if this is the same issue, it would help to know what regressed. What does the href look like if you do Debug->Output HTML during composition?
> Isn't this the same issue as in bug 44372 (entities inside tags not converted > properly?) Dunno. Note that here, the part in the tag has &amp;. We has a similar bug (tags not processed corrently) in libmime's flowed->HTML converter, but that was supposed to be fixed, too. Ccing Daniel Bratell nevertheless. > What does the href look like if you do Debug->Output HTML during composition? <a href="http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&amp;test=1">http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&amp;test=1</a> (which is very odd, because the link is not blue in the HTML composer). OutputText gives << http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1 >> because the converter correctly recognized that text and href are identical.
Oh, I think I get it. This is the opposite of bug 44372. That bug was that entities like &amp; inside of a tags were getting converted to &, which was not considered to be correct html. This bug is that you WANT the & inside the a tag and don't want it converted to &amp;. I don't know what is technically correct, or whether both are correct. Once they're in the dom, though, we can't tell the two cases apart; the parser makes them equivalent, so we have to choose one or the other form on output.
akk, the output is plaintext - the &amp; is wrong. Daniel, I enabled some debugging output and is seems that the libmime flowed->HTML converter is not hit at all during send (which is correct).
The HTML srializer will convert '&' to '&amp;', and doing that is absolutely corect, not doing that conversion will produce non-wellformed HTML (ha ha, I know, there's no well-formed HTML:-). If the plaintext serializer grabs attribute values it should unescape the value before using it.
Accepting. I am confuse, is it a real bug?
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9
> is it a real bug? Yes. We send out plaintext. "&" is valid in URLs - no need to escape. Actually the way we escape gives a wrong URL, since "&amp;" is HTML-escaping, not URL-escaping. I guess the only reason why it works nevertheless in Mozilla (you can click on either link and get to the same page) is that Mozilla is "smart" and fixes up the URL. The only question is *where* the bug is. Is it in ScanHTML, ScanTXT, Compose, ...?
Daniel, akk, could HTML->TXT be the problem? Here's the code: Opening a tag: nsAutoString url; if (NS_SUCCEEDED(GetAttributeValue(nsHTMLAtoms::href, url)) && !url.IsEmpty()) { mURL = url; } Closing a tag: nsAutoString temp; temp.AssignWithConversion(" <"); temp += mURL; temp.AppendWithConversion(">"); Write(temp); mURL.Truncate(); Does |GetAttributeValue| or |Write| any HTMLunescaping, i.e. "&amp;" -> "&"? If not, that would be the culprit.
I don't believe Write does any unescaping. GetAttributeValue calls mContent's GetAttribute and mParserNode's GetValueAt, neither of which, AFAIK, should do any unescaping.
Based on akk's comments, over to DOM->TXT. BTW: How do we unescape normal text in the converter? Didn't find any Unescape (*scape*) function call.
Assignee: ducarroz → akkana
Status: ASSIGNED → NEW
Component: Composition → DOM to Text Conversion
Product: MailNews → Browser
QA Contact: esther → sujay
The parser splits up text containing entitys in parts so foo&lt;bar would enter the converter as text: "foo" entity: "lt" text: "bar" You will se an entity handler in DoAddLeaf (IIRC).
Anthony is taking over dom-text conversion. I really need to figure out who to ask to change the default component owner. :-) Any unescaping in the plaintext serializer comes from the charset converters, AFAIK.
Assignee: akkana → anthonyd
this should be resolved for moz0.9
Keywords: correctness, nsbeta1
Priority: P3 → P2
Whiteboard: [nsbeta1+]
reassign to myself since anthonyd has a large buglist
Assignee: anthonyd → brade
Target Milestone: mozilla0.9 → mozilla0.9.1
based on RFC1738, & when it's not a delimiter should be %26. If it's a delimiter, then we need to do the html-escaping (&amp;)
Status: NEW → RESOLVED
Closed: 24 years ago
Keywords: correctness
Resolution: --- → INVALID
verified in 3/5 build.
Status: RESOLVED → VERIFIED
> based on RFC1738, & when it's not a delimiter should be %26. Please quote the exact sentence, with paragraph number. I think, you are wrong. It must be %26, if it isn't used as query specifier. In this case (the common case), it is a query specifier, so the "?" is completely legal. See RFC2396 (which updates RFC1793 - please use the former, not the latter), Section 3.. REOPENing. > If it's a delimiter, then we need to do the html-escaping (&amp;) We're talking about the case where it is *not* a delimiter. > verified Did you actually check the RFC? > in 3/5 build. That's complete nonsense.
Status: VERIFIED → REOPENED
Resolution: INVALID → ---
reassign to beppe for explanation/resolution Ben--in the case where it is *not* a delimiter, then the user should have entered "%26" in the url. Am I mistaken?
Assignee: brade → beppe
Status: REOPENED → NEW
item 1: Reserved: Many URL schemes reserve certain characters for a special meaning: their appearance in the scheme-specific part of the URL has a designated semantics. If the character corresponding to an octet is reserved in a scheme, the octet must be encoded. The characters ";", "/", "?", ":", "@", "=" and "&" are the characters which may be reserved for special meaning within a scheme. No other characters may be reserved within a scheme. Usually a URL has the same interpretation when an octet is represented by a character and when it encoded. However, this is not true for reserved characters: encoding a character reserved for a particular scheme may change the semantics of a URL. Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL. On the other hand, characters that are not required to be encoded (including alphanumerics) may be encoded within the scheme-specific part of a URL, as long as they are not being used for a reserved purpose. Item 2: 3.1. Common Internet Scheme Syntax While the syntax for the rest of the URL may vary depending on the particular scheme selected, URL schemes that involve the direct use of an IP-based protocol to a specified host on the Internet use a common syntax for the scheme-specific data: //<user>:<password>@<host>:<port>/<url-path> Some or all of the parts "<user>:<password>@", ":<password>", ":<port>", and "/<url-path>" may be excluded. The scheme specific data start with a double slash "//" to indicate that it complies with the common Internet scheme syntax. The different components obey the following rules: user An optional user name. Some schemes (e.g., ftp) allow the specification of a user name. password An optional password. If present, it follows the user name separated from it by a colon. The user name (and password), if present, are followed by a commercial at-sign "@". Within the user and password field, any ":", "@", or "/" must be encoded. NOTE: ampersand is not listed Item 3: ip-schemepart = "//" login [ "/" urlpath ] login = [ user [ ":" password ] "@" ] hostport hostport = host [ ":" port ] host = hostname | hostnumber hostname = *[ domainlabel "." ] toplabel domainlabel = alphadigit | alphadigit *[ alphadigit | "-" ] alphadigit toplabel = alpha | alpha *[ alphadigit | "-" ] alphadigit alphadigit = alpha | digit hostnumber = digits "." digits "." digits "." digits port = digits user = *[ uchar | ";" | "?" | "&" | "=" ] password = *[ uchar | ";" | "?" | "&" | "=" ] urlpath = *xchar ; depends on protocol see section 3.1 NOTE: for user and password the & is present, and must be encoded Item 4: ; FTP (see also RFC959) ftpurl = "ftp://" login [ "/" fpath [ ";type=" ftptype ]] fpath = fsegment *[ "/" fsegment ] fsegment = *[ uchar | "?" | ":" | "@" | "&" | "=" ] ftptype = "A" | "I" | "D" | "a" | "i" | "d" ; FILE fileurl = "file://" [ host | "localhost" ] "/" fpath ; HTTP httpurl = "http://" hostport [ "/" hpath [ "?" search ]] hpath = hsegment *[ "/" hsegment ] hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ] search = *[ uchar | ";" | ":" | "@" | "&" | "=" ] NOTE: & in fsegment,ftptype, hsegment and search, and must be encoded Item 5: ; NEWS (see also RFC1036) newsurl = "news:" grouppart grouppart = "*" | group | article group = alpha *[ alpha | digit | "-" | "." | "+" | "_" ] article = 1*[ uchar | ";" | "/" | "?" | ":" | "&" | "=" ] "@" host NOTE: the ampersand is within article, and must be encoded Item 6: ; PROSPERO prosperourl = "prospero://" hostport "/" ppath *[ fieldspec ] ppath = psegment *[ "/" psegment ] psegment = *[ uchar | "?" | ":" | "@" | "&" | "=" ] fieldspec = ";" fieldname "=" fieldvalue fieldname = *[ uchar | "?" | ":" | "@" | "&" ] fieldvalue = *[ uchar | "?" | ":" | "@" | "&" ] NOTE: the ampersand is within psegment, fieldname and fieldvalue, and must be encoded Item 7: reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" NOTE: the & is listed as a reserved value -- now go back up and re-read item 1 To further illustrate: 2.1. The main parts of URLs A full BNF description of the URL syntax is given in Section 5. In general, URLs are written as follows: <scheme>:<scheme-specific-part> A URL contains the name of the scheme being used (<scheme>) followed by a colon and then a string (the <scheme-specific-part>) whose interpretation depends on the scheme. Scheme names consist of a sequence of characters. The lower case letters "a"--"z", digits, and the characters plus ("+"), period ("."), and hyphen ("-") are allowed. For resiliency, programs interpreting URLs should treat upper case letters as equivalent to lower case in scheme names (e.g., allow "HTTP" as well as "http"). Thus, if the & is being utilized as a delimiter, it must be escaped, consequently not be translated when passing through necko. If the & is used for any other purpose, it must be encoded so it does get translated when passing through necko
Status: NEW → RESOLVED
Closed: 24 years ago24 years ago
First, with "delimiter", I mean a charater that shows where the URL starts and ends, i.e. is not part of the URL, i.e. is part of the normal text. The usual delimiters are "<" and ">", e.g. <http://www.mozilla.org>. RFC2396, 2.4.3.: > The angle-bracket "<" and ">" and double-quote (") characters are > excluded because they are often used as the delimiters around URI in > text documents and protocol fields. The character "#" is excluded > because it is used to delimit a URI from a fragment identifier in URI > references (Section 4). The percent character "%" is excluded because > it is used for the encoding of escaped characters. > delims = "<" | ">" | "#" | "%" | If you mean "delimiter" = HTTP query variable separator, e.g. "&" in <http://www.mozilla.org/cgi/script.pl?var1=string1&var2=string2>, then, yes, we're talking about a "delimiter" in this bug. Second, apart from the discussion if to encode or not, even *if* you had to encode the "&", using "&amp;" would be wrong: It is HTML-encoding, while we're talking about a URL in plaintext. *If* you'd encode, you'd have to use %26 or similar. So, this is a bug in *any* case. You quote RFC1738: > Reserved: > Many URL schemes reserve certain characters for a special meaning: > their appearance in the scheme-specific part of the URL has a > designated semantics. "&" is such a "certain", "reserved" charater for the HTTP scheme - it separates query variables. > If the character corresponding to an octet is > reserved in a scheme, the octet must be encoded. ...for the non-reserved use! If you encoded the variable separator "&", how would you represent a "&" in the variable content? You'd have to encode it, too, and it would be impossible to tell, if the encoded "&" was meant as variable content or separator. To avoid exactly *that* is the whole point of the encoding! The variable separator appears as raw "&", while "&" as variable content has to be encoded as %26 or similar. > The characters ";", > "/", "?", ":", "@", "=" and "&" are the characters which may be > reserved for special meaning within a scheme. But, again, please do not use RFC1738, it is outdated. In fact, RFC2396 is much more clear: > 2.2. Reserved Characters > Many URI include components consisting of or delimited by, certain > special characters. These characters are called "reserved", since > their usage within the URI component is limited to their reserved > purpose. If the data for a URI component would conflict with the > reserved purpose, then the conflicting data must be escaped before > forming the URI. > reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | > "$" | "," Basically exactly what I said above. > 3.4. Query Component [...] > query = *uric > Within a query component, the characters ";", "/", "?", ":", "@", > "&", "=", "+", ",", and "$" are reserved. REOPENing.
Status: RESOLVED → REOPENED
and reclosing
Status: REOPENED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → INVALID
Huh? Please say why I am wrong.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
I must say I find it pretty barfling that you just reclose, completely ignoring my well-founded comment.
ok, let me try this one last time before I reclose the bug. From a user perspective, when they are entering a value they are normally doing that in some sort of dialog such as the link dialog. As the user types, the expectation is that they will actually see the ampersand. The reason I bring this up, is because we need to account for that instance during this whole process. So, if a user types in (or copies) the following string into such a dialog: http://ruby:ban&far@www.foo.bar/foo.html?p="1"&q="2"&r="bert&ernie" the user would expect to see just that, however, in the backend there will be a need for the translations. Based on the above example, the ampersand (&) is being utilized in three different instances. The first instance is within a password and is treated as string value, the second instance in which it is being used is as a delimiter, the thrid and last instance in which it is being used is within an attribute value and is again treated as a text string value. The ampersand cannot be used in any other fashion within the url -- this is it, no other uses in the url string. In such cases where the ampersand is being used as a text string value: it MUST BE encoded to %26. It is done so, so it is not translated when retrieving the value. In such cases where it is used as a delimiter: it MUST BE html-entitied (&amp; or &#nn;). It is done so, so it IS translated when retrieving the value. In either case, there is an additional translation required for rendering to the user. Now, hopefully for the last time -- if the ampersand is not being used as a delimiter -- which would mean that no matter where it is placed in the string, if it not specifially being used as the delimiter, it is translated as text, if it is translated as text, then it gets encoded - that means if it is at the beginning, the middle or the end -- doesn't matter. Closing yet again
Status: REOPENED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → INVALID
Whiteboard: [nsbeta1+] → [nsbeta1-]
> let me try this one last time before I reclose the bug. You seem to be very sure that you are right and I am too dumb to understand you. Somebody else has an opinion here? > the second instance in which it is being used is as a delimiter (variable separator) > In such cases where it is used as a delimiter: it MUST BE html-entitied > (&amp; or &#nn;). Why? How do you come to HTML? This is a URL, sent in *plaintext*. No HTML at all here. URLs are not "HTML-entitied". Plaintext isn't either. *If* were talking about HTML source code, you were right - & has spepcial memaning in HTML and thus must be "entitied" in HTML. But "&amp;" appears in the plaintext, on-the-wire msg source.
I'm was not alluding that you're dumb or any such thing. why you interpreted the comment as such is an incorrect interpretation. I was addressing this issue via html based context and in that context, it is working correctly. However, in the context of plaintext mail, what I believe is really happening is that the string (the url) is actually being filtered as if it is an anchor. When you receive the mail message, the string is viewed as a link -- do you see that? So, what I think is happening is that on send, the url is being wrapped within an anchor, which in turn is/was getting escaped and what you saw is happening because of that. However, using the current builds, the string is still embedded in an anchor, but the last ampersand is being displayed correctly. In that respect, the delimited (variable separator) is being handled correctly. If in the plaintext context, you are correct in that it should not be escaped, but I do not think that we are actually doing plaintext in that sense, I believe there is still some level of html. variable separator == delimiter
beppe, do the following: Reproduction: 1. Go to <http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1> 2. File|Send Link... 3. Add yourself as recipient 4. Send, as Plaintext Only (this is the default, at least in Mozilla) 5. Recieve mail, in any mailer, e.g. Mozilla or 4.x 6. View Source Actual result: http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1 <http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&amp;test=1> Expected result: http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1 As for *why* this happens, see the discussion above.
I did the test and I do not see the &amp; in the source -- regardless if I open it in mozilla or 4.x -- In fact, if it is sent via plaintext, in the source there is only this: This is a multi-part message in MIME format. --------------030401010301090802020903 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit <http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1> --------------030401010301090802020903 However, in HTML compose, the source looks like this: This is a multi-part message in MIME format. --------------010300030303040103020804 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1 <http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&amp;test=1> Notice, that the html compose source is exactly what was mentioned in this bug -- not the plaintext. Also note, that the second url line is wrapped in the < > delimiters, and is treated as if it is the content (value) of the href attribute. What does this lead me to believe: that the issue is within html compose and not plaintext compose, due to 1. the contstructs of the source within the plaintext and html files verses the discussion and samples in this bug are consistent with the source of html compose, 2. within the html source, the 2nd url is treated as a link (easily tested by displaying the mail and mousing over the line). Consequently, since the first line being displayed is the text string of the url, and the ampersand is not encoded, it is correct. The second url is being treated as the content of the href atribute, the encoding of the ampersand is correct.
verified in 3/8 build.
Status: RESOLVED → VERIFIED
beppe, I'm sorry that I didn't give very exact instructions for reproduction. Next try: Reproduction: 1. Set your prefs to use the HTML composer 2. Go to <http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1> 3. File|Send Link... 4. Add yourself as recipient 5. Options|Format|Plaintext Only 6. Send 5. Recieve mail, in any mailer, e.g. Mozilla or 4.x 6. View Source Yes, you do have to use the HTML composer. Nevertheless, the HTML composer can send as Plaintext, after the HTML is converted. This is the default, at least in Mozilla: Compose in HTML editor, send as Plaintext. No matter how you compose is, if you send as plaintext, it has to be plaintext. An "&amp;" is completely misplaced in plaintext. Yes, IIRC, the bug comes from "&" being escaped in the href attribute. But that doesn't change the fact that the endresult is bogus and thus this is a bug. As for where exactly the bug has to be fixed, see above.
I just tried Ben's instructions: the mail I got back, viewed in a plaintext mail UA, said: http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&test=1 <http://gemal.dk/browserspy/ws.cgi?server=http://gemal.dk&amp;test=1> If I tried to go to the URL inside the angle brackets, I would go to the wrong URL. So I have to agree with BenB that this is still a bug. Beth, am I missing something? Ben, any idea what the right fix for this is? Perhaps the plaintext serializer is doing entity conversion and shouldn't? But I don't see anywhere in nsPlaintextSerializer where it's doing entity conversion except for entity tags (which obviously aren't the problem since this is inside an anchor tag). Doesn't look like the plaintext serializer ever calls the charset encoder either. Not sure where this conversion is happening if it isn't in the html source inside the edit window (Beth, how are you viewing the source? I wish mail would enable the debug menu in their compose window, or offer a source pane tab).
Status: VERIFIED → REOPENED
Resolution: INVALID → ---
> Ben, any idea what the right fix for this is? See the comments from 2000-12-08 17:45 to 2000-12-09 10:43. I suggest to put some code that HTML-unescapes into |nsPlaintextSerializer::GetAttributeValue|. (The actualy unescaping should probably be written as a function in a more central place and just be called from |GetAttributeValue|.)
Actually, I have an even simpler reproduction, which surely doesn't involve the TXT->HTML converter: 1. Open HTMl composer 2. Enter "bla1", select it 3. Creat Link, with link-target (href) "bla&bla" 4. Options|Format|Plaintext Only 5. Send 6. Recieve with any mailer 7. View Source Acutal result: bla1 <bla&amp;bpa> Expected result: bla1 <bla&bpa>
If I follow Ben's simpler repro instructions in the html editor (not mail compose), so I can get the Debug menu and see what's going on, if I then Output Text, I see: bla <bla&bla> If I follow the longer (mail send) instructions and check the value of url at line 672 of nsPlaintextSerializer, just before it sets mURL = url, I see that the &amp is already there. url came straight from mParserNode->GetValueAt(i) and hasn't gone through any extra url escapes. This suggests that the mail (libmime?) code which inserts the initial text into the compose window did the url escaping. That's at odds with Ben's latest experiment, though, where he typed the value in to the link dialog in html compose. His experiment suggested to me (since I didn't see the same thing happening in Output Text from the editor) that the problem likely had something to do with the fact that mail saves as html, then reads the html back in and converts it to plaintext, instead of just writing it as plaintext in the first place. In any case, I'm leaning strongly toward thinking this is a mail problem, unless someone can find a way to demonstrate it in the editor without going through mail code at all. I can't seem to repro it in the editor.
ok, yes that one step is quite important (Options|Format|Plaintext), if selecting that option is indeed supposed to convert the content of the message to plaintext, then yes that is an issue. But it is quite evident that that is not what is happening (the content of the web page is sent as an inline html file and should not if it truly is plaintext). I contacted Seth and have asked him to take a look at this (will also cc him now).
forgot to mention, if that is the case, then this should be assigned to mail team to work out the conversion issues.
Component: DOM to Text Conversion → Mail Window Front End
Product: Browser → MailNews
Summary: URL is htmlified (& -> &amp; etc.) → Options|Format|Plaintext is not converting contents of HTML compose message to plaintext
reassigning to the mail team. The real problem: Options|Format|Plaintext is not converting contents of HTML compose message to plaintext updating Summary with the correct issue
reassigning to mail team
Assignee: beppe → sspitzer
Status: REOPENED → NEW
QA Contact: sujay → esther
beppe, no, the msg is converted. You did "Send Page", not "Send Link". Even if you do Send Page, the behaviour is correct, because the page is an attachment, and the conversion applies only to the main body, i.e. the text you write. This is exactly the same as 4.x. Restoring SUMMARY.
Summary: Options|Format|Plaintext is not converting contents of HTML compose message to plaintext → URL is htmlified (& -> & etc.)
no Ben, when you send page with the initial mail message set to plaintext it works just fine, when you set the default method to be html and then select via the options menu it doesn't work fine -- get it? changing the summary back to THE CORRECT SUMMARY STATEMENT
Summary: URL is htmlified (& -> & etc.) → Options|Format|Plaintext is not converting contents of HTML compose message to
Beth and I talked about it and agreed that it is converting to plaintext, it just isn't doing it correctly (since it's expanding the entities). Seth can change the summary to whatever he feels best describes the bug. One interesting thing I just discovered: when I view a folder that has a message like this, I see both instances of the url linkified (the first one, which corresponds to the link text, and the second one in angle brackets, which is the real url) and the one in angle brackets still displays with the &amp; in it; but if I click on the one in angle brackets, it removes the &amp; before it goes to the link, so it goes to the right link. Perhaps someone noticed this problem earlier, and wrote a workaround so that the links would work right, but forgot that they'd still look wrong and they wouldn't work in other mail clients. (And if someone ever did want to mail a link that had &amp; in it, it wouldn't work right because it would be converted to &.)
Summary: Options|Format|Plaintext is not converting contents of HTML compose message to → Options|Format|Plaintext is not converting contents of HTMLcompose message to
reassigning to ducarroz and marking nsbeta1+
Whiteboard: [nsbeta1-] → [nsbeta1+]
forgot to reassign
Assignee: sspitzer → ducarroz
Re-assigning to varada
Assignee: ducarroz → varada
CCing harishd and jst.
AFAICT (from looking at the code and discussing this with harishd and varada) this problem is in the HTML serializer (and is visible in the HTML editor as well), the seiralizer is converting '&'s in anchor href attributes into '&amp'. Attributes that are url's need to be special cased for HTML to not do the entity expansion since that's just the way they work. Reassigning to anthonyd, fixing this sohuld be fairly straight forward in the HTML serializer. Also cc:ing nhotta since he's working on related changes to getting the characterset of href's correct, or something.
Assignee: varada → anthonyd
I am changing the HTML serializer for bug 74137 to escape URI. For URI attributes like 'href' and 'src', do a charset conversion (if not ASCII) then apply URI encode (%xx). I think '&' will not be URL escaped so it still needs a special handling for that case separately from bug 74137.
Johnny: did you read my comment of 2001-03-08 14:35 ? This does not happen when we go straight from the editor to the plaintext serializer, only when we go through mailnews to the html serializer, then re-parse, after which the extra characters are already in the dom and the serializer can't do anything about that. That convinces me that the serializer is doing the right thing, and the problem happens when mail/news sends the contents back around to be re-parsed and then re-serialized. Are you seeing something I'm not? Have you come up with a way to reproduce this without going through mime code?
Akkana, start composer, click on the intsert link button, enter "foo" as the text to show in the link and type "http://foo.bar/foo.html&a=b" as the href and insert the link. Then save the file and look at what the serializer produced, you'll see <a href="http://foo.bar/foo.html&amp;a=b">foo</a> when it should've been <a href="http://foo.bar/foo.html&a=b">foo</a>
jst, I believe that the current behaviour you describe is correct. [Note for all: I am discussing that with jst in private mail, will log results here.] Anyways, I think you can reproduce this bug even without URLs (so changing the HTML serializer for URLs only won't fix this bug completely): Reproduction: 1. New mail 2. Insert Image 3. Enter as alternative text: "An image & a test". 4. Send as 'Plaintext and HTML'. 5. Recieve 6. View Source Actual result (similar to): [...] Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit ghkfguk [an image &amp; a text] [...] Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit <html><head></head><body>ghkfguk<img src="cid:part1.03000002.09080505@netscape.com" alt="an image &amp; a text" width="191" height="79"></body></html> [...] I.e. in both the plaintext part and the HTML part, an "&amp;" appears for "&". Expected result: [...] Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit ghkfguk [an image & a text] [...] Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit <html><head></head><body>ghkfguk<img src="cid:part1.03000002.09080505@netscape.com" alt="an image &amp; a text" width="191" height="79"></body></html> [...] I.e. in the HTML part, an "&amp;" appears for "&", while in the plaintext part, a normal "&" appears. This shows that this bug has nothing to do with URL-encoding at all, although URLs are the worst and most common case.
Yeah, you're right, this is not specific to url attributes, it applies to all attributes, and my suggestion would only fix this for href attributes. My suggestion would fix the bug, bit it would make the serializer produce non-standard (non-well formed if you like) HTML, but all browsers out there deals with '&'s in href attributes correctly, I bet the common useage on the web is to not encode '&' in an URL as '&amp;', but it's still wrong... So the problem here is probably that the serializer needs to expand entities in attribute values when it's driven by the contentsink API (whihc it really shouldn't be, but oh well), just like the HTML content sink does (HTMLContentSink::GetAttributeValueAt()), the parser is really the one that should be doing this, but for some odd reason it doesn't, it leaves it up to the sink to do it. So it seems like the real fix for this problem, in the long run, is to make mailnews use the plaintext serializer to generate the plaintext output directly from the DOM in stead of first serializing it to HTML and then doign a HTML to text conversion (which is an expensive operation). But in the short run the fix is to make the plaintext serializer expand entities in attribute values that are gotten from the parser.
reassigning back to the mail team, as suggested back on 09-03-2001, the Format|Options|Plaintext option needs to do the correct thing
Assignee: anthonyd → ducarroz
Adding extra code to "fix" the serializer (which one, html or plaintext?) to do the wrong thing intentionally, to compensate for problems in the interaction between incorrect mail code mail and the parser caused by doing extra work that we know we shouldn't be doing, seems like the wrong approach. Please, let's not do that.
It looks like the entity reduction should happen in the parser than in the sinks ( currently only the HTMLContentSink supports this ). So, moving over to the parser and will come up with a fix based on risk factors.
Assignee: ducarroz → harishd
Component: Mail Window Front End → Parser
Product: MailNews → Browser
Akkana, I wasn't suggesting we'd add code to the serializers (plaintext) to do the wrong thing, what I suggested was to add code to the plaintext serializer to make it work correctly in the current model (that we've lived with for 3 years now, but the model is IMO wrong) where the parser doesn't expand entities in attribute values. To fix the plaintext serializer we wouldn't even need to write any new code, we could use the code in the HTML content sink. I'd feel better about adding the code to the plaintext serializer for now and not change the parser since that change can and will change the interaction with the HTML sink so we could screw up normal parsing, who knows what will regress, and we're getting close to a critical milestone. Harish is looking into changing the parser to do the right thing, and if that change doesn't look to scary then I'm ok with that, but if that gets complicated then I'd say we should fix the plaintext serializer (which is trivial). In either case, this is not a mailnews problem, they might be using an API to the serializer that they shouldn't be using but since that API exists, it should work correctly.
Adding nsbeta1+ in keywords
Keywords: nsbeta1nsbeta1+
Attached patch Patch v1.2Splinter Review
I have moved the attribute entity parsing code from the sinks to the parser. The patch also covers the issue raised by dbaron in bug 63188 ( where the parameter aMode was being used inconsistently ). This change in addition to fixing the bug should also enhance performance, a little bit ( sorry I don't have the numbers ..yet), since it completely avoids attribute parsing in the sink.
Status: NEW → ASSIGNED
You could add a little whistespace to make things more readable, for example on these lines: mFlags |= (aCommand==eViewSource)? NS_IPARSER_FLAG_VIEW_SOURCE:NS_IPARSER_FLAG_VIEW_NORMAL; isUsableAttr=!(mFlags & (NS_IPARSER_FLAG_STRICT_MODE|NS_IPARSER_FLAG_TRANSITIONAL_MODE)); This comment does not make sense: //Enabling strict comment parsing for Bug 53011 and 2749 contradicts!!!! In ConsumeAttributeEntity() you have two declarations for theNCRValue. Easiest would be to just use the outer, I guess. There are some other things in that function I do not really like, but since they were copied from the old implementation they can stay unless you want to do extra work (cast of PRInt32 to PRUnichar, ignoring error value from ToInteger). Also I have trouble reading this: + // Nav 4.x does not support entities greater than 255 however IE + // does support them if terminated with semicolon. Let's do the same... + // ...atleast in the quirks mode. + if (-1!=theNCRValue && + (theNCRValue<255 || !(aFlag & NS_IPARSER_FLAG_QUIRKS_MODE) || theTermChar == ';')) { + aString.Append(PRUnichar(theNCRValue)); To me the comment and code don't seem to match, and I think you changed how it works from the old code. How would this work: + if ( (theNCRValue > 255) && (theTermChar == ';') && (aFlag & NS_IPARSER_FLAG_QUIRKS_MODE) ) { + aString.Append(PRUnichar(theNCRValue));
> // Nav 4.x does not support entities greater than 255 however IE > // does support them if terminated with semicolon. Let's do the same... > // ...atleast in the quirks mode. HTML 4.0 spec <http://www.w3.org/TR/REC-html40/charset.html#h-5.3>: > Note. In SGML [, on which HTML is based], it is possible to eliminate the > final ";" after a character reference in some cases (e.g., at a line break > or immediately before a tag). In other circumstances it may not be > eliminated (e.g., in the middle of a word). We strongly suggest using > the ";" in all cases to avoid problems with user agents that require this > character to be present.
Heikki, I don't think that the logic is incorrect but will try to simplify it further so that it can be more readable.
Ben: Are you suggesting that we should always look for ";", for entities greater than 255, whether the mode is strict or not? According to the spec., from what I understand, the semicolon is not a MUST but strongly recommends using it. So, according to my comment, mozilla will not care for a semicolon, for entity values greater than 255, in strict mode. That is, in strict mode, the entities would get reduced irrespective of a semicolon.
harishd, I don't know enough about the actual web (pages) to make a good suggestion, but I would try to allow what SGML allows, at least in strict mode. The "recommondation" is for web authors to guard them from non-SGML-complying user agents (what Mozilla would be with the logic in your patch, if I interpret it correctly).
sr=jst
Fix is in. Marking FIXED.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
Using build 2001-05-22 on win, linux and mac this is fixed per original scenario. the &amp; is not in the viewed msgs after doing a send page in html and or plain text. Side note: when viewing this page with linux I sometimes crash. We had crashing bugs in today's build with links and viewing msgs. I will check this page again tomorrow (with a fix) to see if it still crashes. If so, I will log a new bug. Verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: