948901 - Character encoding menu doesn't work correctly in the FTP directory view

Reporter

Description

•

11 years ago

Attached patch ftp_charsetmenu (obsolete) — Details — Splinter Review

STR: 1. Set Options-Content-Advanced-Fallback Character Encoding to Japanese. 2. Open <ftp://ftp.picosystems.net/pub/>. 3. Choose Unicode from View-Character Encoding. Actual result: The 20th file name looks like "PMI_リニア�E��E�_1988_89.pdf" (mojibake). Expected result: It should be look like "PMI_リニアＩＣ_1988_89.pdf". This is because nsDirIndexParser knows nothing about user-forced encoding. The attached patch will take the encoding information from the document for the generated HTML. As an added bonus, with this patch and bug 910211 patch, <ftp://ftp.asu.ru/incoming/> will be displayed correctly by default without changing any Character Encoding settings.

Attachment #8345833 - Flags: review?(honzab.moz)

Masatoshi Kimura [:emk]

Reporter

Comment 1

•

11 years ago

Try run: https://tbpl.mozilla.org/?tree=Try&rev=d38db806e5d2

Honza Bambas (:mayhemer)

Comment 2

•

11 years ago

Comment on attachment 8345833 [details] [diff] [review] ftp_charsetmenu Review of attachment 8345833 [details] [diff] [review]: ----------------------------------------------------------------- Does what it should do, but I cannot check for 100% if this is technically correct. r+ ::: netwerk/streamconv/converters/nsIndexedToHTML.cpp @@ +161,5 @@ > + channel->GetContentCharset(charset); > + if (EncodingUtils::FindEncodingForLabel(charset, charset) && > + EncodingUtils::IsAsciiCompatible(charset) && > + !charset.EqualsLiteral("ISO-2022-JP")) { > + mParser->SetEncoding(charset.get()); can you please add comments here what these lines do/decide on? What is ISO-2022-JP and why is it here? What if SetEncoding fails? ::: xpfe/components/directory/nsDirectoryViewer.cpp @@ +1370,5 @@ > + nsCOMPtr<nsIDocument> doc = (*aDocViewerResult)->GetDocument(); > + if (doc) { > + nsString charset; > + doc->GetCharacterSet(charset); > + aChannel->SetContentCharset(NS_ConvertUTF16toUTF8(charset)); I'm not that familiar with nsDirectoryViewerFactory, find someone who is peer to this code to check on this particular piece.

Attachment #8345833 - Flags: review?(honzab.moz) → review+

Masatoshi Kimura [:emk]

Reporter

Comment 3

•

11 years ago

Comment on attachment 8345833 [details] [diff] [review] ftp_charsetmenu Neil, could you look into the nsDirectoryViewer bit?

Attachment #8345833 - Flags: review?(neil)

Masatoshi Kimura [:emk]

Reporter

Comment 4

•

11 years ago

(In reply to Honza Bambas (:mayhemer) from comment #2) > ::: netwerk/streamconv/converters/nsIndexedToHTML.cpp > @@ +161,5 @@ > > + channel->GetContentCharset(charset); > > + if (EncodingUtils::FindEncodingForLabel(charset, charset) && > > + EncodingUtils::IsAsciiCompatible(charset) && > > + !charset.EqualsLiteral("ISO-2022-JP")) { > > + mParser->SetEncoding(charset.get()); > > can you please add comments here what these lines do/decide on? What is > ISO-2022-JP and why is it here? Will do. Summarizing here for the reference: FindEncodingForLabel will resolve an alias to a canonical encoding name. It is an implementation of the "get an encoding" algorithm of the Encoding Standard. http://encoding.spec.whatwg.org/#concept-encoding-get IsAsciiCompatible will reject XSS-unsafe encodings to prevent XSS attacks via FTP directory view. IsAsciiCompatible doesn't reject ISO-2022-JP, so I added an explicit check. It is unlikely that the stateful encoding is used for filenames. > What if SetEncoding fails? It will never fail. It just assign the value to mEncoding and always returns NS_OK. Even if it fails somehow, it is harmless. Although FTP directory listing will be garbled, it is not worse than the current status. mEncoding is initialized by mozilla::dom::FallbackEncoding::FromLocale() in the constructor.

neil@parkwaycc.co.uk

Assignee

Comment 5

•

11 years ago

Attached patch WIP Patch (obsolete) — Details — Splinter Review

I think I've spent far too much time thinking about this... As I understand it, the directory index parser currently assumes that everything uses the fallback character encoding. It takes the input data, unescapes and converts it to Unicode using that encoding, and passes the result to the indexed to HTML converter... which then promptly converts it back to that character encoding, so that it can feed it to the HTML parser. This patch therefore makes it so that the directory index parser acquires knowledge of the character set from the document in to which it's being loaded, so that it can convert it to Unicode and back again using the "correct" character encoding. However you can just feed the original bytes through and let the the Unicode decoding happen somewhere else. This will automatically pick up on forced character set changes without any extra code. The attached patch is an attempt to do just that, although I have only done basic testing on it. Rather annoyingly you can't set the fallback encoding to Unicode, so it might be better to omit the meta charset completely and let Gecko try to detect the character set automatically.

Attachment #8355048 - Flags: feedback?(VYV03354)

Masatoshi Kimura [:emk]

Reporter

Comment 6

•

11 years ago

Comment on attachment 8355048 [details] [diff] [review] WIP Patch Review of attachment 8355048 [details] [diff] [review]: ----------------------------------------------------------------- The overall idea looks great. ::: netwerk/streamconv/converters/nsIndexedToHTML.cpp @@ -814,3 @@ > > - // Truncate long names to not stretch the table > - //XXX this should be left to the stylesheet (bug 391471) Please add "text-overflow: ellipsis" to the embedded stylesheet to compensate for the removed code.

Attachment #8355048 - Flags: feedback?(VYV03354) → feedback+

Masatoshi Kimura [:emk]

Reporter

Updated

•

11 years ago

Attachment #8345833 - Attachment is obsolete: true

Attachment #8345833 - Flags: review?(neil)

neil@parkwaycc.co.uk

Assignee

Comment 7

•

11 years ago

Attached patch Proposed patch (obsolete) — Details — Splinter Review

Pinging bz because the file has more of his fingerprints than anyone else. Most of the changes relate to changing the internal buffer from nsString to nsCString. The charset is only set for file: URLs; for FTP URLs the bytes are transmitted to the HTML parser which then guesses the character set, applying any forced or default character set as appropriate. Instead of manually limiting the file names to 71 characters the output has been improved to allow the use of text-overflow: ellipsis; however this has necessitated a minor theme tweak. There is also some minor string usage cleanup; NCRs are appended directly to the output string, character equivalents of integer constants are used, and memory is managed using adopting strings.

Assignee: nobody → neil

Status: NEW → ASSIGNED

Attachment #8373064 - Flags: review?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 8

•

11 years ago

Comment on attachment 8373064 [details] [diff] [review] Proposed patch Going to punt this to a necko peer.

Attachment #8373064 - Flags: review?(bzbarsky) → review?(mcmanus)

Patrick McManus [:mcmanus]

Updated

•

11 years ago

Attachment #8373064 - Flags: review?(mcmanus) → review?(honzab.moz)

Honza Bambas (:mayhemer)

Comment 9

•

11 years ago

Comment on attachment 8373064 [details] [diff] [review] Proposed patch Review of attachment 8373064 [details] [diff] [review]: ----------------------------------------------------------------- Sorry for such a long delay (occupied). I have few questions before r+ given. The patch works (according the STR) but I don't know what else it could break. Also I want to mention I'm not deeply (if at all) familiar with this code and have no idea who is. Maybe Michal Novotny (:michal). You should ask him. ::: netwerk/streamconv/converters/nsIndexedToHTML.cpp @@ +658,5 @@ > return mListener->OnStopRequest(request, aContext, aStatus); > } > > nsresult > +nsIndexedToHTML::FormatInputStream(nsIRequest* aRequest, nsISupports *aContext, const nsACString &aBuffer) This no longer "formats" anything. Please rename the method to something more appropriate, like SendStringToStreamListener or so. when you are here, please remove the whitespace at the end @@ -772,4 @@ > pushBuffer.AppendLiteral("<tr"); > > - nsXPIDLString description; > - aIndex->GetDescription(getter_Copies(description)); this was a bug? you are changing this to location, why? @@ +691,5 @@ > + nsXPIDLCString loc; > + aIndex->GetLocation(getter_Copies(loc)); > + > + // Don't byte-to-Unicode conversion here, it is lossy. > + loc.SetLength(nsUnescapeCount(loc.BeginWriting())); can you more explain this line? @@ +849,1 @@ > if (!escaped) Are you sure this is right/still necessary to do? @@ +849,4 @@ > if (!escaped) > return NS_ERROR_OUT_OF_MEMORY; > pushBuffer.AppendLiteral("<tr>\n <td>"); > + AppendNonAsciiToNCR(escaped, pushBuffer); why is the conversion to &#x; now needed?

neil@parkwaycc.co.uk

Assignee

Comment 10

•

11 years ago

(In reply to Honza Bambas from comment #9) > I have few questions before r+ given. The patch works (according the STR) > but I don't know what else it could break. Also I want to mention I'm not > deeply (if at all) familiar with this code and have no idea who is. Thanks for looking at the patch. > > +nsIndexedToHTML::FormatInputStream(nsIRequest* aRequest, nsISupports *aContext, const nsACString &aBuffer) > > This no longer "formats" anything. Please rename the method to something > more appropriate, like SendStringToStreamListener or so. Fair enough, I was just avoiding fixing up the six callers. > when you are here, please remove the whitespace at the end Fixed locally. > > - nsXPIDLString description; > > - aIndex->GetDescription(getter_Copies(description)); > > this was a bug? you are changing this to location, why? GetDescription tries to decode using the character set set on the HTTPIndex object. We don't know the correct character set here, so we need to send the raw string to the parser which then decodes it with the user-selected character set. > > + // Don't byte-to-Unicode conversion here, it is lossy. > > + loc.SetLength(nsUnescapeCount(loc.BeginWriting())); > > can you more explain this line? The comment and line were moved up about 150 lines (including deleted lines). The comment is just clarifying that we don't want to convert to Unicode here because we don't know the correct character set. The code line simply ensures that the string is the correct length after it is unescaped in place. > > if (!escaped) > > Are you sure this is right/still necessary to do? I don't know whether it is possible for nsEscapeHTML2 to fail, but this will continue to detect it if it does. > > + AppendNonAsciiToNCR(escaped, pushBuffer); > > why is the conversion to &#x; now needed? I've been given a Unicode nsAString which I have to convert to NCRs because the pushBuffer is in an unknown encoding. If someone can provide a test case to show that the relevant information is being processed incorrectly then I could try to come up with a patch that changes the interface but I suggest that might be better in a dependent bug.

neil@parkwaycc.co.uk

Assignee

Updated

•

11 years ago

Attachment #8373064 - Flags: review?(michal.novotny)

Honza Bambas (:mayhemer)

Comment 11

•

11 years ago

(In reply to neil@parkwaycc.co.uk from comment #10) > Fair enough, I was just avoiding fixing up the six callers. Feel free to do, it's OK in this case - the naming is wrong. > GetDescription tries to decode using the character set set on the HTTPIndex > object. We don't know the correct character set here, so we need to send the > raw string to the parser which then decodes it with the user-selected > character set. Worth a comment. > > > > + // Don't byte-to-Unicode conversion here, it is lossy. > > > + loc.SetLength(nsUnescapeCount(loc.BeginWriting())); > > > > can you more explain this line? > > The comment and line were moved up about 150 lines (including deleted > lines). The comment is just clarifying that we don't want to convert to > Unicode here because we don't know the correct character set. The code line > simply ensures that the string is the correct length after it is unescaped > in place. I'm concerned here on how you calculate the length. If the string is not filled we will show an uninitialized data on the screen, or when overfilled, we crash (serious sec bug!). So please take care this is correct and explain how you calculate the length, why you have to call SetLength and how you ensure the string is filled correctly. Personally I try to avoid these constructions, definitely when encoding and escaping is involved. The length calculation and difference between unescaped and escaped is not always clear and usually leads to hard to find and dangerous mistakes. > I don't know whether it is possible for nsEscapeHTML2 to fail, but this will > continue to detect it if it does. I trust you :) > I've been given a Unicode nsAString which I have to convert to NCRs because > the pushBuffer is in an unknown encoding. If someone can provide a test case > to show that the relevant information is being processed incorrectly then I > could try to come up with a patch that changes the interface but I suggest > that might be better in a dependent bug. Again, worth a comment.

neil@parkwaycc.co.uk

Assignee

Comment 12

•

11 years ago

(In reply to Honza Bambas from comment #11) > > > > + // Don't byte-to-Unicode conversion here, it is lossy. > > > > + loc.SetLength(nsUnescapeCount(loc.BeginWriting())); > > > > > > can you more explain this line? > > > > The comment and line were moved up about 150 lines (including deleted > > lines). The comment is just clarifying that we don't want to convert to > > Unicode here because we don't know the correct character set. The code line > > simply ensures that the string is the correct length after it is unescaped > > in place. > > I'm concerned here on how you calculate the length. If the string is not > filled we will show an uninitialized data on the screen, or when overfilled, > we crash (serious sec bug!). So please take care this is correct and > explain how you calculate the length, why you have to call SetLength and how > you ensure the string is filled correctly. Personally I try to avoid these > constructions, definitely when encoding and escaping is involved. The > length calculation and difference between unescaped and escaped is not > always clear and usually leads to hard to find and dangerous mistakes. loc is an nsCString which contains a null-terminated string, however some of the characters may have been %-escaped. nsUnescapeCount takes a null-terminated string, unescapes it in-place, and returns the length of the resulting string. Unescaping always shortens the string, so it is not possible to overfill the string. Simply applying nsUnescapeCount to loc.BeginWriting() unescapes the string, however we need to update loc to its new length, otherwise we would accidentally output some of the escaped portion of the string.

Honza Bambas (:mayhemer)

Comment 13

•

11 years ago

(In reply to neil@parkwaycc.co.uk from comment #12) > loc is an nsCString which contains a null-terminated string, however some of > the characters may have been %-escaped. > > nsUnescapeCount takes a null-terminated string, unescapes it in-place, and > returns the length of the resulting string. Unescaping always shortens the > string, so it is not possible to overfill the string. > > Simply applying nsUnescapeCount to loc.BeginWriting() unescapes the string, > however we need to update loc to its new length, otherwise we would > accidentally output some of the escaped portion of the string. Ah! Yes.. Good example of bad naming of a function :) nsUnescapeCount alters the string passed in and returns the new length. So this is OK. Thanks for clarifying. Please just add a comment: "nsUnescapeCount() directly unescapes loc string content and returns the processed string length, so that we directly adjust nsCString's length to be correct again." ...or something like this.

Honza Bambas (:mayhemer)

Updated

•

11 years ago

Attachment #8373064 - Flags: review?(honzab.moz) → review+

Jason Duell

Comment 14

•

11 years ago

> Try run: https://tbpl.mozilla.org/?tree=Try&rev=d38db806e5d2 Note that try runs don't test FTP at all--we have *no* automated testing for FTP still :( So we need to do at least minimal hand-testing against common FTP server versions before we can land this. Sadly we don't even have a list of what the most common FTP servers are (that's the state of disrepair of our FTP testing).

Michal Novotny [:michal]

Updated

•

11 years ago

Attachment #8373064 - Flags: review?(michal.novotny) → review+

neil@parkwaycc.co.uk

Assignee

Comment 15

•

11 years ago

Attached patch Addressed review comments — Details — Splinter Review

Sorry for taking so long to get back to you with this.

Attachment #8355048 - Attachment is obsolete: true

Attachment #8373064 - Attachment is obsolete: true

Attachment #8392606 - Flags: review?(honzab.moz)

Honza Bambas (:mayhemer)

Comment 16

•

11 years ago

Comment on attachment 8392606 [details] [diff] [review] Addressed review comments r=honzab thanks, sorry for the delay

Attachment #8392606 - Flags: review?(honzab.moz) → review+

neil@parkwaycc.co.uk

Assignee

Comment 17

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/18f95fb1d866

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 18

•

11 years ago

Backed out in http://hg.mozilla.org/integration/mozilla-inbound/rev/75a932c24023 for breaking OSX m-oth tests: https://tbpl.mozilla.org/php/getParsedLog.php?id=36696363&tree=Mozilla-Inbound

Flags: needinfo?(neil)

neil@parkwaycc.co.uk

Assignee

Comment 19

•

11 years ago

Well, it looks as if I have two ways to go here. A couple of people have mentioned that his particular assertion isn't that unexpected, and so I might just need to annotate the test and move on. However, it might be possible to avoid triggering the assertion, so I'm also testing that option on try.

Flags: needinfo?(neil)

neil@parkwaycc.co.uk

Assignee

Comment 20

•

11 years ago

Somehow my "workaround" actually manages to prevent one of the existing assertions, which shows how useful they are... Unfortunately there's a second test that's asserting (but only on the Mac) and my workaround wouldn't apply in that case anyway.

Alfred Kayser

Comment 21

•

11 years ago

What is this change about? 80 tbody > tr:hover { 80 body > table > tbody > tr:hover {

neil@parkwaycc.co.uk

Assignee

Comment 22

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/b2f9a0080f9c

neil@parkwaycc.co.uk

Assignee

Comment 23

•

11 years ago

(In reply to Alfred Kayser from comment #21) > What is this change about? > 80 tbody > tr:hover { > 80 body > table > tbody > tr:hover { In order to get the ellipsis to work, I can't put the file name directly inside the table cell, instead I have to nest it inside a table with fixed table layout. This change simply ensures that the hover rule only applies to the outer table.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 24

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/b2f9a0080f9c

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla31

Masatoshi Kimura [:emk]

Reporter

Comment 25

•

11 years ago

Looks like <ftp://ftp.asu.ru/incoming/> was broken by Neil's patch (I couldn't test it before because the server was down).

Masatoshi Kimura [:emk]

Reporter

Updated

•

11 years ago

Depends on: 989576

YF (Yang)

Updated

•

11 years ago

Depends on: 997519

Mike Hommey [:glandium]

Updated

•

10 years ago

Depends on: 1194497

BMO Automation

Updated

•

1 year ago

Product: Core → Core Graveyard

ftp_charsetmenu 11 years ago Masatoshi Kimura [:emk] 3.81 KB, patch	mayhemer : review+	Details \| Diff \| Splinter Review
WIP Patch 11 years ago neil@parkwaycc.co.uk 22.71 KB, patch	emk : feedback+	Details \| Diff \| Splinter Review
Proposed patch 11 years ago neil@parkwaycc.co.uk 26.63 KB, patch	mayhemer : review+ michal : review+	Details \| Diff \| Splinter Review
Addressed review comments 11 years ago neil@parkwaycc.co.uk 27.55 KB, patch	mayhemer : review+	Details \| Diff \| Splinter Review