Closed Bug 948901 Opened 6 years ago Closed 6 years ago

Character encoding menu doesn't work correctly in the FTP directory view

Categories

(Core :: Networking: FTP, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla31

People

(Reporter: emk, Assigned: neil)

References

Details

Attachments

(1 file, 3 obsolete files)

Attached patch ftp_charsetmenu (obsolete) — Splinter Review
STR:
1. Set Options-Content-Advanced-Fallback Character Encoding to Japanese.
2. Open <ftp://ftp.picosystems.net/pub/>.
3. Choose Unicode from View-Character Encoding.

Actual result:
The 20th file name looks like "PMI_リニア�E��E�_1988_89.pdf" (mojibake).

Expected result:
It should be look like "PMI_リニアIC_1988_89.pdf".

This is because nsDirIndexParser knows nothing about user-forced encoding.
The attached patch will take the encoding information from the document for the generated HTML.
As an added bonus, with this patch and bug 910211 patch, <ftp://ftp.asu.ru/incoming/> will be displayed correctly by default without changing any Character Encoding settings.
Attachment #8345833 - Flags: review?(honzab.moz)
Comment on attachment 8345833 [details] [diff] [review]
ftp_charsetmenu

Review of attachment 8345833 [details] [diff] [review]:
-----------------------------------------------------------------

Does what it should do, but I cannot check for 100% if this is technically correct.
r+

::: netwerk/streamconv/converters/nsIndexedToHTML.cpp
@@ +161,5 @@
> +    channel->GetContentCharset(charset);
> +    if (EncodingUtils::FindEncodingForLabel(charset, charset) &&
> +        EncodingUtils::IsAsciiCompatible(charset) &&
> +        !charset.EqualsLiteral("ISO-2022-JP")) {
> +        mParser->SetEncoding(charset.get());

can you please add comments here what these lines do/decide on?  What is ISO-2022-JP and why is it here?

What if SetEncoding fails?

::: xpfe/components/directory/nsDirectoryViewer.cpp
@@ +1370,5 @@
> +    nsCOMPtr<nsIDocument> doc = (*aDocViewerResult)->GetDocument();
> +    if (doc) {
> +      nsString charset;
> +      doc->GetCharacterSet(charset);
> +      aChannel->SetContentCharset(NS_ConvertUTF16toUTF8(charset));

I'm not that familiar with nsDirectoryViewerFactory, find someone who is peer to this code to check on this particular piece.
Attachment #8345833 - Flags: review?(honzab.moz) → review+
Comment on attachment 8345833 [details] [diff] [review]
ftp_charsetmenu

Neil, could you look into the nsDirectoryViewer bit?
Attachment #8345833 - Flags: review?(neil)
(In reply to Honza Bambas (:mayhemer) from comment #2)
> ::: netwerk/streamconv/converters/nsIndexedToHTML.cpp
> @@ +161,5 @@
> > +    channel->GetContentCharset(charset);
> > +    if (EncodingUtils::FindEncodingForLabel(charset, charset) &&
> > +        EncodingUtils::IsAsciiCompatible(charset) &&
> > +        !charset.EqualsLiteral("ISO-2022-JP")) {
> > +        mParser->SetEncoding(charset.get());
> 
> can you please add comments here what these lines do/decide on?  What is
> ISO-2022-JP and why is it here?

Will do.
Summarizing here for the reference:
FindEncodingForLabel will resolve an alias to a canonical encoding name. It is an implementation of the "get an encoding" algorithm of the Encoding Standard.
http://encoding.spec.whatwg.org/#concept-encoding-get
IsAsciiCompatible will reject XSS-unsafe encodings to prevent XSS attacks via FTP directory view.
IsAsciiCompatible doesn't reject ISO-2022-JP, so I added an explicit check. It is unlikely that the stateful encoding is used for filenames.

> What if SetEncoding fails?

It will never fail. It just assign the value to mEncoding and always returns NS_OK. Even if it fails somehow, it is harmless. Although FTP directory listing will be garbled, it is not worse than the current status. mEncoding is initialized by mozilla::dom::FallbackEncoding::FromLocale() in the constructor.
Attached patch WIP Patch (obsolete) — Splinter Review
I think I've spent far too much time thinking about this...

As I understand it, the directory index parser currently assumes that everything uses the fallback character encoding. It takes the input data, unescapes and converts it to Unicode using that encoding, and passes the result to the indexed to HTML converter... which then promptly converts it back to that character encoding, so that it can feed it to the HTML parser.

This patch therefore makes it so that the directory index parser acquires knowledge of the character set from the document in to which it's being loaded, so that it can convert it to Unicode and back again using the "correct" character encoding.

However you can just feed the original bytes through and let the the Unicode decoding happen somewhere else. This will automatically pick up on forced character set changes without any extra code. The attached patch is an attempt to do just that, although I have only done basic testing on it.

Rather annoyingly you can't set the fallback encoding to Unicode, so it might be better to omit the meta charset completely and let Gecko try to detect the character set automatically.
Attachment #8355048 - Flags: feedback?(VYV03354)
Comment on attachment 8355048 [details] [diff] [review]
WIP Patch

Review of attachment 8355048 [details] [diff] [review]:
-----------------------------------------------------------------

The overall idea looks great.

::: netwerk/streamconv/converters/nsIndexedToHTML.cpp
@@ -814,3 @@
>  
> -    // Truncate long names to not stretch the table
> -    //XXX this should be left to the stylesheet (bug 391471)

Please add "text-overflow: ellipsis" to the embedded stylesheet to compensate for the removed code.
Attachment #8355048 - Flags: feedback?(VYV03354) → feedback+
Attachment #8345833 - Attachment is obsolete: true
Attachment #8345833 - Flags: review?(neil)
Attached patch Proposed patch (obsolete) — Splinter Review
Pinging bz because the file has more of his fingerprints than anyone else.

Most of the changes relate to changing the internal buffer from nsString to nsCString. The charset is only set for file: URLs; for FTP URLs the bytes are transmitted to the HTML parser which then guesses the character set, applying any forced or default character set as appropriate.

Instead of manually limiting the file names to 71 characters the output has been improved to allow the use of text-overflow: ellipsis; however this has necessitated a minor theme tweak.

There is also some minor string usage cleanup; NCRs are appended directly to the output string, character equivalents of integer constants are used, and memory is managed using adopting strings.
Assignee: nobody → neil
Status: NEW → ASSIGNED
Attachment #8373064 - Flags: review?(bzbarsky)
Comment on attachment 8373064 [details] [diff] [review]
Proposed patch

Going to punt this to a necko peer.
Attachment #8373064 - Flags: review?(bzbarsky) → review?(mcmanus)
Attachment #8373064 - Flags: review?(mcmanus) → review?(honzab.moz)
Comment on attachment 8373064 [details] [diff] [review]
Proposed patch

Review of attachment 8373064 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for such a long delay (occupied).

I have few questions before r+ given.  The patch works (according the STR) but I don't know what else it could break.  Also I want to mention I'm not deeply (if at all) familiar with this code and have no idea who is.  Maybe Michal Novotny (:michal).  You should ask him.

::: netwerk/streamconv/converters/nsIndexedToHTML.cpp
@@ +658,5 @@
>      return mListener->OnStopRequest(request, aContext, aStatus);
>  }
>  
>  nsresult
> +nsIndexedToHTML::FormatInputStream(nsIRequest* aRequest, nsISupports *aContext, const nsACString &aBuffer) 

This no longer "formats" anything.  Please rename the method to something more appropriate, like SendStringToStreamListener or so.

when you are here, please remove the whitespace at the end

@@ -772,4 @@
>      pushBuffer.AppendLiteral("<tr");
>  
> -    nsXPIDLString description;
> -    aIndex->GetDescription(getter_Copies(description));

this was a bug?  you are changing this to location, why?

@@ +691,5 @@
> +    nsXPIDLCString loc;
> +    aIndex->GetLocation(getter_Copies(loc));
> +
> +    // Don't byte-to-Unicode conversion here, it is lossy.
> +    loc.SetLength(nsUnescapeCount(loc.BeginWriting()));

can you more explain this line?

@@ +849,1 @@
>      if (!escaped)

Are you sure this is right/still necessary to do?

@@ +849,4 @@
>      if (!escaped)
>          return NS_ERROR_OUT_OF_MEMORY;
>      pushBuffer.AppendLiteral("<tr>\n <td>");
> +    AppendNonAsciiToNCR(escaped, pushBuffer);

why is the conversion to &#x; now needed?
(In reply to Honza Bambas from comment #9)
> I have few questions before r+ given.  The patch works (according the STR)
> but I don't know what else it could break.  Also I want to mention I'm not
> deeply (if at all) familiar with this code and have no idea who is.

Thanks for looking at the patch.

> > +nsIndexedToHTML::FormatInputStream(nsIRequest* aRequest, nsISupports *aContext, const nsACString &aBuffer) 
> 
> This no longer "formats" anything.  Please rename the method to something
> more appropriate, like SendStringToStreamListener or so.

Fair enough, I was just avoiding fixing up the six callers.

> when you are here, please remove the whitespace at the end

Fixed locally.

> > -    nsXPIDLString description;
> > -    aIndex->GetDescription(getter_Copies(description));
> 
> this was a bug?  you are changing this to location, why?

GetDescription tries to decode using the character set set on the HTTPIndex object. We don't know the correct character set here, so we need to send the raw string to the parser which then decodes it with the user-selected character set.

> > +    // Don't byte-to-Unicode conversion here, it is lossy.
> > +    loc.SetLength(nsUnescapeCount(loc.BeginWriting()));
> 
> can you more explain this line?

The comment and line were moved up about 150 lines (including deleted lines). The comment is just clarifying that we don't want to convert to Unicode here because we don't know the correct character set. The code line simply ensures that the string is the correct length after it is unescaped in place.

> >      if (!escaped)
> 
> Are you sure this is right/still necessary to do?

I don't know whether it is possible for nsEscapeHTML2 to fail, but this will continue to detect it if it does.

> > +    AppendNonAsciiToNCR(escaped, pushBuffer);
> 
> why is the conversion to &#x; now needed?

I've been given a Unicode nsAString which I have to convert to NCRs because the pushBuffer is in an unknown encoding. If someone can provide a test case to show that the relevant information is being processed incorrectly then I could try to come up with a patch that changes the interface but I suggest that might be better in a dependent bug.
Attachment #8373064 - Flags: review?(michal.novotny)
(In reply to neil@parkwaycc.co.uk from comment #10)

> Fair enough, I was just avoiding fixing up the six callers.

Feel free to do, it's OK in this case - the naming is wrong.

> GetDescription tries to decode using the character set set on the HTTPIndex
> object. We don't know the correct character set here, so we need to send the
> raw string to the parser which then decodes it with the user-selected
> character set.

Worth a comment.

> 
> > > +    // Don't byte-to-Unicode conversion here, it is lossy.
> > > +    loc.SetLength(nsUnescapeCount(loc.BeginWriting()));
> > 
> > can you more explain this line?
> 
> The comment and line were moved up about 150 lines (including deleted
> lines). The comment is just clarifying that we don't want to convert to
> Unicode here because we don't know the correct character set. The code line
> simply ensures that the string is the correct length after it is unescaped
> in place.

I'm concerned here on how you calculate the length.  If the string is not filled we will show an uninitialized data on the screen, or when overfilled, we crash (serious sec bug!).  So please take care this is correct and explain how you calculate the length, why you have to call SetLength and how you ensure the string is filled correctly.  Personally I try to avoid these constructions, definitely when encoding and escaping is involved.  The length calculation and difference between unescaped and escaped is not always clear and usually leads to hard to find and dangerous mistakes.

> I don't know whether it is possible for nsEscapeHTML2 to fail, but this will
> continue to detect it if it does.

I trust you :)

> I've been given a Unicode nsAString which I have to convert to NCRs because
> the pushBuffer is in an unknown encoding. If someone can provide a test case
> to show that the relevant information is being processed incorrectly then I
> could try to come up with a patch that changes the interface but I suggest
> that might be better in a dependent bug.

Again, worth a comment.
(In reply to Honza Bambas from comment #11)
> > > > +    // Don't byte-to-Unicode conversion here, it is lossy.
> > > > +    loc.SetLength(nsUnescapeCount(loc.BeginWriting()));
> > > 
> > > can you more explain this line?
> > 
> > The comment and line were moved up about 150 lines (including deleted
> > lines). The comment is just clarifying that we don't want to convert to
> > Unicode here because we don't know the correct character set. The code line
> > simply ensures that the string is the correct length after it is unescaped
> > in place.
> 
> I'm concerned here on how you calculate the length.  If the string is not
> filled we will show an uninitialized data on the screen, or when overfilled,
> we crash (serious sec bug!).  So please take care this is correct and
> explain how you calculate the length, why you have to call SetLength and how
> you ensure the string is filled correctly.  Personally I try to avoid these
> constructions, definitely when encoding and escaping is involved.  The
> length calculation and difference between unescaped and escaped is not
> always clear and usually leads to hard to find and dangerous mistakes.

loc is an nsCString which contains a null-terminated string, however some of the characters may have been %-escaped.

nsUnescapeCount takes a null-terminated string, unescapes it in-place, and returns the length of the resulting string. Unescaping always shortens the string, so it is not possible to overfill the string.

Simply applying nsUnescapeCount to loc.BeginWriting() unescapes the string, however we need to update loc to its new length, otherwise we would accidentally output some of the escaped portion of the string.
(In reply to neil@parkwaycc.co.uk from comment #12)
> loc is an nsCString which contains a null-terminated string, however some of
> the characters may have been %-escaped.
> 
> nsUnescapeCount takes a null-terminated string, unescapes it in-place, and
> returns the length of the resulting string. Unescaping always shortens the
> string, so it is not possible to overfill the string.
> 
> Simply applying nsUnescapeCount to loc.BeginWriting() unescapes the string,
> however we need to update loc to its new length, otherwise we would
> accidentally output some of the escaped portion of the string.

Ah! Yes..  Good example of bad naming of a function :)  nsUnescapeCount alters the string passed in and returns the new length.  So this is OK.  Thanks for clarifying.  Please just add a comment:

"nsUnescapeCount() directly unescapes loc string content and returns the processed string length, so that we directly adjust nsCString's length to be correct again."

...or something like this.
Attachment #8373064 - Flags: review?(honzab.moz) → review+
> Try run: https://tbpl.mozilla.org/?tree=Try&rev=d38db806e5d2

Note that try runs don't test FTP at all--we have *no* automated testing for FTP still :(

So we need to do at least minimal hand-testing against common FTP server versions before we can land this.  Sadly we don't even have a list of what the most common FTP servers are (that's the state of disrepair of our FTP testing).
Attachment #8373064 - Flags: review?(michal.novotny) → review+
Sorry for taking so long to get back to you with this.
Attachment #8355048 - Attachment is obsolete: true
Attachment #8373064 - Attachment is obsolete: true
Attachment #8392606 - Flags: review?(honzab.moz)
Comment on attachment 8392606 [details] [diff] [review]
Addressed review comments

r=honzab

thanks, sorry for the delay
Attachment #8392606 - Flags: review?(honzab.moz) → review+
Well, it looks as if I have two ways to go here.

A couple of people have mentioned that his particular assertion isn't that unexpected, and so I might just need to annotate the test and move on.

However, it might be possible to avoid triggering the assertion, so I'm also testing that option on try.
Flags: needinfo?(neil)
Somehow my "workaround" actually manages to prevent one of the existing assertions, which shows how useful they are...

Unfortunately there's a second test that's asserting (but only on the Mac) and my workaround wouldn't apply in that case anyway.
What is this change about?
80 	tbody > tr:hover {
80 	body > table > tbody > tr:hover {
(In reply to Alfred Kayser from comment #21)
> What is this change about?
> 80 	tbody > tr:hover {
> 80 	body > table > tbody > tr:hover {

In order to get the ellipsis to work, I can't put the file name directly inside the table cell, instead I have to nest it inside a table with fixed table layout. This change simply ensures that the hover rule only applies to the outer table.
https://hg.mozilla.org/mozilla-central/rev/b2f9a0080f9c
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Looks like <ftp://ftp.asu.ru/incoming/> was broken by Neil's patch (I couldn't test it before because the server was down).
Depends on: 989576
Depends on: 997519
Depends on: 1194497
You need to log in before you can comment on or make changes to this bug.