Last Comment Bug 296795 - ZipReader doesn't handle non-ASCII characters
: ZipReader doesn't handle non-ASCII characters
Status: RESOLVED FIXED
: addon-compat, dev-doc-needed
Product: Core Graveyard
Classification: Graveyard
Component: File Handling (show other bugs)
: Trunk
: All All
: -- normal with 3 votes (vote)
: mozilla10
Assigned To: Wolfgang Germund
:
Mentors:
: 552367 (view as bug list)
Depends on:
Blocks: 445065 697061
  Show dependency treegraph
 
Reported: 2005-06-06 01:26 PDT by Daniel Kirsch
Modified: 2016-06-22 12:16 PDT (History)
22 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
input zip file to test case (209 bytes, application/zip)
2007-05-15 13:04 PDT, Gary Johnson
no flags Details
Javascript to show bug in a console window (4.96 KB, application/x-javascript)
2007-05-15 13:06 PDT, Gary Johnson
no flags Details
xul file to run Javascript, change path to where you put source js (1.07 KB, application/vnd.mozilla.xul+xml)
2007-05-15 13:07 PDT, Gary Johnson
no flags Details
a file with code points above 127 that extracts fine (280 bytes, application/zip)
2007-05-16 16:37 PDT, Gary Johnson
no flags Details
a file with a file name with code point above 127 (280 bytes, application/zip)
2007-05-16 16:39 PDT, Gary Johnson
no flags Details
nsIZipReader.idl use always AUTF8String as in parameter v1.0 (29.22 KB, patch)
2011-09-02 03:34 PDT, Wolfgang Germund
taras.mozilla: feedback-
Details | Diff | Splinter Review
nsIZipReader.idl use always AUTF8String as in parameter v2 (29.10 KB, patch)
2011-09-16 05:21 PDT, Wolfgang Germund
taras.mozilla: review+
Details | Diff | Splinter Review
nsIZipReader.idl use always AUTF8String as in parameter v3 (31.04 KB, patch)
2011-09-23 05:20 PDT, Wolfgang Germund
no flags Details | Diff | Splinter Review

Description Daniel Kirsch 2005-06-06 01:26:46 PDT
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; de-DE; rv:1.7.8) Gecko/20050511 Firefox/1.0.4
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; de-DE; rv:1.7.8) Gecko/20050511 Firefox/1.0.4

If a zip file contains files with Umlaute like "ä" nsIZipReader doesn't handle
them correctly. It seems as if such a character is just ignored so an entry like
"bär.gif" will become "br.gif".

test script:
var zippath = "d:\\temp\\tmp.zip";
var filetype = "*.gif";
var ZipReader = new
Components.Constructor("@mozilla.org/libjar/zip-reader;1","nsIZipReader");
var ZipPath = new
Components.Constructor("@mozilla.org/file/local;1","nsILocalFile","initWithPath");

var zip = new ZipReader();
var zipFile = new ZipPath(zippath);
zip.init(zipFile);
zip.open();
var entries=zip.findEntries(filetype || "*");
var names = [];
while (entries.hasMoreElements()) {
    
alert(entries.getNext().QueryInterface(Components.interfaces.nsIZipEntry).name);
      }

      zip.close();

Reproducible: Always

Steps to Reproduce:
1. Create a file with a Umlaut or another valid character > charCode 122
2. Zip that file. The name of the zip-File doesn't matter.
3. Run the given script with zippath set to your zip file

Actual Results:  
Characters > 122 will be ignored.

Expected Results:  
Get the correct name of an entry in a zip file including all valid characters.
Comment 1 Christian :Biesinger (don't email me, ping me on IRC) 2005-06-06 04:16:34 PDT
> valid character > charCode 122

127, right?
Comment 2 Daniel Kirsch 2005-06-06 05:17:12 PDT
Right. {, |, } and ~ seems to be handled correct.

"§" will not be ignored but becomes "õ".
Comment 3 Gervase Markham [:gerv] 2005-09-27 02:09:50 PDT
This is an automated message, with ID "auto-resolve01".

This bug has had no comments for a long time. Statistically, we have found that
bug reports that have not been confirmed by a second user after three months are
highly unlikely to be the source of a fix to the code.

While your input is very important to us, our resources are limited and so we
are asking for your help in focussing our efforts. If you can still reproduce
this problem in the latest version of the product (see below for how to obtain a
copy) or, for feature requests, if it's not present in the latest version and
you still believe we should implement it, please visit the URL of this bug
(given at the top of this mail) and add a comment to that effect, giving more
reproduction information if you have it.

If it is not a problem any longer, you need take no action. If this bug is not
changed in any way in the next two weeks, it will be automatically resolved.
Thank you for your help in this matter.

The latest beta releases can be obtained from:
Firefox:     http://www.mozilla.org/projects/firefox/
Thunderbird: http://www.mozilla.org/products/thunderbird/releases/1.5beta1.html
Seamonkey:   http://www.mozilla.org/projects/seamonkey/
Comment 4 Gervase Markham [:gerv] 2005-10-13 10:39:24 PDT
This bug has been automatically resolved after a period of inactivity (see above
comment). If anyone thinks this is incorrect, they should feel free to reopen it.
Comment 5 Daniel Kirsch 2007-05-14 01:03:26 PDT
The bug is still present, however the zip and enumerator interface has changed a bit.

I can reproduce it on th current XULRunner trunk using the following method:

var zippath = "d:\\temp\\tmp.zip";
var filetype = "*.gif";
var ZipReader = new
Components.Constructor("@mozilla.org/libjar/zip-reader;1","nsIZipReader");
var ZipPath = new
Components.Constructor("@mozilla.org/file/local;1","nsILocalFile","initWithPath");

var zip = new ZipReader();
var zipFile = new ZipPath(zippath);
//PLA(zip)

zip.open(zipFile);
var entries=zip.findEntries("*");
var names = [];
//PLA(entries)
while (entries.hasMore()) {
  alert(entries.getNext());
}
zip.close();


All characters including and after a char code > 127 are cropped. So "bär.gif" will now become "b" only.
Comment 6 Gary Johnson 2007-05-15 12:43:05 PDT
I can confirm this BUG

See  mozilla\modules\libjar\nsJAR.cpp  - implements nsIZipReader

nsJAR::FindEntries(const char *aPattern, nsIUTF8StringEnumerator **result)
{
  NS_ENSURE_ARG_POINTER(result);

  nsZipFind *find;
  nsresult rv = mZip.FindInit(aPattern, &find);
  NS_ENSURE_SUCCESS(rv, rv);

  nsIUTF8StringEnumerator *zipEnum = new nsJAREnumerator(find);

which wants a UTF-8 string   

We end up in nsUTF8Utils.h

     NS_ERROR("Not a UTF-8 string. This code should only be used for
converting from known UTF-8 strings."); 
Comment 7 Dave Townsend [:mossop] 2007-05-15 13:04:54 PDT
The ultimate problem here is that nsZipArchive is built according to the original zip spec which only allowed for ascii characters in the names. Updating it for the UTF-8 extensions to zip would be some work and would also require changes to the nsIZipReader interface.
Comment 8 Gary Johnson 2007-05-15 13:04:56 PDT
Created attachment 264902 [details]
input zip file to test case
Comment 9 Gary Johnson 2007-05-15 13:06:11 PDT
Created attachment 264903 [details]
Javascript to show bug in a console window
Comment 10 Gary Johnson 2007-05-15 13:07:12 PDT
Created attachment 264904 [details]
xul file to run Javascript, change path to where you put source js
Comment 11 Daniel Veditz [:dveditz] 2007-05-16 16:12:20 PDT
(In reply to comment #7)
> original zip spec which only allowed for ascii characters in the names.

Actually the original zip just took bytes, interpreted in whatever character set the reader wanted to interpret them as (whatever the local file system understood, basically). If you archived your own files or shared zips with friends in the same country as you things basically worked.

The underlying nsZipArchive implementation is likewise character-set agnostic, and just takes C strings.

nsIZipReader appears to have turned into a horrible mix of methods that take "string" and "AUTF8String".
Comment 12 Gary Johnson 2007-05-16 16:37:10 PDT
Created attachment 265061 [details]
a file with code points above 127 that extracts fine

file in zip is named saute.txt
Sauté C:\saute.txt C:\saute.txt
C:\saute.txt
C:\CookBook\TermsDictionaryGlossary\ZZAccented\à la bourguignonne.txt
C:\CookBook\TermsDictionaryGlossary\ZZAccented\à la Maitre d'Hotel.txt
à la king.html
Pls	no quotes	in ->Pont l'Évêque cheese
Comment 13 Gary Johnson 2007-05-16 16:39:24 PDT
Created attachment 265062 [details]
a file with a file name with code point above 127

file named Sauté.txt with same contents as  C:\FileWithNONAsciiInternalCharters.ZIP.  Crash and burn
Comment 14 Gervase Markham [:gerv] 2007-05-21 08:15:26 PDT
Confirming due to comment #13.

Gerv
Comment 15 Daniel Veditz [:dveditz] 2007-05-21 19:22:05 PDT
Part of the problem is that prior to the relatively recent (2006) "general
purpose bit 11" flag to indicate utf8 filenames most zip implementations
allowed whatever filenames happened to be valid on the current OS -- it's
just storing length-counted char* so it doesn't really care what the caller
passes in. The "official" spec from the PKZIP people says that it's always
to be interpreted as IBM's old DOS codepage 437. That's already more than
just ascii, and since it's totally unsuited to anything outside US/western
europe that was taken as permission to just go ahead and use whatever
you've got. Windows versions of zip utilities do appear to call ANSIToOEM()
on the filenames, though, probably to remain compatible with old
DOS-produced archives.

Interesting facts:

- WinZip creates archives with cp437 file names (Window's "OEM" charset).
- WinRAR creates archives with "OEM" filenames.
- in contrast, command-line zip (cygwin) created archives using iso-8859
file names (window's "ANSI" character set) -- just guessing, but iso8859 is
a common file system charset on Unix platforms, so maybe that's "more
compatible" in that world.
- none of the three used utf8 encoding or the general purpose bit 11 flag
- all of command-line zip, WinZip, and 7-zip correctly read both versions,
restoring the correct characters when the archive was listed or extracted
whether cygwin's "ANSI" charset or the other's "OEM" charset.
- WinRAR created archives with "OEM" filenames, and automatically assumed
the characters were OEM; it mangled the cygwin zip filenames

The other difference between WinZip archives and zip archives is that the
cygwin zip archives are flagged as "unix" ("version made by" has a high
byte of x03 - UNIX, although the "version needed to extract" does not).
The cygwin zip also uses the optional "extra field" for file permissions and misc stuff, but I think that's less relevant.

Supporting "UTF8" zip archives is _not_ going to help these guys. If we
want to make all the interfaces UTF8 and then make a stab at charset
conversion from cp437 (or guess at other "OEM" charsets) that might help,
but the most common zip utilities are not creating archives with unicode
filenames.

If we want to support this we'll have to convert all the names as we read
the directory structure because theoretically this "made by unix" flag
could differ from file to file. Well, more than theoretical, I've created
them by using cygwin to add to a winzip-created archive and vice versa --
you get a nice mix of charsets there. Again, zip, WinZip, and 7-zip handle
the mixture gracefully, displaying the list in a unified character set
(whether at the command line or GUI display, or when extracting the files
into the filesystem charset).

Another potential wrinkle, the zip library is currently also built standalone so it can be incorporated into the old Suite installer. In that context it won't have access to our charset conversion routines. Maybe we no longer care if we
can get SuiteRunner to use the NSIS installer in the FF3 timeframe. Otherwise we'll have to put the conversion in the level above nsZipArchive which would be more awkward. I guess we could stub them out with calls to window's OEMtoAnsi calls.

If we could simply fix this internally to nsZipArchive I'd say just make
those char* interfaces always UTF8, converted from whatever they actually
are on disk.
Comment 16 Daniel Veditz [:dveditz] 2007-05-21 19:38:11 PDT
There are other "version made by" values we might be concerned about. I have absolutely no idea if they come up in practice but if they do they might imply different character set conversions. The Macintosh value, for example, is probably the Mac-Roman character set. There's a different value for OS X--what character set would that be?--but we'd have to poke around and see if that's used. There's a Windows NTFS value that none of the Windows zip archivers I tried actually used, they stuck with the original "MS-DOS" value for compatibility presumably.

We could peek at the info-zip code and see that they do with all this, I think their license is compatible with ours so we don't have to worry about being tainted just by looking.

For reference the PKZIP spec is found at
http://www.pkware.com/documents/casestudies/APPNOTE.TXT
Comment 17 Gary Johnson 2007-05-22 06:49:11 PDT
If we change our thinking to supporting web applications, not just the legacy zip specification I think we will be better off.  Integrate something like libiconv (http://www.gnu.org/software/libiconv/) with compression and a new string interface to support it.  Support the compression part of the old standard for the major file systems, but add options for binary files and files to be converted from one codepage to another and support the file system calls on the major platforms.  Supporting the developers working on this project with development access to MACs, PCs and Linux and Unix boxes would very helpful.  Give them some clear specifications of what has to be supported and what is optional.  Exclude a few things.  ISO-8859-1 gives you windows in the US.  Forget about things like x80 - x90 on windows 1252.  Trying to work out the wrinkles with those code point could be a career or two. This is an area were a small teams needs to form.  Split the project into 2 with a Western and Eastern focus.  Another option that just supports file transfers between systems would be to distribute an open source zipper on windows and provide a protected way of calling tar on Mac and Unix and Linux.  

I think Mr. Daniel Veditz has proved that the ZIP spec has enough ambiguities that it should not be taken literally. BTW thanks.     
Comment 18 Daniel Veditz [:dveditz] 2007-05-22 11:04:48 PDT
What does "supporting web applications" mean? web applications don't make zip archives, they call command-line zip, use Java classes, or perl libraries. We need to handle what those produce, and what those produce. 

The zip spec hosted by PKWARE is a good place to start, it's a living document and does represent some cross-vendor consensus in the interests of compatibility. Like web standards, specific implementations do vary.

I don't understand the link to libiconv. We already have character set conversion functionality in Mozilla.

The trick is to figure out what conversion to use since there doesn't seem to be a marker present in the files. I don't care about the difference between iso-8859-1 and cp1252, I want to know how to tell the difference between iso-8859-1 and iso-8859-2 or whatever multibyte character set the Asian countries will be using. We have the converters, we just need to know which one to use.
Comment 19 Gary Johnson 2007-05-22 12:41:28 PDT
1) Bytes is bytes, unless you give them context.  For example, for US English,  if you want to support all valid filenames on both the PC and the MAC, first of all, you can’t.  Second you can support 99.99 of them by eliminating the illegal characters on both platforms (and a couple of other restrictions).  But you still have to call some kind of code page conversion so that the valid PC characters above dec 127 get converted into UTF MAC and vice versa. If you are going from MAC to PC, it gets a whole lot more complicated.  You can translate a resource fork, but without a lot more application level support, it’s not going to be meaningful.  But in a real sense that is presentation level data, just like application formats carry presentation level data, and you can choose not to support that.  What is missing from the zip spec is the concept of From and To.  It was not the pressing concern when we had 640k and most data transfer was done with an IBM mainframe tape. 

2) From our earlier posts, “The trick is to figure out what conversion” is  “options to convert them”  and “(whatever the local file system understood, basically)”. 

 3) Lets say “Like” Libiconv means your character conversion (mozilla’s) routines.  

 4) I am sure I do not understand all the things  a web application might do.  But I would not preclude doing things with files and data hosted on local machines.  And if you look at the success of the zip format, one of the things it did was help move data around from one machine to another.  For a definition of a what I was thinking about when I mentioned web applications, you can start with this http://en.wikipedia.org/wiki/Web_application, and then expand it.  

What I am trying to get at, is the state of the art is such that we cannot move all data around seamlessly.  We can move around a lot of data, accurately, if we live with a few restrictions.  

I do not know what would happen if someone on a mac tried to zip up files with non ascii characters and then someone tried to read that zip on a PC.  Does the mac zip up UTF?  What happens when you try to unzip them with various programs on the PC?   

If you use plain ascii, you do not have these problems, but then you do not fully support zip applications.  What we have now is going to fail as soon as filenames use non ascii data.   But, that is not a big deal, as long as you know what will and will not work for you.

In my case, I can live with accessing the zip utilities on the various platforms.  I will convert my data to the target machines format.  In the case of the MAC and UNIX, I can try and use tar, on the PC I can distribute an open source zip from info-zip.org.  

Comment 20 skierpage 2009-07-05 15:25:15 PDT
Firefox's amazing build-in jar: protocol handler shows this problem.   Using the attachment from comment #13 , the URL
  jar:https://bug296795.bugzilla.mozilla.org/attachment.cgi?id=265062!/
displays the ZIP file's contents, showing the file as "File:Saut?.txt"  (that appears to be the ASCII '?' mark symbol, not a missing glyph icon).

If you click the filename, Firefox attempts to load the file
  jar:https://bug296795.bugzilla.mozilla.org/attachment.cgi?id=265062!/Saut?.txt
(no escaping or encoding), and reports File not found.  There is nothing in the Error Console in Firefox 3.6apre1.

This also seems to expose bugs in the way the jar viewer represents and parses non-ASCII characters.
Comment 21 Gary Johnson 2010-02-26 11:16:15 PST
If you attach multiple files to an email and send them to a Hotmail recipient, the recipient gets the option to "download all" attachments.  When you click on "download all", it zips up all the files and saves them to your hard drive.  It zips up files with non ascii characters in the filename.
Comment 22 Dave Townsend [:mossop] 2010-03-15 11:03:37 PDT
*** Bug 552367 has been marked as a duplicate of this bug. ***
Comment 23 Wolfgang Germund 2011-08-12 01:09:46 PDT
Hi

I think the Problem above has two aspects:
1) What coding comes from the archive?
2) How works the interface nsIZipReader.idl from JavaScript to C++?

Too aspect 1:
My openSUSE Linux use UTF-8 for filenames.
For a file test_ü.txt in an archive zipinfo shows:
-rw-r--r--  3.0 unx        7 tx stor 11-Aug-10 16:15 test_??.txt


I have made a Source Code Analysis of aspect 2.

In comment 11 Dave come to the point.
>> nsIZipReader appears to have turned into a horrible mix of methods that take >> "string" and "AUTF8String".

The reason is in the files:
   mozilla/modules/libjar/nsIZipReader.idl
   mozilla/modules/libjar/nsJAR.cpp

The methods are:
   nsIUTF8StringEnumerator findEntries(in string aPattern);
   void extract(in string zipEntry, in nsIFile outFile);
   nsIZipEntry getEntry(in string zipEntry);
   boolean hasEntry(in AUTF8String zipEntry);

see IDL String types
   https://developer.mozilla.org/En/Mozilla_internal_string_guide

If I use this IDL in JavaScript the nsIUTF8StringEnumerator from findEntries() is auto converted to UTF-16. If I use this strings as input parameter in other methods there are differend types for zipEntry.

The method boolean hasEntry(in AUTF8String zipEntry); works fine!!!
The string from JavaScript is also auto converted in this direction to UTF-8.

All other methods in this IDL have ...(in string zipEntry).
In this case the high Byte value is cut off.
For germans Umlaute "äöü" this means, from UTF-8 convert to ISO-8859.
And this don't match in mZip->GetItem().

I will try to build a patch that improves the interface. Then we can test whether other functions are able to cope.
Comment 24 Wolfgang Germund 2011-09-02 03:34:19 PDT
Created attachment 557793 [details] [diff] [review]
nsIZipReader.idl use always  AUTF8String as in parameter v1.0

This is a patch that improves the interface nsIZipReader.idl
I changed parameter (in string aEntryName) to (in AUTF8String zipEntry)

The JavaScript users don't need to change code. This is because the automatic type conversion from XPCOM do that.

The c++ users have to change
    parameter of type nsCString:  (xxx.get()) --> (xxx)
    parameter of type char*: (xxx) --> (nsDependentCString(xxx))
    for string literal use (NS_LITERAL_CSTRING("xxx"))
    for NULL use EmptyCString()

This patch changed all occurrence of nsIZipReader.idl in mozilla-central.
These are:
    modules/libjar/*
    caps/src/nsScriptSecurityManager.cpp
    xpcom/components/nsComponentManager.cpp
    xpinstall/src/nsXPInstallManager.cpp
Comment 25 Benjamin Smedberg [:bsmedberg] 2011-09-07 13:25:51 PDT
Comment on attachment 557793 [details] [diff] [review]
nsIZipReader.idl use always  AUTF8String as in parameter v1.0

It's not clear to me that we want to support non-ASCII paths if there is any possible confusion about the character sets. But I'll defer to Taras since I really don't have the time to work through all the consequences.
Comment 26 (dormant account) 2011-09-14 17:31:50 PDT
Comment on attachment 557793 [details] [diff] [review]
nsIZipReader.idl use always  AUTF8String as in parameter v1.0


> NS_IMETHODIMP
>-nsJAR::Test(const char *aEntryName)
>+nsJAR::Test(const nsACString &aEntryName)
> {
>-  return mZip->Test(aEntryName);
>+  char *entry = PL_strdup(PromiseFlatCString(aEntryName).get());
>+  if (!entry)
>+    return NS_ERROR_OUT_OF_MEMORY;
>+
>+  if (*entry == '\0')
>+    entry = nsnull;

This is a memory leak.
Use nsCString for entry.
>+
>+  nsresult rv = mZip->Test(entry);
entry.get()

>-nsJAR::FindEntries(const char *aPattern, nsIUTF8StringEnumerator **result)
>+nsJAR::FindEntries(const nsACString &aPattern, nsIUTF8StringEnumerator **result)
> {
>   NS_ENSURE_ARG_POINTER(result);
> 
>+  char *pattern = PL_strdup(PromiseFlatCString(aPattern).get());
>+  if (!pattern)
>+    return NS_ERROR_OUT_OF_MEMORY;
>+
>+  if (*pattern == '\0')
>+    pattern = nsnull;

leak

Sorry for the review lag, traveling atm. The rest of the patch looks good. This is an important fix, thanks for taking it on. I'm r-ing this for now, I'd like to review it again in case I missed anything.
Comment 27 Wolfgang Germund 2011-09-16 05:21:42 PDT
Created attachment 560540 [details] [diff] [review]
nsIZipReader.idl use always  AUTF8String as in parameter v2

I don't know, why I used local copies of the parameter.
Now I use aEntryName.IsEmpty()? nsnull : PromiseFlatCString(aEntryName).get()
It's easier to use and to understand. I changed that in nsJAR::Test and nsJAR::FindEntries.

What is about the IDL documentation in https://developer.mozilla.org/en/XPCOM_Interface_Reference/nsIZipReader and nsIZipReaderCache?
Is this auto generated or hand made? I miss also nsIZipWriter!

If I have to change the documentation by hand, I still need an example of how one describes a not kompatieble IDL. Must I create a new UUID for the interface?
Comment 28 O. Atsushi (Torisugari) 2011-09-16 23:21:03 PDT
If high bit is the only problem, why AUTF8String instead of ACString?

Most of non-ascii zip entries are not encoded in UTF-8. Indeed, attachment 265062 [details] (comment #13) seems Code Page 437.

S    a    u    t    é    .    t    x    t
0x53 0x61 0x75 0x74 0x82 0x2E 0x74 0x78 0x74 (0x00)

In unicode, é is 0x000000E9 and in utf-8, 0xE9.
Comment 29 O. Atsushi (Torisugari) 2011-09-17 07:06:30 PDT
(In reply to O. Atsushi (Torisugari) from comment #28)
> In unicode, é is 0x000000E9 and in utf-8, 0xE9.

s/0xE9/0xC3A9/

I'm sorry for the spam.
Comment 30 Wolfgang Germund 2011-09-17 09:33:10 PDT
(In reply to O. Atsushi (Torisugari) from comment #28)
> If high bit is the only problem, why AUTF8String instead of ACString?

AUTF8String is the best type to communicate with unicode between c++ and javascript.
AUTF8String is auto converted to UTF-16 in Javascript and the UI works also with UTF-16.
see: https://developer.mozilla.org/En/Mozilla_internal_string_guide#IDL_String_types

> Most of non-ascii zip entries are not encoded in UTF-8. Indeed, attachment
> 265062 [details] (comment #13) seems Code Page 437.

This depends on the platform. Windos 7 and the many Linux works with UTF-8.

> S    a    u    t    é    .    t    x    t
> 0x53 0x61 0x75 0x74 0x82 0x2E 0x74 0x78 0x74 (0x00)
> 
> In unicode, é is 0xC3A9 and in utf-8, 0xE9.

This is still a problem. In comment 23 I wrote:
   I think the Problem above has two aspects!

The Patch attachment 560540 [details] [diff] [review] solves only aspect 2!
Aspect 1 is still there  and we must remember that the bug 
is not completely fixed with my patch.

Important is, that the situation keeps getting better
and open the door for the next step!
Comment 31 (dormant account) 2011-09-22 17:09:10 PDT
Comment on attachment 560540 [details] [diff] [review]
nsIZipReader.idl use always  AUTF8String as in parameter v2

You do indeed need to bump the uuid in nsIZipReader.idl see https://developer.mozilla.org/en/Generating_GUIDs (short story: /msg firebot uuid & replace existing one)
Comment 32 (dormant account) 2011-09-22 17:13:48 PDT
Wolfgang, thanks for the patch. I think it is important to be able to read archives with unicode filenames.
Comment 33 Wolfgang Germund 2011-09-23 05:20:23 PDT
Created attachment 562019 [details] [diff] [review]
nsIZipReader.idl use always AUTF8String as in parameter v3

My changes to v2:
new UUID and CID for nsIZipReader and nsIZipReaderCache

Please check the patch on Try Server and check in.
Remember that the bug is not completely fixed with my patch. (comment 30)

What is about the IDL documentation in https://developer.mozilla.org/en/XPCOM_Interface_Reference/nsIZipReader and nsIZipReaderCache?
Is this auto generated or shall I change the wiki by hand?
Comment 34 (dormant account) 2011-09-23 12:24:51 PDT
Wolfgang,
You should apply for level 1 access so you can push to try yourself. Documentation is manual, assigning dev-doc-needed keyword to a bug after closing will get someone to update it.

I pushed the patch to try https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=bbfc4cc0c427 should get results posted in here. I would like to aim at landing this asap after the next aurora merge so this spends time in nightly testing.
Comment 35 Mozilla RelEng Bot 2011-09-23 19:21:05 PDT
Try run for bbfc4cc0c427 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=bbfc4cc0c427
Results (out of 171 total builds):
    exception: 3
    success: 162
    warnings: 5
    failure: 1
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/tglek@mozilla.com-bbfc4cc0c427
Comment 36 Ed Morley [:emorley] 2011-09-24 03:15:04 PDT
Comment on attachment 562019 [details] [diff] [review]
nsIZipReader.idl use always AUTF8String as in parameter v3

Thanks for the patch!

The try run looks good, but I'm going to remove the checkin flag for now (for the reason in comment 34), so this doesn't get checked in accidentally. As soon as the next aurora uplift happens (27th Sept) please re-add it and someone will push for you.

If you need any help getting level 1 commit access so you can push to try for future patches, just ask :-)
Comment 38 Michael Wu [:mwu] 2011-09-29 01:36:31 PDT
https://hg.mozilla.org/mozilla-central/rev/6032f7c15af8
Comment 39 Alfred Kayser 2011-09-29 22:18:40 PDT
Around 405:
  const char *filename = PromiseFlatCString(aFilename).get();
  if (*filename)
  {
    //-- Find the item
   nsCStringKey key(filename);
   nsJARManifestItem* manItem = static_cast<nsJARManifestItem*>(mManifestData.Get(&key));
Could have been:
  if (!aFilename.isEmpty())  {
    //-- Find the item
   nsJARManifestItem* manItem = static_cast<nsJARManifestItem*>(mManifestData.Get(&aFilename));

This saves the creation of a nsCString, and the copy of the string itself.
Comment 40 Wolfgang Germund 2011-10-26 04:33:54 PDT
(In reply to Alfred Kayser from comment #39)

I make a followup at bug 697061.

The Parameter of mManifestData.Get(&key) must be a nsCStringKey not a nsCString.
Comment 42 Chuck Baker 2012-01-28 09:24:28 PST
In the new documentation there is a paragraph under the "Warning" section that reads:

"In the ZIP specification are only filenames with 7 bit ASCII defined but usually the filesystems use 8 bit code pages for example UTF-8. This can mismatch characters in the nsIZipReader API. Since Gecko 10 the nsIZipReader API supports a limited 8 bit code page usage. All functions pass the filenames (in / out) as AUTF8String. So you can read and extract, what you wrote with nsIZipWriter. If you show filenames from the findEntries result in the UI,  the character matching is only on UTF-8 filesystems (ZIP archives) ok.".

(See https://developer.mozilla.org/en/XPCOM_Interface_Reference/nsIZipReader#Warning)

The wording, syntax, and grammar are confusing (at least to me).

I mean no disrespect to the author, but technical documentation needs to be concise with little room for misinterpretation.
Comment 43 O. Atsushi (Torisugari) 2012-02-01 08:32:39 PST
(In reply to Chuck Baker (Shadoefax on irc) from comment #42)

> I mean no disrespect to the author, but technical documentation needs to be
> concise with little room for misinterpretation.

That sounds we need reviewers (and a module owner) for documentation, besides web-developers.

By the way, filesystem is not the problem here. As comment #15 says, these days archiving applications encode a file name in their own ways, though they used to depend on filesystem's charset decades ago.

> what you wrote with nsIZipWriter
nsIZipWriter sets bit 11. So whether or not it can handle a given zip archive is predictable.
http://mxr.mozilla.org/mozilla-central/ident?i=FLAGS_IS_UTF8

Bit 11? ---> Yes -> OK.                            (UTF-8, typically by nsIZipWriter)
        |
        \--> No  -> pure ASCII? ---> Yes -> OK.    (ASCII, which is compatible with UTF-8)
                                |
                                \--> No  -> Fails. (Most of the Non-ASCII in the world.)

Probably we don't have to think about UTF-8 without bit 11. It's negligible, even though there are a lot of machines whose default charset is UTF-8.
Comment 44 :Paolo Amadini 2012-03-29 01:35:52 PDT
I've added the "addon-compat" keyword to this bug, since this interface change
require modifications to add-ons that use it. In fact, this keyword is almost
always needed when there is an interface change.

Also, while the documentation of the interface itself is updated in detail
(thanks!), this change should also be reported here:

https://developer.mozilla.org/en/Firefox_10_for_developers#Interface_changes

I would have added the required line to the page myself, but my MDN account is
temporarily unable to edit pages (bug 738900), thus I've also added the
"dev-doc-needed" keyword.
Comment 45 :Paolo Amadini 2012-03-29 01:56:07 PDT
Here is also some code showing how to keep compatibility with browser versions
prior to Firefox 10. In the add-on I'm maintaining, I'm using this technique now
(though slightly optimized). I'll keep this compatibility code at least until
Firefox 3.6 reaches end-of-life.


function isGecko10OrLater() {
  var platformVersion = Cc["@mozilla.org/xre/app-info;1"]
                        .getService(Ci.nsIXULAppInfo).platformVersion;
  return Cc["@mozilla.org/xpcom/version-comparator;1"]
         .getService(Ci.nsIVersionComparator)
         .compare(platformVersion, "10") >= 0;
}


var converter = Cc["@mozilla.org/intl/scriptableunicodeconverter"]
                .createInstance(Ci.nsIScriptableUnicodeConverter);
converter.charset = "UTF-8";


var zipReader = [...]


var zipEntries = zipReader.findEntries(null);
while (zipEntries.hasMore()) {
  var zipEntry = zipEntries.getNext();
  if (isGecko10OrLater()) {
    zipReader.extract(zipEntry, destFile);
  } else {
    zipReader.extract(converter.ConvertFromUnicode(zipEntry) +
                      converter.Finish(), destFile);
  }
}

Note You need to log in before you can comment on or make changes to this bug.