Closed Bug 296795 Opened 15 years ago Closed 9 years ago

ZipReader doesn't handle non-ASCII characters


(Core Graveyard :: File Handling, defect)

Not set


(Not tracked)



(Reporter: daniel, Assigned: wolfgang)


(Blocks 1 open bug)


(Keywords: addon-compat, dev-doc-needed)


(6 files, 2 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; de-DE; rv:1.7.8) Gecko/20050511 Firefox/1.0.4
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; de-DE; rv:1.7.8) Gecko/20050511 Firefox/1.0.4

If a zip file contains files with Umlaute like "ä" nsIZipReader doesn't handle
them correctly. It seems as if such a character is just ignored so an entry like
"bär.gif" will become "br.gif".

test script:
var zippath = "d:\\temp\\";
var filetype = "*.gif";
var ZipReader = new
var ZipPath = new

var zip = new ZipReader();
var zipFile = new ZipPath(zippath);
var entries=zip.findEntries(filetype || "*");
var names = [];
while (entries.hasMoreElements()) {


Reproducible: Always

Steps to Reproduce:
1. Create a file with a Umlaut or another valid character > charCode 122
2. Zip that file. The name of the zip-File doesn't matter.
3. Run the given script with zippath set to your zip file

Actual Results:  
Characters > 122 will be ignored.

Expected Results:  
Get the correct name of an entry in a zip file including all valid characters.
> valid character > charCode 122

127, right?
Summary: ZipReader doesn't handle Umlaute → ZipReader doesn't handle non-ASCII characters
Right. {, |, } and ~ seems to be handled correct.

"§" will not be ignored but becomes "õ".
This is an automated message, with ID "auto-resolve01".

This bug has had no comments for a long time. Statistically, we have found that
bug reports that have not been confirmed by a second user after three months are
highly unlikely to be the source of a fix to the code.

While your input is very important to us, our resources are limited and so we
are asking for your help in focussing our efforts. If you can still reproduce
this problem in the latest version of the product (see below for how to obtain a
copy) or, for feature requests, if it's not present in the latest version and
you still believe we should implement it, please visit the URL of this bug
(given at the top of this mail) and add a comment to that effect, giving more
reproduction information if you have it.

If it is not a problem any longer, you need take no action. If this bug is not
changed in any way in the next two weeks, it will be automatically resolved.
Thank you for your help in this matter.

The latest beta releases can be obtained from:
This bug has been automatically resolved after a period of inactivity (see above
comment). If anyone thinks this is incorrect, they should feel free to reopen it.
Closed: 15 years ago
Resolution: --- → EXPIRED
The bug is still present, however the zip and enumerator interface has changed a bit.

I can reproduce it on th current XULRunner trunk using the following method:

var zippath = "d:\\temp\\";
var filetype = "*.gif";
var ZipReader = new
var ZipPath = new

var zip = new ZipReader();
var zipFile = new ZipPath(zippath);
var entries=zip.findEntries("*");
var names = [];
while (entries.hasMore()) {

All characters including and after a char code > 127 are cropped. So "bär.gif" will now become "b" only.
Resolution: EXPIRED → ---
I can confirm this BUG

See  mozilla\modules\libjar\nsJAR.cpp  - implements nsIZipReader

nsJAR::FindEntries(const char *aPattern, nsIUTF8StringEnumerator **result)

  nsZipFind *find;
  nsresult rv = mZip.FindInit(aPattern, &find);

  nsIUTF8StringEnumerator *zipEnum = new nsJAREnumerator(find);

which wants a UTF-8 string   

We end up in nsUTF8Utils.h

     NS_ERROR("Not a UTF-8 string. This code should only be used for
converting from known UTF-8 strings."); 
The ultimate problem here is that nsZipArchive is built according to the original zip spec which only allowed for ascii characters in the names. Updating it for the UTF-8 extensions to zip would be some work and would also require changes to the nsIZipReader interface.
(In reply to comment #7)
> original zip spec which only allowed for ascii characters in the names.

Actually the original zip just took bytes, interpreted in whatever character set the reader wanted to interpret them as (whatever the local file system understood, basically). If you archived your own files or shared zips with friends in the same country as you things basically worked.

The underlying nsZipArchive implementation is likewise character-set agnostic, and just takes C strings.

nsIZipReader appears to have turned into a horrible mix of methods that take "string" and "AUTF8String".
file in zip is named saute.txt
Sauté C:\saute.txt C:\saute.txt
C:\CookBook\TermsDictionaryGlossary\ZZAccented\à la bourguignonne.txt
C:\CookBook\TermsDictionaryGlossary\ZZAccented\à la Maitre d'Hotel.txt
à la king.html
Pls	no quotes	in ->Pont l'Évêque cheese
file named Sauté.txt with same contents as  C:\FileWithNONAsciiInternalCharters.ZIP.  Crash and burn
Confirming due to comment #13.

Ever confirmed: true
Part of the problem is that prior to the relatively recent (2006) "general
purpose bit 11" flag to indicate utf8 filenames most zip implementations
allowed whatever filenames happened to be valid on the current OS -- it's
just storing length-counted char* so it doesn't really care what the caller
passes in. The "official" spec from the PKZIP people says that it's always
to be interpreted as IBM's old DOS codepage 437. That's already more than
just ascii, and since it's totally unsuited to anything outside US/western
europe that was taken as permission to just go ahead and use whatever
you've got. Windows versions of zip utilities do appear to call ANSIToOEM()
on the filenames, though, probably to remain compatible with old
DOS-produced archives.

Interesting facts:

- WinZip creates archives with cp437 file names (Window's "OEM" charset).
- WinRAR creates archives with "OEM" filenames.
- in contrast, command-line zip (cygwin) created archives using iso-8859
file names (window's "ANSI" character set) -- just guessing, but iso8859 is
a common file system charset on Unix platforms, so maybe that's "more
compatible" in that world.
- none of the three used utf8 encoding or the general purpose bit 11 flag
- all of command-line zip, WinZip, and 7-zip correctly read both versions,
restoring the correct characters when the archive was listed or extracted
whether cygwin's "ANSI" charset or the other's "OEM" charset.
- WinRAR created archives with "OEM" filenames, and automatically assumed
the characters were OEM; it mangled the cygwin zip filenames

The other difference between WinZip archives and zip archives is that the
cygwin zip archives are flagged as "unix" ("version made by" has a high
byte of x03 - UNIX, although the "version needed to extract" does not).
The cygwin zip also uses the optional "extra field" for file permissions and misc stuff, but I think that's less relevant.

Supporting "UTF8" zip archives is _not_ going to help these guys. If we
want to make all the interfaces UTF8 and then make a stab at charset
conversion from cp437 (or guess at other "OEM" charsets) that might help,
but the most common zip utilities are not creating archives with unicode

If we want to support this we'll have to convert all the names as we read
the directory structure because theoretically this "made by unix" flag
could differ from file to file. Well, more than theoretical, I've created
them by using cygwin to add to a winzip-created archive and vice versa --
you get a nice mix of charsets there. Again, zip, WinZip, and 7-zip handle
the mixture gracefully, displaying the list in a unified character set
(whether at the command line or GUI display, or when extracting the files
into the filesystem charset).

Another potential wrinkle, the zip library is currently also built standalone so it can be incorporated into the old Suite installer. In that context it won't have access to our charset conversion routines. Maybe we no longer care if we
can get SuiteRunner to use the NSIS installer in the FF3 timeframe. Otherwise we'll have to put the conversion in the level above nsZipArchive which would be more awkward. I guess we could stub them out with calls to window's OEMtoAnsi calls.

If we could simply fix this internally to nsZipArchive I'd say just make
those char* interfaces always UTF8, converted from whatever they actually
are on disk.
There are other "version made by" values we might be concerned about. I have absolutely no idea if they come up in practice but if they do they might imply different character set conversions. The Macintosh value, for example, is probably the Mac-Roman character set. There's a different value for OS X--what character set would that be?--but we'd have to poke around and see if that's used. There's a Windows NTFS value that none of the Windows zip archivers I tried actually used, they stuck with the original "MS-DOS" value for compatibility presumably.

We could peek at the info-zip code and see that they do with all this, I think their license is compatible with ours so we don't have to worry about being tainted just by looking.

For reference the PKZIP spec is found at
If we change our thinking to supporting web applications, not just the legacy zip specification I think we will be better off.  Integrate something like libiconv ( with compression and a new string interface to support it.  Support the compression part of the old standard for the major file systems, but add options for binary files and files to be converted from one codepage to another and support the file system calls on the major platforms.  Supporting the developers working on this project with development access to MACs, PCs and Linux and Unix boxes would very helpful.  Give them some clear specifications of what has to be supported and what is optional.  Exclude a few things.  ISO-8859-1 gives you windows in the US.  Forget about things like x80 - x90 on windows 1252.  Trying to work out the wrinkles with those code point could be a career or two. This is an area were a small teams needs to form.  Split the project into 2 with a Western and Eastern focus.  Another option that just supports file transfers between systems would be to distribute an open source zipper on windows and provide a protected way of calling tar on Mac and Unix and Linux.  

I think Mr. Daniel Veditz has proved that the ZIP spec has enough ambiguities that it should not be taken literally. BTW thanks.     
What does "supporting web applications" mean? web applications don't make zip archives, they call command-line zip, use Java classes, or perl libraries. We need to handle what those produce, and what those produce. 

The zip spec hosted by PKWARE is a good place to start, it's a living document and does represent some cross-vendor consensus in the interests of compatibility. Like web standards, specific implementations do vary.

I don't understand the link to libiconv. We already have character set conversion functionality in Mozilla.

The trick is to figure out what conversion to use since there doesn't seem to be a marker present in the files. I don't care about the difference between iso-8859-1 and cp1252, I want to know how to tell the difference between iso-8859-1 and iso-8859-2 or whatever multibyte character set the Asian countries will be using. We have the converters, we just need to know which one to use.
1) Bytes is bytes, unless you give them context.  For example, for US English,  if you want to support all valid filenames on both the PC and the MAC, first of all, you can’t.  Second you can support 99.99 of them by eliminating the illegal characters on both platforms (and a couple of other restrictions).  But you still have to call some kind of code page conversion so that the valid PC characters above dec 127 get converted into UTF MAC and vice versa. If you are going from MAC to PC, it gets a whole lot more complicated.  You can translate a resource fork, but without a lot more application level support, it’s not going to be meaningful.  But in a real sense that is presentation level data, just like application formats carry presentation level data, and you can choose not to support that.  What is missing from the zip spec is the concept of From and To.  It was not the pressing concern when we had 640k and most data transfer was done with an IBM mainframe tape. 

2) From our earlier posts, “The trick is to figure out what conversion” is  “options to convert them”  and “(whatever the local file system understood, basically)”. 

 3) Lets say “Like” Libiconv means your character conversion (mozilla’s) routines.  

 4) I am sure I do not understand all the things  a web application might do.  But I would not preclude doing things with files and data hosted on local machines.  And if you look at the success of the zip format, one of the things it did was help move data around from one machine to another.  For a definition of a what I was thinking about when I mentioned web applications, you can start with this, and then expand it.  

What I am trying to get at, is the state of the art is such that we cannot move all data around seamlessly.  We can move around a lot of data, accurately, if we live with a few restrictions.  

I do not know what would happen if someone on a mac tried to zip up files with non ascii characters and then someone tried to read that zip on a PC.  Does the mac zip up UTF?  What happens when you try to unzip them with various programs on the PC?   

If you use plain ascii, you do not have these problems, but then you do not fully support zip applications.  What we have now is going to fail as soon as filenames use non ascii data.   But, that is not a big deal, as long as you know what will and will not work for you.

In my case, I can live with accessing the zip utilities on the various platforms.  I will convert my data to the target machines format.  In the case of the MAC and UNIX, I can try and use tar, on the PC I can distribute an open source zip from  

Firefox's amazing build-in jar: protocol handler shows this problem.   Using the attachment from comment #13 , the URL
displays the ZIP file's contents, showing the file as "File:Saut?.txt"  (that appears to be the ASCII '?' mark symbol, not a missing glyph icon).

If you click the filename, Firefox attempts to load the file
(no escaping or encoding), and reports File not found.  There is nothing in the Error Console in Firefox 3.6apre1.

This also seems to expose bugs in the way the jar viewer represents and parses non-ASCII characters.
Assignee: file-handling → nobody
QA Contact: ian → file-handling
Blocks: 445065
If you attach multiple files to an email and send them to a Hotmail recipient, the recipient gets the option to "download all" attachments.  When you click on "download all", it zips up all the files and saves them to your hard drive.  It zips up files with non ascii characters in the filename.
Duplicate of this bug: 552367

I think the Problem above has two aspects:
1) What coding comes from the archive?
2) How works the interface nsIZipReader.idl from JavaScript to C++?

Too aspect 1:
My openSUSE Linux use UTF-8 for filenames.
For a file test_ü.txt in an archive zipinfo shows:
-rw-r--r--  3.0 unx        7 tx stor 11-Aug-10 16:15 test_??.txt

I have made a Source Code Analysis of aspect 2.

In comment 11 Dave come to the point.
>> nsIZipReader appears to have turned into a horrible mix of methods that take >> "string" and "AUTF8String".

The reason is in the files:

The methods are:
   nsIUTF8StringEnumerator findEntries(in string aPattern);
   void extract(in string zipEntry, in nsIFile outFile);
   nsIZipEntry getEntry(in string zipEntry);
   boolean hasEntry(in AUTF8String zipEntry);

see IDL String types

If I use this IDL in JavaScript the nsIUTF8StringEnumerator from findEntries() is auto converted to UTF-16. If I use this strings as input parameter in other methods there are differend types for zipEntry.

The method boolean hasEntry(in AUTF8String zipEntry); works fine!!!
The string from JavaScript is also auto converted in this direction to UTF-8.

All other methods in this IDL have ...(in string zipEntry).
In this case the high Byte value is cut off.
For germans Umlaute "äöü" this means, from UTF-8 convert to ISO-8859.
And this don't match in mZip->GetItem().

I will try to build a patch that improves the interface. Then we can test whether other functions are able to cope.
This is a patch that improves the interface nsIZipReader.idl
I changed parameter (in string aEntryName) to (in AUTF8String zipEntry)

The JavaScript users don't need to change code. This is because the automatic type conversion from XPCOM do that.

The c++ users have to change
    parameter of type nsCString:  (xxx.get()) --> (xxx)
    parameter of type char*: (xxx) --> (nsDependentCString(xxx))
    for string literal use (NS_LITERAL_CSTRING("xxx"))
    for NULL use EmptyCString()

This patch changed all occurrence of nsIZipReader.idl in mozilla-central.
These are:
Attachment #557793 - Flags: feedback?(tglek)
Attachment #557793 - Flags: feedback?(benjamin)
Comment on attachment 557793 [details] [diff] [review]
nsIZipReader.idl use always  AUTF8String as in parameter v1.0

It's not clear to me that we want to support non-ASCII paths if there is any possible confusion about the character sets. But I'll defer to Taras since I really don't have the time to work through all the consequences.
Attachment #557793 - Flags: feedback?(benjamin)
Comment on attachment 557793 [details] [diff] [review]
nsIZipReader.idl use always  AUTF8String as in parameter v1.0

>-nsJAR::Test(const char *aEntryName)
>+nsJAR::Test(const nsACString &aEntryName)
> {
>-  return mZip->Test(aEntryName);
>+  char *entry = PL_strdup(PromiseFlatCString(aEntryName).get());
>+  if (!entry)
>+    return NS_ERROR_OUT_OF_MEMORY;
>+  if (*entry == '\0')
>+    entry = nsnull;

This is a memory leak.
Use nsCString for entry.
>+  nsresult rv = mZip->Test(entry);

>-nsJAR::FindEntries(const char *aPattern, nsIUTF8StringEnumerator **result)
>+nsJAR::FindEntries(const nsACString &aPattern, nsIUTF8StringEnumerator **result)
> {
>+  char *pattern = PL_strdup(PromiseFlatCString(aPattern).get());
>+  if (!pattern)
>+    return NS_ERROR_OUT_OF_MEMORY;
>+  if (*pattern == '\0')
>+    pattern = nsnull;


Sorry for the review lag, traveling atm. The rest of the patch looks good. This is an important fix, thanks for taking it on. I'm r-ing this for now, I'd like to review it again in case I missed anything.
Attachment #557793 - Flags: feedback?(tglek) → feedback-
I don't know, why I used local copies of the parameter.
Now I use aEntryName.IsEmpty()? nsnull : PromiseFlatCString(aEntryName).get()
It's easier to use and to understand. I changed that in nsJAR::Test and nsJAR::FindEntries.

What is about the IDL documentation in and nsIZipReaderCache?
Is this auto generated or hand made? I miss also nsIZipWriter!

If I have to change the documentation by hand, I still need an example of how one describes a not kompatieble IDL. Must I create a new UUID for the interface?
Attachment #557793 - Attachment is obsolete: true
Attachment #560540 - Flags: review?(tglek)
If high bit is the only problem, why AUTF8String instead of ACString?

Most of non-ascii zip entries are not encoded in UTF-8. Indeed, attachment 265062 [details] (comment #13) seems Code Page 437.

S    a    u    t    é    .    t    x    t
0x53 0x61 0x75 0x74 0x82 0x2E 0x74 0x78 0x74 (0x00)

In unicode, é is 0x000000E9 and in utf-8, 0xE9.
(In reply to O. Atsushi (Torisugari) from comment #28)
> In unicode, é is 0x000000E9 and in utf-8, 0xE9.


I'm sorry for the spam.
(In reply to O. Atsushi (Torisugari) from comment #28)
> If high bit is the only problem, why AUTF8String instead of ACString?

AUTF8String is the best type to communicate with unicode between c++ and javascript.
AUTF8String is auto converted to UTF-16 in Javascript and the UI works also with UTF-16.

> Most of non-ascii zip entries are not encoded in UTF-8. Indeed, attachment
> 265062 [details] (comment #13) seems Code Page 437.

This depends on the platform. Windos 7 and the many Linux works with UTF-8.

> S    a    u    t    é    .    t    x    t
> 0x53 0x61 0x75 0x74 0x82 0x2E 0x74 0x78 0x74 (0x00)
> In unicode, é is 0xC3A9 and in utf-8, 0xE9.

This is still a problem. In comment 23 I wrote:
   I think the Problem above has two aspects!

The Patch attachment 560540 [details] [diff] [review] solves only aspect 2!
Aspect 1 is still there  and we must remember that the bug 
is not completely fixed with my patch.

Important is, that the situation keeps getting better
and open the door for the next step!
Comment on attachment 560540 [details] [diff] [review]
nsIZipReader.idl use always  AUTF8String as in parameter v2

You do indeed need to bump the uuid in nsIZipReader.idl see (short story: /msg firebot uuid & replace existing one)
Attachment #560540 - Flags: review?(tglek) → review+
Wolfgang, thanks for the patch. I think it is important to be able to read archives with unicode filenames.
My changes to v2:
new UUID and CID for nsIZipReader and nsIZipReaderCache

Please check the patch on Try Server and check in.
Remember that the bug is not completely fixed with my patch. (comment 30)

What is about the IDL documentation in and nsIZipReaderCache?
Is this auto generated or shall I change the wiki by hand?
Attachment #560540 - Attachment is obsolete: true
Attachment #562019 - Flags: checkin?(tglek)
You should apply for level 1 access so you can push to try yourself. Documentation is manual, assigning dev-doc-needed keyword to a bug after closing will get someone to update it.

I pushed the patch to try should get results posted in here. I would like to aim at landing this asap after the next aurora merge so this spends time in nightly testing.
Try run for bbfc4cc0c427 is complete.
Detailed breakdown of the results available here:
Results (out of 171 total builds):
    exception: 3
    success: 162
    warnings: 5
    failure: 1
Builds available at
Assignee: nobody → wgermund
OS: Windows XP → All
Hardware: x86 → All
Whiteboard: [land after 27th sept aurora uplift]
Comment on attachment 562019 [details] [diff] [review]
nsIZipReader.idl use always AUTF8String as in parameter v3

Thanks for the patch!

The try run looks good, but I'm going to remove the checkin flag for now (for the reason in comment 34), so this doesn't get checked in accidentally. As soon as the next aurora uplift happens (27th Sept) please re-add it and someone will push for you.

If you need any help getting level 1 commit access so you can push to try for future patches, just ask :-)
Attachment #562019 - Flags: checkin?(tglek)
Keywords: dev-doc-needed
Whiteboard: [land after 27th sept aurora uplift] → [inbound]
Closed: 15 years ago9 years ago
Resolution: --- → FIXED
Whiteboard: [inbound]
Target Milestone: --- → mozilla10
Around 405:
  const char *filename = PromiseFlatCString(aFilename).get();
  if (*filename)
    //-- Find the item
   nsCStringKey key(filename);
   nsJARManifestItem* manItem = static_cast<nsJARManifestItem*>(mManifestData.Get(&key));
Could have been:
  if (!aFilename.isEmpty())  {
    //-- Find the item
   nsJARManifestItem* manItem = static_cast<nsJARManifestItem*>(mManifestData.Get(&aFilename));

This saves the creation of a nsCString, and the copy of the string itself.
Blocks: 697061
(In reply to Alfred Kayser from comment #39)

I make a followup at bug 697061.

The Parameter of mManifestData.Get(&key) must be a nsCStringKey not a nsCString.
In the new documentation there is a paragraph under the "Warning" section that reads:

"In the ZIP specification are only filenames with 7 bit ASCII defined but usually the filesystems use 8 bit code pages for example UTF-8. This can mismatch characters in the nsIZipReader API. Since Gecko 10 the nsIZipReader API supports a limited 8 bit code page usage. All functions pass the filenames (in / out) as AUTF8String. So you can read and extract, what you wrote with nsIZipWriter. If you show filenames from the findEntries result in the UI,  the character matching is only on UTF-8 filesystems (ZIP archives) ok.".


The wording, syntax, and grammar are confusing (at least to me).

I mean no disrespect to the author, but technical documentation needs to be concise with little room for misinterpretation.
(In reply to Chuck Baker (Shadoefax on irc) from comment #42)

> I mean no disrespect to the author, but technical documentation needs to be
> concise with little room for misinterpretation.

That sounds we need reviewers (and a module owner) for documentation, besides web-developers.

By the way, filesystem is not the problem here. As comment #15 says, these days archiving applications encode a file name in their own ways, though they used to depend on filesystem's charset decades ago.

> what you wrote with nsIZipWriter
nsIZipWriter sets bit 11. So whether or not it can handle a given zip archive is predictable.

Bit 11? ---> Yes -> OK.                            (UTF-8, typically by nsIZipWriter)
        \--> No  -> pure ASCII? ---> Yes -> OK.    (ASCII, which is compatible with UTF-8)
                                \--> No  -> Fails. (Most of the Non-ASCII in the world.)

Probably we don't have to think about UTF-8 without bit 11. It's negligible, even though there are a lot of machines whose default charset is UTF-8.
I've added the "addon-compat" keyword to this bug, since this interface change
require modifications to add-ons that use it. In fact, this keyword is almost
always needed when there is an interface change.

Also, while the documentation of the interface itself is updated in detail
(thanks!), this change should also be reported here:

I would have added the required line to the page myself, but my MDN account is
temporarily unable to edit pages (bug 738900), thus I've also added the
"dev-doc-needed" keyword.
Here is also some code showing how to keep compatibility with browser versions
prior to Firefox 10. In the add-on I'm maintaining, I'm using this technique now
(though slightly optimized). I'll keep this compatibility code at least until
Firefox 3.6 reaches end-of-life.

function isGecko10OrLater() {
  var platformVersion = Cc[";1"]
  return Cc[";1"]
         .compare(platformVersion, "10") >= 0;

var converter = Cc[""]
converter.charset = "UTF-8";

var zipReader = [...]

var zipEntries = zipReader.findEntries(null);
while (zipEntries.hasMore()) {
  var zipEntry = zipEntries.getNext();
  if (isGecko10OrLater()) {
    zipReader.extract(zipEntry, destFile);
  } else {
    zipReader.extract(converter.ConvertFromUnicode(zipEntry) +
                      converter.Finish(), destFile);
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.