Closed Bug 746942 Opened 12 years ago Closed 12 years ago

Filename incorrectly encoded for upload using international UTF-8 characters

Categories

(Firefox :: File Handling, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 703161

People

(Reporter: m.swerts, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(1 file)

1.64 KB, application/octet-stream
Details
Attached file PHP app.zip
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.54.16 (KHTML, like Gecko) Version/5.1.4 Safari/534.54.16

Steps to reproduce:

When uploading a file with a UTF-8 filename (for example "WPF Délices  de Comines.pdf") the filename is not encoded correctly by Firefox 4 - 11.

We had this problem with a Java JSF application using a MySQL database. To be sure we tried to write a little PHP application (attached to this issue) which writes the filename to a file. Same problem occurs.


Actual results:

De filename arrived as "WPF De´lices  de Comines.pdf". This does not happen in Firefox 3.6, Safari or Chrome.


Expected results:

The filename should arrive at the server as "WPF Délices  de Comines.pdf".
Product: Core → Firefox
Summary: Filename incorrect encoded for upload using international UTF-8 characters → Filename incorrectly encoded for upload using international UTF-8 characters
Status: UNCONFIRMED → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
Why is this being a problem for your app?  The two filenames are equivalent as Unicode strings.

Note the extensive discussion about this in bug 695995 (of which this looks to be more properly a duplicate).
Depends on: 695995
I do not have a lot of knowledge regarding encodings, so I don't know which of the two bugs responds more closely to the one I reported.

The problem is that when I write the filename to a file/database using either PHP or Java, the filename is represented differently than when uploading the file using Safari or Chrome. So if they are represented differently, how can they be equivalent?

I first thought that the problem was in my code (we are using a custom application to store the files + filenames in an Alfresco repository), but even when uploading with the standard Alfresco user interface the problem occurs (question marks are rendered in the filename).
> So if they are represented differently, how can they be equivalent?

Please see discussion in bug 695995 and the W3C bug linked from it.  Basically, Unicode has two different ways to represent the "e with accent" letter: precomposed and decomposed.  Applications processing Unicode strings are supposed to treat the two ways as being equivalent (e.g. for things like string compares).  Equivalent doesn't mean "the same".

On the other hand, question marks being rendered in the filename should not happen, as long as you just pass through the Unicode string with the decomposed character.
Thanks for the quick reply.

After some research on normalization it indeed seems more appropriate to mark this bug as a duplicate of 695995. However I just experimented with normalization in the code and my results are the opposite of what you propose in your last comment. 

Your comment marks that a decomposed string should be the solution, but when working with the Java Normalization API, I noticed that filenames from Firefox 3.6, Safari and Chrome are normalized NFC (composed). Firefox 4 to Firefox 11 filenames are normalized NFD (decomposed). Normalizing the Firefox filenames in NFC (composed) result in the correct accented characters inserted in the system. Leaving them as is results in wrong characters and question marks.

Does this make sense?
It all makes sense except that last bit.  Why would there be question marks just because you put something in NFD on a web page?  Browsers should handle that....
The decomposed filename is stored in a MySQL database. In the database we can see that the stored decomposed filename is different from the composed filename (e.g. from Safari) and contains "e?" instead of "é".

Is it so that an application that support uploading files with international characters in the filename should always check for normalization (NFC or NFD) and normalize manually?
Ah, that sounds like a bug somewhere between your server and the database, then, if the database ends up containing a '?'.

Filenames can in fact be NFD on Windows (in which case all Windows browsers will send them as NFD), so if part of your software stack can't handle NFD then you probably do want to normalize, no matter what we end up doing on Mac.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: