Open Bug 695995 Opened 13 years ago Updated 2 years ago

Normalize filenames in form uploads and the DOM to NFC

Categories

(Core :: DOM: Core & HTML, defect, P3)

x86
macOS
defect

Tracking

()

People

(Reporter: ari.rantalainen, Unassigned)

References

()

Details

(Keywords: regression)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Build ID: 20110928134238

Steps to reproduce:

Upload a file named pöö.txt through HTML Upload <input type=file ...>


Actual results:

On server-side the file is stored as: po%CC%88o%CC%88.txt.


Expected results:

File should have been stored as: p%C3%B6%C3%B6.txt. This encoding is used with Chrome and Safari on OS X and also IE and FF on Windows7
The bug is also reported by someone in FF 6.0.2 on OS X in https://support.mozilla.com/fi/questions/874246
Does the HTML code specify any encoding?
Component: General → HTML: Form Submission
Product: Firefox → Core
QA Contact: general → form-submission
From the page source:
<input type="file" name="filename1" value="" size="25" maxlength="200"/>

I'm also able to repeat this problem on FF8b3 and 4.2a1pre (2011-03-31).

In FF3.6.23 it works correctly.
Version: 7 Branch → 8 Branch
That didn't quite answer my question. Does the page set the encoding (codepage/charset), like utf-8 or iso-8859-1, anywhere? What does Page Info say about the encoding?
Page begins with:
<?xml version="1.0" encoding="utf-8"?>
> Page begins with:

That actually doesn't matter; in HTML the XML encoding declaration is ignored.

In any case, this is not an issue with the character encoding per se.  %CC%88 is the URL-encoded UTF-8 encoding of U+0308 COMBINING DIAERESIS.  So all you're seeing is the difference between precomposed forms (which is what the expected results above show) and decomposed forms.

The native filename encoding on Mac for HFS+ is in fact decomposed Unicode.

Josh, I assume this is fallout from the change to use nsLocalFileUnix on Mac?
Blocks: 571193
Status: UNCONFIRMED → NEW
Component: HTML: Form Submission → XPCOM
Ever confirmed: true
Keywords: regression
QA Contact: form-submission → xpcom
Version: 8 Branch → Trunk
Summary: Filename gets corrupted in upload → Filenames are uploaded in decomposed form on Mac
I should add that 'o' followed by U+0308 is a perfectly valid way of expressing 'ö' in Unicode, so I'm not sure why the behavior change here is a problem in the first place...
(In reply to Boris Zbarsky (:bz) from comment #6)

> Josh, I assume this is fallout from the change to use nsLocalFileUnix on Mac?

Possibly, but I worked on that a long time ago and I have little to no memory of what I did that might be related to this.
I doubt it's something you did; more likely MacOS hands out slightly different information to different APIs...
It'd be nice to have a regression range on this ... if there is one (if the problem hasn't always existed).
By going through quite a lot of nightlies and come to the conclusion that it still works as I'm expecting it to work on: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b3pre) Gecko/20100721 Minefield/4.0b3pre. 

First version where this bug is present is: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b3pre) Gecko/20100722 Minefield/4.0b3pre
Steven, comment 3 is very clear that the behavior changed from 3.6 to 4.

The regression range from comment 11 is roughly http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2010-07-21&enddate=2010-07-23 (roughly because dates are hard to work with; the changeset ids from about:buildconfig would be a lot more useful).  That does in fact include the fix for bug 571193.

Arei, I still don't understand what problems this causes in practice.  Can you explain that, please?  In particular, see discussion in http://www.w3.org/Bugs/Public/show_bug.cgi?id=14526
As to the w3 bug report my applications is storing the files in objects. Object can only hold one file with a distinct name. In this case the object will house two files with different names (pöö.txt vs. po¨po¨.txt) but which show as having the same name.

It works as I'm expecting: http://hg.mozilla.org/mozilla-central/rev/1ac07fe5f6c9
It doesn't work as I'm expecting: http://hg.mozilla.org/mozilla-central/rev/0f5fc40c6a0f

I'm not an expert in how the non-basic characters are handled. I'm reporting this bug as it works in a different way than FF on other OS's (Windows7 and Ubuntu 11.10) in addition it working differently than other browsers I tested on OSX (Safari 5.1, Chrome 14.0).

So is this a bug or a feature?
> In this case the object will house two files with different names 

Ah, and the comparison algorithm is not really doing Unicode string comparisons correctly?  See http://unicode.org/faq/normalization.html for info....

> So is this a bug or a feature?

It wasn't a purposeful behavior change, for sure.

So we're treating it as a bug for the moment.  But whatever comparison routine you're using that doesn't normalize before comparing is definitely broken, and might break in other browsers on other OSes any time.  That is, unless the W3C standardizes behavior here.
> Ah, and the comparison algorithm is not really doing Unicode string
> comparisons correctly?  See http://unicode.org/faq/normalization.html for

Yes, I finally understand that this is the case.

> So we're treating it as a bug for the moment.  But whatever comparison

Ok. I'll stay tuned to this bug report but will start to fix my code.

Thank you all very much for this triage and help with this. I learned a thing or two.
OK, so it seems like in practice we and Opera and IE all send whatever is on disk as the filename.  WebKit normalizes to NFC.

Looks like on Windows filenames usually end up as NFC if typed in directly, though there's no guarantee of that.  So windows browsers will _usually_ end up sending NFC.

Ian is going to make the spec require NFC here (and in the DOM filename APIs).

We're probably not going to change our nsIFile code to do that, so need to handle this on the DOM end.  Jonas, Kyle, any objections?
Component: XPCOM → DOM
QA Contact: xpcom → general
Summary: Filenames are uploaded in decomposed form on Mac → Normalize filenames in form uploads and the DOM to NFC
Please don't require normalizing to NFC. It will break CJK Compatibility Ideographs in the filename which is popular in Japan because our goverment certified to use those characters for personal names.
Filename normalization should be a platform convention rather than a DOM spec. Mac supports a dedicated API to normalize filenames to avoid breaking Compatibility Ideographs.
https://developer.apple.com/jp/technotes/tn1150.html#UnicodeSubtleties
(Sorry for Japanese, I couldn't find a corresponding English technote.)
OK.  Can you please escalate that to the spec, then?  See link in url field.  I can do it if you really prefer, but having me pipe things back and forth is a little silly....

Also, does webkit not actually do full NFC normalization, then?
Will do, I can't log in to W3C Bugzilla right now.
> Also, does webkit not actually do full NFC normalization, then?
I have no working Mac. I'll hear Japanese Mac users.
> I have no working Mac. 

http://www.w3.org/Bugs/Public/show_bug.cgi?id=14526#c11 says Safari and Chrome do some sort of normalization on Windows too.
Safari and Chrome normalized Compatibility Ideographs at least on Windows...
So their behavior there is broken in the way comment 18 says we should avoid?
Yes. Now I'm not sure I can convince them to pay the cost of non-standard normalization behavior.
emk: Please add a comment to http://www.w3.org/Bugs/Public/show_bug.cgi?id=14526 describing what normalisation you think we should do. (I'm afraid I can't read Japanese so the link you cited above is not helpful to me. :-( )
See Also: → 703161
Blocks: 746942
I'm pretty sure Chrome removed this, but I can't find a reference. It also seems like it hasn't been a problem in the past five years?
Re-testing the original case again with Firefox 55.0.2 (64-bit) and the behaviour is as it was when reported. Chrome has indeed moved to work the same on OS X but Safari is still using "C3 B6" as is all Windows browsers that I have tested.

I have come to live with this over the years and then reported it as it was a sudden change in Firefox behaviour (and not intended like stated in comment 14).
I believe this may be the relevant part of that (now-ancient) Tech Note was this, quoting from the English version at https://developer.apple.com/legacy/library/technotes/tn/tn1150.html#CanonicalDecomposition :

> The characters in the range u+F900 through u+FAFF are CJK compatibility ideographs, and are not decomposed in HFS Plus strings.

This does indeed reduce breakage (in comparison to NFC/NFD/NFKC/NFKD) for some characters used in Japanese names. You might be able to approximate its effect these days in script like this:

const betterCompose = str => str.replace(/[^\uF900-\uFAFF]+/u, s => s.normalize('NFC'));

Note however that some of the characters in that range are visually indistinguishable from the characters with which normalization unifies them, and that this presence or absence of visual distinction furthermore varies both by locale and by font.
Oops! Missed the g when retyping that here:

const betterCompose = str => str.replace(/[^\uF900-\uFAFF]+/gu, s => s.normalize('NFC'));
Priority: -- → P3
Component: DOM → DOM: Core & HTML
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.