695995 - Normalize filenames in form uploads and the DOM to NFC

Reporter

Description

•

14 years ago

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 Build ID: 20110928134238 Steps to reproduce: Upload a file named pöö.txt through HTML Upload <input type=file ...> Actual results: On server-side the file is stored as: po%CC%88o%CC%88.txt. Expected results: File should have been stored as: p%C3%B6%C3%B6.txt. This encoding is used with Chrome and Safari on OS X and also IE and FF on Windows7

Ari Rantalainen

Reporter

Comment 1

•

14 years ago

The bug is also reported by someone in FF 6.0.2 on OS X in https://support.mozilla.com/fi/questions/874246

:aceman

Comment 2

•

14 years ago

Does the HTML code specify any encoding?

Component: General → HTML: Form Submission

Product: Firefox → Core

QA Contact: general → form-submission

Ari Rantalainen

Reporter

Comment 3

•

14 years ago

From the page source: <input type="file" name="filename1" value="" size="25" maxlength="200"/> I'm also able to repeat this problem on FF8b3 and 4.2a1pre (2011-03-31). In FF3.6.23 it works correctly.

Ari Rantalainen

Reporter

Updated

•

14 years ago

Version: 7 Branch → 8 Branch

:aceman

Comment 4

•

14 years ago

That didn't quite answer my question. Does the page set the encoding (codepage/charset), like utf-8 or iso-8859-1, anywhere? What does Page Info say about the encoding?

Ari Rantalainen

Reporter

Comment 5

•

14 years ago

Page begins with: <?xml version="1.0" encoding="utf-8"?>

Boris Zbarsky [:bzbarsky]

Comment 6

•

14 years ago

> Page begins with: That actually doesn't matter; in HTML the XML encoding declaration is ignored. In any case, this is not an issue with the character encoding per se. %CC%88 is the URL-encoded UTF-8 encoding of U+0308 COMBINING DIAERESIS. So all you're seeing is the difference between precomposed forms (which is what the expected results above show) and decomposed forms. The native filename encoding on Mac for HFS+ is in fact decomposed Unicode. Josh, I assume this is fallout from the change to use nsLocalFileUnix on Mac?

Blocks: 571193

Status: UNCONFIRMED → NEW

Component: HTML: Form Submission → XPCOM

Ever confirmed: true

Keywords: regression

QA Contact: form-submission → xpcom

Version: 8 Branch → Trunk

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Summary: Filename gets corrupted in upload → Filenames are uploaded in decomposed form on Mac

Boris Zbarsky [:bzbarsky]

Comment 7

•

14 years ago

I should add that 'o' followed by U+0308 is a perfectly valid way of expressing 'ö' in Unicode, so I'm not sure why the behavior change here is a problem in the first place...

Josh Aas

Comment 8

•

14 years ago

(In reply to Boris Zbarsky (:bz) from comment #6) > Josh, I assume this is fallout from the change to use nsLocalFileUnix on Mac? Possibly, but I worked on that a long time ago and I have little to no memory of what I did that might be related to this.

Boris Zbarsky [:bzbarsky]

Comment 9

•

14 years ago

I doubt it's something you did; more likely MacOS hands out slightly different information to different APIs...

Steven Michaud [:smichaud] (Retired)

Comment 10

•

14 years ago

It'd be nice to have a regression range on this ... if there is one (if the problem hasn't always existed).

Ari Rantalainen

Reporter

Comment 11

•

14 years ago

By going through quite a lot of nightlies and come to the conclusion that it still works as I'm expecting it to work on: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b3pre) Gecko/20100721 Minefield/4.0b3pre. First version where this bug is present is: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b3pre) Gecko/20100722 Minefield/4.0b3pre

Boris Zbarsky [:bzbarsky]

Comment 12

•

14 years ago

Steven, comment 3 is very clear that the behavior changed from 3.6 to 4. The regression range from comment 11 is roughly http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2010-07-21&enddate=2010-07-23 (roughly because dates are hard to work with; the changeset ids from about:buildconfig would be a lot more useful). That does in fact include the fix for bug 571193. Arei, I still don't understand what problems this causes in practice. Can you explain that, please? In particular, see discussion in http://www.w3.org/Bugs/Public/show_bug.cgi?id=14526

Ari Rantalainen

Reporter

Comment 13

•

14 years ago

As to the w3 bug report my applications is storing the files in objects. Object can only hold one file with a distinct name. In this case the object will house two files with different names (pöö.txt vs. po¨po¨.txt) but which show as having the same name. It works as I'm expecting: http://hg.mozilla.org/mozilla-central/rev/1ac07fe5f6c9 It doesn't work as I'm expecting: http://hg.mozilla.org/mozilla-central/rev/0f5fc40c6a0f I'm not an expert in how the non-basic characters are handled. I'm reporting this bug as it works in a different way than FF on other OS's (Windows7 and Ubuntu 11.10) in addition it working differently than other browsers I tested on OSX (Safari 5.1, Chrome 14.0). So is this a bug or a feature?

Boris Zbarsky [:bzbarsky]

Comment 14

•

14 years ago

> In this case the object will house two files with different names Ah, and the comparison algorithm is not really doing Unicode string comparisons correctly? See http://unicode.org/faq/normalization.html for info.... > So is this a bug or a feature? It wasn't a purposeful behavior change, for sure. So we're treating it as a bug for the moment. But whatever comparison routine you're using that doesn't normalize before comparing is definitely broken, and might break in other browsers on other OSes any time. That is, unless the W3C standardizes behavior here.

Ari Rantalainen

Reporter

Comment 15

•

14 years ago

> Ah, and the comparison algorithm is not really doing Unicode string > comparisons correctly? See http://unicode.org/faq/normalization.html for Yes, I finally understand that this is the case. > So we're treating it as a bug for the moment. But whatever comparison Ok. I'll stay tuned to this bug report but will start to fix my code. Thank you all very much for this triage and help with this. I learned a thing or two.

Boris Zbarsky [:bzbarsky]

Comment 16

•

14 years ago

OK, so it seems like in practice we and Opera and IE all send whatever is on disk as the filename. WebKit normalizes to NFC. Looks like on Windows filenames usually end up as NFC if typed in directly, though there's no guarantee of that. So windows browsers will _usually_ end up sending NFC. Ian is going to make the spec require NFC here (and in the DOM filename APIs). We're probably not going to change our nsIFile code to do that, so need to handle this on the DOM end. Jonas, Kyle, any objections?

URL: http://www.w3.org/Bugs/Public/show_bu...

Component: XPCOM → DOM

QA Contact: xpcom → general

Summary: Filenames are uploaded in decomposed form on Mac → Normalize filenames in form uploads and the DOM to NFC

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 17

•

14 years ago

Seems reasonable to me.

Masatoshi Kimura [:emk]

Comment 18

•

14 years ago

Please don't require normalizing to NFC. It will break CJK Compatibility Ideographs in the filename which is popular in Japan because our goverment certified to use those characters for personal names. Filename normalization should be a platform convention rather than a DOM spec. Mac supports a dedicated API to normalize filenames to avoid breaking Compatibility Ideographs. https://developer.apple.com/jp/technotes/tn1150.html#UnicodeSubtleties (Sorry for Japanese, I couldn't find a corresponding English technote.)

Boris Zbarsky [:bzbarsky]

Comment 19

•

14 years ago

OK. Can you please escalate that to the spec, then? See link in url field. I can do it if you really prefer, but having me pipe things back and forth is a little silly.... Also, does webkit not actually do full NFC normalization, then?

Masatoshi Kimura [:emk]

Comment 20

•

14 years ago

Will do, I can't log in to W3C Bugzilla right now. > Also, does webkit not actually do full NFC normalization, then? I have no working Mac. I'll hear Japanese Mac users.

Boris Zbarsky [:bzbarsky]

Comment 21

•

14 years ago

> I have no working Mac. http://www.w3.org/Bugs/Public/show_bug.cgi?id=14526#c11 says Safari and Chrome do some sort of normalization on Windows too.

Masatoshi Kimura [:emk]

Comment 22

•

14 years ago

Safari and Chrome normalized Compatibility Ideographs at least on Windows...

Boris Zbarsky [:bzbarsky]

Comment 23

•

14 years ago

So their behavior there is broken in the way comment 18 says we should avoid?

Masatoshi Kimura [:emk]

Comment 24

•

14 years ago

Yes. Now I'm not sure I can convince them to pay the cost of non-standard normalization behavior.

Hixie (not reading bugmail)

Comment 25

•

14 years ago

emk: Please add a comment to http://www.w3.org/Bugs/Public/show_bug.cgi?id=14526 describing what normalisation you think we should do. (I'm afraid I can't read Japanese so the link you cited above is not helpful to me. :-( )

Masatoshi Kimura [:emk]

Updated

•

14 years ago

Updated

•

13 years ago

Blocks: 746942

Anne (:annevk)

Comment 26

•

8 years ago

I'm pretty sure Chrome removed this, but I can't find a reference. It also seems like it hasn't been a problem in the past five years?

Ari Rantalainen

Reporter

Comment 27

•

8 years ago

Re-testing the original case again with Firefox 55.0.2 (64-bit) and the behaviour is as it was when reported. Chrome has indeed moved to work the same on OS X but Safari is still using "C3 B6" as is all Windows browsers that I have tested. I have come to live with this over the years and then reported it as it was a sudden change in Firefox behaviour (and not intended like stated in comment 14).

Benjamin C. Wiley Sittler

Comment 28

•

8 years ago

I believe this may be the relevant part of that (now-ancient) Tech Note was this, quoting from the English version at https://developer.apple.com/legacy/library/technotes/tn/tn1150.html#CanonicalDecomposition : > The characters in the range u+F900 through u+FAFF are CJK compatibility ideographs, and are not decomposed in HFS Plus strings. This does indeed reduce breakage (in comparison to NFC/NFD/NFKC/NFKD) for some characters used in Japanese names. You might be able to approximate its effect these days in script like this: const betterCompose = str => str.replace(/[^\uF900-\uFAFF]+/u, s => s.normalize('NFC')); Note however that some of the characters in that range are visually indistinguishable from the characters with which normalization unifies them, and that this presence or absence of visual distinction furthermore varies both by locale and by font.

Benjamin C. Wiley Sittler

Comment 29

•

8 years ago

Oops! Missed the g when retyping that here: const betterCompose = str => str.replace(/[^\uF900-\uFAFF]+/gu, s => s.normalize('NFC'));

Marion Daly [:mdaly]

Updated

•

7 years ago

Priority: -- → P3

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: DOM → DOM: Core & HTML

BMO Automation

Updated

•

3 years ago

Severity: normal → S3