Closed Bug 703161 Opened 13 years ago Closed 9 years ago

Filesystem encoding should be decomposed UTF8 on Mac OS X

Categories

(Core :: XPCOM, defect)

All
macOS
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: m_kato, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: jp-critical, regression)

Attachments

(2 files, 2 obsolete files)

Mac OS X filesystem uses UTF-8, but it isn't generic UTF-8. It is normalized like NFD (but not equal to it). So, iconv uses UTF-8-MAC for this. It was reported at past years, then we were fixed this issue by bug 227547. But, when I discuss with community yesterday, it occurs on Gecko 4/5 or later. So I have to investigate it.
By bug 571193, CopyUTF8toUTF16NFC is removed. This may be regression from bug 571193.
Blocks: 571193
Keywords: regression
Note that HFS+ does not use true NFD but a non-standard variant to avoid breaking CJK Compatibility Ideographs.
Keywords: jp-critical
Attached patch fix v1 (obsolete) — Splinter Review
Attachment #576696 - Flags: review?(smichaud)
Why this patch uses CFStringNormalize / kCFStringNormalizationFormC instead of ConvertFromUnicodeToText / kUnicodeHFSPlusCompVariant? The former will not preserve CJK Compatibility Ideographs.
(In reply to Masatoshi Kimura [:emk] from comment #4) > Why this patch uses CFStringNormalize / kCFStringNormalizationFormC instead > of ConvertFromUnicodeToText / kUnicodeHFSPlusCompVariant? The former will > not preserve CJK Compatibility Ideographs. Becasue back to original way of nsLocalFileOSX.mm. This is regression of bug 571193.
(In reply to Makoto Kato from comment #5) > (In reply to Masatoshi Kimura [:emk] from comment #4) > > Why this patch uses CFStringNormalize / kCFStringNormalizationFormC instead > > of ConvertFromUnicodeToText / kUnicodeHFSPlusCompVariant? The former will > > not preserve CJK Compatibility Ideographs. > > Becasue back to original way of nsLocalFileOSX.mm. This is regression of > bug 571193. It means the original code had a bug. We don't have to bring back the bug along.
(In reply to Masatoshi Kimura [:emk] from comment #6) > (In reply to Makoto Kato from comment #5) > > (In reply to Masatoshi Kimura [:emk] from comment #4) > > > Why this patch uses CFStringNormalize / kCFStringNormalizationFormC instead > > > of ConvertFromUnicodeToText / kUnicodeHFSPlusCompVariant? The former will > > > not preserve CJK Compatibility Ideographs. > > > > Becasue back to original way of nsLocalFileOSX.mm. This is regression of > > bug 571193. > > It means the original code had a bug. We don't have to bring back the bug > along. Do you have test case?
Feel free to fold into your patch. BTW, test file name based on the bug number is no longer recommended.
And this test fails with the current patch. https://tbpl.mozilla.org/?tree=Try&rev=690753b1af78
Comment on attachment 576696 [details] [diff] [review] fix v1 > nsLocalFile::GetNativeLeafName(nsACString &aLeafName) > { > nsACString::const_iterator begin, end; > LocateNativeLeafName(begin, end); >+#ifdef MOZ_WIDGET_COCOA >+ CocoaFileUtils::CopyUTF8toUTF8NFC(Substring(begin, end), aLeafName); >+#else > NS_IMETHODIMP > nsLocalFile::GetNativePath(nsACString &_retval) > { >+#ifdef MOZ_WIDGET_COCOA >+ CocoaFileUtils::CopyUTF8toUTF8NFC(mPath, _retval); >+#else Why native names are composed? Decomposed names are "native" on Mac OS X. Callers are responsible for differentiating nativeLeafName and leafName (like on Windows).
Comment on attachment 576696 [details] [diff] [review] fix v1 >+void CopyUTF8toUTF8NFC(const nsACString& aSrc, nsACString& aResult) >+{ >+ NS_OBJC_BEGIN_TRY_ABORT_BLOCK; >+ >+ char buf[PATH_MAX]; >+ const nsAFlatCString &inFlatSrc = PromiseFlatCString(aSrc); >+ >+ // Converting to NFC, length won't become bigger >+ CFMutableStringRef inStr = ::CFStringCreateMutable(NULL, inFlatSrc.Length()); Is |aSrc| guaranteed to be normalized by the file system? If not, this is wrong because Composition Exclusions will become longer.
(In reply to Masatoshi Kimura [:emk] from comment #10) > Why native names are composed? Decomposed names are "native" on Mac OS X. > Callers are responsible for differentiating nativeLeafName and leafName > (like on Windows). But HFS+ uses UTF-16, not UTF-8. Moreover, "native" names are Unicode on NTFS. Those property names are distorted...
Making the test more strict.
Attachment #576894 - Attachment is obsolete: true
This test failed without the patch. That is, file names were not normalized by the file system. https://tbpl.mozilla.org/?tree=Try&rev=e94074040bf9 Please forget about comment #10. nsILocalFile's "native" concept is fundamentally broken...
(In reply to Masatoshi Kimura [:emk] from comment #14) > Please forget about comment #10. nsILocalFile's "native" concept is > fundamentally broken... Actually, the original code (which is introduced in bug 227547 and removed in bug) didn't compose filename for "Native" properties. So the current patch does not "back to original way of nsLocalFileOSX.mm". I changed my mind again. We should not compose file names for "native" properties.
> (which is introduced in bug 227547 and removed in bug) (which is introduced in bug 227547 and removed in bug 571193)
To wrap up, the current patch have the following problems: - It breaks CJK Compatibility Ideographs. - It makes a wrong assumption that NFC'ed names will never grow. - It normilizes "native" names which have never been normalized. It also breaks nsIFile contract. https://hg.mozilla.org/mozilla-central/file/6f079f13c06a/xpcom/io/nsIFile.idl#l112 > * For the |nativeLeafName| method, the nativeLeafName must > * be in the native filesystem charset. Although HFS+ internal charset is UTF-16, system API takes UTF-8 strings.
Makoto Kato, are you going to revise your patch in response to Masatoshi Kimura's comments?
Decomposition algorithm of HFS Plus is not idential to NFD.
Summary: Filesystem encoding should be NFD based UTF8 on Mac OS X → Filesystem encoding should be decomposed UTF8 on Mac OS X
Attachment #576696 - Attachment is obsolete: true
Attachment #576696 - Flags: review?(smichaud)
Part 2 test case is not good. This character map area isn't excluded in HPS+. Since filename into HPS+ is customized NFD, it can access multiple filenames to the file. Also, part 1 is good test case. U+FAxx should not be NFD/NFC for HPS+. (I should add a test for U+2000-U+2FFF, too) TEC with kUnicodeHFSPlusCompVariant doesn't convert to correct target name. So we should consider excluded mapping for HPS+ when using NKC.
(In reply to Makoto Kato from comment #20) > Part 2 test case is not good. This character map area isn't excluded in > HPS+. Test case 2 is not a HFS+ specific problam but a generic problem of Unicode normalization (you can't assume that a composed file name will never be longer than the original). > Since filename into HPS+ is customized NFD, it can access multiple > filenames to the file. We should follow the Mac convention for "native" names (use decomposed filenames). > (I should add a test for U+2000-U+2FFF, too) U+212B ANGSTROM SIGN would be a good candidate.
> TEC with kUnicodeHFSPlusCompVariant doesn't convert to correct target name. What is the "correct target name"? Could you provide a test case?
(In reply to Masatoshi Kimura [:emk] from comment #22) > > TEC with kUnicodeHFSPlusCompVariant doesn't convert to correct target name. > What is the "correct target name"? Could you provide a test case? ex. U+2190 U+0338. kUnicodeHFSPlusCompVariant converts it to U+219A. You know, this behavior is invalid.
Also, TEC isn't thread safe. If using it, we cannot cache TECObjectRef (or need global lock).
(In reply to Makoto Kato from comment #23) > (In reply to Masatoshi Kimura [:emk] from comment #22) > > > TEC with kUnicodeHFSPlusCompVariant doesn't convert to correct target name. > > What is the "correct target name"? Could you provide a test case? > ex. U+2190 U+0338. kUnicodeHFSPlusCompVariant converts it to U+219A. You > know, this behavior is invalid. Is it really invalid? I thought HFS+ only excluded singletons from decomposition because singleton decomposition is irreversible.
(In reply to Masatoshi Kimura [:emk] from comment #25) > (In reply to Makoto Kato from comment #23) > > (In reply to Masatoshi Kimura [:emk] from comment #22) > > > > TEC with kUnicodeHFSPlusCompVariant doesn't convert to correct target name. > > > What is the "correct target name"? Could you provide a test case? > > ex. U+2190 U+0338. kUnicodeHFSPlusCompVariant converts it to U+219A. You > > know, this behavior is invalid. > Is it really invalid? I thought HFS+ only excluded singletons from > decomposition because singleton decomposition is irreversible. Hm, Makoto-san is right. HFS+ didn't decompose U+219A, indeed.
Thanks Makoto for linking my bug to this one. I must have overlooked it. Reading the thread of comments, I cannot deduce whether this bug is almost solved. Is there already an estimated time of resolution for this issue? This issue blocks certain uses of our application in a production environment at a client. Thanks.
These is test binary for this. http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/m_kato@ga2.so-net.ne.jp-563ab823e3d0/try-macosx64/ Although I am investigating for performance for new fix, I think that I want to fix on version 15.
Assignee: m_kato → nobody
WFM Version 50.0a1 Build ID 20160620030215 User Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:50.0) Gecko/20100101 Firefox/50.0
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: