960957 - Remove support for *nix systems whose char* APIs for file names don't use UTF-8

Assignee

Description

•

11 years ago

We have a bunch of iconv-using code in https://mxr.mozilla.org/mozilla-central/source/xpcom/io/nsNativeCharsetUtils.cpp that assumes 1) That its initialization can read a "system encoding" from the environment. 2) That the system APIs that use char* for file names use that encoding. 3) That the system encoding might not be UTF-8. 4) That the system encoding should be ISO-8859-1 if unreadable. This code is used on non-OS X, non-Android *nix platforms, including desktop Linux. The above assumptions are probably wrong these days and we'd be better off supporting only *nix systems whose APIs that take char* file names use UTF-8 like OS X and Android. (But we should probably check with the maintainers of the Solaris and *BSD ports first.)

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Comment 1

•

11 years ago

Is dropping support for non-UTF-8 file names in char* file system APIs OK for OpenBSD?

Flags: needinfo?(landry)

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Comment 2

•

11 years ago

Is dropping support for non-UTF-8 file names in char* file system APIs OK for FreeBSD?

Flags: needinfo?(jbeich)

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Comment 3

•

11 years ago

Is dropping support for non-UTF-8 file names in char* file system APIs OK for Solaris?

Flags: needinfo?(ginn.chen)

Zack Weinberg (:zwol)

Comment 4

•

11 years ago

I had thought that Python 3 had some useful implementation advice to offer, because they went aggressively all-Unicode-internally in the 3.0 major version bump and then found that this caused, erm, difficulties with filesystem paths in legacy encodings. Unfortunately, the solution adopted seems to have been to assume that *all* filesystem paths on POSIX-not-OSX systems are in the encoding returned by nl_langinfo(CODESET), which is wrong (see for instance the discussion at https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#file-name-encodings ) We probably have fewer wacky legacy cases to trip over than a generic programming environment like Python or Glib does, but I do wonder if xdg-user-dirs is ever found to be in a non-Unicode encoding (e.g. legacy EUC-JP) in the wild.

Zack Weinberg (:zwol)

Comment 5

•

11 years ago

I would think that the only 100% bombproof solution would be a special "filesystem UTF-8" encoding that arranges to preserve invalid byte sequences somehow (map them into the private use area maybe?) but I don't think we should bother unless we discover we need to.

Jan Beich

Comment 6

•

11 years ago

I did a quick test on FreeBSD 10. USE_ICONV in nsNativeCharsetUtils.cpp seems to be a bit buggy 1. start firefox under LANG=ja_JP.eucJP locale 2. open about:support 3. save the page as |テスト| 4. |ls $HOME/Downloads| would show テスト (UTF-8, empty) <----- shouldn't exist テスト (eucJP) テスト_files (eucJP) USE_STDCONV tries to emulate CopyUTF8toUTF16 and CopyUTF16toUTF8 only to return garbage with non-UTF8 encodings. (In reply to Henri Sivonen (:hsivonen) from comment #2) > Is dropping support for non-UTF-8 file names in char* file system APIs OK > for FreeBSD? I don't think the situation is different from Linux. Many apps ignore user locale, some respect it and others try to enforce UTF-8. Since we're already linking against GLib it'd make sense to mimic its defaults.

Jan Beich

Updated

•

11 years ago

URL: http://www.mail-archive.com/dev-platf...

Flags: needinfo?(jbeich)

ISHIKAWA, Chiaki

Comment 7

•

11 years ago

(In reply to Jan Beich from comment #6) > I did a quick test on FreeBSD 10. USE_ICONV in nsNativeCharsetUtils.cpp > seems to be a bit buggy > > 1. start firefox under LANG=ja_JP.eucJP locale > 2. open about:support > 3. save the page as |テスト| > 4. |ls $HOME/Downloads| would show > > テスト (UTF-8, empty) <----- shouldn't exist > テスト (eucJP) > テスト_files (eucJP) > I have some issues with TB. I am going to investigate more about TB's behavior when it tries to save an attachment from external sender to a local file system under linux when the filename contains non ASCII characters. (I am using ja_JP.UTF-8 on one PC, and ja_JP.eucJP on the other.) One thing I noticed from the above post, especially "UTF-8, empty" part: I have been using Dropbox under linux with Japanese EUC encoding for LOCALE, and found that when I tried to save an attachment from TB to Dropbox directory under linux, then sometimes I see a pathname that says (UTF-8 illegal encoding or something) with Dropbox client under windows, and Dropbox client under Windows seems to try to save the attachment first to this invalid (under Windows?) path name and a ZERO-byte size file remained, and somehow Dropbox seems to try with a seemingly the same but without this (UTF-8 illegal encoding string) and leaves a content there: the file has superficially the same pathname as far as I can read on the Windows machine. I mean that the visible portion of the pathname look exactly the same. It is not entirely clear what the difference was between the two pathnames. There must be some encoding issue in the following path: TB under linux: saving attachment under an L10N pathname -> Dropbox client under linux -> Dropbox server -> Dropbox client under windows. I have no idea where the retrying to save the attachment under a valid file name under Window OS after an initial failure is initiated (maybe the Windows Dropbox client). I should clarify that this issue is noticed under Windows XP client. I am checking the Windows 7 client that talks to the same Dropbox account and see no problematic files remaining there (!). Now I have more clues and background info about the issue and so will try to figure out what is causing the problem, and also, I will investigate the buggy behavior of functions that use iconv() under linux more. TIA > USE_STDCONV tries to emulate CopyUTF8toUTF16 and CopyUTF16toUTF8 only to > return garbage with non-UTF8 encodings. > > (In reply to Henri Sivonen (:hsivonen) from comment #2) > > Is dropping support for non-UTF-8 file names in char* file system APIs OK > > for FreeBSD? > > I don't think the situation is different from Linux. Many apps ignore user > locale, some respect it and others try to enforce UTF-8. Since we're already > linking against GLib it'd make sense to mimic its defaults.

Landry Breuil (:gaston)

Comment 8

•

11 years ago

I dont really understand well how charsets & filesystems interact, but we recently got a somewhat native (but limited) support for UTF-8 on the system, but iirc it doesnt use iconv.

Flags: needinfo?(landry)

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Comment 10

•

11 years ago

(In reply to Zack Weinberg (:zwol) from comment #5) > I would think that the only 100% bombproof solution would be a special > "filesystem UTF-8" encoding that arranges to preserve invalid byte sequences > somehow (map them into the private use area maybe?) but I don't think we > should bother unless we discover we need to. It's probably not worthwhile to do that engineering at this point, when non-UTF-8 file paths are already gone on mainstream Linux distros. (In reply to Jan Beich from comment #6) > I did a quick test on FreeBSD 10. USE_ICONV in nsNativeCharsetUtils.cpp > seems to be a bit buggy > > 1. start firefox under LANG=ja_JP.eucJP locale Is that considered a legitimate supported configuration for FreeBSD? (As opposed to an unsupported footgun or test case.) (In reply to Zack Weinberg (:zwol) from comment #4) > (see for instance the discussion at > https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion. > html#file-name-encodings ) (In reply to Jan Beich from comment #6) > (In reply to Henri Sivonen (:hsivonen) from comment #2) > Since we're already > linking against GLib it'd make sense to mimic its defaults. So even if we don't go UTF-8-only, calling g_filename_to_utf8() and g_filename_from_utf8() would be better than what we have now. When that doesn't work, the user is probably already in trouble with *all* their glib/gtk apps. (In reply to ISHIKAWA, Chiaki from comment #7) > (I am using ja_JP.UTF-8 on one PC, and ja_JP.eucJP on the other.) How did the ja_JP.eucJP configuration happen? Is it still the default for some distro? Did you set it yourself? (Why?) Was the system originally installed as ja_JP.eucJP and has then been upgraded through the years without the updates forcing a migration to ja_JP.UTF-8?

ISHIKAWA, Chiaki

Comment 11

•

11 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #10) To the question posed to me: > > (In reply to ISHIKAWA, Chiaki from comment #7) > > (I am using ja_JP.UTF-8 on one PC, and ja_JP.eucJP on the other.) > > How did the ja_JP.eucJP configuration happen? Is it still the default for > some distro? Did you set it yourself? (Why?) Was the system originally > installed as ja_JP.eucJP and has then been upgraded through the years > without the updates forcing a migration to ja_JP.UTF-8? No, I don't think ja_JP.eucJP is the default these days. I set it myself. Why: I chose ja_JP.eucJP (which may have been called slight different name) when I installed RedHat linux on a PC about dozen years ago. I had been using EUC JP encoding for more close to 20 years by that time (shows my age, huh) and so there was no way I could choose other setting on the PC. I gradually updated the OS initially from RedHat to Fedora, and then to Debian preserving the default locale. On the PC that I chose ja_JP.UTF-8, I selected UTF-8 from the start. This is because I had learned in the last 10 years, that more and more utilities including EMACS, my favorite editor, would handle Japanese filenames better when I mix heterogeneous systems if I select UTF-8 as preferred encoding (yes, Windows XP changed the file name handling in a more unified direction, too, on Desktop PC scene). Yes, I learned this lesson on the PC with ja_JP.eucJP default. Unfortunately, my document archive contains tons of EUC-encoded document, and so I need to still use EUC encoding on that PC, but gradually I am converting to UTF-8 for newly created documents although privately created tools assume EUC encoding and so sometimes, I had to revert back to EUC for the sake of using old shell script and stuff. That is life :-( TIA

Zack Weinberg (:zwol)

Comment 12

•

11 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #10) > (In reply to Zack Weinberg (:zwol) from comment #5) > > I would think that the only 100% bombproof solution would be a special > > "filesystem UTF-8" encoding that arranges to preserve invalid byte sequences > > somehow (map them into the private use area maybe?) but I don't think we > > should bother unless we discover we need to. > > It's probably not worthwhile to do that engineering at this point, when > non-UTF-8 file paths are already gone on mainstream Linux distros. I don't know about that. *New* files should be created UTF-8 throughout, but archived files may persist with ISO-8859-n or other legacy encodings in their names ... indefinitely. That said, I don't know how likely it is for us to have to interact with such files. > So even if we don't go UTF-8-only, calling g_filename_to_utf8() and > g_filename_from_utf8() would be better than what we have now. When that > doesn't work, the user is probably already in trouble with *all* their > glib/gtk apps. Agreed. Also I did some more digging on the Python thing, it turns out that their system for dealing with this problem is cleverer than it appears. See: http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-s-up-with-posix-systems-in-python-3 http://www.python.org/dev/peps/pep-0383/ https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html in decreasing order of clarity.

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Comment 13

•

11 years ago

(In reply to Zack Weinberg (:zwol) from comment #12) > I don't know about that. *New* files should be created UTF-8 throughout, but > archived files may persist with ISO-8859-n or other legacy encodings in their > names ... indefinitely. It seems like a *terrible* idea to either use file names in differing encodings on a single file system instance or to have file system APIs expose different encodings to applications depending on which one of multiple mounted file systems on a single operating system is being accessed. For non-*nix-affiliated file systems like FAT and NTFS, Linux supports translating on-disk file names to UTF-8 for API visibility. It seems to me that the sane way to deal with archive *nix disks would be to declare their encoding as an option to |mount| and have the kernel expose the names on all mounted file systems as UTF-8. But looking at the man page for mount, it seems that options of this nature are only available to file systems that don't have a *nix origin. Am I missing something, or is the Linux world really insane enough not to have facilities to make the file system APIs talk UTF-8 and make non-UTF-8 file system paths the kernel's problem in the general case (as, AFAIK, is the case on OS X)? > > So even if we don't go UTF-8-only, calling g_filename_to_utf8() and > > g_filename_from_utf8() would be better than what we have now. When that > > doesn't work, the user is probably already in trouble with *all* their > > glib/gtk apps. > > Agreed. It has come to my attention that the problem reported about our current setup only manifests itself under |make mozmill|, so there might be less reason to poke at this legacy code that it initially appeared. Throwing the problem from Gecko to glib seems attractive as a matter of principle, but it's unclear how much a win the churn would be if the current code works outside |make mozmill|. (Still, it's terrible that a script can mess up the environment badly enough to break file name handling. That wouldn't change if we relied on glib, which also reads from the environment.) > Also I did some more digging on the Python thing, it turns out that their > system for dealing with this problem is cleverer than it appears. See: > > http://python-notes.curiousefficiency.org/en/latest/python3/ > questions_and_answers.html#what-s-up-with-posix-systems-in-python-3 > http://www.python.org/dev/peps/pep-0383/ > https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux- > utf8/2000-07/msg00040.html > > in decreasing order of clarity. So the key is: "non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF". ...and then the *encoders* need to give analogous special treatment to those lone surrogates. It's not clear if introducing malformed UTF-16 to the mix would make things better in Gecko.

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Comment 14

•

11 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #13) > It has come to my attention that the problem reported about our current > setup only manifests itself under |make mozmill|, so there might be less > reason to poke at this legacy code that it initially appeared. Under mozmill, the CODESET becomes ANSI_X3.4-1968 (i.e. US-ASCII), so any path with non-ASCII bytes breaks. I haven't investigated what exactly in mozmill causes this, but at least it indicates that the current code breaks if something (such a wrapper script) manages to force CODESET to its POSIX default (which ANSI_X3.4-1968 is).

Zack Weinberg (:zwol)

Comment 15

•

11 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #13) > (In reply to Zack Weinberg (:zwol) from comment #12) > > I don't know about that. *New* files should be created UTF-8 throughout, but > > archived files may persist with ISO-8859-n or other legacy encodings in their > > names ... indefinitely. > > It seems like a *terrible* idea to either use file names in differing > encodings on a single file system instance or to have file system APIs > expose different encodings to applications depending on which one of > multiple mounted file systems on a single operating system is being accessed. It is. However, people did the former in the 1990s and early 2000s and we are stuck with the consequences of that. > For non-*nix-affiliated file systems like FAT and NTFS, Linux supports > translating on-disk file names to UTF-8 for API visibility. It seems to me > that the sane way to deal with archive *nix disks would be to declare their > encoding as an option to |mount| and have the kernel expose the names on all > mounted file systems as UTF-8. But looking at the man page for mount, it > seems that options of this nature are only available to file systems that > don't have a *nix origin. Am I missing something, or is the Linux world > really insane enough not to have facilities to make the file system APIs > talk UTF-8 and make non-UTF-8 file system paths the kernel's problem in the > general case (as, AFAIK, is the case on OS X)? You are not missing anything. However, I would characterize this as a reasoned choice to take the least bad available option given all the design constraints at the time (2001), not insanity. (Keep in mind that in 2001 there were still large, vocal anti-Unicode constituencies, and some of them probably had reps on the POSIX committee. Also keep in mind that Unicode is a moving target and has made backward-incompatible changes to normalization and case-folding on more than one occasion.) The kernel perspective is that pathnames are strings of *bytes* - not characters - of which only the values 0x00 and 0x2F have kernel-assigned semantics. (That's NUL and '/'.) If you want to use UTF-8, ASCII, KOI8-R, and ISO-8859-7 simultaneously within the same filesystem, or even one *pathname*, or even one *component*, the kernel does not care. Have fun. This attitude has three significant advantages for the kernel: it allows pathname comparisons to continue to use blind string compare and therefore go fast(er than any alternative), it means the kernel does not have to take any stance on Unicode normalization, and it guarantees to not damage legacy data. Making *sense* of legacy data is up to applications, and may be a pain in the ass, but at least it will all be there and intact. This attitude also has an advantage for applications, in that it means that they don't have to think about encoding unless they want to. Two pathnames refer to the same file (modulo symlinks) if and only if they compare equal according to strcmp(). When I was hacking on Monotone, OSX's alternative choice of converting to some undocumented rescension of NFKD and then comparing according to some undocumented rescension of the Unicode case-folding algorithm meant that we could not guarantee that a repo created on some other operating system would check out successfully on OSX. (And in fact people usually tripped over this because Windows picked *different*, but equally undocumented, rescensions of the same.) We only get in trouble because we internally want to do everything in UTF-16, which has always been an error; because we want to display paths to the user, which forces us to *convert* to UTF-16 rather than just pretending bytes 0x00 .. 0xFF coming from the OS are U+0000 .. U+00FF no matter how little sense that makes in context; and because we do that conversion immediately upon input from the OS rather than right before display, which also makes sense in context but means the *consequences* of failed conversion are worse than just mojibake in the URL bar. > Throwing the problem from Gecko to glib seems attractive as a matter of > principle, but it's unclear how much a win the churn would be if the current > code works outside |make mozmill|. (Still, it's terrible that a script can > mess up the environment badly enough to break file name handling. That > wouldn't change if we relied on glib, which also reads from the environment.) The more I think about this, the more this plan makes sense to me: 1. Use g_filename_{to,from}_utf8 when, but only when, talking to the Gtk file selection dialogs. 2. When we read paths directly from disk, assume UTF-8 regardless of locale. (I *think* this by itself would fix |make mozmill|.) 3. If that assumption causes us to trip over encoding errors, punt; that file is inaccessible from Firefox. 4. If (3) causes support headaches, only then do we worry about doing something cleverer (such as what Python does). > So the key is: "non-decodable bytes >= 128 will be represented as lone > surrogate codes U+DC80..U+DCFF". ...and then the *encoders* need to give > analogous special treatment to those lone surrogates. Well, just the special |surrogateescape| encoder that's paired with it. And you (that is, Python) only use them when you're sure you have to. (In reply to Henri Sivonen (:hsivonen) from comment #14) > Under mozmill, the CODESET becomes ANSI_X3.4-1968 (i.e. US-ASCII), so any > path with non-ASCII bytes breaks. I haven't investigated what exactly in > mozmill causes this, but at least it indicates that the current code breaks > if something (such a wrapper script) manages to force CODESET to its POSIX > default (which ANSI_X3.4-1968 is). Probably setting LC_ALL=C or equivalent. I understand there is now a C.UTF-8 that has all of the desirable properties of the "C" locale but leaves CODESET=UTF-8. We should be using that in the test farm instead. (May require system configuration tweakage.)

ISHIKAWA, Chiaki

Comment 16

•

11 years ago

Attached file |make mozmill| under Debian GNU/Linux 64-bit somehow force nl_langinfo(CODESET) to return "ANSI_X3.4-1968" ? — Details

Hi, I am uploading the content of an e-mail that I sent to Henri Sivonen off band. The e-mail was triggered by an exchange in devel-platform mailing list, but as you can see the it has become lengthy due to some quote from |make mozmill| log, etc. So I sent this privately to Henri for his review of the strange behavior. But now, I consulted Henri and I think it is best to expose the content to wider dissemination. So here I am uploading this. Since I wrote this e-mail, I found that a particular installation of Debian GNU/Linux 32-bit (and presumably other 32-bit Debian also) does not exhibit this error even though nl_langinfo() return "ANSI..." string. There must be some environmental setup glitches, but it is hard to figure out where. Use of LANG=C.UTF-8, etc. may solve the issue (or may not). Coupled with the issue of Date string in Japanese locale which caused |make mozmill| to produce non-RFC-compliant test messages (Date: header line contained date that used Japanese characters which are not RFC-compilant), I think |make mozmill| ought to use more sanitized set of environment variables. (Should I file a bug on this?) TIA

Zack Weinberg (:zwol)

Comment 17

•

11 years ago

(In reply to ISHIKAWA, Chiaki from comment #16) > Created attachment 8363759 [details] > |make mozmill| under Debian GNU/Linux 64-bit somehow force > nl_langinfo(CODESET) to return "ANSI_X3.4-1968" ? > > Hi, I am uploading the content of an e-mail that I sent to Henri Sivonen > off band. The e-mail was triggered by an exchange > in devel-platform mailing list, but as you can see the it has become > lengthy due to some quote from |make mozmill| log, etc. So I sent this > privately to Henri for his review of the strange behavior. LC_ALL=C overrides all other locale information in the environment (LANG, LANGUAGE, LC_*). Please do try instead LC_ALL=C.UTF-8.

Martin Husemann

Comment 18

•

11 years ago

(In reply to Zack Weinberg (:zwol) from comment #15) > We only get in trouble because we internally want to do everything in > UTF-16, which has always been an error; Please pardon my ignorance (as a non-Linux, non-glibc user): how does UTF-16 get into the mix here? (Just curious)

Zack Weinberg (:zwol)

Comment 19

•

11 years ago

(In reply to Martin Husemann from comment #18) > (In reply to Zack Weinberg (:zwol) from comment #15) > > We only get in trouble because we internally want to do everything in > > UTF-16, which has always been an error; > > Please pardon my ignorance (as a non-Linux, non-glibc user): how does UTF-16 > get into the mix here? > (Just curious) JavaScript strings are, unfortunately, *specified* as encoded in UTF-16 (that is, a conformant JS program can observe this). For consistency's sake, therefore, all of Gecko's C++ code uses UTF-16 strings as well, and (here's where this bug comes into it) pathnames retrieved from the OS, eg. by readdir(), are immediately forced into that encoding. By "has always been an error" I mean that the people responsible for this design decision, 'way back in the nineties, *should have known at the time* that UTF-16 was the wrong encoding to choose for internal operations, and even more so, that it should not have been visible to JavaScript. This is, now, a problem for us on all our platforms, not just the GNUish ones.

ISHIKAWA, Chiaki

Comment 20

•

11 years ago

Attached patch allow-one-letter-C-for-langauge-under-Posix.patch — Details — Splinter Review

Hi, The wise suggestion so far seems to be - make sure that |make mozmill| and other test suite scripts set the environment variables controlling the locale for language, and charset selection to sane values. - It was suggested that C.UTF-8 is chosen for testing. Now, as it turned out the story is not straightforward. Problem-1: Mozilla code does not understand single letter "C" as valid. When I set C.UTF-8 under Debian GNU/Linux 64-bit version (where iconv() failed), I got the error output due to the check in the following part of the source code. http://mxr.mozilla.org/comm-central/source/mozilla/intl/locale/src/unix/nsPosixLocale.cpp#126 NS_ASSERTION((len == 2) || (len == 3), "language code too short"); So I need to loosen the check to allow for "C" as language code. Actually, "C" is downcased when the execution reaches here, so I needed to check for (len == 1) AND 'C' or 'c' as in the attached patch. After changing the code with the attached patch, |make mozmill| no longer complained about iconv failure (!) Compared with the old report, charset becomes "UTF-8" with the patch (and my setting LC_TYPE=C.UTF-8, etc. in the script BEFORE calling |make mozmill). An excerpt from an execution under 32-bit Debian GNU/Linux after the change: 64-bit DebianGNU/Linux showed basically the same information. DEBUG: nsNativeCharsetConverter: native_charset = <<UTF-8>> DEBUG: xp_iconv_open: VALID res=0x8c3fa00,to_name=<<UTF-16LE>>, from_name=<<UTF-8>> DEBUG: xp_iconv_open: VALID res=0x8c47a60,to_name=<<UTF-8>>, from_name=<<UTF-16LE>> DEBUG: nsNativeCharsetConverter: at the end of LayzInit call path DEBUG: nsNativeCharsetConverter::gNativeToUnicode = 0x8c3fa00 DEBUG: nsNativeCharsetConverter::gUnicodeToNative = 0x8c47a60 I ran |make mozmill| and xpcshell-tests and both ran without any ill-effect due to the change. (under Debian GNU/Linux 64-bit). I was glad this patch helps solving the issue. Problem-2: *BUT* when I checked this under 32-bit Debian GNU/Linux, (under which iconv() did not fail even though native_charset was "ANSI...". This was a mystery) TB worked as expected. *BUT* I noticed this message in the log. Fontconfig warning: ignoring C.UTF-8: not a valid language tag Huh? This was seen even before |make mozmill| was invoked. (I think something in the following command sequence, or program that is triggered from these, to set up a dedicated window system for testing TB complained about C.UTF-8 setting. # Use a separate window as Xserver screen. # Size 1024x768 -> 1280 x768 Xephyr -ac -br -noreset -screen 1280x768 :1 & DISPLAY=localhost:1.0 sleep 2 oclock & xfwm4 & (I am dedicating this screen for testing so that normal desktop usage is not disrupted) During the testing of TB, this message was also seen. So fontconfig may not like C.UTF-8 setting. One down and one more to go? I think the patch should go in the source tree anyway. Should I file a separate bug? TIA

ISHIKAWA, Chiaki

Comment 21

•

11 years ago

Hi, I found that second issue I mentioned about "Fontconfig complaining about C.UTF-8" is a known Debian issue and was supposedly fixed in the new version released in Nov last year. http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=721275 (My 64-bit linux installation is for deveopment and I often update it to get the latest development tools, whereas 32-bit version is used for desktop work and I don't update the packages often for stability reasons.) After I upgraded the pakcages on 32-bit Debian GNU/Linux (specifically, fontconfig and libfontconfig1-dev), the updated version fixes the second issue So, then usage of C.UTF-8 with TB (and presumably FF) is possible with the proposed patch for the mozilla code per se. TIA

Ginn Chen

Comment 22

•

11 years ago

I think the situation is same for Linux and Solaris.

Flags: needinfo?(ginn.chen)

ISHIKAWA, Chiaki

Comment 23

•

11 years ago

For the problem found in comment 20, I filed Bug 981463 - Allow one letter "C" as in C.UTF-8 as a valid language code as a separate entry. TIA

ISHIKAWA, Chiaki

Updated

•

11 years ago

Updated

•

10 years ago

Updated

•

8 years ago

Blocks: 1342659

Comment hidden (mozreview-request)

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Comment 26

•

8 years ago

I tested that with an EUC-JP locale if the profile path and download directory have non-ASCII EUC-JP byte sequences in them, Firefox is already very broken: can't download, can't save page as, can't save history, can't save bookmarks. From the relative lack of bug reports, it seems safe to conclude that Firefox users who might still have non-UTF-8 locale settings don't have non-ASCII home directory paths.

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Updated

•

8 years ago

Attachment #8934491 - Flags: review?(m_kato)

Attachment #8934491 - Flags: review?(VYV03354)

ISHIKAWA, Chiaki

Comment 27

•

8 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #26) > I tested that with an EUC-JP locale if the profile path and download > directory have non-ASCII EUC-JP byte sequences in them, Firefox is already > very broken: can't download, can't save page as, can't save history, can't > save bookmarks. > > From the relative lack of bug reports, it seems safe to conclude that > Firefox users who might still have non-UTF-8 locale settings don't have > non-ASCII home directory paths. My experience: I have used EUC-JP for a long time since the early 1980's and one of the PCs that runs Debian hasused EUC-JP for a long time. *HOWEVER* I never tried to create a directory name that has Japanese directory TO THIS DAY because of the issues like this one here which I experienced over the years. So I suspect programming-types don't use non-ASCII directory names when they use EUC-JP. The problem happens with non-suspecting clerical types who are asked to use such a system and create directory names in Japanese (I always gasp when I see people's desktop. People who did not get bitten in 1980's and early 1990's use Japanese directory names freely, and my heart skips a beat or two when I see such desktop or folder listing. Because of the particular way some Japanese people tend to insert space/blank on both sides of an English word in a directory name, many people had directory names that contain whitespace in the path and they have caused problems before, but that happens even with ASCII-only path name with whitespace in it. These natural language-related issues are hard to explain. Some people never experience them. Some, like mine, who used computers before the software matured got hit with such problems all the time. TIA for people's attention.

Masatoshi Kimura [:emk]

Comment 28

•

8 years ago

mozreview-review

Comment on attachment 8934491 [details] Bug 960957 - Drop nsIFile support for non-UTF-8 file path encodings on non-Windows platforms. https://reviewboard.mozilla.org/r/205406/#review211040 ::: xpcom/io/nsNativeCharsetUtils.h:43 (Diff revision 2) > */ > -#if defined(XP_UNIX) && !defined(XP_MACOSX) && !defined(ANDROID) > -bool NS_IsNativeUTF8(); > -#else > inline bool > NS_IsNativeUTF8() Let's make this a constexpr function while we are here.

Attachment #8934491 - Flags: review?(VYV03354) → review+

Comment hidden (mozreview-request)

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Comment 30

•

8 years ago

mozreview-review-reply

Comment on attachment 8934491 [details] Bug 960957 - Drop nsIFile support for non-UTF-8 file path encodings on non-Windows platforms. https://reviewboard.mozilla.org/r/205406/#review211040 > Let's make this a constexpr function while we are here. Added constexpr. Thanks.

Makoto Kato [:m_kato]

Comment 31

•

8 years ago

mozreview-review

Comment on attachment 8934491 [details] Bug 960957 - Drop nsIFile support for non-UTF-8 file path encodings on non-Windows platforms. https://reviewboard.mozilla.org/r/205406/#review211454 I guess that we can remove HAVE_ICONV, HAVE_ICONV_WITH_CONST_INPUT, HAVE_MBRTOWC and HAVE_WCRTOMB from configure.in. But I will file a bug for it.

Attachment #8934491 - Flags: review?(m_kato) → review+

Makoto Kato [:m_kato]

Comment 32

•

8 years ago

Release Note Request (optional, but appreciated) [Why is this notable]: Firefox for Linux (and *NIX) will no longer support non-UTF-8 locale such as ja_JP.eucJP. [Affects Firefox for Android]: No. [Suggested wording]: Firefox for Linux will no longer support non-UTF-8 locale such as ja_JP.eucJP. [Links (documentation, blog post, etc)]: Nothing.

relnote-firefox: --- → ?

Pulsebot

Comment 33

•

8 years ago

Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/354794d16661 Drop nsIFile support for non-UTF-8 file path encodings on non-Windows platforms. r=emk,m_kato

Makoto Kato [:m_kato]

Updated

•

8 years ago

Blocks: 1423846

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Comment 34

•

8 years ago

(In reply to Makoto Kato [:m_kato] (slow due to PTO?) from comment #32) Thanks for the r+. > Release Note Request (optional, but appreciated) ... > [Suggested wording]: > Firefox for Linux will no longer support non-UTF-8 locale such as > ja_JP.eucJP. I think this wording is problematic for two reasons: 1) It suggests that we substantially dropped support at this point. However, the support was already very, very broken. We don't even know when in the past exactly we dropped support for practical purposes. 2) It suggests that Firefox won't work with a non-UTF-8 locale at all even though it'll still work when the profile path is all-ASCII (the common case) and will work even for downloads if the download folder path and the name of the downloaded file are all-ASCII. I suggest this wording instead: "The vestiges of support for non-UTF-8 file paths were removed from Firefox for Linux."

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Updated

•

8 years ago

Assignee: nobody → hsivonen

Status: NEW → ASSIGNED

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Updated

•

8 years ago

Comment 35

•

8 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #34) > I suggest this wording instead: > "The vestiges of support for non-UTF-8 file paths were removed from Firefox > for Linux." I think this doesn't convey 1) very clearly.

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Comment 36

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #35) > (In reply to Henri Sivonen (:hsivonen) from comment #34) > > I suggest this wording instead: > > "The vestiges of support for non-UTF-8 file paths were removed from Firefox > > for Linux." > > I think this doesn't convey 1) very clearly. How about: "Previously on Linux under non-UTF-8 glibc locale settings file paths were in some contexts treated as UTF-8 and in other contexts according to the locale's encoding leading to erroneous behavior. Now file paths are always treated as UTF-8."

Cosmin Sabou [:CosminS]

Comment 37

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/354794d16661

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox59: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla59

Mike Hommey [:glandium]

Comment 38

•

8 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #36) > (In reply to Mike Hommey [:glandium] from comment #35) > > (In reply to Henri Sivonen (:hsivonen) from comment #34) > > > I suggest this wording instead: > > > "The vestiges of support for non-UTF-8 file paths were removed from Firefox > > > for Linux." > > > > I think this doesn't convey 1) very clearly. > > How about: > "Previously on Linux under non-UTF-8 glibc locale settings file paths were > in some contexts treated as UTF-8 and in other contexts according to the > locale's encoding leading to erroneous behavior. Now file paths are always > treated as UTF-8." That's maybe too much detail. How about: "Previously, paths were inconsistently treated as UTF-8 or not, leading to erroneous behavior. They are now always treated as UTF-8." ?

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Comment 39

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #38) > How about: > "Previously, paths were inconsistently treated as UTF-8 or not, leading to > erroneous behavior. They are now always treated as UTF-8." > ? Looks good to me provided that we add "on Linux": "Previously on Linux, paths were inconsistently treated as UTF-8 or not, leading to erroneous behavior. They are now always treated as UTF-8."

Martin Husemann

Comment 40

•

8 years ago

But it is not just Linux, the other affected Posix systems are unrelated and not recognized as "Linux".

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Comment 41

•

8 years ago

(In reply to Martin Husemann from comment #40) > But it is not just Linux, the other affected Posix systems are unrelated and > not recognized as "Linux". Right, but Mozilla publishes release notes only for Mac, Windows, Linux and Android, and in that scope, this change applies to Linux.

Liz Henry (:lizzard) (relman/hg->git project)

Comment 42

•

8 years ago

If you would like to write a blog post about it (on a mozilla hosted blog) it would be great to link to. Or, anything on MDN or SUMO that explains further. For now I'm using this for 59.0b1 release notes: Paths now are always treated as UTF-8 on Linux

relnote-firefox: ? → 59+

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Comment 43

•

7 years ago

(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #42) > If you would like to write a blog post about it (on a mozilla hosted blog) > it would be great to link to. Or, anything on MDN or SUMO that explains > further. A blog post would give this more attention than this deserves at this point. I'll see if I can write up something on SUMO. > For now I'm using this for 59.0b1 release notes: > > Paths now are always treated as UTF-8 on Linux Thanks.

Henri Sivonen (:hsivonen) (on vacation)

Assignee

Comment 44

•

7 years ago

Added SUMO article: https://support.mozilla.org/en-US/kb/utf-8-only-file-paths

\|make mozmill\| under Debian GNU/Linux 64-bit somehow force nl_langinfo(CODESET) to return "ANSI_X3.4-1968" ? 11 years ago ISHIKAWA, Chiaki 21.13 KB, text/plain		Details
allow-one-letter-C-for-langauge-under-Posix.patch 11 years ago ISHIKAWA, Chiaki 1.21 KB, patch		Details \| Diff \| Splinter Review
Bug 960957 - Drop nsIFile support for non-UTF-8 file path encodings on non-Windows platforms. 8 years ago Henri Sivonen (:hsivonen) (on vacation) 59 bytes, text/x-review-board-request	emk : review+ m_kato : review+	Details