Closed Bug 960957 Opened 10 years ago Closed 6 years ago

Remove support for *nix systems whose char* APIs for file names don't use UTF-8

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
relnote-firefox --- 59+
firefox59 --- fixed

People

(Reporter: hsivonen, Assigned: hsivonen)

References

()

Details

Attachments

(3 files)

We have a bunch of iconv-using code in https://mxr.mozilla.org/mozilla-central/source/xpcom/io/nsNativeCharsetUtils.cpp that assumes
 1) That its initialization can read a "system encoding" from the environment.
 2) That the system APIs that use char* for file names use that encoding.
 3) That the system encoding might not be UTF-8.
 4) That the system encoding should be ISO-8859-1 if unreadable.

This code is used on non-OS X, non-Android *nix platforms, including desktop Linux. The above assumptions are probably wrong these days and we'd be better off supporting only *nix systems whose APIs that take char* file names use UTF-8 like OS X and Android.

(But we should probably check with the maintainers of the Solaris and *BSD ports first.)
Is dropping support for non-UTF-8 file names in char* file system APIs OK for OpenBSD?
Flags: needinfo?(landry)
Is dropping support for non-UTF-8 file names in char* file system APIs OK for FreeBSD?
Flags: needinfo?(jbeich)
Is dropping support for non-UTF-8 file names in char* file system APIs OK for Solaris?
Flags: needinfo?(ginn.chen)
I had thought that Python 3 had some useful implementation advice to offer, because they went aggressively all-Unicode-internally in the 3.0 major version bump and then found that this caused, erm, difficulties with filesystem paths in legacy encodings.  Unfortunately, the solution adopted seems to have been to assume that *all* filesystem paths on POSIX-not-OSX systems are in the encoding returned by nl_langinfo(CODESET), which is wrong (see for instance the discussion at https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#file-name-encodings )

We probably have fewer wacky legacy cases to trip over than a generic programming environment like Python or Glib does, but I do wonder if xdg-user-dirs is ever found to be in a non-Unicode encoding (e.g. legacy EUC-JP) in the wild.
I would think that the only 100% bombproof solution would be a special "filesystem UTF-8" encoding that arranges to preserve invalid byte sequences somehow (map them into the private use area maybe?) but I don't think we should bother unless we discover we need to.
I did a quick test on FreeBSD 10. USE_ICONV in nsNativeCharsetUtils.cpp seems to be a bit buggy

1. start firefox under LANG=ja_JP.eucJP locale
2. open about:support
3. save the page as |テスト|
4. |ls $HOME/Downloads| would show

  テスト (UTF-8, empty)  <----- shouldn't exist
  テスト (eucJP)
  テスト_files (eucJP)

USE_STDCONV tries to emulate CopyUTF8toUTF16 and CopyUTF16toUTF8 only to return garbage with non-UTF8 encodings.

(In reply to Henri Sivonen (:hsivonen) from comment #2)
> Is dropping support for non-UTF-8 file names in char* file system APIs OK
> for FreeBSD?

I don't think the situation is different from Linux. Many apps ignore user locale, some respect it and others try to enforce UTF-8. Since we're already linking against GLib it'd make sense to mimic its defaults.
Flags: needinfo?(jbeich)
(In reply to Jan Beich from comment #6)
> I did a quick test on FreeBSD 10. USE_ICONV in nsNativeCharsetUtils.cpp
> seems to be a bit buggy
> 
> 1. start firefox under LANG=ja_JP.eucJP locale
> 2. open about:support
> 3. save the page as |テスト|
> 4. |ls $HOME/Downloads| would show
> 
>   テスト (UTF-8, empty)  <----- shouldn't exist
>   テスト (eucJP)
>   テスト_files (eucJP)
> 

I have some issues with TB.
I am going to investigate more about TB's behavior when it tries to save an attachment
from external sender to a local file system under linux
when the filename contains non ASCII characters.
(I am using ja_JP.UTF-8 on one PC, and ja_JP.eucJP on the other.)

One thing I noticed from the above post, especially "UTF-8, empty" part: I have been using Dropbox under linux with Japanese EUC encoding for LOCALE, and found that when I tried to save an attachment from TB
to Dropbox directory under linux, then 
sometimes I see a pathname that says (UTF-8 illegal encoding or something) with Dropbox client under
windows,
and Dropbox client under Windows seems to try to save the attachment first 
to this invalid (under Windows?) path name and a ZERO-byte size
file remained, and somehow Dropbox seems to try with a seemingly the same but without 
this (UTF-8 illegal encoding string) and leaves a content there: the file has superficially the same pathname as far as I can read on the Windows machine. I mean that the visible portion of the
pathname look exactly the same. It is not entirely clear what the difference was between the two pathnames. There must be some encoding issue in the following path:

TB under linux: saving attachment under an L10N pathname
  -> Dropbox client under linux 
    -> Dropbox server
      -> Dropbox client under windows.

I have no idea where the retrying to save the attachment under a valid
file name under Window OS after an initial failure is initiated (maybe the Windows Dropbox client). 

I should clarify that this issue is noticed under Windows XP client.
I am checking the Windows 7 client that talks to the same Dropbox account and see no problematic files
remaining there (!).

Now I have more clues and background info about the issue and so
will try to figure out what is causing the problem, and also,
I will investigate the buggy behavior of functions that use 
iconv() under linux more.

TIA


> USE_STDCONV tries to emulate CopyUTF8toUTF16 and CopyUTF16toUTF8 only to
> return garbage with non-UTF8 encodings.
> 
> (In reply to Henri Sivonen (:hsivonen) from comment #2)
> > Is dropping support for non-UTF-8 file names in char* file system APIs OK
> > for FreeBSD?
> 
> I don't think the situation is different from Linux. Many apps ignore user
> locale, some respect it and others try to enforce UTF-8. Since we're already
> linking against GLib it'd make sense to mimic its defaults.
I dont really understand well how charsets & filesystems interact, but we recently got a somewhat native (but limited) support for UTF-8 on the system, but iirc it doesnt use iconv.
Flags: needinfo?(landry)
(In reply to Zack Weinberg (:zwol) from comment #5)
> I would think that the only 100% bombproof solution would be a special
> "filesystem UTF-8" encoding that arranges to preserve invalid byte sequences
> somehow (map them into the private use area maybe?) but I don't think we
> should bother unless we discover we need to.

It's probably not worthwhile to do that engineering at this point, when non-UTF-8 file paths are already gone on mainstream Linux distros.

(In reply to Jan Beich from comment #6)
> I did a quick test on FreeBSD 10. USE_ICONV in nsNativeCharsetUtils.cpp
> seems to be a bit buggy
> 
> 1. start firefox under LANG=ja_JP.eucJP locale

Is that considered a legitimate supported configuration for FreeBSD? (As opposed to an unsupported footgun or test case.)

(In reply to Zack Weinberg (:zwol) from comment #4)
> (see for instance the discussion at
> https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.
> html#file-name-encodings )
(In reply to Jan Beich from comment #6)
> (In reply to Henri Sivonen (:hsivonen) from comment #2)
> Since we're already
> linking against GLib it'd make sense to mimic its defaults.

So even if we don't go UTF-8-only, calling g_filename_to_utf8() and g_filename_from_utf8() would be better than what we have now. When that doesn't work, the user is probably already in trouble with *all* their glib/gtk apps.

(In reply to ISHIKAWA, Chiaki from comment #7)
> (I am using ja_JP.UTF-8 on one PC, and ja_JP.eucJP on the other.)

How did the ja_JP.eucJP configuration happen? Is it still the default for some distro? Did you set it yourself? (Why?) Was the system originally installed as ja_JP.eucJP and has then been upgraded through the years without the updates forcing a migration to ja_JP.UTF-8?
(In reply to Henri Sivonen (:hsivonen) from comment #10)

To the question posed to me:
> 
> (In reply to ISHIKAWA, Chiaki from comment #7)
> > (I am using ja_JP.UTF-8 on one PC, and ja_JP.eucJP on the other.)
> 
> How did the ja_JP.eucJP configuration happen? Is it still the default for
> some distro? Did you set it yourself? (Why?) Was the system originally
> installed as ja_JP.eucJP and has then been upgraded through the years
> without the updates forcing a migration to ja_JP.UTF-8?

No, I don't think ja_JP.eucJP is the default these days.
I set it myself.

Why: I chose ja_JP.eucJP (which may have been called slight different name) when I installed
RedHat linux on a PC about dozen years ago. I had been using EUC JP encoding for more close to 
20 years by that time (shows my age, huh)
and so there was no way I could choose other setting on the PC. I gradually updated
the OS initially from RedHat to Fedora, and then to Debian preserving the
default locale.

On the PC that I chose ja_JP.UTF-8, I selected UTF-8 from the start.
This is because I had learned in the last 10 years, that more and
more utilities including EMACS, my favorite editor, would handle Japanese filenames better
when I mix heterogeneous systems if I select UTF-8 as preferred encoding
(yes, Windows XP changed the file name handling in a more unified direction, too, on Desktop PC scene).
Yes, I learned this lesson on the PC with ja_JP.eucJP default.
Unfortunately, my document archive contains tons of EUC-encoded document, and so
I need to still use EUC encoding on that PC, but gradually
I am converting to UTF-8 for newly created documents although privately created tools
assume EUC encoding and so sometimes, I had to revert back to EUC for the sake of using old
shell script and stuff. That is life :-(

TIA
(In reply to Henri Sivonen (:hsivonen) from comment #10)
> (In reply to Zack Weinberg (:zwol) from comment #5)
> > I would think that the only 100% bombproof solution would be a special
> > "filesystem UTF-8" encoding that arranges to preserve invalid byte sequences
> > somehow (map them into the private use area maybe?) but I don't think we
> > should bother unless we discover we need to.
> 
> It's probably not worthwhile to do that engineering at this point, when
> non-UTF-8 file paths are already gone on mainstream Linux distros.

I don't know about that.  *New* files should be created UTF-8 throughout, but
archived files may persist with ISO-8859-n or other legacy encodings in their
names ... indefinitely.

That said, I don't know how likely it is for us to have to interact with
such files.  

> So even if we don't go UTF-8-only, calling g_filename_to_utf8() and
> g_filename_from_utf8() would be better than what we have now. When that
> doesn't work, the user is probably already in trouble with *all* their
> glib/gtk apps.

Agreed.

Also I did some more digging on the Python thing, it turns out that their system for dealing with this problem is cleverer than it appears.  See:

http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-s-up-with-posix-systems-in-python-3
http://www.python.org/dev/peps/pep-0383/
https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

in decreasing order of clarity.
(In reply to Zack Weinberg (:zwol) from comment #12)
> I don't know about that.  *New* files should be created UTF-8 throughout, but
> archived files may persist with ISO-8859-n or other legacy encodings in their
> names ... indefinitely.

It seems like a *terrible* idea to either use file names in differing encodings on a single file system instance or to have file system APIs expose different encodings to applications depending on which one of multiple mounted file systems on a single operating system is being accessed.

For non-*nix-affiliated file systems like FAT and NTFS, Linux supports translating on-disk file names to UTF-8 for API visibility. It seems to me that the sane way to deal with archive *nix disks would be to declare their encoding as an option to |mount| and have the kernel expose the names on all mounted file systems as UTF-8. But looking at the man page for mount, it seems that options of this nature are only available to file systems that don't have a *nix origin. Am I missing something, or is the Linux world really insane enough not to have facilities to make the file system APIs talk UTF-8 and make non-UTF-8 file system paths the kernel's problem in the general case (as, AFAIK, is the case on OS X)?

> > So even if we don't go UTF-8-only, calling g_filename_to_utf8() and
> > g_filename_from_utf8() would be better than what we have now. When that
> > doesn't work, the user is probably already in trouble with *all* their
> > glib/gtk apps.
> 
> Agreed.

It has come to my attention that the problem reported about our current setup only manifests itself under |make mozmill|, so there might be less reason to poke at this legacy code that it initially appeared.

Throwing the problem from Gecko to glib seems attractive as a matter of principle, but it's unclear how much a win the churn would be if the current code works outside |make mozmill|. (Still, it's terrible that a script can mess up the environment badly enough to break file name handling. That wouldn't change if we relied on glib, which also reads from the environment.)

> Also I did some more digging on the Python thing, it turns out that their
> system for dealing with this problem is cleverer than it appears.  See:
> 
> http://python-notes.curiousefficiency.org/en/latest/python3/
> questions_and_answers.html#what-s-up-with-posix-systems-in-python-3
> http://www.python.org/dev/peps/pep-0383/
> https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-
> utf8/2000-07/msg00040.html
> 
> in decreasing order of clarity.

So the key is: "non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF". ...and then the *encoders* need to give analogous special treatment to those lone surrogates.

It's not clear if introducing malformed UTF-16 to the mix would make things better in Gecko.
(In reply to Henri Sivonen (:hsivonen) from comment #13)
> It has come to my attention that the problem reported about our current
> setup only manifests itself under |make mozmill|, so there might be less
> reason to poke at this legacy code that it initially appeared.

Under mozmill, the CODESET becomes ANSI_X3.4-1968 (i.e. US-ASCII), so any path with non-ASCII bytes breaks. I haven't investigated what exactly in mozmill causes this, but at least it indicates that the current code breaks if something (such a wrapper script) manages to force CODESET to its POSIX default (which ANSI_X3.4-1968 is).
(In reply to Henri Sivonen (:hsivonen) from comment #13)
> (In reply to Zack Weinberg (:zwol) from comment #12)
> > I don't know about that.  *New* files should be created UTF-8 throughout, but
> > archived files may persist with ISO-8859-n or other legacy encodings in their
> > names ... indefinitely.
> 
> It seems like a *terrible* idea to either use file names in differing
> encodings on a single file system instance or to have file system APIs
> expose different encodings to applications depending on which one of
> multiple mounted file systems on a single operating system is being accessed.

It is.  However, people did the former in the 1990s and early 2000s and we are stuck with the consequences of that.

> For non-*nix-affiliated file systems like FAT and NTFS, Linux supports
> translating on-disk file names to UTF-8 for API visibility. It seems to me
> that the sane way to deal with archive *nix disks would be to declare their
> encoding as an option to |mount| and have the kernel expose the names on all
> mounted file systems as UTF-8. But looking at the man page for mount, it
> seems that options of this nature are only available to file systems that
> don't have a *nix origin. Am I missing something, or is the Linux world
> really insane enough not to have facilities to make the file system APIs
> talk UTF-8 and make non-UTF-8 file system paths the kernel's problem in the
> general case (as, AFAIK, is the case on OS X)?

You are not missing anything.  However, I would characterize this as a reasoned choice to take the least bad available option given all the design constraints at the time (2001), not insanity.  (Keep in mind that in 2001 there were still large, vocal anti-Unicode constituencies, and some of them probably had reps on the POSIX committee.  Also keep in mind that Unicode is a moving target and has made backward-incompatible changes to normalization and case-folding on more than one occasion.)

The kernel perspective is that pathnames are strings of *bytes* - not characters - of which only the values 0x00 and 0x2F have kernel-assigned semantics.  (That's NUL and '/'.)  If you want to use UTF-8, ASCII, KOI8-R, and ISO-8859-7 simultaneously within the same filesystem, or even one *pathname*, or even one *component*, the kernel does not care.  Have fun.

This attitude has three significant advantages for the kernel: it allows pathname comparisons to continue to use blind string compare and therefore go fast(er than any alternative), it means the kernel does not have to take any stance on Unicode normalization, and it guarantees to not damage legacy data.  Making *sense* of legacy data is up to applications, and may be a pain in the ass, but at least it will all be there and intact.

This attitude also has an advantage for applications, in that it means that they don't have to think about encoding unless they want to.  Two pathnames refer to the same file (modulo symlinks) if and only if they compare equal according to strcmp().  When I was hacking on Monotone, OSX's alternative choice of  converting to some undocumented rescension of NFKD and then comparing according to some undocumented rescension of the Unicode case-folding algorithm meant that we could not guarantee that a repo created on some other operating system would check out successfully on OSX.  (And in fact people usually tripped over this because Windows picked *different*, but equally undocumented, rescensions of the same.)

We only get in trouble because we internally want to do everything in UTF-16, which has always been an error; because we want to display paths to the user, which forces us to *convert* to UTF-16 rather than just pretending bytes 0x00 .. 0xFF coming from the OS are U+0000 .. U+00FF no matter how little sense that makes in context; and because we do that conversion immediately upon input from the OS rather than right before display, which also makes sense in context but means the *consequences* of failed conversion are worse than just mojibake in the URL bar.

> Throwing the problem from Gecko to glib seems attractive as a matter of
> principle, but it's unclear how much a win the churn would be if the current
> code works outside |make mozmill|. (Still, it's terrible that a script can
> mess up the environment badly enough to break file name handling. That
> wouldn't change if we relied on glib, which also reads from the environment.)

The more I think about this, the more this plan makes sense to me:

1. Use g_filename_{to,from}_utf8 when, but only when, talking to the Gtk file selection dialogs.
2. When we read paths directly from disk, assume UTF-8 regardless of locale.
   (I *think* this by itself would fix |make mozmill|.)
3. If that assumption causes us to trip over encoding errors, punt; that file
   is inaccessible from Firefox.
4. If (3) causes support headaches, only then do we worry about doing something
   cleverer (such as what Python does).

> So the key is: "non-decodable bytes >= 128 will be represented as lone
> surrogate codes U+DC80..U+DCFF". ...and then the *encoders* need to give
> analogous special treatment to those lone surrogates.

Well, just the special |surrogateescape| encoder that's paired with it.  And you (that is, Python) only use them when you're sure you have to.

(In reply to Henri Sivonen (:hsivonen) from comment #14)
> Under mozmill, the CODESET becomes ANSI_X3.4-1968 (i.e. US-ASCII), so any
> path with non-ASCII bytes breaks. I haven't investigated what exactly in
> mozmill causes this, but at least it indicates that the current code breaks
> if something (such a wrapper script) manages to force CODESET to its POSIX
> default (which ANSI_X3.4-1968 is).

Probably setting LC_ALL=C or equivalent.  I understand there is now a C.UTF-8 that has all of the desirable properties of the "C" locale but leaves CODESET=UTF-8.  We should be using that in the test farm instead.  (May require system configuration tweakage.)
Hi, I am uploading the content of an e-mail that I sent to Henri Sivonen
off band. The e-mail was triggered by an exchange 
in devel-platform mailing list, but as you can see the it has become
lengthy due to some quote from |make mozmill| log, etc. So I sent this
privately to Henri for his review of the strange behavior.

But now, I consulted Henri and I think it is best to expose the content to wider dissemination.

So here I am uploading this.

Since I wrote this e-mail, I found that a particular installation of Debian GNU/Linux 32-bit (and presumably other 32-bit Debian also) does not
exhibit this error even though nl_langinfo() return "ANSI..." string. 
There must be some environmental setup glitches, but it is hard to figure out where. Use of LANG=C.UTF-8, etc. may solve the issue (or may not).

Coupled with the issue of Date string in Japanese locale which caused
|make mozmill| to produce non-RFC-compliant test messages (Date: header line
contained date that used Japanese characters which are not RFC-compilant), I think |make mozmill| ought to use more sanitized set of environment variables.
(Should I file a bug on this?)

TIA
(In reply to ISHIKAWA, Chiaki from comment #16)
> Created attachment 8363759 [details]
> |make mozmill| under Debian GNU/Linux 64-bit somehow force
> nl_langinfo(CODESET) to return "ANSI_X3.4-1968" ?
> 
> Hi, I am uploading the content of an e-mail that I sent to Henri Sivonen
> off band. The e-mail was triggered by an exchange 
> in devel-platform mailing list, but as you can see the it has become
> lengthy due to some quote from |make mozmill| log, etc. So I sent this
> privately to Henri for his review of the strange behavior.

LC_ALL=C overrides all other locale information in the environment (LANG, LANGUAGE, LC_*).  Please do try instead LC_ALL=C.UTF-8.
(In reply to Zack Weinberg (:zwol) from comment #15)
> We only get in trouble because we internally want to do everything in
> UTF-16, which has always been an error;

Please pardon my ignorance (as a non-Linux, non-glibc user): how does UTF-16 get into the mix here?
(Just curious)
(In reply to Martin Husemann from comment #18)
> (In reply to Zack Weinberg (:zwol) from comment #15)
> > We only get in trouble because we internally want to do everything in
> > UTF-16, which has always been an error;
> 
> Please pardon my ignorance (as a non-Linux, non-glibc user): how does UTF-16
> get into the mix here?
> (Just curious)

JavaScript strings are, unfortunately, *specified* as encoded in UTF-16 (that is, a conformant JS program can observe this).  For consistency's sake, therefore, all of Gecko's C++ code uses UTF-16 strings as well, and (here's where this bug comes into it) pathnames retrieved from the OS, eg. by readdir(), are immediately forced into that encoding.

By "has always been an error" I mean that the people responsible for this design decision, 'way back in the nineties, *should have known at the time* that UTF-16 was the wrong encoding to choose for internal operations, and even more so, that it should not have been visible to JavaScript.  This is, now, a problem for us on all our platforms, not just the GNUish ones.
Hi,

The wise suggestion so far seems to be

- make sure that |make mozmill| and other test suite scripts
  set the environment variables controlling the locale for language,
  and charset selection to sane values.

- It was suggested that C.UTF-8 is chosen for testing.

Now, as it turned out the story is not straightforward.


Problem-1: Mozilla code does not understand single letter "C" as valid.

When I set C.UTF-8 under Debian GNU/Linux 64-bit version
(where iconv() failed),
I got the error output due to the check in the following
part of the source code.

http://mxr.mozilla.org/comm-central/source/mozilla/intl/locale/src/unix/nsPosixLocale.cpp#126

     NS_ASSERTION((len == 2) || (len == 3), "language code too short");


So I need to loosen the check to allow for "C" as language code.
Actually, "C" is downcased when the execution reaches here, so
I needed to check for (len == 1) AND 'C' or 'c' as in the attached
patch.

After changing the code with the attached patch,
|make mozmill| no longer complained about iconv failure (!)

Compared with  the old report,
charset becomes "UTF-8" with the patch (and my setting 
LC_TYPE=C.UTF-8, etc. in the script BEFORE calling |make mozmill).

An excerpt from an execution under 32-bit Debian GNU/Linux after the change:
64-bit DebianGNU/Linux showed basically the same information.

DEBUG: nsNativeCharsetConverter: native_charset = <<UTF-8>>
DEBUG: xp_iconv_open: VALID res=0x8c3fa00,to_name=<<UTF-16LE>>, from_name=<<UTF-8>>
DEBUG: xp_iconv_open: VALID res=0x8c47a60,to_name=<<UTF-8>>, from_name=<<UTF-16LE>>
DEBUG: nsNativeCharsetConverter: at the end of LayzInit call path
DEBUG: nsNativeCharsetConverter::gNativeToUnicode = 0x8c3fa00
DEBUG: nsNativeCharsetConverter::gUnicodeToNative = 0x8c47a60

I ran |make mozmill| and xpcshell-tests and both ran without any
ill-effect due to the change. (under Debian GNU/Linux 64-bit).

I was glad this patch helps solving the issue.

Problem-2:

*BUT* when I checked this under 32-bit Debian GNU/Linux,
(under which iconv() did not fail even though native_charset was
"ANSI...". This was a mystery)

TB worked as expected. *BUT* I noticed this message in the log.

Fontconfig warning: ignoring C.UTF-8: not a valid language tag

Huh?
This was seen even before |make mozmill| was invoked.
(I think something in the following command sequence, or program
that is triggered from these,  to set up
a dedicated window system for testing TB 
complained about C.UTF-8 setting.

# Use a separate window as Xserver screen.
# Size 1024x768 -> 1280 x768 
Xephyr -ac -br -noreset -screen 1280x768 :1 &
DISPLAY=localhost:1.0
sleep 2
oclock &
xfwm4 &

(I am dedicating this screen for testing so that normal desktop usage
is not disrupted)

During the testing of TB, this message was also seen.
So fontconfig may not like C.UTF-8 setting.

One down and one more to go?

I think the patch should go in the source tree anyway.
Should I file a separate bug?

TIA
Hi, I found that second issue I mentioned about 
"Fontconfig complaining about C.UTF-8"
is a known Debian issue
and was supposedly fixed in the new version released in Nov last year.
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=721275

(My 64-bit linux installation is for deveopment and I often update it to get the latest development tools, whereas 32-bit version is used for desktop work and I don't update the packages often for stability reasons.)

After I upgraded the pakcages on 32-bit Debian GNU/Linux (specifically, fontconfig and libfontconfig1-dev), the updated version fixes the second issue
So, then usage of C.UTF-8 with TB (and presumably FF) is possible with the proposed patch for the mozilla code per se.

TIA
I think the situation is same for Linux and Solaris.
Flags: needinfo?(ginn.chen)
For the problem found in comment 20,
I filed
Bug 981463 - Allow one letter "C" as in C.UTF-8 as a valid language code
as a separate entry.

TIA
See Also: → 981463
See Also: → 1228755
Blocks: 1342659
I tested that with an EUC-JP locale if the profile path and download directory have non-ASCII EUC-JP byte sequences in them, Firefox is already very broken: can't download, can't save page as, can't save history, can't save bookmarks.

From the relative lack of bug reports, it seems safe to conclude that Firefox users who might still have non-UTF-8 locale settings don't have non-ASCII home directory paths.
Attachment #8934491 - Flags: review?(m_kato)
Attachment #8934491 - Flags: review?(VYV03354)
(In reply to Henri Sivonen (:hsivonen) from comment #26)
> I tested that with an EUC-JP locale if the profile path and download
> directory have non-ASCII EUC-JP byte sequences in them, Firefox is already
> very broken: can't download, can't save page as, can't save history, can't
> save bookmarks.
> 
> From the relative lack of bug reports, it seems safe to conclude that
> Firefox users who might still have non-UTF-8 locale settings don't have
> non-ASCII home directory paths.

My experience: I have used EUC-JP for a long time since the early 1980's and one of the PCs that runs Debian hasused EUC-JP for a long time.

*HOWEVER* I never tried to create a directory name that has Japanese directory TO THIS DAY because of the issues like this one here which I experienced over the years.

So I suspect programming-types don't use non-ASCII directory names when they use EUC-JP.
The problem happens with non-suspecting clerical types who are asked to use
such a system and create directory names in Japanese (I always gasp when I see people's desktop.
People who did not get bitten in 1980's and early 1990's use Japanese directory names freely, and my heart skips a beat or two when I see such desktop or folder listing.

Because of the particular way some Japanese people tend to insert space/blank on both sides of an English word
in a directory name, many people had directory names that contain whitespace in the path and they have caused problems before, but that happens even with ASCII-only path name with whitespace in it.

These natural language-related issues are hard to explain. Some people never experience them.
Some, like mine, who used computers before the software matured got hit with such problems all the time.

TIA for people's attention.
Comment on attachment 8934491 [details]
Bug 960957 - Drop nsIFile support for non-UTF-8 file path encodings on non-Windows platforms.

https://reviewboard.mozilla.org/r/205406/#review211040

::: xpcom/io/nsNativeCharsetUtils.h:43
(Diff revision 2)
>   */
> -#if defined(XP_UNIX) && !defined(XP_MACOSX) && !defined(ANDROID)
> -bool NS_IsNativeUTF8();
> -#else
>  inline bool
>  NS_IsNativeUTF8()

Let's make this a constexpr function while we are here.
Attachment #8934491 - Flags: review?(VYV03354) → review+
Comment on attachment 8934491 [details]
Bug 960957 - Drop nsIFile support for non-UTF-8 file path encodings on non-Windows platforms.

https://reviewboard.mozilla.org/r/205406/#review211040

> Let's make this a constexpr function while we are here.

Added constexpr. Thanks.
Comment on attachment 8934491 [details]
Bug 960957 - Drop nsIFile support for non-UTF-8 file path encodings on non-Windows platforms.

https://reviewboard.mozilla.org/r/205406/#review211454

I guess that we can remove HAVE_ICONV, HAVE_ICONV_WITH_CONST_INPUT, HAVE_MBRTOWC and HAVE_WCRTOMB from configure.in.  But I will file a bug for it.
Attachment #8934491 - Flags: review?(m_kato) → review+
Release Note Request (optional, but appreciated)
[Why is this notable]:
Firefox for Linux (and *NIX) will no longer support non-UTF-8 locale such as ja_JP.eucJP.

[Affects Firefox for Android]:
No.

[Suggested wording]:
Firefox for Linux will no longer support non-UTF-8 locale such as ja_JP.eucJP.

[Links (documentation, blog post, etc)]:
Nothing.
relnote-firefox: --- → ?
Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/354794d16661
Drop nsIFile support for non-UTF-8 file path encodings on non-Windows platforms. r=emk,m_kato
Blocks: 1423846
(In reply to Makoto Kato [:m_kato] (slow due to PTO?) from comment #32)

Thanks for the r+.

> Release Note Request (optional, but appreciated)
...
> [Suggested wording]:
> Firefox for Linux will no longer support non-UTF-8 locale such as
> ja_JP.eucJP.

I think this wording is problematic for two reasons:
 1) It suggests that we substantially dropped support at this point. However, the support was already very, very broken. We don't even know when in the past exactly we dropped support for practical purposes.
 2) It suggests that Firefox won't work with a non-UTF-8 locale at all even though it'll still work when the profile path is all-ASCII (the common case) and will work even for downloads if the download folder path and the name of the downloaded file are all-ASCII.

I suggest this wording instead:
"The vestiges of support for non-UTF-8 file paths were removed from Firefox for Linux."
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
(In reply to Henri Sivonen (:hsivonen) from comment #34)
> I suggest this wording instead:
> "The vestiges of support for non-UTF-8 file paths were removed from Firefox
> for Linux."

I think this doesn't convey 1) very clearly.
(In reply to Mike Hommey [:glandium] from comment #35)
> (In reply to Henri Sivonen (:hsivonen) from comment #34)
> > I suggest this wording instead:
> > "The vestiges of support for non-UTF-8 file paths were removed from Firefox
> > for Linux."
> 
> I think this doesn't convey 1) very clearly.

How about:
"Previously on Linux under non-UTF-8 glibc locale settings file paths were in some contexts treated as UTF-8 and in other contexts according to the locale's encoding leading to erroneous behavior. Now file paths are always treated as UTF-8."
https://hg.mozilla.org/mozilla-central/rev/354794d16661
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
(In reply to Henri Sivonen (:hsivonen) from comment #36)
> (In reply to Mike Hommey [:glandium] from comment #35)
> > (In reply to Henri Sivonen (:hsivonen) from comment #34)
> > > I suggest this wording instead:
> > > "The vestiges of support for non-UTF-8 file paths were removed from Firefox
> > > for Linux."
> > 
> > I think this doesn't convey 1) very clearly.
> 
> How about:
> "Previously on Linux under non-UTF-8 glibc locale settings file paths were
> in some contexts treated as UTF-8 and in other contexts according to the
> locale's encoding leading to erroneous behavior. Now file paths are always
> treated as UTF-8."

That's maybe too much detail.

How about:
"Previously, paths were inconsistently treated as UTF-8 or not, leading to erroneous behavior. They are now always treated as UTF-8."
?
(In reply to Mike Hommey [:glandium] from comment #38)
> How about:
> "Previously, paths were inconsistently treated as UTF-8 or not, leading to
> erroneous behavior. They are now always treated as UTF-8."
> ?

Looks good to me provided that we add "on Linux":
"Previously on Linux, paths were inconsistently treated as UTF-8 or not, leading to erroneous behavior. They are now always treated as UTF-8."
But it is not just Linux, the other affected Posix systems are unrelated and not recognized as "Linux".
(In reply to Martin Husemann from comment #40)
> But it is not just Linux, the other affected Posix systems are unrelated and
> not recognized as "Linux".

Right, but Mozilla publishes release notes only for Mac, Windows, Linux and Android, and in that scope, this change applies to Linux.
If you would like to write a blog post about it (on a mozilla hosted blog) it would be great to link to. Or, anything on MDN or SUMO that explains further. 

For now I'm using this for 59.0b1 release notes:

Paths now are always treated as UTF-8 on Linux
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #42)
> If you would like to write a blog post about it (on a mozilla hosted blog)
> it would be great to link to. Or, anything on MDN or SUMO that explains
> further. 

A blog post would give this more attention than this deserves at this point. I'll see if I can write up something on SUMO.

> For now I'm using this for 59.0b1 release notes:
> 
> Paths now are always treated as UTF-8 on Linux

Thanks.
You need to log in before you can comment on or make changes to this bug.