Closed Bug 1291275 Opened 3 years ago Closed 3 years ago

Mach shouldn't die on chinese glyphs sent to stdout

Categories

(Firefox Build System :: Mach Core, enhancement)

enhancement
Not set

Tracking

(firefox50 fixed, firefox51 fixed)

RESOLVED FIXED
mozilla51
Tracking Status
firefox50 --- fixed
firefox51 --- fixed

People

(Reporter: Callek, Assigned: Pike)

Details

Attachments

(1 file)

From IRC discussion, :Pike asked for me to file this...


02:07:46     INFO -  05:07:46     INFO -  Error running mach:
02:07:46     INFO -  05:07:46     INFO -      ['compare-locales', '--merge-dir', '/builds/slave/m-aurora-and-api-15-ntly-00000/build/src/obj-firefox/merged', '--l10n-ini', '/builds/slave/m-aurora-and-api-15-ntly-00000/build/src/mobile/android/locales/l10n.ini', '--l10n-base', '/builds/slave/m-aurora-and-api-15-ntly-00000/build/mozilla-aurora', 'zh-CN']
02:07:46     INFO -  05:07:46     INFO -  The error occurred in code that was called by the mach command. This is either
02:07:46     INFO -  05:07:46     INFO -  a bug in the called code itself or in the way that mach is calling it.
02:07:46     INFO -  05:07:46     INFO -  You should consider filing a bug for this issue.
02:07:46     INFO -  05:07:46     INFO -  If filing a bug, please include the full output of mach, including this error
02:07:46     INFO -  05:07:46     INFO -  message.
02:07:46     INFO -  05:07:46     INFO -  The details of the failure are as follows:
02:07:46     INFO -  05:07:46     INFO -  UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 721: ordinal not in range(128)
02:07:46     INFO -  05:07:46     INFO -    File "/builds/slave/m-aurora-and-api-15-ntly-00000/build/src/python/compare-locales/mach_commands.py", line 81, in compare
02:07:46     INFO -  05:07:46     INFO -      print(observer.serialize().encode('utf-8', 'replace'))
02:07:46     INFO -  05:07:46     INFO -    File "/tools/python27/lib/python2.7/codecs.py", line 351, in write
02:07:46     INFO -  05:07:46     INFO -      data, consumed = self.encode(object, self.errors)
02:07:46     INFO -  05:07:46    ERROR - Return code: 1
02:07:46     INFO -  05:07:46    ERROR - 1 not in success codes: [0]
02:07:46     INFO -  05:07:46  WARNING - setting return code to 2


Full log was:

http://archive.mozilla.org/pub/mobile/nightly/2016/08/2016-08-02-00-40-03-mozilla-aurora-android-api-15/mozilla-aurora-android-api-15-nightly-bm77-build1-build1.txt.gz

And from https://l10n.mozilla.org/dashboard/compare?run=618717 this seems like it was an issue because of:

`unknown escape sequence, \警 at line 31, column 52 for EnterUserPasswordForCrossOrigin`
I can't reproduce this via

LANG=en_US.UTF-8 LC_ALL=C ./mach compare-locales --l10n-ini mobile/android/locales/l10n.ini --l10n-base=/src/central/releases/l10n/mozilla-aurora/ zh-CN

on my mac terminal.
I had the problem while building Debian packages, but that's related to how the environment locale is not utf-8. My workaround in Debian is to set PYTHONIOENCODING to utf-8 when running mach compare-locales.
(In reply to Mike Hommey [:glandium] from comment #2)
> but that's related to how the environment locale is not utf-8.

Ah no! I remember better now! The problem is that stdout is not a tty, in which case python doesn't use utf-8 as output encoding (how broken is that, right?)

Axel, I'm sure you can reproduce if you do `mach compare-locales ... | cat`

Maybe that's something we should work around at the mach level, because compare-locales is not the only thing that breaks when redirecting the mach output. Greg, thoughts?
Component: Build Config → mach
Flags: needinfo?(gps)
(In reply to Mike Hommey [:glandium] from comment #3)
> 
> Axel, I'm sure you can reproduce if you do `mach compare-locales ... | cat`

Indeed, that reproduces the error.
The code forcing UTF-8 output goes all the way back to bug 795394, which was pretty early in the days of mach and admittedly I was a less experienced Python developer then. Essentially, we assumed that mach only generated textual output and that output would either be str/bytes that conformed to UTF-8 or unicode. Ideally "unicode" and we would encode that to bytes as the current encoding saw fit.

So I guess the problem we're seeing here is that removing the tty from stdout results in the sys.stdout's encoding being "ascii", which of course will blow up on non-ascii byte sequences. But how does this happen? If I pipe `mach help`, sys.stdout.encoding is None and the code in main/main.py will use utf-8. print() from `mach compare-locales` will send UTF-8 encoded bytes, which should get passed through. Is this not happening? Where is the ASCII conversion happening? Or is Python being stupid and trying to re-convert that byte sequence (interpretted as ASCII) to UTF-8? If so, you can fix the problem by having the print() in compare-locales' mach_commands.py send down a unicode instance instead of str/bytes. e.g.

  print(observer.serialize())

If that doesn't work, try:

  sys.stdout.errors = 'replace'
  print(observer.serialize())

If you want to be really low level, you can get a handle on stdout and write bytes to it:

  sys.stdout.stream.write(observer.serialize().encode('utf-8', 'replace'))
Flags: needinfo?(gps)
printing a unicode instance is what causes the problem in the first place:
$ python -c 'print u"
.... thank you bugzilla for crapping on the poop unicode character

$ python -c 'print u"<unicode-poop>"' | cat
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
Where is the ascii encoding coming from? mach/main.py should swap out sys.stdout for a UTF-8 writer. Is the original sys.stdout instance enforcing ascii encoding causing it to barf when it sees non-ascii byte sequences?

Should we be bypassing sys.stdout completely and using os.fdopen(sys.stdout.fileno()) to obtain a raw file descriptor to feed into codecs.getwriter()?
Indeed, print(u"<poop>") in a MachCommandBase does work thanks to the wrapping in mach/main.py.

But quickly looking at the code, I'd expect observer.serialize() to return a byte string. So doing .encode('utf-8', 'replace') is not going to make any good. And I don't think printing the byte string is going to do any good either (print(b'\xf0\x9f\x92\xa9') in CompareLocales fails with a UnicodeDecodeError when stdout is not a tty).

I think what you need here is to turn the output of observer.serialize into a unicode string, with .decode('utf-8').
Oh. I was assuming observer.serialize() returned a unicode since it was doing the .encode('utf-8', 'strict') foo.

Blame Python 2 for allowing str.encode() and unicode.decode() to actually work (those operations don't make any sense and IMO they should not be implemented). *sigh*
From my local testing, serialize() actually does return a unicode string.

Now, the funny part is where the ASCII is coming from. And I have no idea precisely, but the following is true:

print u'(╯°□°)╯︵ ┻━┻'.encode('utf-8').encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

WTF? I tried searching on github, but the 2.7 sources aren't indexed, so that's futile.

We should be OK just not double-encoding in main.py and in compare_locales/mach_commands.py, I think. Pushed the corresponding patch to rb, and tested locally with both ./mach compare-locales and ./mach compare-locales|cat.
Assignee: nobody → l10n
Comment on attachment 8785941 [details]
bug 1291275, sys.stdout is utf-8 in mach, don't double-encode,

https://reviewboard.mozilla.org/r/74964/#review73082
Attachment #8785941 - Flags: review?(mh+mozilla) → review+
Pushed by axel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e93facb078d9
sys.stdout is utf-8 in mach, don't double-encode, r=glandium
https://hg.mozilla.org/mozilla-central/rev/e93facb078d9
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Should this be uplifted to at least Aurora? e.g. I assume this is the same bug https://treeherder.mozilla.org/logviewer.html#?job_id=3431078&repo=mozilla-aurora#L3551
Comment on attachment 8785941 [details]
bug 1291275, sys.stdout is utf-8 in mach, don't double-encode,

Approval Request Comment
[Feature/regressing bug #]: make stdout of repacks more resistant
[User impact if declined]: fails to generate localized builds for errors
[Describe test coverage new/current, TreeHerder]: been successful on nightly
[Risks and why]: low risk, build change only, only affects localizations, only localizations with issues in their strings.
[String/UUID change made/needed]: -

Given where we are the cycle, we don't need this on beta anymore.
Attachment #8785941 - Flags: approval-mozilla-aurora?
Comment on attachment 8785941 [details]
bug 1291275, sys.stdout is utf-8 in mach, don't double-encode,

NPOTB, Aurora50+
Attachment #8785941 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.