Closed Bug 1291275 Opened 3 years ago Closed 3 years ago
Mach shouldn't die on chinese glyphs sent to stdout
58 bytes, text/x-review-board-request
From IRC discussion, :Pike asked for me to file this... 02:07:46 INFO - 05:07:46 INFO - Error running mach: 02:07:46 INFO - 05:07:46 INFO - ['compare-locales', '--merge-dir', '/builds/slave/m-aurora-and-api-15-ntly-00000/build/src/obj-firefox/merged', '--l10n-ini', '/builds/slave/m-aurora-and-api-15-ntly-00000/build/src/mobile/android/locales/l10n.ini', '--l10n-base', '/builds/slave/m-aurora-and-api-15-ntly-00000/build/mozilla-aurora', 'zh-CN'] 02:07:46 INFO - 05:07:46 INFO - The error occurred in code that was called by the mach command. This is either 02:07:46 INFO - 05:07:46 INFO - a bug in the called code itself or in the way that mach is calling it. 02:07:46 INFO - 05:07:46 INFO - You should consider filing a bug for this issue. 02:07:46 INFO - 05:07:46 INFO - If filing a bug, please include the full output of mach, including this error 02:07:46 INFO - 05:07:46 INFO - message. 02:07:46 INFO - 05:07:46 INFO - The details of the failure are as follows: 02:07:46 INFO - 05:07:46 INFO - UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 721: ordinal not in range(128) 02:07:46 INFO - 05:07:46 INFO - File "/builds/slave/m-aurora-and-api-15-ntly-00000/build/src/python/compare-locales/mach_commands.py", line 81, in compare 02:07:46 INFO - 05:07:46 INFO - print(observer.serialize().encode('utf-8', 'replace')) 02:07:46 INFO - 05:07:46 INFO - File "/tools/python27/lib/python2.7/codecs.py", line 351, in write 02:07:46 INFO - 05:07:46 INFO - data, consumed = self.encode(object, self.errors) 02:07:46 INFO - 05:07:46 ERROR - Return code: 1 02:07:46 INFO - 05:07:46 ERROR - 1 not in success codes:  02:07:46 INFO - 05:07:46 WARNING - setting return code to 2 Full log was: http://archive.mozilla.org/pub/mobile/nightly/2016/08/2016-08-02-00-40-03-mozilla-aurora-android-api-15/mozilla-aurora-android-api-15-nightly-bm77-build1-build1.txt.gz And from https://l10n.mozilla.org/dashboard/compare?run=618717 this seems like it was an issue because of: `unknown escape sequence, \警 at line 31, column 52 for EnterUserPasswordForCrossOrigin`
I can't reproduce this via LANG=en_US.UTF-8 LC_ALL=C ./mach compare-locales --l10n-ini mobile/android/locales/l10n.ini --l10n-base=/src/central/releases/l10n/mozilla-aurora/ zh-CN on my mac terminal.
I had the problem while building Debian packages, but that's related to how the environment locale is not utf-8. My workaround in Debian is to set PYTHONIOENCODING to utf-8 when running mach compare-locales.
(In reply to Mike Hommey [:glandium] from comment #2) > but that's related to how the environment locale is not utf-8. Ah no! I remember better now! The problem is that stdout is not a tty, in which case python doesn't use utf-8 as output encoding (how broken is that, right?) Axel, I'm sure you can reproduce if you do `mach compare-locales ... | cat` Maybe that's something we should work around at the mach level, because compare-locales is not the only thing that breaks when redirecting the mach output. Greg, thoughts?
Component: Build Config → mach
(In reply to Mike Hommey [:glandium] from comment #3) > > Axel, I'm sure you can reproduce if you do `mach compare-locales ... | cat` Indeed, that reproduces the error.
The code forcing UTF-8 output goes all the way back to bug 795394, which was pretty early in the days of mach and admittedly I was a less experienced Python developer then. Essentially, we assumed that mach only generated textual output and that output would either be str/bytes that conformed to UTF-8 or unicode. Ideally "unicode" and we would encode that to bytes as the current encoding saw fit. So I guess the problem we're seeing here is that removing the tty from stdout results in the sys.stdout's encoding being "ascii", which of course will blow up on non-ascii byte sequences. But how does this happen? If I pipe `mach help`, sys.stdout.encoding is None and the code in main/main.py will use utf-8. print() from `mach compare-locales` will send UTF-8 encoded bytes, which should get passed through. Is this not happening? Where is the ASCII conversion happening? Or is Python being stupid and trying to re-convert that byte sequence (interpretted as ASCII) to UTF-8? If so, you can fix the problem by having the print() in compare-locales' mach_commands.py send down a unicode instance instead of str/bytes. e.g. print(observer.serialize()) If that doesn't work, try: sys.stdout.errors = 'replace' print(observer.serialize()) If you want to be really low level, you can get a handle on stdout and write bytes to it: sys.stdout.stream.write(observer.serialize().encode('utf-8', 'replace'))
printing a unicode instance is what causes the problem in the first place: $ python -c 'print u"
.... thank you bugzilla for crapping on the poop unicode character $ python -c 'print u"<unicode-poop>"' | cat Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
Where is the ascii encoding coming from? mach/main.py should swap out sys.stdout for a UTF-8 writer. Is the original sys.stdout instance enforcing ascii encoding causing it to barf when it sees non-ascii byte sequences? Should we be bypassing sys.stdout completely and using os.fdopen(sys.stdout.fileno()) to obtain a raw file descriptor to feed into codecs.getwriter()?
Indeed, print(u"<poop>") in a MachCommandBase does work thanks to the wrapping in mach/main.py. But quickly looking at the code, I'd expect observer.serialize() to return a byte string. So doing .encode('utf-8', 'replace') is not going to make any good. And I don't think printing the byte string is going to do any good either (print(b'\xf0\x9f\x92\xa9') in CompareLocales fails with a UnicodeDecodeError when stdout is not a tty). I think what you need here is to turn the output of observer.serialize into a unicode string, with .decode('utf-8').
Oh. I was assuming observer.serialize() returned a unicode since it was doing the .encode('utf-8', 'strict') foo. Blame Python 2 for allowing str.encode() and unicode.decode() to actually work (those operations don't make any sense and IMO they should not be implemented). *sigh*
From my local testing, serialize() actually does return a unicode string. Now, the funny part is where the ASCII is coming from. And I have no idea precisely, but the following is true: print u'(╯°□°）╯︵ ┻━┻'.encode('utf-8').encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128) WTF? I tried searching on github, but the 2.7 sources aren't indexed, so that's futile. We should be OK just not double-encoding in main.py and in compare_locales/mach_commands.py, I think. Pushed the corresponding patch to rb, and tested locally with both ./mach compare-locales and ./mach compare-locales|cat.
Assignee: nobody → l10n
Comment on attachment 8785941 [details] bug 1291275, sys.stdout is utf-8 in mach, don't double-encode, https://reviewboard.mozilla.org/r/74964/#review73082
Attachment #8785941 - Flags: review?(mh+mozilla) → review+
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/autoland/rev/e93facb078d9 sys.stdout is utf-8 in mach, don't double-encode, r=glandium
Should this be uplifted to at least Aurora? e.g. I assume this is the same bug https://treeherder.mozilla.org/logviewer.html#?job_id=3431078&repo=mozilla-aurora#L3551
Comment on attachment 8785941 [details] bug 1291275, sys.stdout is utf-8 in mach, don't double-encode, Approval Request Comment [Feature/regressing bug #]: make stdout of repacks more resistant [User impact if declined]: fails to generate localized builds for errors [Describe test coverage new/current, TreeHerder]: been successful on nightly [Risks and why]: low risk, build change only, only affects localizations, only localizations with issues in their strings. [String/UUID change made/needed]: - Given where we are the cycle, we don't need this on beta anymore.
Attachment #8785941 - Flags: approval-mozilla-aurora?
Comment on attachment 8785941 [details] bug 1291275, sys.stdout is utf-8 in mach, don't double-encode, NPOTB, Aurora50+
Attachment #8785941 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.