"mach warnings-list | less" produces this error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position 100: ordinal not in range(128)

RESOLVED FIXED in mozilla18

Status

Firefox Build System
Mach Core
RESOLVED FIXED
6 years ago
2 months ago

People

(Reporter: dholbert, Assigned: gps)

Tracking

Trunk
mozilla18
x86_64
Linux

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

As noted in bug 794713 comment 20: if I apply that bug's patch and run...
>  ./mach warnings-list  | less
...then I get the following (and only the following) displayed in my "less" output:
{
No handlers could be found for logger "pymake.data"
 Traceback (most recent call last):
  File "./mach", line 48, in <module>
    mach.run(sys.argv[1:])
  File "/scratch/work/builds/mozilla-central/mozilla-central.12-03-19.10-25/mozilla/python/mach/mach/main.py", line 133, in run
obj-x86_64-unknown-linux-gnu/dist/include/mozilla/Assertions.h:263:54 [-Wvariadic-macros] anonymous variadic macros were introduced in C99
obj-x86_64-unknown-linux-gnu/dist/include/mozilla/Assertions.h:267:33 [-Wvariadic-macros] anonymous variadic macros were introduced in C99
obj-x86_64-unknown-linux-gnu/dist/include/mozilla/Assertions.h:275:22 [-Wvariadic-macros] anonymous variadic macros were introduced in C99
    fn(**stripped)
  File "/scratch/work/builds/mozilla-central/mozilla-central.12-03-19.10-25/mozilla/python/mach/mach/warnings.py", line 61, in list
    warning['column'], warning['flag'], warning['message']))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position 100: ordinal not in range(128)
}

I suspect this is because unicode characters are now making it past the build process, but we can't handle them after that point if we're piped to e.g. "less".

If I don't pipe it through "less", I don't have any problems. (list of build warnings prints as expected)
(Assignee)

Comment 1

6 years ago
I can reproduce this.

The warnings.json file (objdir/.mozbuild/warnings.json) has standard Unicode escaping:

  "files": {
    "/home/gps/src/mozilla-central/xpcom/tests/TestCOMArray.cpp": {
      "hash": "00ad4df58de719ecca19b66d2cfd82a37b39b0dc",
      "warnings": [
        {
          "column": 7,
          "line": 239,
          "flag": "-Wuninitialized",
          "message": "\u2018fourthObject\u2019 may be used uninitialized in this function",
          "filename": "/home/gps/src/mozilla-central/xpcom/tests/TestCOMArray.cpp"
        },

i.e. the file itself is ASCII however the JSON serializer converts to/from UTF-8 at serialization time.

When we've parsed the file back into Python data structures, we see unicode instances. e.g.

{u'column': 7, u'filename': u'/home/gps/src/mozilla-central/xpcom/tests/TestCOMArray.cpp', u'flag': u'-Wuninitialized', u'line': 239, u'message': u'\u2018fourthObject\u2019 may be used uninitialized in this function'}

Just to be sure:

  print w['message'][0]
  ‘

If we print warnings-list directly to the terminal (the default), we see our Unicode smart quotes and all is well.

However, as soon as we pipe into less, we blow up.

I suspect sys.stdout.encoding is different/not-UTF-8 when piping to less.
(Assignee)

Comment 2

6 years ago
When printing to the terminal sys.stdout.encoding is UTF-8. When piping to less it is None. Yay default encoding of ASCII! Not.
(Assignee)

Comment 3

6 years ago
Searching the internets for a recommended solution to this problem (I don't want to have to .encode('utf-8') every time I print) reveals a minefield. CC'ing the Python brain trust.
(Assignee)

Comment 4

6 years ago
Created attachment 666011 [details] [diff] [review]
Transparently wrap standard streams with UTF-8 encoding, v1

This is how I think it should be done. The rest of the Internet seems to be on crack with its setdefaultencoding and PYTHONIOENCODING nonsense. This seems much saner (and simpler) to me. And, it appears to work!

If I could change anything about this patch it would be to have a context manager for transparent encoding overriding on file objects instead of the try..finally. I don't think I saw one in codecs. Maybe one is lurking elsewhere in the stdlib. (I'm too lazy to write my own, meh).
Assignee: nobody → gps
Status: NEW → ASSIGNED
Attachment #666011 - Flags: review?(jhammel)

Comment 5

6 years ago
Comment on attachment 666011 [details] [diff] [review]
Transparently wrap standard streams with UTF-8 encoding, v1

Review of attachment 666011 [details] [diff] [review]:
-----------------------------------------------------------------

This will now blow up if you try to write an non-ASCII |str|:

>>> codecs.getwriter("utf-8")(StringIO.StringIO()).write("\x98")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x98 in position 0: ordinal not in range(128)

The bug 794713 patch basically forces output from process to be unicode, so I suppose you're assuming this won't be a problem.

Comment 6

6 years ago
Comment on attachment 666011 [details] [diff] [review]
Transparently wrap standard streams with UTF-8 encoding, v1

TBH, I'm wary of this change.  If we have non-UTF-8, will this break? (as much as I'd love if the whole world was UTF-8, that doesn't seem likely).  I would like to see tests with this.
(Assignee)

Comment 7

6 years ago
So, the theory is that everything inside of mach/mozbuild is unicode (Python 2.7) or str (Python 3). Either way, data is stored as Unicode.

I concede that this patch currently blows up if we try to write out str or bytes. But, mach should not currently write out non-textual data, so I don't think we have to cross this bridge today.

Anyway, if we are converting from Unicode to bytes, we need to choose an encoding. And, UTF-8 seems like a reasonable choice (if none has been defined). The only additional thing we should consider is whether to throw error='ignore' in there. I can make a case for that. But, at this juncture I'd like to leave it out because I'd like to know where we aren't dealing with Unicode internally.

Comment 8

6 years ago
Even if all mach/mozbuild is unicode, we still have to worry about non-unicode from m-c and the system.  I'll give this an r+, but still, am wary.  I suppose it is no worse than what we do now

Updated

6 years ago
Attachment #666011 - Flags: review?(jhammel) → review+
(Assignee)

Comment 9

6 years ago
https://hg.mozilla.org/mozilla-central/rev/85df971e0db1
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla18

Updated

2 months ago
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.