Closed Bug 1274090 Opened 4 years ago Closed 4 years ago

Logging the output of a subprocess writing non-ascii to its stdout in python configure will fail

Categories

(Firefox Build System :: General, defect)

defect
Not set

Tracking

(firefox49 fixed)

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: chmanchester, Assigned: chmanchester)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

As a test case, putting:

@depends('--help')
@imports('subprocess')
@imports('os')
@imports('codecs')
@imports(_from='__builtin__', _import='open')
def dies_when_logging(_):
    test_file = 'test.txt'
    try:
        with open(test_file, 'w+') as fh:
            codecs.getwriter('utf-8')(fh).write('\u00B4')
        log.info(subprocess.check_output(['cat', 'test.txt']))
    finally:
        os.remove(test_file)

in moz.configure will produce a traceback like:

 0:00.55 Traceback (most recent call last):
 0:00.55   File "/home/chris/m-c/python/mozbuild/mozbuild/configure/util.py", line 128, in emit
 0:00.55     self._stdout_waiting = msg.endswith('... ')
 0:00.55 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
 0:00.55 Logged from file moz.configure, line 197
 0:00.55 Traceback (most recent call last):
 0:00.55   File "/usr/lib/python2.7/logging/__init__.py", line 859, in emit
 0:00.55     msg = self.format(record)
 0:00.55   File "/usr/lib/python2.7/logging/__init__.py", line 732, in format
 0:00.55     return fmt.format(record)
 0:00.55   File "/usr/lib/python2.7/logging/__init__.py", line 474, in format
 0:00.55     s = self._fmt % record.__dict__
 0:00.55 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Assignee: nobody → cmanchester
Ok, so subprocess always returns str... we can at least attempt to detect the right encoding in check_cmd_output.
Attachment #8756038 - Flags: review?(mh+mozilla)
Comment on attachment 8756038 [details]
MozReview Request: Bug 1274090 - Attempt to convert str objects containing non-ascii characters in Python configure rather than failing outright. r=glandium

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/54930/diff/1-2/
Attachment #8756038 - Flags: review?(mh+mozilla)
Comment on attachment 8756038 [details]
MozReview Request: Bug 1274090 - Attempt to convert str objects containing non-ascii characters in Python configure rather than failing outright. r=glandium

https://reviewboard.mozilla.org/r/54930/#review51666

Considering other things than check_cmd_output might do this in the future (or might even be doing it already), it seems to me it would be better to just make the log functions impervious to this instead.

Also note that locale.getpreferredencoding is a piece of sh*t. See python/mozbuild/mozbuild/configure/util.py.

::: build/moz.configure/util.configure:33
(Diff revision 2)
>  @imports(_from='mozbuild.shellutil', _import='quote')
> +@imports(_from='locale', _import='getpreferredencoding')
>  def check_cmd_output(*args, **kwargs):
>      onerror = kwargs.pop('onerror', None)
>  
> +    # subprocess always returns bytes, but the rest of out output

s/out/our/ ?
Attachment #8756038 - Flags: review?(mh+mozilla)
Comment on attachment 8756038 [details]
MozReview Request: Bug 1274090 - Attempt to convert str objects containing non-ascii characters in Python configure rather than failing outright. r=glandium

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/54930/diff/2-3/
Attachment #8756038 - Attachment description: MozReview Request: Bug 1274090 - Don't fail when printing a process' non-ascii output from Python configure's check_cmd_output. r=glandium → MozReview Request: Bug 1274090 - Attempt to convert str objects containing non-ascii characters in Python configure rather than failing outright. r=glandium
Attachment #8756038 - Flags: review?(mh+mozilla)
Comment on attachment 8756038 [details]
MozReview Request: Bug 1274090 - Attempt to convert str objects containing non-ascii characters in Python configure rather than failing outright. r=glandium

https://reviewboard.mozilla.org/r/54930/#review52334

This seems reasonable to me, although the isinstance(..., str) is not python 3 friendly, but so is most of our string handling anyways... I'd still like gps to also take a look at this.

::: python/mozbuild/mozbuild/configure/__init__.py:168
(Diff revision 3)
> +                out_args = []
> +                for arg in args:
> +                    if encoding and isinstance(arg, str):
> +                        arg = arg.decode(encoding)
> +                    out_args.append(arg)

This could be written:

out_args = [
  arg.decode(encoding) if encoding and isinstance(arg, str) else arg
  for arg in args
]

::: python/mozbuild/mozbuild/configure/util.py:85
(Diff revision 3)
>                  # locale._parse_localename makes locale.getpreferredencoding
>                  # return None when LC_ALL is C, instead of e.g. 'US-ASCII' or
>                  # 'ANSI_X3.4-1968' when it uses nl_langinfo.

This should probably move as well.

::: python/mozbuild/mozbuild/test/configure/test_util.py:418
(Diff revision 3)
>              l.write('a\nb\nc')
>  
>              self.assertEqual(lines, ['a', 'b'])
>  
>          self.assertEqual(lines, ['a', 'b', 'c'])
>  

PEP8 wants 2 lines here.
Attachment #8756038 - Flags: review?(mh+mozilla) → review+
Comment on attachment 8756038 [details]
MozReview Request: Bug 1274090 - Attempt to convert str objects containing non-ascii characters in Python configure rather than failing outright. r=glandium

https://reviewboard.mozilla.org/r/54930/#review52662

I'll look at the final patch once glandium's (and possibly my) suggestions are incorporated

::: python/mozbuild/mozbuild/configure/__init__.py:161
(Diff revision 3)
> +        # Some callers will manage to log a bytestring with characters in it
> +        # that can't be converted to ascii. Make our log methods robust to this
> +        # by detecting the encoding that a producer is likely to have used.
> +        encoding = getpreferredencoding()
> +        def wrapped_log_method(logger, key):
> +            method = getattr(logger, key)
> +            def wrapped(*args, **kwargs):
> +                out_args = []
> +                for arg in args:
> +                    if encoding and isinstance(arg, str):
> +                        arg = arg.decode(encoding)
> +                    out_args.append(arg)
> +                return method(*out_args, **kwargs)
> +            return wrapped

So, the way `logging` works are function calls to log things get turned into `LogRecord` class instances. Those get passed to `LogHandler` instances. The default implementation of `LogHandler.emit()` calls `format()` on the `Formatter` registered to the handler, which usually takes the str/unicode message and the arguments and does a `%` formatting operation. Read the source at https://hg.python.org/cpython/file/2.7/Lib/logging/__init__.py.

The original stack in this bug says the error is occurring in the formatter. We're likely passing a mix of str and unicode to a `X % Y` operation and it attempts implicit encoding/decoding using "ascii" and fails.

The question I want to posit is where should the str/unicode conversion occur? Should it magically convert arguments to the log functions (as this commit implements)? Should it convert closer to output time? Should it prevent mixed types to log function calls and require callers to pass in specific types?

I'm kind of a fan of the latter. I think encoding should be done at the periphery. If the logging subsystem is getting of str and unicode types, the caller has already failed.

So I'm going to gently push back: why are we allowing str to be passed into the logging mechanism at all? Can we limit to unicode? Or, could we limit to unicode *or* str and have the log filter munge to a consistent type?

I hate Python 2's str/unicode foo. I doubly hate that the log APIs allow you to pass in mixed types. IMO it should provide an API for sending bytes and an API for sending unicode. And you could specify the encoding for bytes in case they need to be decoded. But alas.
Attachment #8756038 - Flags: review?(gps)
The thing is, we're not exactly talking about python code. We're talking about sandboxed configure code. We don't want callers to have to deal with the str/unicode mess on their own.
https://reviewboard.mozilla.org/r/54930/#review52662

> So, the way `logging` works are function calls to log things get turned into `LogRecord` class instances. Those get passed to `LogHandler` instances. The default implementation of `LogHandler.emit()` calls `format()` on the `Formatter` registered to the handler, which usually takes the str/unicode message and the arguments and does a `%` formatting operation. Read the source at https://hg.python.org/cpython/file/2.7/Lib/logging/__init__.py.
> 
> The original stack in this bug says the error is occurring in the formatter. We're likely passing a mix of str and unicode to a `X % Y` operation and it attempts implicit encoding/decoding using "ascii" and fails.
> 
> The question I want to posit is where should the str/unicode conversion occur? Should it magically convert arguments to the log functions (as this commit implements)? Should it convert closer to output time? Should it prevent mixed types to log function calls and require callers to pass in specific types?
> 
> I'm kind of a fan of the latter. I think encoding should be done at the periphery. If the logging subsystem is getting of str and unicode types, the caller has already failed.
> 
> So I'm going to gently push back: why are we allowing str to be passed into the logging mechanism at all? Can we limit to unicode? Or, could we limit to unicode *or* str and have the log filter munge to a consistent type?
> 
> I hate Python 2's str/unicode foo. I doubly hate that the log APIs allow you to pass in mixed types. IMO it should provide an API for sending bytes and an API for sending unicode. And you could specify the encoding for bytes in case they need to be decoded. But alas.

An earlier version of this patch just fixed a producer, but as glandium's feedback suggests this could be an issue for any caller using subprocess, so let's just fix it once.
Comment on attachment 8756038 [details]
MozReview Request: Bug 1274090 - Attempt to convert str objects containing non-ascii characters in Python configure rather than failing outright. r=glandium

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/54930/diff/3-4/
Attachment #8756038 - Flags: review?(gps)
Comment on attachment 8756038 [details]
MozReview Request: Bug 1274090 - Attempt to convert str objects containing non-ascii characters in Python configure rather than failing outright. r=glandium

https://reviewboard.mozilla.org/r/54930/#review53246

I hate text encoding.
Attachment #8756038 - Flags: review?(gps) → review+
Pushed by cmanchester@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/aeee5ac6aeb9
Attempt to convert str objects containing non-ascii characters in Python configure rather than failing outright. r=glandium,gps
https://hg.mozilla.org/mozilla-central/rev/aeee5ac6aeb9
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.