Closed Bug 1384062 Opened 7 years ago Closed 7 years ago

AssertionError: assert self._running when shutting down resource monitor

Categories

(Testing :: Mozbase, defect)

defect
Not set
normal

Tracking

(firefox56 wontfix, firefox57 fixed)

RESOLVED FIXED
mozilla57
Tracking Status
firefox56 --- wontfix
firefox57 --- fixed

People

(Reporter: sa4zet, Assigned: glandium)

References

Details

Attachments

(2 files)

Attached file build.out
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0
Build ID: 20170707114307

Steps to reproduce:

hg clone https://hg.mozilla.org/releases/mozilla-beta/ beta
export SHELL=/bin/bash
export MOZ_CURRENT_PROJECT=browser
export MOZCONFIG=./mozconfig
./mach build

mozconfig:
mk_add_options MOZ_MAKE_FLAGS="-j8"
mk_add_options MOZ_BUILD_PROJECTS="browser"
ac_add_options --enable-application=browser
ac_add_options --disable-debug
ac_add_options --disable-pulseaudio
ac_add_options --enable-alsa
ac_add_options --disable-tests



Actual results:

23:40.38 Packaging quitter@mozilla.org.xpi...
23:40.45 0 compiler warnings present.
Error running mach:

    ['build']

The error occurred in code that was called by the mach command. This is either
a bug in the called code itself or in the way that mach is calling it.

You should consider filing a bug for this issue.

If filing a bug, please include the full output of mach, including this error
message.

The details of the failure are as follows:

AssertionError

  File "/root/mozilla-beta/python/mozbuild/mozbuild/mach_commands.py", line 528, in build
    monitor.finish(record_usage=status==0)
  File "/root/mozilla-beta/python/mozbuild/mozbuild/mach_commands.py", line 241, in __exit__
    self.monitor.stop_resource_recording()
  File "/root/mozilla-beta/python/mozbuild/mozbuild/controller/building.py", line 258, in stop_resource_recording
    self.resources.stop()
  File "/root/mozilla-beta/testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py", line 289, in stop
    assert self._running



Expected results:

mach build end with success
Component: General → Build Config
Product: Release Engineering → Core
This bug is known to me. I suspect it is a dupe. But I can't find it right now.

It is a mostly harmless (but annoying) error. The underlying bug is in the resource monitor not handling shutting down when it is not running. It should probably silently fail unless the API was clearly used incorrectly.
Summary: mach build failed on beta → AssertionError: assert self._running when shutting down resource monitor
stop_resource_recording is called twice from python/mozbuild/mozbuild/controller/building.py and python/mozbuild/mozbuild/mach_commands.py.
Assignee: nobody → mh+mozilla
Component: Build Config → Mozbase
Product: Core → Testing
Comment on attachment 8897697 [details]
Bug 1384062 - Make SystemResourceMonitor.stop more resilient to errors.

Turns out the problem is way more subtle than this, and this patch actually fixes nothing.

The core problem is that there's either a bug in python or in the kernel, and either way, it's going to require weird hacks.

So, it turns out, BuildMonitor.stop_resource_recording has its own tracking, and *does* avoid calling stop again. The way it does it is:

   def stop_resource_recording(self):
       if self._resources_started:
           self.resources.stop()

       self._resources_started = False

Now, what's actually happening is that *during* self.resources.stop (so, in SystemResourceMonitor.stop), an exception is thrown (more on that later), which makes the self._resources_started = False skipped, *and* makes the BuildOutputManager context manager invoke its __exit__, which calls stop_resource_recording again, which is how we end up calling SystemResourceMonitor.stop a second time, because self._resource_started is still True.

And an exception is thrown from this line in SystemResourceMonitor.stop:
    while self._pipe.poll(1.0):

yes, the exception comes from poll. But here's the twist. Even though poll failed, there's still something to receive: if I wrap the while in a try/except, and do a _pipe.recv() in the except, I do receive the 'done' message from the end of _collect.

Printing out the exception message is instructive about what python thinks is going wrong:
poll() gave POLLNVAL or POLLERR

Looking at strace is instructive about what the underlying poll system call does:
poll([{fd=7, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=7, revents=POLLIN|POLLERR|POLLHUP}])

Yup, it's telling: there's stuff to read *and* an error *and* the other end hanged up.

I'm not sure the kernel is right to be returning POLLERR here, but the fact is it does, and that makes python unhappy, and there's apparently no clean way to detect this error case, because the exception's errno is None, and there's no way to get the underlying poll return values.

A reliable way to make the error happen, because it's dependent on a race between the collect thread closing the pipe and stop trying to read from it, is to add a time.sleep(1) before the loop.

That being said, I'm not sure why poll is being used at all here...
Attachment #8897697 - Flags: review?(ahalberstadt)
Comment on attachment 8897697 [details]
Bug 1384062 - Make SystemResourceMonitor.stop more resilient to errors.

https://reviewboard.mozilla.org/r/168986/#review174522

I think this makes sense, though I'm not all that familiar with mozsystemresourcemonitor. If you have any doubts about this, please flag gps instead.
Attachment #8897697 - Flags: review?(ahalberstadt) → review+
Comment on attachment 8897697 [details]
Bug 1384062 - Make SystemResourceMonitor.stop more resilient to errors.

https://reviewboard.mozilla.org/r/168986/#review174560

::: testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py:304
(Diff revision 2)
> -        while self._pipe.poll(1.0):
> +        while True:
> +            try:
> -            start_time, end_time, io_diff, cpu_diff, cpu_percent, virt_mem, \
> +                start_time, end_time, io_diff, cpu_diff, cpu_percent, virt_mem, \
> -                swap_mem = self._pipe.recv()
> +                    swap_mem = self._pipe.recv()

`multiprocess.Pipe.recv()` can block. That's why the `poll()` is here.

With this patch applied locally, I was able to hang a `mach build` by issuing a ^C during just the right moment.

```
 0:05.32 checking for getpagesize... yes
Process Process-1:
Traceback (most recent call last):
  File "/home/gps/.pyenv/versions/2.7.12/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
 0:05.34 checking for gmtime_r... yes
 0:05.34 Traceback (most recent call last):
 0:05.34   File "/home/gps/src/firefox/configure.py", line 124, in <module>
 0:05.35     sys.exit(main(sys.argv))
 0:05.35   File "/home/gps/src/firefox/configure.py", line 29, in main
 0:05.35     sandbox.run(os.path.join(os.path.dirname(__file__), 'moz.configure'))
 0:05.35   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 398, in run
 0:05.35     self.include_file(path)
 0:05.35   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 389, in include_file
 0:05.35 Traceback (most recent call last):
 0:05.35   File "/home/gps/src/firefox/configure.py", line 124, in <module>
 0:05.35     sys.exit(main(sys.argv))
 0:05.35     exec_(code, self)
 0:05.35   File "/home/gps/src/firefox/python/mozbuild/mozbuild/util.py", line 59, in exec_
 0:05.35   File "/home/gps/src/firefox/configure.py", line 29, in main
 0:05.35     sandbox.run(os.path.join(os.path.dirname(__file__), 'moz.configure'))
 0:05.35   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 427, in run
 0:05.35     exec(object, globals, locals)
 0:05.35   File "/home/gps/src/firefox/moz.configure", line 7, in <module>
 0:05.35     include('build/moz.configure/init.configure')
 0:05.35   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 693, in include_impl
 0:05.35     func(*args)
 0:05.35   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 473, in _value_for
 0:05.35     self.include_file(what)
 0:05.35   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 389, in include_file
 0:05.35     return self._value_for_depends(obj, need_help_dependency)
 0:05.35   File "/home/gps/src/firefox/python/mozbuild/mozbuild/util.py", line 925, in method_call
 0:05.35     exec_(code, self)
 0:05.35   File "/home/gps/src/firefox/python/mozbuild/mozbuild/util.py", line 59, in exec_
 0:05.35     exec(object, globals, locals)
 0:05.35   File "/home/gps/src/firefox/build/moz.configure/init.configure", line 160, in <module>
    self.run()
  File "/home/gps/.pyenv/versions/2.7.12/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/gps/src/firefox/testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py", line 96, in _collect
    while not pipe.poll(sleep_interval):
KeyboardInterrupt
 0:05.35     @imports('distutils.sysconfig')
 0:05.35   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 676, in decorator
 0:05.35     cache[args] = self.func(instance, *args)
 0:05.35   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 482, in _value_for_depends
 0:05.35     depends = DependsFunction(self, func, dependencies, when=when)
 0:05.35   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 95, in __init__
 0:05.35     sandbox._value_for(self)
 0:05.35   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 473, in _value_for
 0:05.35     return obj.result(need_help_dependency)
 0:05.36   File "/home/gps/src/firefox/python/mozbuild/mozbuild/util.py", line 925, in method_call
 0:05.36     return self._value_for_depends(obj, need_help_dependency)
 0:05.36   File "/home/gps/src/firefox/python/mozbuild/mozbuild/util.py", line 925, in method_call
 0:05.36     cache[args] = self.func(instance, *args)
 0:05.36   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 122, in result
 0:05.36     return self._func(*resolved_args)
 0:05.36   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 1001, in wrapped
 0:05.36     cache[args] = self.func(instance, *args)
 0:05.36   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 482, in _value_for_depends
 0:05.36     return obj.result(need_help_dependency)
 0:05.36   File "/home/gps/src/firefox/python/mozbuild/mozbuild/util.py", line 925, in method_call
 0:05.36     return new_func(*args, **kwargs)
 0:05.36   File "/home/gps/src/firefox/build/moz.configure/old.configure", line 350, in old_configure
 0:05.36     ret = subprocess.call(cmd)
 0:05.36   File "/home/gps/.pyenv/versions/2.7.12/lib/python2.7/subprocess.py", line 523, in call
 0:05.36     cache[args] = self.func(instance, *args)
 0:05.36   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 122, in result
 0:05.36     return self._func(*resolved_args)
 0:05.36   File "/home/gps/src/firefox/python/mozbuild/mozbuild/configure/__init__.py", line 1001, in wrapped
 0:05.36     return new_func(*args, **kwargs)
 0:05.36   File "/home/gps/src/firefox/build/moz.configure/init.configure", line 217, in virtualenv_python
 0:05.36     sys.exit(subprocess.call([python] + sys.argv))
 0:05.36   File "/home/gps/.pyenv/versions/2.7.12/lib/python2.7/subprocess.py", line 523, in call
 0:05.36     return Popen(*popenargs, **kwargs).wait()
 0:05.36   File "/home/gps/.pyenv/versions/2.7.12/lib/python2.7/subprocess.py", line 1392, in wait
 0:05.36     pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
 0:05.36   File "/home/gps/.pyenv/versions/2.7.12/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call
 0:05.36     return func(*args)
 0:05.36 KeyboardInterrupt
 0:05.36     return Popen(*popenargs, **kwargs).wait()
 0:05.36   File "/home/gps/.pyenv/versions/2.7.12/lib/python2.7/subprocess.py", line 1392, in wait
 0:05.36     pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
 0:05.36   File "/home/gps/.pyenv/versions/2.7.12/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call
 0:05.36     return func(*args)
 0:05.36 KeyboardInterrupt
 0:05.36 client.mk:384: recipe for target 'configure' failed
 0:05.36 make: *** [configure] Interrupt
```
Attachment #8897697 - Flags: review-
(In reply to Gregory Szorc [:gps] from comment #7)
> Comment on attachment 8897697 [details]
> Bug 1384062 - Remove the poll loop from SystemResourceMonitor.stop.
> 
> https://reviewboard.mozilla.org/r/168986/#review174560
> 
> ::: testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py:304
> (Diff revision 2)
> > -        while self._pipe.poll(1.0):
> > +        while True:
> > +            try:
> > -            start_time, end_time, io_diff, cpu_diff, cpu_percent, virt_mem, \
> > +                start_time, end_time, io_diff, cpu_diff, cpu_percent, virt_mem, \
> > -                swap_mem = self._pipe.recv()
> > +                    swap_mem = self._pipe.recv()
> 
> `multiprocess.Pipe.recv()` can block. That's why the `poll()` is here.

It doesn't matter that it blocks. It only blocks if the other end is still alive and waiting to send data.

> With this patch applied locally, I was able to hang a `mach build` by
> issuing a ^C during just the right moment.

This makes no sense, and I can't even reproduce, no matter how hard I try.
Comment on attachment 8897697 [details]
Bug 1384062 - Make SystemResourceMonitor.stop more resilient to errors.

https://reviewboard.mozilla.org/r/168986/#review175148

This looks much more resilient!

I suspect there's edge cases where we need to catch detect KeyboardInterrupt (which doesn't inherit from Exception). But this is strictly better.
Attachment #8897697 - Flags: review?(gps) → review+
I suspect KeyboardInterrupt doesn't cause problems besides large tracebacks. First because apparently it's usually caught by the collection process, which leads to the pipe being in a non-pipe state, which is why you got the hang in comment 7, and we're now dealing with that. And if a KeyboardInterrupt happens on the main process during stop(), well, that just interrupts stop and prints a tracebacks but that should have no bad consequences anyways (at least it shouldn't leave us hanging).
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/4e9cd1d537cc
Make SystemResourceMonitor.stop more resilient to errors. r=ahal,gps
Attachment #8897697 - Flags: review+ → review?(gps)
Comment on attachment 8897697 [details]
Bug 1384062 - Make SystemResourceMonitor.stop more resilient to errors.

https://reviewboard.mozilla.org/r/168986/#review175162
Attachment #8897697 - Flags: review?(gps) → review+
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/8d9ae8c45dd0
Make SystemResourceMonitor.stop more resilient to errors. r=ahal,gps
https://hg.mozilla.org/mozilla-central/rev/8d9ae8c45dd0
Status: UNCONFIRMED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
my bugs have been marked as duplicate of this bug and I still cannot build Firefox
Ran into this problem attempting to build release 56.0.
See Also: → 1239939, 1272782
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: