Closed Bug 1622257 Opened 4 years ago Closed 4 years ago

Talos terminating processes that take too long to shut down should fail the task more reliably and work correctly on Windows

Categories

(Testing :: Talos, defect, P1)

Version 3
defect

Tracking

(firefox76 fixed)

RESOLVED FIXED
mozilla76
Tracking Status
firefox76 --- fixed

People

(Reporter: Gijs, Assigned: Gijs)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

Talos' process wrapper has some shutdown logic ( https://searchfox.org/mozilla-central/rev/278046367dab878316f60f0bd7f740cf73f3c447/testing/talos/talos/talos_process.py#148-152 ) where it'll kill the process if it stops responding.

The resulting crashes get found and reported and break the task (see deps of bug 1339594).

There are two missing bits here:

  • having to terminate the process should just fail the task anyway. If the crashreporter fails or the minidumps end up in the wrong directory or, like on Windows, the launcher process means we fail to terminate the process, we should still be causing orange immediately. It seems right now, the fact that we don't do this is hiding shutdown hangs: the timeouts in bug 1598924 are caused by us ending the previous instance of Firefox, that only terminating the launcher (not the "real" Firefox), and then hanging on the next startup. When the same shutdown hang happens on Linux, we successfully terminate the browser in the first instance, so the next startup works fine, and the resulting run is green, but the shutdown issue is the same and should also be turning the linux job orange.
  • on Windows, we should terminate the real parent process, not the launcher.

I have patches for both of these.

I thought we'd be OK for crashreports based on bug 1598924 comment #27, but re-reading, I don't actually think this is the case.

(In reply to Gabriele Svelto [:gsvelto] from bug 1598924 comment #27)

It seems to me that the code is already doing the right thing by calling mozcrash.kill_and_get_minidump() (see here).

Fixed the link to have a line number. However, this is only the case where we time out waiting for the event. Here's the code:

        # wait until we saw __endTimestamp in the proc output,
        # or the browser just terminated - or we have a timeout
        if not event.wait(timeout):
            LOG.info("Timeout waiting for test completion; killing browser...")
            # try to extract the minidump stack if the browser hangs
            mozcrash.kill_and_get_minidump(proc.pid, minidump_dir)
            raise TalosError("timeout")
        if reader.got_end_timestamp:
            for i in range(1, wait_for_quit_timeout):
                if proc.wait(1) is not None:
                    break
            if proc.poll() is None:
                LOG.info(
                    "Browser shutdown timed out after {0} seconds, terminating"
                    " process.".format(wait_for_quit_timeout)
                )

If the talos run itself outputs the number for which we wait (which it does in this case), we then progress to waiting for the process to end. We do this for 5 seconds by default. If that doesn't work, we progress to killing the process differently:

    finally:
        # this also handle KeyboardInterrupt
        # ensure early the process is really terminated
        return_code = None
        try:
            return_code = context.kill_process()
            if return_code is None:
                return_code = proc.wait(1)
        except Exception:
            # Maybe killed by kill_and_get_minidump(), maybe ended?
            LOG.info("Unable to kill process")
            LOG.info(traceback.format_exc())

This doesn't use mozcrash, and so we don't get a dump.

For that to work however an exception handler must be installed and the launcher process doesn't have one which is why I think we're not getting a crash report. On the other hand killing the browser process that way should always yield a crash report.

Right. I suspect this is also true on Windows, ie we kill the "wrong" process anyway.

Gabriele, can you sanity-check the above?

Flags: needinfo?(gsvelto)

I've got limited experience with this code but I agree with your analysis. context.kill_process() will never yield a minidump on Linux because it's using SIGKILL IIRC (you'd need SIGABRT do get one) and on Windows one would need to explicitly grab a mindump first, like mozcrash does.

Flags: needinfo?(gsvelto)

This ensures that we shut down the "real" Firefox, rather than the launcher process,
when forcing process termination. It also ensures that, even when shutdown times out,
we use mozcrash so we get minidump information.

Depends on D66783

Attachment #9133217 - Attachment description: Bug 1622257 - shutdown hangs should fail the talos job, r?gbrown → Bug 1622257 - shutdown hangs should fail the talos job, r?#perftest-reviewers
Attachment #9133218 - Attachment description: Bug 1622257 - improve handling of crashes and shutdown hangs on talos runs, especially on Windows, r?gbrown → Bug 1622257 - improve handling of crashes and shutdown hangs on talos runs, especially on Windows, r?#perftest-reviewers
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/182f6b349371
shutdown hangs should fail the talos job, r=perftest-reviewers,whimboo
https://hg.mozilla.org/integration/autoland/rev/3537ff03365b
improve handling of crashes and shutdown hangs on talos runs, especially on Windows, r=perftest-reviewers,whimboo,gbrown
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla76
Depends on: 1623917
Blocks: 1625888
Blocks: 1629005
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: