handleTimeout() and friends should indicate they will kill the process

RESOLVED FIXED in Firefox 49

Status

defect
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: Gijs, Assigned: Gijs)

Tracking

Trunk
mozilla49
Points:
---

Firefox Tracking Flags

(firefox49 fixed)

Details

Attachments

(1 attachment)

As in summary. Right now, reading the log it's not obvious whether the crash stacks that follow are from a crash that happened causing the timeout, or just the result of us killing the process because of the timeout.

The string "application timed out after %d seconds" has quite some hits in DXR; ideally we should update all of the ones that result in us pskill'ing the process.
Comment on attachment 8751004 [details]
MozReview Request: Bug 1270144 - fix message from timed out tests to indicate the following crash is caused by us terminating the browser process, r?ahal,RyanVM

https://reviewboard.mozilla.org/r/51733/#review48739

It might also be nice to add a new log line to killAndGetStack along the lines of:
"Forcing crash to get stack"

Which might make it even more obvious that this is not a legit crash.

::: testing/mochitest/runtests.py:2444
(Diff revision 1)
>      def handleTimeout(self, timeout, proc, utilityPath, debuggerInfo,
>                        browser_pid, processLog):
>          """handle process output timeout"""
>          # TODO: bug 913975 : _processOutput should call self.processOutputLine
>          # one more time one timeout (I think)
> -        error_message = "TEST-UNEXPECTED-TIMEOUT | %s | application timed out after %d seconds with no output" % (
> +        error_message = "TEST-UNEXPECTED-TIMEOUT | %s | application timed out after %d seconds with no output - going to terminate it." % (

nit: might as well use a comma here for consistency
Attachment #8751004 - Flags: review?(ahalberstadt) → review+
Comment on attachment 8751004 [details]
MozReview Request: Bug 1270144 - fix message from timed out tests to indicate the following crash is caused by us terminating the browser process, r?ahal,RyanVM

I'm really hoping this change in signature doesn't break all the "application timed out" oranges we have on file. Offhand, I don't *think* it will, but it does make me nervous.

I'm a fan of ahal's idea of putting a second line in the log instead, along the lines of "INFO - Force-crashing the hung process to get a stack." or something to that effect.
Attachment #8751004 - Flags: review?(ryanvm)
radar+

(In reply to Ryan VanderMeulen [:RyanVM] from comment #3)
> Comment on attachment 8751004 [details]
> MozReview Request: Bug 1270144 - fix message from timed out tests to
> indicate the following crash is caused by us terminating the browser
> process, r?ted,RyanVM
> 
> I'm really hoping this change in signature doesn't break all the
> "application timed out" oranges we have on file. Offhand, I don't *think* it
> will, but it does make me nervous.
> 
> I'm a fan of ahal's idea of putting a second line in the log instead, along
> the lines of "INFO - Force-crashing the hung process to get a stack." or
> something to that effect.

Fair enough. Will update the patch.
Assignee: nobody → gijskruitbosch+bugs
Status: NEW → ASSIGNED
Comment on attachment 8751004 [details]
MozReview Request: Bug 1270144 - fix message from timed out tests to indicate the following crash is caused by us terminating the browser process, r?ahal,RyanVM

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/51733/diff/1-2/
Attachment #8751004 - Attachment description: MozReview Request: Bug 1270144 - fix message from timed out tests to indicate the following crash is caused by us terminating the browser process, r?ted,RyanVM → MozReview Request: Bug 1270144 - fix message from timed out tests to indicate the following crash is caused by us terminating the browser process, r?ahal,RyanVM
Attachment #8751004 - Flags: review?(ryanvm)
Comment on attachment 8751004 [details]
MozReview Request: Bug 1270144 - fix message from timed out tests to indicate the following crash is caused by us terminating the browser process, r?ahal,RyanVM

I looked at putting this in killAndGetStack(), but it seems that gets called for child processes as well so it'd spam the log. :-\

I'm using warning() rather than info() because I want it to show up in the treeherder summary.
Attachment #8751004 - Flags: review+ → review?(ahalberstadt)
Attachment #8751004 - Flags: review?(ahalberstadt) → review+
Comment on attachment 8751004 [details]
MozReview Request: Bug 1270144 - fix message from timed out tests to indicate the following crash is caused by us terminating the browser process, r?ahal,RyanVM

https://reviewboard.mozilla.org/r/51733/#review49247
Comment on attachment 8751004 [details]
MozReview Request: Bug 1270144 - fix message from timed out tests to indicate the following crash is caused by us terminating the browser process, r?ahal,RyanVM

https://reviewboard.mozilla.org/r/51733/#review49251

Wording feels a bit awkward to me. How about "Force-terminating active process(es)." instead?
Attachment #8751004 - Flags: review?(ryanvm) → review+
https://hg.mozilla.org/mozilla-central/rev/cc0133021748
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
(In reply to :Gijs Kruitbosch from comment #6)
> I'm using warning() rather than info() because I want it to show up in the
> treeherder summary.

This doesn't seem to have helped. What's guaranteed to show up in the treeherder logs?
Status: RESOLVED → REOPENED
Flags: needinfo?(ahalberstadt)
Resolution: FIXED → ---
You mean the blue summary? Yeah, pretty sure that needs to be at the ERROR level. Ed, is there another way to get a log in the treeherder summary? Or is using ERROR the best bet?
Flags: needinfo?(ahalberstadt) → needinfo?(emorley)
Could you paste links to some sample logs and/or lines from them? :-)
Flags: needinfo?(emorley)
(In reply to Ed Morley [:emorley] from comment #13)
> Could you paste links to some sample logs and/or lines from them? :-)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=20211baffa40&selectedJob=21062223

raw log looks like:

12:23:15  WARNING -  TEST-UNEXPECTED-TIMEOUT | browser/components/search/test/browser_contextSearchTabPosition.js | application timed out after 330 seconds with no output
12:23:15     INFO -  Force-terminating active process(es).
12:23:15     INFO -  Determining child pids from psutil
12:23:15     INFO -  Found child pids: []
12:23:15     INFO -  Killing process: 492
12:23:15     INFO -  Not taking screenshot here: see the one that was previously logged
12:23:15     INFO -  TEST-INFO | Main app process: exit 1
12:23:15  WARNING -  TEST-UNEXPECTED-FAIL | browser/components/search/test/browser_contextSearchTabPosition.js | application terminated with exit code 1

The code is here:

https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#2443-2450

        error_message = "TEST-UNEXPECTED-TIMEOUT | %s | application timed out after %d seconds with no output" % (
            self.lastTestSeen, int(timeout))
        self.message_logger.dump_buffered()
        self.message_logger.buffering = False
        self.log.info(error_message)
        self.log.warning("Force-terminating active process(es).");
Flags: needinfo?(emorley)
Oh, in that case this could be a bug in the tbplformatter not including the level in the log message. It could be doing this to avoid redundant log levels due to the mozharness info level.

Assuming treeherder will highlight WARNING, then this is likely a bug in mozharness. The proper fix is to get mozharness to read from the structured log (bug 1261194). This way, mozharness can grab the 'level' from the JSON object and we don't need to print it twice.

Ed, I'll leave the needinfo for this question: will log lines with WARNING in them get highlighted?
Treeherder will match on {ERROR,CRITICAL,FATAL}:
https://github.com/mozilla/treeherder/blob/17eb4bc74cd4851b52bf12e087ef8d624dbb028c/treeherder/log_parser/parsers.py#L405

And additionally on any of the message strings that match:
https://github.com/mozilla/treeherder/blob/17eb4bc74cd4851b52bf12e087ef8d624dbb028c/treeherder/log_parser/parsers.py#L356-L395

Ideally this should use `.error()` - long term I'd like to see the list of hardcoded error strings reduce in size, rather than grow, as more suites/steps are wrapped in mozharness-style error level prefixes.
Flags: needinfo?(emorley)
https://hg.mozilla.org/mozilla-central/rev/e2f558143b0b
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
(In reply to Ed Morley [:emorley] from comment #16)
> Treeherder will match on {ERROR,CRITICAL,FATAL}:
> https://github.com/mozilla/treeherder/blob/
> 17eb4bc74cd4851b52bf12e087ef8d624dbb028c/treeherder/log_parser/parsers.
> py#L405
> 
> And additionally on any of the message strings that match:
> https://github.com/mozilla/treeherder/blob/
> 17eb4bc74cd4851b52bf12e087ef8d624dbb028c/treeherder/log_parser/parsers.
> py#L356-L395
> 
> Ideally this should use `.error()` - long term I'd like to see the list of
> hardcoded error strings reduce in size, rather than grow, as more
> suites/steps are wrapped in mozharness-style error level prefixes.

This *still* doesn't work: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=29519107#L3965

It doesn't look like these log levels work the way you'd expect as they end up in this log as "INFO" first and "ERROR" later, and the treeherder matching expression only cares about the first part of the string. Andrew, do you know why?
Flags: needinfo?(ahalberstadt)
Summary: https://dxr.mozilla.org/mozilla-central/rev/0a25833062a880f369e6f9f622413a94cc671bf4/testing/mochitest/runtests.py#2444-2445 and friends should indicate they will kill the process → handleTimeout() and friends should indicate they will kill the process
Ugh, regex log parsing is horrible and this is a great example why. There's actually two places that the log is being parsed via regexes, mozharness and treeherder.

First mozharness has its own log levels, and uses regexes to determine when something should get logged above INFO:
https://dxr.mozilla.org/mozilla-central/rev/c9edfe35619f69f7785776ebd19a3140684024dc/testing/mozharness/mozharness/mozilla/testing/errors.py#92

Notice there's nothing about ERROR in there, so that's why your log is still showing up with the INFO mozharness level. Secondly, treeherder now parses the (already processed by mozharness) log. Going from what Ed linked:
https://github.com/mozilla/treeherder/blob/17eb4bc74cd4851b52bf12e087ef8d624dbb028c/treeherder/log_parser/parsers.py#L405

It looks like ERROR is only highlighted if it appears at the start of the line (i.e if mozharness thinks it's an error). So we either need to:
1) Modify mozharness regexes to flag "93 ERROR" at the error level
2) Modify treeherder regexes to look at the whole string, not just the beginning
3) Use log.info("TEST-UNEXPECTED-ERROR | ...")

Dealing with shit like this is why there are so many "TEST-UNEXPECTED" logs even when it's not related to a specific test. Alternatively you can leave it as is and wait for bug 1068138 and dependencies. If it's important to get fixed asap, just go with 3). The first two options might be hard to land.
Flags: needinfo?(ahalberstadt)
You need to log in before you can comment on or make changes to this bug.