Closed Bug 840305 Opened 7 years ago Closed 7 years ago

add timeout retry to ScriptFactory for mozharness unittests ("command timed out: 1200 seconds without output, attempting to kill" or "command timed out: 3600 seconds without output, attempting to kill" during clone of tools repo)

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

x86
macOS
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aki, Assigned: aki)

References

Details

(Keywords: intermittent-failure, Whiteboard: [mozharness][unittest])

Attachments

(2 files, 6 obsolete files)

This is specifically for hg pulls that take too long and get killed by buildbot.
I think ideally we would timeout in mozharness before the buildbot timeout, so we could have any parsing logic there, but any sort of retry here that doesn't cause other issues would be good.

https://tbpl.mozilla.org/php/getParsedLog.php?id=19634838&tree=Cedar#error0

08:21:16     INFO - #####
08:21:16     INFO - ##### Running pull step.
08:21:16     INFO - #####
08:21:16     INFO - Changing directory to /Users/cltbld/talos-slave/test/build.
08:21:16     INFO - Setting /Users/cltbld/talos-slave/test/build/tools to http://hg.mozilla.org/build/tools.
08:21:16     INFO - Cloning http://hg.mozilla.org/build/tools to /Users/cltbld/talos-slave/test/build/tools.
08:21:16     INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', 'clone', 'http://hg.mozilla.org/build/tools', '/Users/cltbld/talos-slave/test/build/tools']
08:21:16     INFO - Copy/paste: hg --config ui.merge=internal:merge clone http://hg.mozilla.org/build/tools /Users/cltbld/talos-slave/test/build/tools

command timed out: 1200 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1240.755553
The rc_eval_func probably won't work here because
a) we're using regex_eval_func, and
b) it has an exit code, -1.
Unless we figure out a solution, I'm going to not block rollout on this bug.
No longer blocks: 793022
Not actively working on this bug, nor do I have a solution yet.
Assignee: aki → nobody
Duplicate of this bug: 847920
Bug 848227 comment 4 may help for download_file() timeouts like this.  (Dependent on python 2.6+)
We could also do something like https://duckduckgo.com/?q=subprocess+timeout+python for comment 0.
Summary: add timeout retry to ScriptFactory for mozharness unittests → add timeout retry to ScriptFactory for mozharness unittests ("command timed out: 1200 seconds without output, attempting to kill" during clone of tools repo)
Aki, any chance we could bump the priority on this?
Flags: needinfo?(aki)
I'll see what I can do.
Assignee: nobody → aki
Flags: needinfo?(aki)
Attached patch mozharness idle timeout (obsolete) — Splinter Review
This passes unit tests on linux+osx.

Adds an idle timeout with a non-blocking read in run_command() in a thread, and watches for a timeout since we last got output.

Windows is barfing on the unit tests, but it's also barfing on the non-patched unit tests, so I need to fix that up.
Depends on: 857853
Attachment #733094 - Flags: review?(catlee)
Comment on attachment 733094 [details] [diff] [review]
mozharness idle timeout

Review of attachment 733094 [details] [diff] [review]:
-----------------------------------------------------------------

r- for busy looping, assuming I'm reading this correctly.

::: mozharness/base/script.py
@@ +58,5 @@
> +        if not next_line:
> +            continue
> +        parser.add_lines(next_line)
> +        next_line = None
> +    proc.stdout.close()

some comments in this function would be great

@@ +557,5 @@
> +            handle = ctypes.windll.kernel32.OpenProcess(PROCESS_TERMINATE, False, pid)
> +            ctypes.windll.kernel32.TerminateProcess(handle, -1)
> +            ctypes.windll.kernel32.CloseHandle(handle)
> +        else:
> +            os.kill(pid, signal.SIGKILL)

it would be nice to have the option of not killing the process so abruptly to begin with. SIGKILL is kill -9, so we're not giving subprocesses the chance to clean up.

@@ +641,5 @@
> +                            self.log("Still not dead.  Giving up.", level=idle_error_level)
> +                            # We could |p.wait(); if t and t.is_alive(): t.join()| here
> +                            # However, blocking on something finishing when we're trying
> +                            # to time out seems counter-intuitive
> +                break

If I'm reading this right, it looks like we're busy looping here until idle_timeout happens or the child finishes? This is going to needlessly burn up a CPU core.

Also, if no idle_timeout is specified we busy loop until the child exits.
Attachment #733094 - Flags: review?(catlee) → review-
Blocks: 847727
Attached patch timeout interdiff (obsolete) — Splinter Review
An attempt to address review comments.
Applies on top of previous r-'ed patch.
Attachment #734145 - Flags: feedback?(catlee)
Comment on attachment 734145 [details] [diff] [review]
timeout interdiff

Review of attachment 734145 [details] [diff] [review]:
-----------------------------------------------------------------

yeah, this looks better. how does it behave? does it return quickly enough for fast commands?
Attachment #734145 - Flags: feedback?(catlee) → feedback+
I wasn't sure how to test the quick command thing; I'm just doing 3 echo's at < 1s.

Not sure I'm thrilled about timing-based tests, but I'm not sure how else to test this either.
Attachment #733094 - Attachment is obsolete: true
Attachment #734145 - Attachment is obsolete: true
Attachment #736511 - Flags: review?(catlee)
Attached patch fix windows tests (obsolete) — Splinter Review
Oops.  With this change, these tests pass on windows as well.
Attachment #736511 - Attachment is obsolete: true
Attachment #736511 - Flags: review?(catlee)
Attachment #737782 - Flags: review?(catlee)
Attachment #737782 - Flags: review?(catlee) → review+
Comment on attachment 737782 [details] [diff] [review]
fix windows tests

https://hg.mozilla.org/build/mozharness/rev/d83f0ecb2b6b

I'm going to push to cedar to do one more level of testing.
Attachment #737782 - Flags: checked-in+
Merged to production.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Backed out of production due to unexplained oranges on osx.
I don't see how it would affect it, but playing safe.
http://hg.mozilla.org/build/mozharness/rev/6197294a06ea
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment on attachment 737782 [details] [diff] [review]
fix windows tests

Reverting on default as well, to avoid inadvertent merges.
http://hg.mozilla.org/build/mozharness/rev/8da0ee335860
Attachment #737782 - Flags: checked-in+ → checked-in-
(In reply to Aki Sasaki [:aki] from comment #143)
> Backed out of production due to unexplained oranges on osx.
> I don't see how it would affect it, but playing safe.
> http://hg.mozilla.org/build/mozharness/rev/6197294a06ea

To provide some more context, this was somehow causing timeout failures across a range of tests on all platforms and on multiple trees. The timeouts cleared up once this was reverted.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #145)
> (In reply to Aki Sasaki [:aki] from comment #143)
> > Backed out of production due to unexplained oranges on osx.
> > I don't see how it would affect it, but playing safe.
> > http://hg.mozilla.org/build/mozharness/rev/6197294a06ea
> 
> To provide some more context, this was somehow causing timeout failures
> across a range of tests on all platforms and on multiple trees. The timeouts
> cleared up once this was reverted.

I wonder if this was causing oranges, or if it was just un-hiding oranges that are hidden by this bug's command timeout.  My other guess is maybe my logging algorithm is taking too many system resources and causing issues with test timing.

I can try a different approach later.
(In reply to Aki Sasaki [:aki] from comment #154)
> I wonder if this was causing oranges, or if it was just un-hiding oranges
> that are hidden by this bug's command timeout.

Given the volume we were seeing, that's a scary thought if true...
(In reply to Aki Sasaki [:aki] from comment #154)
> I can try a different approach later.

Aki, is this still on your radar?
Flags: needinfo?(aki)
It is.
It's going to take time, since I don't know what caused the oranges last time, and that was my best shot at solving this.  I'm going to have to take a different approach, and I don't know what that will be yet.
Flags: needinfo?(aki)
Any news on this since comment 205? :-)
Flags: needinfo?(aki)
(In reply to Ed Morley [:edmorley UTC+1] from comment #298)
> Any news on this since comment 205? :-)

Nope, same.
Working on b2g and hg-git for the time being.
Flags: needinfo?(aki)