Race conditions and possibility for infinite timeout in mozsystemmonitor

RESOLVED FIXED in Firefox 49

Status

Testing
Mozbase
RESOLVED FIXED
2 years ago
a month ago

People

(Reporter: gps, Assigned: gps)

Tracking

Version 3
mozilla49
Points:
---

Firefox Tracking Flags

(firefox49 fixed)

Details

MozReview Requests

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(6 attachments)

(Assignee)

Description

2 years ago
Resource monitor is causing hangs/timeouts in automation. This appears to be a result of upgrading automation from mozsystemmonitor 0.0 to 0.1.

I think I found the bug(s). Patches forthcoming.
(Assignee)

Comment 1

2 years ago
Created attachment 8752358 [details]
MozReview Request: Bug 1272782 - Send tuple properly; r?ahal

Found this bug when auditing the code for issues. We are attempting
to send a tuple but were forgetting the trailing comma on a single
element tuple.

Fortunately, this doesn't appear to impact anything because
the receiving end of the pipe doesn't care what data it receives.

Review commit: https://reviewboard.mozilla.org/r/52545/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/52545/
Attachment #8752358 - Flags: review?(ahalberstadt)
Attachment #8752359 - Flags: review?(ahalberstadt)
Attachment #8752360 - Flags: review?(ahalberstadt)
Attachment #8752361 - Flags: review?(ahalberstadt)
(Assignee)

Comment 2

2 years ago
Created attachment 8752359 [details]
MozReview Request: Bug 1272782 - Wait longer and stop after "done" message; r?ahal

Before, we kept waiting for data in the pipe after receiving the
"done" message. This didn't really make much sense because the
"done" message should be the final thing sent over the pipe!

e9113fd6cdb8 (bug 1239939) recently dropped the poll interval
of the pipe from 1.0 to 0.1s. This appears to have introduced
an intermittent failure in a test. The race condition was
between the child process sending data and the parent process
timing out (after only 0.1s) waiting for that data. Increasing
the timeout makes the failure reproduce less often. Although
technically the race condition is still present! I'm not
inclined to fix it at this time, however.

The rationale for dropping the pipe timeout was that it was
causing lag when terminating short-lived processes. Now that
we abort the pipe reading/polling loop as soon as the "done"
message is received, we no longer poll the pipe after receiving
"done" and no longer have to worry about its timeout impacting
shutdown time.

Review commit: https://reviewboard.mozilla.org/r/52547/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/52547/
(Assignee)

Comment 3

2 years ago
Created attachment 8752360 [details]
MozReview Request: Bug 1272782 - Don't wait forever for child process to exit; r?ahal

I believe this is the source of hangs/timeouts in automation.
join() waits forever. We add code to wait at most N seconds before
force terminating the process. The timeout is a bit high. But it is
better than infinite.

Review commit: https://reviewboard.mozilla.org/r/52549/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/52549/
(Assignee)

Comment 4

2 years ago
Created attachment 8752361 [details]
MozReview Request: Bug 1272782 - Bump mozsystemmonitor version to 0.3; r?ahal

So we can release the bug fixes we just made.

Review commit: https://reviewboard.mozilla.org/r/52551/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/52551/
Comment on attachment 8752358 [details]
MozReview Request: Bug 1272782 - Send tuple properly; r?ahal

https://reviewboard.mozilla.org/r/52545/#review49525
Attachment #8752358 - Flags: review?(ahalberstadt) → review+
Comment on attachment 8752359 [details]
MozReview Request: Bug 1272782 - Wait longer and stop after "done" message; r?ahal

https://reviewboard.mozilla.org/r/52547/#review49527
Attachment #8752359 - Flags: review?(ahalberstadt) → review+
(Assignee)

Comment 7

2 years ago
Created attachment 8752363 [details]
MozReview Request: Bug 1272782 - Use mozsystemmonitor 0.3 in mozharness; r?catlee

Review commit: https://reviewboard.mozilla.org/r/52557/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/52557/
Attachment #8752363 - Flags: review?(catlee)
Attachment #8752360 - Flags: review?(ahalberstadt) → review+
Comment on attachment 8752360 [details]
MozReview Request: Bug 1272782 - Don't wait forever for child process to exit; r?ahal

https://reviewboard.mozilla.org/r/52549/#review49531
Comment on attachment 8752361 [details]
MozReview Request: Bug 1272782 - Bump mozsystemmonitor version to 0.3; r?ahal

https://reviewboard.mozilla.org/r/52551/#review49533
Attachment #8752361 - Flags: review?(ahalberstadt) → review+
(Assignee)

Comment 10

2 years ago
Created attachment 8752367 [details]
mozsystemmonitor-0.3.tar.gz

Please upload this to our public PyPI mirror.
Comment on attachment 8752363 [details]
MozReview Request: Bug 1272782 - Use mozsystemmonitor 0.3 in mozharness; r?catlee

https://reviewboard.mozilla.org/r/52557/#review49539
Attachment #8752363 - Flags: review?(catlee) → review+

Comment 12

2 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/bd21c16b035a
https://hg.mozilla.org/integration/mozilla-inbound/rev/c67cb230d266
https://hg.mozilla.org/integration/mozilla-inbound/rev/f7f3a7bff29c
https://hg.mozilla.org/integration/mozilla-inbound/rev/ca9ed07960f5
https://hg.mozilla.org/integration/mozilla-inbound/rev/0569f59c7ce7

Comment 13

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/bd21c16b035a
https://hg.mozilla.org/mozilla-central/rev/c67cb230d266
https://hg.mozilla.org/mozilla-central/rev/f7f3a7bff29c
https://hg.mozilla.org/mozilla-central/rev/ca9ed07960f5
https://hg.mozilla.org/mozilla-central/rev/0569f59c7ce7
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
status-firefox49: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla49

Comment 14

2 years ago
12 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 4
* mozilla-central: 4
* fx-team: 4

Platform breakdown:
* windows7-32: 9
* windowsxp: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1272782&startday=2016-05-10&endday=2016-05-16&tree=all
(Assignee)

Comment 15

2 years ago
And no failures since May 14, so I think this series addressed things.

Updated

a month ago
See Also: → bug 1384062
You need to log in before you can comment on or make changes to this bug.