Bug 1093804 (foopy56)

foopy56 problem tracking

RESOLVED FIXED

Status

Release Engineering
Buildduty
RESOLVED FIXED
4 years ago
a year ago

People

(Reporter: coop, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

4 years ago
This foopy56 has been showing load spikes for the past day. It may have too many pandas associated with it, or it may be experiencing hardware issues.
(Reporter)

Comment 1

4 years ago
I've disabled panda-0295. I may disable more if the load doesn't drop.
(Reporter)

Comment 2

4 years ago
Logging into the foopy, I see multiple pywebsocket_wrapper.py processes running for each panda, e.g.:

[cltbld@foopy56.p3.releng.scl3.mozilla.com builds]$ ps auxww | grep pywebsocket_wrapper | grep panda-0298
cltbld    1364  0.0  0.1  97868 10448 ?        S    Oct20   1:45 /builds/panda-0298/test/build/venv/bin/python /builds/panda-0298/test/build/tests/mochitest/pywebsocket_wrapper.py -p 9988 -w /builds/panda-0298/test/build/tests/mochitest -l /builds/panda-0298/test/build/tests/mochitest/websock.log --log-level=debug --allow-handlers-outside-root-dir
cltbld    6821  0.0  0.0  97980  7128 ?        S    May16  12:28 /builds/panda-0298/test/build/venv/bin/python /builds/panda-0298/test/build/tests/mochitest/pywebsocket_wrapper.py -p 9988 -w /builds/panda-0298/test/build/tests/mochitest -l /builds/panda-0298/test/build/tests/mochitest/websock.log --log-level=debug --allow-handlers-outside-root-dir
cltbld   27448  0.0  0.1  97868 10956 ?        S    Nov06   0:35 /builds/panda-0298/test/build/venv/bin/python /builds/panda-0298/test/build/tests/mochitest/pywebsocket_wrapper.py -p 9988 -w /builds/panda-0298/test/build/tests/mochitest -l /builds/panda-0298/test/build/tests/mochitest/websock.log --log-level=debug --allow-handlers-outside-root-dir

This foopy has been up 195 days, so it's had a lot of time to accumulate these extra processes. Killing the older ones off brought the load back down under 2 very quickly.

We should do a few things here:

1) Look at how we launch pywebsocket_wrapper.py to make sure we don't end up with duplicates.
2) Cleanup old pywebsocket_wrapper.py instances automatically. Not sure what the intended lifespan is supposed to be.
3) Consider rebooting foopys on some cadence to avoid random other duplicate processes building up over time.
Created attachment 8522598 [details] [diff] [review]
[tools] stop pywebsocket on per-job sanity checks for foopies

Lets save some trouble by trapping this with our existing "omg pre-existing proc" checks.
Attachment #8522598 - Flags: review?(coop)
(Reporter)

Comment 4

4 years ago
Comment on attachment 8522598 [details] [diff] [review]
[tools] stop pywebsocket on per-job sanity checks for foopies

Review of attachment 8522598 [details] [diff] [review]:
-----------------------------------------------------------------

The script from comment #2 was pywebsocket_wrapper.py. Are you sure you're checking for the right thing here?
Comment on attachment 8522598 [details] [diff] [review]
[tools] stop pywebsocket on per-job sanity checks for foopies

Indeed not, to quick to the draw on this
Attachment #8522598 - Flags: review?(coop) → review-
Created attachment 8523080 [details] [diff] [review]
[tools] v2 - stop pywebsocket on per-job sanity checks for foopies
Attachment #8522598 - Attachment is obsolete: true
Attachment #8523080 - Flags: review?(coop)
(Reporter)

Updated

4 years ago
Attachment #8523080 - Flags: review?(coop) → review+
(Reporter)

Comment 7

4 years ago
Comment on attachment 8523080 [details] [diff] [review]
[tools] v2 - stop pywebsocket on per-job sanity checks for foopies

Review of attachment 8523080 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/7604f5d5748f
Attachment #8523080 - Flags: checked-in+
(Reporter)

Comment 8

4 years ago
I've deployed this change to all the foopies now.

Will check back in at the end of the week to see if we have any errant pywebsocket processes lingering.
(Reporter)

Comment 9

4 years ago
Cleanup is working; I no longer see any stray pywebsocket processes.

However, load on foopy56 is still spiking. I'm going to disable the foopies on it and send it for diagnostics.
(Reporter)

Comment 10

4 years ago
(In reply to Chris Cooper [:coop] from comment #9) 
> However, load on foopy56 is still spiking. I'm going to disable the foopies
> on it and send it for diagnostics.

I removed a few directories for pandas that had been decommissioned. The foopy was still spinning up watch_devices.sh every 5 minutes to check for buildbot.tac files for these missing machines. There were even a few older iterations of these checks stuck in the process table. Killing them off brought the load down again.

Still, this foopy seems to be in a weird state relative to the others. It's the only one we're getting alerts about. I've still disabled all the pandas attached to it and will send it for diagnostics.
(Reporter)

Updated

4 years ago
Depends on: 1101278
(Reporter)

Comment 11

4 years ago
The deploypass in the default image hadn't been updated, so I'm running puppetize.sh by hand now to get this machine setup.
(Reporter)

Comment 12

4 years ago
Pandas re-enabled. Back in production.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Looks like we have a bad disk here.
Forgot to reopen when requesting the RMA.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Pandas re-enabled and taking jobs.
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.