Closed Bug 1093804 (foopy56) Opened 10 years ago Closed 9 years ago

foopy56 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

ARM
Android
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Unassigned)

References

Details

Attachments

(1 file, 1 obsolete file)

This foopy56 has been showing load spikes for the past day. It may have too many pandas associated with it, or it may be experiencing hardware issues.
I've disabled panda-0295. I may disable more if the load doesn't drop.
Logging into the foopy, I see multiple pywebsocket_wrapper.py processes running for each panda, e.g.:

[cltbld@foopy56.p3.releng.scl3.mozilla.com builds]$ ps auxww | grep pywebsocket_wrapper | grep panda-0298
cltbld    1364  0.0  0.1  97868 10448 ?        S    Oct20   1:45 /builds/panda-0298/test/build/venv/bin/python /builds/panda-0298/test/build/tests/mochitest/pywebsocket_wrapper.py -p 9988 -w /builds/panda-0298/test/build/tests/mochitest -l /builds/panda-0298/test/build/tests/mochitest/websock.log --log-level=debug --allow-handlers-outside-root-dir
cltbld    6821  0.0  0.0  97980  7128 ?        S    May16  12:28 /builds/panda-0298/test/build/venv/bin/python /builds/panda-0298/test/build/tests/mochitest/pywebsocket_wrapper.py -p 9988 -w /builds/panda-0298/test/build/tests/mochitest -l /builds/panda-0298/test/build/tests/mochitest/websock.log --log-level=debug --allow-handlers-outside-root-dir
cltbld   27448  0.0  0.1  97868 10956 ?        S    Nov06   0:35 /builds/panda-0298/test/build/venv/bin/python /builds/panda-0298/test/build/tests/mochitest/pywebsocket_wrapper.py -p 9988 -w /builds/panda-0298/test/build/tests/mochitest -l /builds/panda-0298/test/build/tests/mochitest/websock.log --log-level=debug --allow-handlers-outside-root-dir

This foopy has been up 195 days, so it's had a lot of time to accumulate these extra processes. Killing the older ones off brought the load back down under 2 very quickly.

We should do a few things here:

1) Look at how we launch pywebsocket_wrapper.py to make sure we don't end up with duplicates.
2) Cleanup old pywebsocket_wrapper.py instances automatically. Not sure what the intended lifespan is supposed to be.
3) Consider rebooting foopys on some cadence to avoid random other duplicate processes building up over time.
Lets save some trouble by trapping this with our existing "omg pre-existing proc" checks.
Attachment #8522598 - Flags: review?(coop)
Comment on attachment 8522598 [details] [diff] [review]
[tools] stop pywebsocket on per-job sanity checks for foopies

Review of attachment 8522598 [details] [diff] [review]:
-----------------------------------------------------------------

The script from comment #2 was pywebsocket_wrapper.py. Are you sure you're checking for the right thing here?
Comment on attachment 8522598 [details] [diff] [review]
[tools] stop pywebsocket on per-job sanity checks for foopies

Indeed not, to quick to the draw on this
Attachment #8522598 - Flags: review?(coop) → review-
Attachment #8523080 - Flags: review?(coop) → review+
Comment on attachment 8523080 [details] [diff] [review]
[tools] v2 - stop pywebsocket on per-job sanity checks for foopies

Review of attachment 8523080 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/7604f5d5748f
Attachment #8523080 - Flags: checked-in+
I've deployed this change to all the foopies now.

Will check back in at the end of the week to see if we have any errant pywebsocket processes lingering.
Cleanup is working; I no longer see any stray pywebsocket processes.

However, load on foopy56 is still spiking. I'm going to disable the foopies on it and send it for diagnostics.
(In reply to Chris Cooper [:coop] from comment #9) 
> However, load on foopy56 is still spiking. I'm going to disable the foopies
> on it and send it for diagnostics.

I removed a few directories for pandas that had been decommissioned. The foopy was still spinning up watch_devices.sh every 5 minutes to check for buildbot.tac files for these missing machines. There were even a few older iterations of these checks stuck in the process table. Killing them off brought the load down again.

Still, this foopy seems to be in a weird state relative to the others. It's the only one we're getting alerts about. I've still disabled all the pandas attached to it and will send it for diagnostics.
Depends on: 1101278
The deploypass in the default image hadn't been updated, so I'm running puppetize.sh by hand now to get this machine setup.
Pandas re-enabled. Back in production.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Looks like we have a bad disk here.
Forgot to reopen when requesting the RMA.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Pandas re-enabled and taking jobs.
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: