Closed
Bug 1093804
(foopy56)
Opened 10 years ago
Closed 9 years ago
foopy56 problem tracking
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: coop, Unassigned)
References
Details
Attachments
(1 file, 1 obsolete file)
1.05 KB,
patch
|
coop
:
review+
coop
:
checked-in+
|
Details | Diff | Splinter Review |
This foopy56 has been showing load spikes for the past day. It may have too many pandas associated with it, or it may be experiencing hardware issues.
Reporter | ||
Comment 1•10 years ago
|
||
I've disabled panda-0295. I may disable more if the load doesn't drop.
Reporter | ||
Comment 2•10 years ago
|
||
Logging into the foopy, I see multiple pywebsocket_wrapper.py processes running for each panda, e.g.: [cltbld@foopy56.p3.releng.scl3.mozilla.com builds]$ ps auxww | grep pywebsocket_wrapper | grep panda-0298 cltbld 1364 0.0 0.1 97868 10448 ? S Oct20 1:45 /builds/panda-0298/test/build/venv/bin/python /builds/panda-0298/test/build/tests/mochitest/pywebsocket_wrapper.py -p 9988 -w /builds/panda-0298/test/build/tests/mochitest -l /builds/panda-0298/test/build/tests/mochitest/websock.log --log-level=debug --allow-handlers-outside-root-dir cltbld 6821 0.0 0.0 97980 7128 ? S May16 12:28 /builds/panda-0298/test/build/venv/bin/python /builds/panda-0298/test/build/tests/mochitest/pywebsocket_wrapper.py -p 9988 -w /builds/panda-0298/test/build/tests/mochitest -l /builds/panda-0298/test/build/tests/mochitest/websock.log --log-level=debug --allow-handlers-outside-root-dir cltbld 27448 0.0 0.1 97868 10956 ? S Nov06 0:35 /builds/panda-0298/test/build/venv/bin/python /builds/panda-0298/test/build/tests/mochitest/pywebsocket_wrapper.py -p 9988 -w /builds/panda-0298/test/build/tests/mochitest -l /builds/panda-0298/test/build/tests/mochitest/websock.log --log-level=debug --allow-handlers-outside-root-dir This foopy has been up 195 days, so it's had a lot of time to accumulate these extra processes. Killing the older ones off brought the load back down under 2 very quickly. We should do a few things here: 1) Look at how we launch pywebsocket_wrapper.py to make sure we don't end up with duplicates. 2) Cleanup old pywebsocket_wrapper.py instances automatically. Not sure what the intended lifespan is supposed to be. 3) Consider rebooting foopys on some cadence to avoid random other duplicate processes building up over time.
Comment 3•10 years ago
|
||
Lets save some trouble by trapping this with our existing "omg pre-existing proc" checks.
Attachment #8522598 -
Flags: review?(coop)
Reporter | ||
Comment 4•10 years ago
|
||
Comment on attachment 8522598 [details] [diff] [review] [tools] stop pywebsocket on per-job sanity checks for foopies Review of attachment 8522598 [details] [diff] [review]: ----------------------------------------------------------------- The script from comment #2 was pywebsocket_wrapper.py. Are you sure you're checking for the right thing here?
Comment 5•10 years ago
|
||
Comment on attachment 8522598 [details] [diff] [review] [tools] stop pywebsocket on per-job sanity checks for foopies Indeed not, to quick to the draw on this
Attachment #8522598 -
Flags: review?(coop) → review-
Comment 6•10 years ago
|
||
Attachment #8522598 -
Attachment is obsolete: true
Attachment #8523080 -
Flags: review?(coop)
Reporter | ||
Updated•10 years ago
|
Attachment #8523080 -
Flags: review?(coop) → review+
Reporter | ||
Comment 7•10 years ago
|
||
Comment on attachment 8523080 [details] [diff] [review] [tools] v2 - stop pywebsocket on per-job sanity checks for foopies Review of attachment 8523080 [details] [diff] [review]: ----------------------------------------------------------------- https://hg.mozilla.org/build/tools/rev/7604f5d5748f
Attachment #8523080 -
Flags: checked-in+
Reporter | ||
Comment 8•10 years ago
|
||
I've deployed this change to all the foopies now. Will check back in at the end of the week to see if we have any errant pywebsocket processes lingering.
Reporter | ||
Comment 9•10 years ago
|
||
Cleanup is working; I no longer see any stray pywebsocket processes. However, load on foopy56 is still spiking. I'm going to disable the foopies on it and send it for diagnostics.
Reporter | ||
Comment 10•10 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #9) > However, load on foopy56 is still spiking. I'm going to disable the foopies > on it and send it for diagnostics. I removed a few directories for pandas that had been decommissioned. The foopy was still spinning up watch_devices.sh every 5 minutes to check for buildbot.tac files for these missing machines. There were even a few older iterations of these checks stuck in the process table. Killing them off brought the load down again. Still, this foopy seems to be in a weird state relative to the others. It's the only one we're getting alerts about. I've still disabled all the pandas attached to it and will send it for diagnostics.
Reporter | ||
Comment 11•10 years ago
|
||
The deploypass in the default image hadn't been updated, so I'm running puppetize.sh by hand now to get this machine setup.
Reporter | ||
Comment 12•10 years ago
|
||
Pandas re-enabled. Back in production.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 13•9 years ago
|
||
Looks like we have a bad disk here.
Comment 14•9 years ago
|
||
Forgot to reopen when requesting the RMA.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 15•9 years ago
|
||
Pandas re-enabled and taking jobs.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 9 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•