17:21 < nagios-releng> Thu 14:21:19 PST  kvm3.infra.scl1.mozilla.com:avg load is CRITICAL: CRITICAL - load average: 23.76, 25.70, 17.25 (http://m.allizom.org/avg+load) 17:36 < nagios-releng> Thu 14:36:49 PST  kvm4.infra.scl1.mozilla.com:avg load is CRITICAL: CRITICAL - load average: 32.27, 26.15, 19.38 (http://m.allizom.org/avg+load)
I don't see any smoking guns in htop. I checked buildbot-master40 and buildapi01, but neither is swapping nor do they have high CPU utilization. There are no instances with primary and secondary on these nodes, so no guesses from that perspective. I'm about to take off, so Ben, if you have a chance to take a look maybe you'll see what I'm missing.
I downtimed the load check for 6h.
Both nodes are fine now. I'm going to go ahead and blame the usual suspect: swap on DRBD.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → WORKSFORME
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.