Closed Bug 949674 Opened 11 years ago Closed 10 years ago

ec2 Builders timing out during mock-install step

Categories

(Release Engineering :: General, defect)

All
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: KWierso, Unassigned)

Details

https://tbpl.mozilla.org/php/getParsedLog.php?id=31892811&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=31892485&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=31891234&tree=Fx-Team [13:53] <KWierso|sheriffduty> bhearsum|buildduty: ping <bhearsum|buildduty> KWierso|sheriffduty: pong <KWierso|sheriffduty> I've been seeing a few things like this today: https://tbpl.mozilla.org/php/getParsedLog.php?id=31892485&tree=Mozilla-Inbound [13:54] anything someone should be worried about? https://tbpl.mozilla.org/php/getParsedLog.php?id=31891234&tree=Fx-Team was another <bhearsum|buildduty> hmm [13:55] i bet that's related to this nagios alert 16:50 < nagios-releng> Thu 13:50:44 PST [4846] releng-puppet2.srv.releng.usw2.mozilla.com:load is WARNING: WARNING - load average: 11.97, 9.44, 5.31 (http://m.allizom.org/load) that step downloads files from the puppet server please file it while i look into it load seems okay now - so it could've just been a brief spike of instances being started
It looks like we got a simple spike in load, but this may have been made worse if we increased the maximum number of instances recently. Rail, I think you touched that earlier this week? We also deployed foreman recently which I don't _think_ affects load on the Puppet master, but it's worth checking. Dustin, do you know?
Flags: needinfo?(rail)
That's certainly possible. If load is spiking that high (and I see it did the same about an hour before this event) then we should have more puppetmasters in ec2.
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #2) > That's certainly possible. If load is spiking that high (and I see it did > the same about an hour before this event) then we should have more > puppetmasters in ec2. As a temporary solution we can also bump instance type from m1.large to m1.xlarge ($0.240 per hour vs $0.480 per hour).
Flags: needinfo?(rail)
Looks like we already have releng-puppet1 & 2 in use1 and usw2 and AFAICT, all are getting used. Definitely seems like we should add a releng-puppet3 for at least usw2.
Another option is to offload hosting yum/deb repos from the puppet masters and put them on S3 or some other file server. This was mentioned several times at the AWS re:Invent conference as a best practice for getting puppet to scale.
That's certainly more complicated, but an option if you can figure out how.
Moving this, because it's not an acute buildduty concern.
Component: Buildduty → General Automation
QA Contact: armenzg → catlee
We haven't seen these for a long time.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.