Closed Bug 1164214 Opened 9 years ago Closed 9 years ago

Set up a 5 machine puppet configured 2008 datacenter test pool in try

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: markco, Assigned: markco)

References

Details

(Whiteboard: [windows])

Setting a 5 machine test pool for Puppet configured 2008 machines. 

As per irc conversation with coop. Going to be using b-2008-ix-017(5-9).
Assignee: relops → mcornmesser
0175 has been re-imaged and enabled. 

The other 4 machines needs to finish current builds and then reimaged.
All machines 0177 have been reimaged and enabled. 0177 is currently being reimaged.
Summary: Set up a 5 machine puppet configured 2008 test pool in try → Set up a 5 machine datacenter puppet configured 2008 test pool in try
0177 is now enabled.
Summary: Set up a 5 machine datacenter puppet configured 2008 test pool in try → Set up a 5 machine puppet configured 2008 datacenterdatacenter test pool in try
The majority of builds are green. 

The failing builds appear to be an issue with the build itself. 

There was many green successful builds. There were a few with failures or warnings, but treeherder showed other parts failing and warning on other Windows builders.
Summary: Set up a 5 machine puppet configured 2008 datacenterdatacenter test pool in try → Set up a 5 machine puppet configured 2008 datacenter test pool in try
I disabled your b-2008-ix-0176 because it has failed every job but one for the last several days, timing out after 4800 seconds without output when it starts the actual build. Please decide whether you want to fix it, or to give it back to the pool to have buildduty treat it as broken hardware and get diagnostics run on it, because we're getting pretty desperate for working Windows try build slaves.
Flags: needinfo?(mcornmesser)
These are currently failing their puppet runs. Mark was out today but will look at them when he gets back tomorrow.
Huh, all the rest of them are actually working - they can fail puppet and still happily take jobs?

They also seem to be running Mercurial 2.9.1, substantially behind the 3.2.1 of non-puppet.
I looking into this now. 

If the machines are still taking jobs then that means it eventually ran Puppet without failures. The are set up to run Puppet forever or until successful run.
Depends on: 1170587, 1170588
Flags: needinfo?(mcornmesser)
As far as 0176. The new network configurations have landed in Puppet. I am going to reimage it  and test to see if I see any issues network wise. If i do then it maybe hardware.
Network wise the machine seems fine now. 1.5 gig file transfer within the datacenter was at 50+MB/s and same transfer out to S3 was at 40+MB/s. 

I am going to reenable the machine in the AM and keep an eye on it.
Whiteboard: [windows]
It seems that all but 0078 has been returned to the domain and have been re-enabled. 

I am going to reenable 0078 and keep an eye on it.
Disregard comment 11.

Sendchange appears to have been addressed and hg 3.2.1 has been installed through Puppet. 

I am reimaging and reenabling this test pool and will keep an eye out for any failures.
Hi Mark,
We ahve found a host that is failing to do sendchanges (even though the job happily shows as green).
Please disable all hosts on try until we can find a fix.

Otherwise, developers will get frustrated not knowing why their green Windows builds do not trigger test jobs.
Blocks: 1186586
Done. It was a single machine currently 0175. Could you give a me a link to a log that shows the failure, please?
Depends on: 1175701
It's in bug 1186586.
Thanks Mark! Good luck!
The sendchange issue has been addressed. I am going to enable 0175, and if all goes well with it I will spin up the rest of the test pool.
Looks like 0175-0179 were enabled on try, but they have staging keys and fail to upload to stage.mozilla.org (the build is still green though, for some reason). I've disabled them again.

When these have been swapped to production keys, please double check that try slaves only have trybld and b2gtry keys, and not anything for ffxbld, tbirdbird, or b2gbld.
That should read '... and not anything for ffxbld, tbirdbld, b2gbld.'
The slavealloc environment was set to dev/pp. Which overrode the slavetrust level set in Puppet. It is now set to prod. I will verify which keys are there before enabling the slaves.
On 0176, after the update slavealloc, there is b2gtry_dsa and trybld_dsa and none the keys mentioned in comment 18. 

Enabled the machine for one build:
http://buildbot-master83.bb.releng.scl3.mozilla.com:8101/builders/WINNT%206.1%20x86-64%20try%20build/builds/6242

And I am not seeing any traceback or errors in regards to the upload or keys hear
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dwillcoxon@mozilla.com-c4e957b01933/try-win64/try-win64-bm83-try1-build6242.txt.gz

Above is a raw log from here: 
https://treeherder.mozilla.org/logviewer.html#?job_id=10090623&repo=try

nthomas: Does it seem safe to enable these machines after next Puppet run?
Flags: needinfo?(nthomas)
0176 looks good to me, and that job uploaded fine (see the lines leading up to "11:39:30     INFO -  Running post-upload command: post_upload.py ...").  In the case of errors, the overall build status doesn't change due to bug 1118778.  Thanks for figuring it out.
Flags: needinfo?(nthomas)
No longer applicable.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.