1164214 - Set up a 5 machine puppet configured 2008 datacenter test pool in try

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Description

•

9 years ago

Setting a 5 machine test pool for Puppet configured 2008 machines. 

As per irc conversation with coop. Going to be using b-2008-ix-017(5-9).

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Updated

•

9 years ago

Assignee: relops → mcornmesser

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Comment 1

•

9 years ago

0175 has been re-imaged and enabled. 

The other 4 machines needs to finish current builds and then reimaged.

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Comment 2

•

9 years ago

All machines 0177 have been reimaged and enabled. 0177 is currently being reimaged.

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Updated

•

9 years ago

Summary: Set up a 5 machine puppet configured 2008 test pool in try → Set up a 5 machine datacenter puppet configured 2008 test pool in try

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Comment 3

•

9 years ago

0177 is now enabled.

Summary: Set up a 5 machine datacenter puppet configured 2008 test pool in try → Set up a 5 machine puppet configured 2008 datacenterdatacenter test pool in try

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Comment 4

•

9 years ago

The majority of builds are green. 

The failing builds appear to be an issue with the build itself. 

There was many green successful builds. There were a few with failures or warnings, but treeherder showed other parts failing and warning on other Windows builders.

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Updated

•

9 years ago

Summary: Set up a 5 machine puppet configured 2008 datacenterdatacenter test pool in try → Set up a 5 machine puppet configured 2008 datacenter test pool in try

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Updated

•

9 years ago

Blocks: 1165771

Phil Ringnalda (:philor)

Comment 5

•

9 years ago

I disabled your b-2008-ix-0176 because it has failed every job but one for the last several days, timing out after 4800 seconds without output when it starts the actual build. Please decide whether you want to fix it, or to give it back to the pool to have buildduty treat it as broken hardware and get diagnostics run on it, because we're getting pretty desperate for working Windows try build slaves.

Flags: needinfo?(mcornmesser)

Amy Rich [:arr] [:arich]

Comment 6

•

9 years ago

These are currently failing their puppet runs. Mark was out today but will look at them when he gets back tomorrow.

Phil Ringnalda (:philor)

Comment 7

•

9 years ago

Huh, all the rest of them are actually working - they can fail puppet and still happily take jobs?

They also seem to be running Mercurial 2.9.1, substantially behind the 3.2.1 of non-puppet.

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Comment 8

•

9 years ago

I looking into this now. 

If the machines are still taking jobs then that means it eventually ran Puppet without failures. The are set up to run Puppet forever or until successful run.

Depends on: 1170587, 1170588

Flags: needinfo?(mcornmesser)

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Comment 9

•

9 years ago

As far as 0176. The new network configurations have landed in Puppet. I am going to reimage it  and test to see if I see any issues network wise. If i do then it maybe hardware.

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Comment 10

•

9 years ago

Network wise the machine seems fine now. 1.5 gig file transfer within the datacenter was at 50+MB/s and same transfer out to S3 was at 40+MB/s. 

I am going to reenable the machine in the AM and keep an eye on it.

Amy Rich [:arr] [:arich]

Updated

•

9 years ago

Whiteboard: [windows]

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Comment 11

•

9 years ago

It seems that all but 0078 has been returned to the domain and have been re-enabled. 

I am going to reenable 0078 and keep an eye on it.

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Comment 12

•

9 years ago

Disregard comment 11.

Sendchange appears to have been addressed and hg 3.2.1 has been installed through Puppet. 

I am reimaging and reenabling this test pool and will keep an eye out for any failures.

Armen [:armenzg]

Comment 13

•

9 years ago

Hi Mark,
We ahve found a host that is failing to do sendchanges (even though the job happily shows as green).
Please disable all hosts on try until we can find a fix.

Otherwise, developers will get frustrated not knowing why their green Windows builds do not trigger test jobs.

Blocks: 1186586

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Comment 14

•

9 years ago

Done. It was a single machine currently 0175. Could you give a me a link to a log that shows the failure, please?

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Updated

•

9 years ago

Depends on: 1175701

Armen [:armenzg]

Comment 15

•

9 years ago

It's in bug 1186586.
Thanks Mark! Good luck!

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Comment 16

•

9 years ago

The sendchange issue has been addressed. I am going to enable 0175, and if all goes well with it I will spin up the rest of the test pool.

Nick Thomas [:nthomas] (UTC+12)

Comment 17

•

9 years ago

Looks like 0175-0179 were enabled on try, but they have staging keys and fail to upload to stage.mozilla.org (the build is still green though, for some reason). I've disabled them again.

When these have been swapped to production keys, please double check that try slaves only have trybld and b2gtry keys, and not anything for ffxbld, tbirdbird, or b2gbld.

Nick Thomas [:nthomas] (UTC+12)

Comment 18

•

9 years ago

That should read '... and not anything for ffxbld, tbirdbld, b2gbld.'

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Comment 19

•

9 years ago

The slavealloc environment was set to dev/pp. Which overrode the slavetrust level set in Puppet. It is now set to prod. I will verify which keys are there before enabling the slaves.

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Comment 20

•

9 years ago

On 0176, after the update slavealloc, there is b2gtry_dsa and trybld_dsa and none the keys mentioned in comment 18. 

Enabled the machine for one build:
http://buildbot-master83.bb.releng.scl3.mozilla.com:8101/builders/WINNT%206.1%20x86-64%20try%20build/builds/6242

And I am not seeing any traceback or errors in regards to the upload or keys hear
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dwillcoxon@mozilla.com-c4e957b01933/try-win64/try-win64-bm83-try1-build6242.txt.gz

Above is a raw log from here: 
https://treeherder.mozilla.org/logviewer.html#?job_id=10090623&repo=try

nthomas: Does it seem safe to enable these machines after next Puppet run?

Flags: needinfo?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Comment 21

•

9 years ago

0176 looks good to me, and that job uploaded fine (see the lines leading up to "11:39:30     INFO -  Running post-upload command: post_upload.py ...").  In the case of errors, the overall build status doesn't change due to bug 1118778.  Thanks for figuring it out.

Flags: needinfo?(nthomas)

Mark Cornmesser [:markco] OOO 2024/04/15

Assignee

Comment 22

•

9 years ago

No longer applicable.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED