Closed Bug 1366029 Opened 7 years ago Closed 7 years ago

add windows 10 machines to buildbot-configs so we can run new talos tests on there

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: jmaher)

References

Details

Attachments

(4 files, 3 obsolete files)

      No description provided.
Do we know what the machine naming scheme will be?
t-w1064-ix-NNNN.wintest.releng.scl3.mozilla.com
I am familiar with list_builder_differences to verify changes for scheduling new jobs, I am not familiar with adding new machine names and platforms in the buildbot-configs (or maybe even buildbotcustom).  If there is prior art for doing this, I would be happy to look at that as a starting point.
We previously removed win10 support from buildbot in bug 1330999. I would use those as a starting point.
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #8869610 - Flags: feedback?(catlee)
Attached file buildbot differences for win10 (obsolete) —
assuming my patch looks good, we can go ahead and schedule a time to replace win8 talos with win10; ideally this is something we can line up with reimaging machines.
:catlee, I would like to know if this is a patch worth pursuing- maybe if you don't have time you can redirect to another buildbot hacker?  Getting this ready to land would help us move forward in finishing the win10 project.
Comment on attachment 8869610 [details] [diff] [review]
add win10-ix as a platform- shift win8 talos tests to win10

Review of attachment 8869610 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozilla-tests/config.py
@@ -152,5 @@
>      'config_file': 'talos/windows_config.py',
>  }
>  
>  PLATFORMS['win64']['slave_platforms'] = ['win8_64']
> -PLATFORMS['win64']['talos_slave_platforms'] = ['win8_64']

Will we want to make a hard transition from win8 to win10 talos testing?
Attachment #8869610 - Flags: feedback?(catlee) → feedback+
in addition to buildbot-configs, we need support for slavehealth/slavealloc/puppet.

I see a puppet patch when win10 was removed:
https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=1330999&attachment=8827909

there is also a cloudtools patch:
https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=1330999&attachment=8827910

but I am not sure what slavehealth/slaveconfig is, is that cloud-tools?
Flags: needinfo?(catlee)
buildduty can add the entries to slavealloc. I'm not sure about how machines get added to slavehealth. Alin, can you help Joel out?
Flags: needinfo?(catlee) → needinfo?(aselagea)
the plan here is to turn off win8 and turn on win10 at the same time.  If there are problems with that plan, let me know and I can do this in 2 stages.
Attachment #8869610 - Attachment is obsolete: true
Attachment #8871284 - Flags: review?(kmoir)
Attached patch add windows 10 ix to puppet (obsolete) — Splinter Review
support for windows 10 ix hardware inside of puppet.
Attachment #8871285 - Flags: review?(kmoir)
(In reply to Chris AtLee [:catlee] from comment #11)
> buildduty can add the entries to slavealloc. I'm not sure about how machines
> get added to slavehealth. Alin, can you help Joel out?

Yeah, I can take care of those.
Flags: needinfo?(aselagea)
According to https://bugzilla.mozilla.org/show_bug.cgi?id=1367102#c4, we're going to enable 75 Win 10 machines at this point.   
Added those to slavealloc.

mysql> select count(*) from slaves where name like 't-w1064-ix%';
+----------+
| count(*) |
+----------+
|       75 |
+----------+
1 row in set (0.00 sec)
Comment on attachment 8871285 [details] [diff] [review]
add windows 10 ix to puppet

Do we need to include

  $slave_trustlevel = 'try'

here?
Comment on attachment 8871284 [details] [diff] [review]
add windows 10 ix to buildbot configs

I think this is fine except for

PLATFORMS['win64-devedition']['win10_64_devedition'] = {'name': 'Windows 10 64-bit DevEdition',
+                                                       'try_by_default': True}


try_by_default': True should be False

we only run these tests on beta
Attachment #8871284 - Flags: review?(kmoir) → review+
thanks!  I think with the two patches attached here, we will be all set.  I assume the puppet patch can land sooner rather than later, then the buildbot-config patch when we start shutting off win8 machines.
For the slave_health part, I simply reverted Coop's patch which actually disabled win10:
https://hg.mozilla.org/build/slave_health/rev/ed1e646be536
manifests/moco-nodes.pp should not have any node definitions for w10, since we are using GPO and AD.
removed the moco-nodes.pp changes.
Attachment #8871285 - Attachment is obsolete: true
Attachment #8871285 - Flags: review?(kmoir)
Attachment #8871312 - Flags: review+
Comment on attachment 8871312 [details] [diff] [review]
add windows 10 ix to puppet

sorry, this was not r+ from :kmoir already; the question about slavelevel='try' seems to be resolved by removing the changes for moco-nodes.pp
Attachment #8871312 - Flags: review+ → review?(kmoir)
updated patch to set win10-devedition on try=False by default.  thanks for the review
Attachment #8869611 - Attachment is obsolete: true
Attachment #8871313 - Flags: review+
One note here that I made in bug 1367102, the host regex is t-w1064-ix-NNN.wintest.releng.scl3.mozilla.com (3 digits instead of 4).
Blocks: 1367102
Attachment #8871312 - Flags: review?(kmoir) → review+
Did a bit of research over what's needed in Treeherder so the new jobs show up and I think we have everything in place from our previous setup to run Win 10 tests.

https://github.com/mozilla/treeherder/blob/master/ui/js/values.js#L38
https://github.com/mozilla/treeherder/blob/master/treeherder/etl/buildbot.py#L279

A test is also added:
https://github.com/mozilla/treeherder/blob/master/tests/etl/test_buildbot.py#L1018
We ran into several problems with this deploy from the releng side of things.  There were also relops issues but I'll also address them in their bug.

There were two main problems
1) New w10 machines could not connect to buildbot masters
2) Huge windows pending counts were triggered

New w10 machines could not connect to buildbot masters
1) The initial reconfig failed because the win10 devedition key was missing in puppet.  Also there were windows eol characters in the patch, not sure if this caused an issue but I removed them as well.
I deployed this fix
https://hg.mozilla.org/build/puppet/rev/3f09b62b7c30
2) The puppet patch landed but a new reconfig was not triggered because the reconfig script did not see a change to the version from the last time when it failed bug 1369164
3) I triggered a reconfig and machines could connect

Huge windows pending counts were triggered
When we enabled w10 as a platform there were a huge increase in pending counts for w7 and w10 jobs.  We have seen this happen before when adding a new platform.  I opened bug 1369157 to investigate the root cause.

Alin fixed the db issues as well
Alin, can you include the db queries/updates you used to fix the issue on this bug.  I looked in the mysql console history but you must have attached to the db from a different machine than I did.
Flags: needinfo?(aselagea)
noticed this alert because the range is not quite right

[sns alert] Jun 01 08:00:02 buildbot-master119.bb.releng.scl3.mozilla.com watch_twistd_log.py: Count: 372 | First instance: 2017-06-01 07:38:09-0700 | Most recent instance: 2017-06-01 08:00:00-0700 | Twistd exception: twisted.cred.error.UnauthorizedLogin - t-w1064-ix-075.wintest.releng.scl3.mozilla.com 10.26.42.97
Attachment #8873465 - Flags: checked-in+
(In reply to Kim Moir [:kmoir] from comment #27)

> Alin, can you include the db queries/updates you used to fix the issue on
> this bug.  I looked in the mysql console history but you must have attached
> to the db from a different machine than I did.

I first created a temporary table to store the IDs of all build requests that were submitted *after* May 31 07:00 PDT, but corresponding to changes that were done *before* May 31 07:00 PDT.

create temporary table ids select buildrequests.id from buildrequests, buildsets, sourcestamp_changes, changes where changes.changeid = sourcestamp_changes.changeid and sourcestamp_changes.sourcestampid = buildsets.sourcestampid and buildrequests.buildsetid = buildsets.id and buildrequests.complete = 0 and buildrequests.claimed_at =0 and  buildername like 'Windows%' and buildrequests.submitted_at > 1496214000 and changes.when_timestamp < 1496214000;

I then simply marked those jobs as completed.

update buildrequests, ids2 set complete=1, results=2, complete_at=1496223480 where buildrequests.id=ids2.id and complete=0 and claimed_at=0;
Flags: needinfo?(aselagea)
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: