Upgrade all win7/win10 gecko workers to generic-worker 10.7.8

RESOLVED FIXED in mozilla61

Status

enhancement
RESOLVED FIXED
2 years ago
4 months ago

People

(Reporter: pmoore, Assigned: pmoore)

Tracking

(Blocks 2 bugs)

unspecified
mozilla61

Details

Attachments

(4 attachments)

Assignee

Description

2 years ago
These worker types currently run outdated versions of generic-worker:

> gecko-t-win7-32:      generic-worker 8.2.0
> gecko-t-win7-32-gpu:  generic-worker 8.2.0
> gecko-t-win10-64:     generic-worker 8.3.0

We should upgrade these worker type to 10.2.2 for the following benefits:

  * all tasks user-sandboxed (dedicated user for each task, which is deleted after the task completes)
  * more secure (no access to secrets on machines)
  * more realistic environment (winlogon session belonging to a regular user, with full dedicated desktop environment)
  * tasks cannot (intentionally or accidentally) interfere with each other
  * latest features available
  * includes several bug fixes and logging/monitoring improvements
  * avoid needing to maintain two different branches of generic-worker

Currently when upgrading we hit these failures:

> Windows 7 opt:
>   tc-M-e10s(5 bc1)
> 
> Windows 7 debug:
>   tc-M-e10s(5 bc5) tc-M(5 bc1 bc7)
> 
> windows7-32-stylo-disabled opt:
>   tc-M-e10s(5 bc2)
> 
> windows7-32-stylo-disabled debug:
>   tc-M-e10s(5 bc2)

> Windows 10 x64 opt:
>   tc-X(X)
> 
> Windows 10 x64 debug:
>   tc-X(X) tc-M-e10s(5)
> 
> windows10-64-stylo-disabled opt:
>   tc-M-e10s(5)
> 
> windows10-64-stylo-disabled debug:
>   tc-M-e10s(5)

Fixing these failures will allow us roll out new worker features to these worker types.

This bug originates from https://bugzilla.mozilla.org/show_bug.cgi?id=1382204#c59
Assignee

Comment 1

2 years ago
Add this commit to your try push to switch to generic-worker 10.2.2:

  * https://hg.mozilla.org/try/raw-rev/835faadcf252b3b019476c713ab5459ccc6af951
Assignee

Updated

2 years ago
Duplicate of this bug: 1373722
Assignee

Comment 3

2 years ago
Hi Joel,

Is this something you can help me with?

Many thanks,
Pete
Flags: needinfo?(jmaher)
:pmoore, you can followup with :mattn for the alert/dialog/notification failures and :rstrong for the xpcshell failures related to installation/updating.

I could help after the migration, but as it stands many on our team are at full capacity for the rest of the month.
Flags: needinfo?(jmaher)
Assignee

Comment 5

2 years ago
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #4)
> :pmoore, you can followup with :mattn for the alert/dialog/notification
> failures and :rstrong for the xpcshell failures related to
> installation/updating.
> 
> I could help after the migration, but as it stands many on our team are at
> full capacity for the rest of the month.

Matt, could you help me diagnose these alert/dialog/notification failures?
Robert, could you help me diagnose the xpcshell failures?

Many thanks guys!
Pete
Flags: needinfo?(robert.strong.bugs)
Flags: needinfo?(MattN+bmo)
Assignee

Comment 6

2 years ago
For convenience I've made a new try push from latest mozilla-central revision:

  * https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa0ec3f9f9c876dc0ec7b8d8237d86f206dcbb51

I just did this by applying the patch from comment 1 against the latest mozilla-central revision (893fe1549e1e).
Assignee

Comment 7

2 years ago
I spoke to Robert over IRC and he kindly pointed me to https://bugzilla.mozilla.org/show_bug.cgi?id=1067756#c21

That would explain it, because after the worker upgrade, task users would not have write access to these directories.
Flags: needinfo?(robert.strong.bugs)
Assignee

Comment 8

2 years ago
(In reply to Pete Moore [:pmoore][:pete] from comment #7)
> I spoke to Robert over IRC and he kindly pointed me to
> https://bugzilla.mozilla.org/show_bug.cgi?id=1067756#c21
> 
> That would explain it, because after the worker upgrade, task users would
> not have write access to these directories.

Rebuilding win7/win10 beta worker types in OpenCloudConfig:
  * https://tools.taskcluster.net/groups/L6NZFeUNSlabxdmqJSgcPw

Thanks Robert!
Assignee

Comment 10

2 years ago
rstrong has also just highlighted to me that the privilege granted in bug 1353889 for the GenericWorker account will need to be granted to the task users

e.g. for gecko-t-win10-64-beta that would be:
https://github.com/mozilla-releng/OpenCloudConfig/blob/b20fe867f48dde8dd040b339eef21047ca5b728d/userdata/Manifest/gecko-t-win10-64-beta.json#L1147
Assignee

Comment 11

2 years ago
(In reply to Pete Moore [:pmoore][:pete] from comment #6)
> For convenience I've made a new try push from latest mozilla-central
> revision:
> 
>   *
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=aa0ec3f9f9c876dc0ec7b8d8237d86f206dcbb51
> 
> I just did this by applying the patch from comment 1 against the latest
> mozilla-central revision (893fe1549e1e).

gecko-1-b-win2012-beta was broken, hopefully fixed now with:

https://github.com/mozilla-releng/OpenCloudConfig/commit/bae0c1c78410197cb64a2e9e122b37e6e515255e

When the rollout of the new AMIs complete, we should be able to retrigger those broken builds.

AMI rollouts are happening in:

  * https://tools.taskcluster.net/groups/bv4SHLSIT3eEsKUbriOoVw/tasks/LjPipx7TQTKl9IuMaACpZw/runs/0
Assignee

Comment 12

2 years ago
(In reply to Pete Moore [:pmoore][:pete] from comment #10)
> rstrong has also just highlighted to me that the privilege granted in bug
> 1353889 for the GenericWorker account will need to be granted to the task
> users
> 
> e.g. for gecko-t-win10-64-beta that would be:
> https://github.com/mozilla-releng/OpenCloudConfig/blob/
> b20fe867f48dde8dd040b339eef21047ca5b728d/userdata/Manifest/gecko-t-win10-64-
> beta.json#L1147

Applied to beta worker types:
  * https://github.com/mozilla-releng/OpenCloudConfig/commit/02a45d1e675b25f95bf764b095fa037181d5aa2b
Assignee

Comment 15

2 years ago
(In reply to Pete Moore [:pmoore][:pete] from comment #14)
> New push with fixes:
> 
>   *
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=bdd9ec4a5222cf3a3b82db8d155015e655fbc986

This hasn't fixed the tc-X(X) tests yet, I'm now checking the change to make the 'Mozilla Maintenance Service' folder writable to task users worked with this test task:

  * https://tools.taskcluster.net/groups/OO_g0scBShmyWdRDzYQ7mg/tasks/OO_g0scBShmyWdRDzYQ7mg/details
Assignee

Comment 16

2 years ago
Backslash escaping syntax mistake in previous task, new task:

  * https://tools.taskcluster.net/groups/Z_FR3TFAQ-axtTqdVOV93w/tasks/Z_FR3TFAQ-axtTqdVOV93w/details
Assignee

Comment 17

2 years ago
Updated task:

  * https://tools.taskcluster.net/groups/DlCMdhvqS5us3EIgcm6Fiw/tasks/DlCMdhvqS5us3EIgcm6Fiw/runs/0/logs/public%2Flogs%2Flive.log

Looks like my change[1] didn't work for some reason:


Z:\task_1505581330>echo hellooo  1>"C:\Program Files (x86)\Mozilla Maintenance Service\hello.txt" 
Access is denied.
[taskcluster 2017-09-16T17:05:27.754Z]    Exit Code: 1


---

[1] https://github.com/mozilla-releng/OpenCloudConfig/commit/b20fe867f48dde8dd040b339eef21047ca5b728d
Assignee

Comment 18

2 years ago
I still haven't got around to further investigating the issue in comment 17, and I'll be out for a few days now.

Rob, if you get the time to look into this, it would be awesome, otherwise I can have a look when I'm back next week. Basically, I made a change to a manifest in OCC so that a directory is read/writable to Everyone, rolled everything out, but I can't write to that directory in a task. All the links are in comment 17.

Like I say, I can also take a look when I'm back next week. Thanks guys!
Flags: needinfo?(rthijssen)
iirc the build system Windows images have the maintenance service installed. Do these systems?
my guess is that the command isn't succeeding because of missing quotes around the arg with spaces in it.

eg this line (https://github.com/mozilla-releng/OpenCloudConfig/commit/b20fe867f48dde8dd040b339eef21047ca5b728d#diff-87907bc6a1f2a26aacbddd5425eea212R977):

reads:
"C:\\Program Files (x86)\\Mozilla Maintenance Service",

but should read:
"\"C:\\Program Files (x86)\\Mozilla Maintenance Service\"",
Flags: needinfo?(rthijssen)
Assignee

Updated

2 years ago
Blocks: 1370877
Assignee

Comment 21

2 years ago
(In reply to Rob Thijssen (:grenade - UTC+3) from comment #20)
> my guess is that the command isn't succeeding because of missing quotes
> around the arg with spaces in it.
> 
> eg this line
> (https://github.com/mozilla-releng/OpenCloudConfig/commit/
> b20fe867f48dde8dd040b339eef21047ca5b728d#diff-
> 87907bc6a1f2a26aacbddd5425eea212R977):
> 
> reads:
> "C:\\Program Files (x86)\\Mozilla Maintenance Service",
> 
> but should read:
> "\"C:\\Program Files (x86)\\Mozilla Maintenance Service\"",

Thanks Rob! I'm trying a push with this now, to see if it helps. Thanks for looking. :)

https://github.com/mozilla-releng/OpenCloudConfig/commit/3357d441046aad0442f2032aed6d7cc78bf996c1
Assignee

Comment 24

2 years ago
(In reply to Pete Moore [:pmoore][:pete] from comment #23)
> Retrying failed xpcshell tasks:
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=bdd9ec4a5222cf3a3b82db8d155015e655fbc986&filter-
> searchStr=xpcshell&duplicate_jobs=visible

That fixed xpcshell!

Now just the mochitests left.
great stuff
Assignee

Comment 26

2 years ago
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #25)
> great stuff

All credit goes to rstrong and grenade for that one :-)

New try push against latest mozilla central head revision:

  * https://treeherder.mozilla.org/#/jobs?repo=try&revision=f3e3fd07da461fc449a0038c51b69db7392a2e2f&filter-tier=1&filter-tier=2&filter-tier=3&duplicate_jobs=visible&group_state=expanded
there are a lot of failures which all seem to be related to popups/notifications/other windows.  For example, bc1 already runs on taskcluster VM and passes as tier-1, but it is failing with this change.

I verified in the log that we are running:
10:54:42     INFO -  17 INFO TEST-START | browser/base/content/test/alerts/browser_notification_do_not_disturb.js
10:54:45     INFO -  GECKO(2968) | MEMORY STAT | vsize 1742MB | vsizeMaxContiguous 131597346MB | residentFast 260MB | heapAllocated 108MB
10:54:45     INFO -  18 INFO TEST-OK | browser/base/content/test/alerts/browser_notification_do_not_disturb.js | took 3117ms
10:54:45     INFO -  19 INFO checking window state
10:54:45     INFO -  20 INFO TEST-START | browser/base/content/test/alerts/browser_notification_open_settings.js
10:54:47     INFO -  GECKO(2968) | MEMORY STAT | vsize 1791MB | vsizeMaxContiguous 131597346MB | residentFast 311MB | heapAllocated 140MB
10:54:47     INFO -  21 INFO TEST-OK | browser/base/content/test/alerts/browser_notification_open_settings.js | took 2163ms
10:54:47     INFO -  22 INFO checking window state
10:54:47     INFO -  23 INFO TEST-START | browser/base/content/test/alerts/browser_notification_remove_permission.js
10:54:48     INFO -  GECKO(2968) | MEMORY STAT | vsize 1792MB | vsizeMaxContiguous 131597346MB | residentFast 312MB | heapAllocated 142MB
10:54:48     INFO -  24 INFO TEST-OK | browser/base/content/test/alerts/browser_notification_remove_permission.js | took 785ms
10:54:48     INFO -  25 INFO checking window state
10:54:48     INFO -  26 INFO TEST-START | browser/base/content/test/alerts/browser_notification_replace.js
10:54:49     INFO -  GECKO(2968) | MEMORY STAT | vsize 1792MB | vsizeMaxContiguous 131597346MB | residentFast 296MB | heapAllocated 117MB
10:54:49     INFO -  27 INFO TEST-OK | browser/base/content/test/alerts/browser_notification_replace.js | took 462ms
10:54:49     INFO -  28 INFO checking window state
10:54:49     INFO -  29 INFO TEST-START | browser/base/content/test/alerts/browser_notification_tab_switching.js
10:54:49     INFO -  GECKO(2968) | MEMORY STAT | vsize 1783MB | vsizeMaxContiguous 131597346MB | residentFast 278MB | heapAllocated 104MB
10:54:49     INFO -  30 INFO TEST-OK | browser/base/content/test/alerts/browser_notification_tab_switching.js | took 759ms


but on your try push I see failures like this:
07:58:53     INFO -  2 INFO TEST-START | browser/base/content/test/alerts/browser_notification_open_settings.js
07:59:38     INFO -  TEST-INFO | started process screenshot
07:59:38     INFO -  TEST-INFO | screenshot: exit 0
07:59:38     INFO -  Buffered messages logged at 07:58:53
07:59:38     INFO -  3 INFO Entering test bound test_settingsOpen_observer
07:59:38     INFO -  4 INFO Opening a dummy tab so openPreferences=>switchToTabHavingURI doesn't use the blank tab.
07:59:38     INFO -  5 INFO Console message: [JavaScript Warning: "Use of nsIFile in content process is deprecated." {file: "resource://gre/modules/FileUtils.jsm" line: 174}]
07:59:38     INFO -  6 INFO simulate a notifications-open-settings notification
07:59:38     INFO -  7 INFO TEST-PASS | browser/base/content/test/alerts/browser_notification_open_settings.js | The notification settings tab opened -
07:59:38     INFO -  Buffered messages logged at 07:58:54
07:59:38     INFO -  8 INFO Leaving test bound test_settingsOpen_observer
07:59:38     INFO -  9 INFO Entering test bound test_settingsOpen_button
07:59:38     INFO -  10 INFO Adding notification permission
07:59:38     INFO -  11 INFO Console message: [JavaScript Warning: "Use of nsIFile in content process is deprecated." {file: "resource://gre/modules/FileUtils.jsm" line: 174}]
07:59:38     INFO -  12 INFO Console message: [JavaScript Warning: "Unknown pseudo-class or pseudo-element ‘-moz-tree-line’.  Ruleset ignored due to bad selector." {file: "chrome://global/content/xul.css" line: 654}]
07:59:38     INFO -  13 INFO Waiting for notification
07:59:38     INFO -  Buffered messages finished
07:59:38    ERROR -  14 INFO TEST-UNEXPECTED-FAIL | browser/base/content/test/alerts/browser_notification_open_settings.js | Test timed out -
07:59:38     INFO -  GECKO(3176) | MEMORY STAT | vsize 685MB | vsizeMaxContiguous 804MB | residentFast 195MB | heapAllocated 63MB
07:59:38     INFO -  15 INFO TEST-OK | browser/base/content/test/alerts/browser_notification_open_settings.js | took 45078ms
07:59:38     INFO -  Not taking screenshot here: see the one that was previously logged
07:59:38    ERROR -  16 INFO TEST-UNEXPECTED-FAIL | browser/base/content/test/alerts/browser_notification_open_settings.js | Found a tab after previous test timed out: http://example.org/browser/browser/base/content/test/alerts/file_dom_notifications.html -
07:59:38     INFO -  17 INFO checking window state
07:59:38     INFO -  18 INFO TEST-START | browser/base/content/test/alerts/browser_notification_remove_permission.js
08:00:23     INFO -  Not taking screenshot here: see the one that was previously logged
08:00:23     INFO -  Buffered messages logged at 07:59:38
08:00:23     INFO -  19 INFO Console message: [JavaScript Warning: "Use of nsIFile in content process is deprecated." {file: "resource://gre/modules/FileUtils.jsm" line: 174}]
08:00:23     INFO -  Buffered messages finished
08:00:23    ERROR -  20 INFO TEST-UNEXPECTED-FAIL | browser/base/content/test/alerts/browser_notification_remove_permission.js | Test timed out -
08:00:23     INFO -  GECKO(3176) | MEMORY STAT | vsize 684MB | vsizeMaxContiguous 804MB | residentFast 195MB | heapAllocated 65MB
08:00:23     INFO -  21 INFO TEST-OK | browser/base/content/test/alerts/browser_notification_remove_permission.js | took 45328ms



given that, it looks like there is an issue with focus with the worker changes you are making :pete.  Can you run the test on a loaner before/after you changes and watch what is going on?  I suspect the answer might be obvious.
Flags: needinfo?(MattN+bmo)
Assignee

Updated

2 years ago
Blocks: 1403490
Assignee

Updated

2 years ago
No longer blocks: 1403490
There was some investigation of those notification-related things in bug 1364517.
I'm hoping that attachment 8916804 [details] will fix any test failures related to browser/base/content/test/alerts/*. The issue being fixed is a race condition not specific to Windows 10 though so it may not be sufficient.
Assignee

Updated

2 years ago
Depends on: 1352791
> Depends on: 1352791

FYI, even though that bug is still open, I landed a fix there for Windows 10. Is that bug still blocking this landing? Does this change make the Win7 failures worse?
Assignee

Updated

2 years ago
Blocks: 1373551
Assignee

Updated

2 years ago
Blocks: 1419974
Assignee

Updated

2 years ago
Summary: Upgrade all win7/win10 gecko workers to generic-worker 10.2.2 or later → Upgrade all win7/win10 gecko workers to generic-worker 10.2.3
Assignee

Updated

2 years ago
Blocks: 1394557
Assignee

Updated

2 years ago
Blocks: 1343049
Pete, what are the next steps on this bug?
Assignee

Comment 35

2 years ago
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #34)
> Pete, what are the next steps on this bug?

I've just landed https://github.com/mozilla-releng/OpenCloudConfig/pull/111 which updates all of our beta worker types to be identical to our production worker types, except for generic-worker version and configuration.

I've also rigorously updated our worker type definitions in the aws provisioner to make sure the beta worker types also match the production versions.

When the OpenCloudConfig changes have propagated to AWS, I'll trigger a new try job using the beta worker types to see what issues remain.

This is much like I did in previous comments - just refreshing to latest versions, and then triggering a new try push.

I suspect my try push will have to wait for tomorrow as it takes a couple of hours for all the changes to propagate, but hopefully by this time tomorrow we should have a new completed try push that we can evaluate.
please do a --rebuild 20 on your try push, that will help get data on failure rates.
Assignee

Comment 37

2 years ago
So I've been having some problem getting the last beta worker type updated - gecko-t-win10-64-gpu-b - I'm going to have another try now - all my deploys until now have been failing due to either not being able to get instances, or the instances I had losing network connectivity (so it isn't possible to see what is going wrong).

This could just be a case of bug 1372172 hitting us during AMI creation.

See e.g. last failed OCC task just now: https://tools.taskcluster.net/groups/R4iy9VIJSB6tG1f4UwVgmA/tasks/DTdqlTWNRC-naG3mtCgitg/runs/0/logs/public%2Flogs%2Flive.log

That links to a papertrail log, that stops outputting for 85 minutes between 14:50:01 and 16:25:05 CET:
  https://papertrailapp.com/groups/2488493/events?q=i-09768a467658be83e



Dec 01 14:48:01 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: generic-worker is not running. 
Dec 01 14:48:01 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: OpenCloudConfig is running. 
Dec 01 14:48:01 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com HaltOnIdle: instance appears to be initialising. 
Dec 01 14:50:01 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: generic-worker is not running. 
Dec 01 14:50:01 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: OpenCloudConfig is running. 
Dec 01 14:50:01 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com HaltOnIdle: instance appears to be initialising. 
Dec 01 16:25:05 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com Microsoft-Windows-GroupPolicy: Shutdown script failed.   	GPO Name : Local Group Policy  	GPO File System Path : C:\Windows\System32\GroupPolicy\Machine  	Script Name: C:\scripts\set_user_data.ps1 
Dec 01 16:25:05 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com Microsoft-Windows-Kernel-PnP: The driver \Driver\WudfRd failed to load for the device SWD\WPDBUSENUM\{70ffd6cb-3efa-11e7-9146-806e6f6e6963}#0000000000100000. 
Dec 01 16:25:05 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com Service_Control_Manager: The CldFlt service failed to start due to the following error:   The request is not supported. 
Dec 01 16:25:05 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com OpenCloudConfig: Windows update service is running 




The above log extract indicates a problem, since HaltOnIdle is scheduled to run every 2 minutes, which it does up until 14:50, and then for 85 minutes we have no logging until we see the machine is rebooted. This suggests the worker is still running, but either loses network connectivity or the papertrail integration breaks. I did not reboot the machine from the outside, so I presume it rebooted as part of the environment preparation it was internally performing.


The observant log reader may also notice a repeated message earlier in the log:

    "An error occurred (InstanceLimitExceeded) when calling the RunInstances operation: You have requested more instances (2) than your current instance limit of 1 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit."

This occurred multiple times in the log before the above extract, because there was another g3.4xlarge instance running in us-west-2, when we have a limit of 1. This other running instance was probably a runaway instance from a previously timed-out OCC task for a previous push. By terminating that rogue instance, I was able to get the task to continue. The price of this though, was a delay in time that ate into the maxRunTime of the task. So maybe the task might have been successful if it had been able to spawn a g3.4xlarge instance in us-west-2 immediately, rather than needing to wait for someone to terminate the running one. For future reference, such a rogue instance can be seen under:
  https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Instances:search=g3.4xlarge;sort=instanceId



I've now ensured that there are no g3.4xlarge instance running in us-west-2, and made a new OCC push to try to rebuild gecko-t-win10-64-gpu-b again:

  https://tools.taskcluster.net/groups/GMo_mDgUSZqubrB74-7rrA/tasks/Kz1LVo5fS5ak5U1nWjR0MQ/details

I may not be around when this completes, but if it does complete successfully, I have prepared a try patch here: 
  https://bugzilla.mozilla.org/show_bug.cgi?id=1400012#c9
that makes it trivial to run a try push against the beta worker types - so if anyone wants to make a try push once the above task completes successfully, this patch should do the trick.
Assignee

Updated

2 years ago
Depends on: 1422870
Assignee

Comment 38

2 years ago
(In reply to Pete Moore [:pmoore][:pete] from comment #37)
> So I've been having some problem getting the last beta worker type updated -
> gecko-t-win10-64-gpu-b - I'm going to have another try now - all my deploys
> until now have been failing due to either not being able to get instances,
> or the instances I had losing network connectivity (so it isn't possible to
> see what is going wrong).

All problems updating gecko-t-win10-64-gpu-b have been solved now, so I have made a new try push here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=67581af6162e8c0dfaaa726c3fda298ef576a846&filter-tier=1&filter-tier=2&filter-tier=3&duplicate_jobs=visible&group_state=expanded&filter-searchStr=windows&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=runnable

Let's see how that goes! :-)
Assignee

Comment 39

2 years ago
(In reply to Pete Moore [:pmoore][:pete] from comment #38)
> (In reply to Pete Moore [:pmoore][:pete] from comment #37)
> > So I've been having some problem getting the last beta worker type updated -
> > gecko-t-win10-64-gpu-b - I'm going to have another try now - all my deploys
> > until now have been failing due to either not being able to get instances,
> > or the instances I had losing network connectivity (so it isn't possible to
> > see what is going wrong).
> 
> All problems updating gecko-t-win10-64-gpu-b have been solved now, so I have
> made a new try push here:
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=67581af6162e8c0dfaaa726c3fda298ef576a846&filter-
> tier=1&filter-tier=2&filter-
> tier=3&duplicate_jobs=visible&group_state=expanded&filter-
> searchStr=windows&filter-resultStatus=testfailed&filter-
> resultStatus=busted&filter-resultStatus=exception&filter-
> resultStatus=runnable
> 
> Let's see how that goes! :-)

Some problems with Y: drive not getting mounted on some win 7 gpu jobs - but other than that, I think worth taking a look at already.
Flags: needinfo?(jmaher)
Assignee

Comment 40

2 years ago
Rob any idea what the Y: drive mounting problem on gecko-t-win7-32-gpu-b might be related to?

Thanks!
Flags: needinfo?(rthijssen)
Assignee

Comment 41

2 years ago
From this screenshot from a problematic worker[1] we see that the Y: drive got mounted as E: instead of Y:

--

[1] https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-win7-32-gpu-b/workers/us-east-1/i-03115ac894967073d
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Assignee

Comment 42

2 years ago
I spotted drive letters can be mapped in DriveLetterConfig.xml[1]. I think at the moment we are doing mounting drives in rundsc.ps1[2]. See EC2 docs[3] for details on the DriveLetterConfig.xml file.

If using DriveLetterConfig.xml works, that could that be an alternative solution than mounting in rundsc.ps1?

I checked one of our instances, and saw the file exists, but contains no mappings at the moment:

Z:\task_1512497232>type "C:\Program Files\Amazon\Ec2ConfigService\Settings\DriveLetterConfig.xml" 
<?xml version="1.0" standalone="yes"?>
<DriveLetterMapping>
</DriveLetterMapping>


--

[1] C:\Program Files\Amazon\Ec2ConfigService\Settings\DriveLetterConfig.xml
[2] https://github.com/mozilla-releng/OpenCloudConfig/blob/cebf4fc5888510550a09f1ccdcf0d4001d7c32ec/userdata/rundsc.ps1#L306-L367
[3] http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/UsingConfig_WinAMI.html#UsingConfigInterface_WinAMI
this is failing to get rid of xendpriv.exe (look at the logs for 'cl').  This is a hack I put in the allow clipboard to run on the machines- we cannot remove the file due to access denied, I suspect you need to remove this in the setup, or fix the taskcluster worker to allow access to that file.

outside of this, many tests are failing for prompt/notification/multi-window- that is a common theme from earlier when looking at taskcluster workers for windows in the past.
Flags: needinfo?(jmaher)
most likely explanation is that gw started running before occ was able to assign the correct drive letter. we see in the logs that this is often the case on windows 2012 where we use newer gw. since the worker type experiencing the incorrect drive mappings is also running a newer gw, i suspect this is also the case here. when occ detects that gw has started before occ, it simply terminates itself (as a workaroud to other issues experienced earlier) since we can't have both running. note that using ec2 to assign the drive lettere in DriveLetterConfig.xml will not be 100% effective as gw also doesn't wait for ec2config to complete before it starts up. imo the best fix is to add a check inside gw to wait until occ has set the ready state flag before attempting to run tasks. there's simply nothing we can do in occ to get the drive mappings correct if gw starts before occ has run.
Flags: needinfo?(rthijssen)
Assignee

Comment 45

2 years ago
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #43)
> this is failing to get rid of xendpriv.exe (look at the logs for 'cl'). 
> This is a hack I put in the allow clipboard to run on the machines- we
> cannot remove the file due to access denied, I suspect you need to remove
> this in the setup, or fix the taskcluster worker to allow access to that
> file.
> 
> outside of this, many tests are failing for
> prompt/notification/multi-window- that is a common theme from earlier when
> looking at taskcluster workers for windows in the past.

Indeed - it looks like this file is included on the workers. Rob, do you know why this file is there, and where it comes from? Should this whole "C:\Program Files\Citrix\XenTools" directory be there at all?


-----



[taskcluster 2017-12-06T11:26:35.251Z] Worker Type (gecko-t-win7-32-beta) settings:
[taskcluster 2017-12-06T11:26:35.251Z]   {
[taskcluster 2017-12-06T11:26:35.251Z]     "aws": {
[taskcluster 2017-12-06T11:26:35.251Z]       "ami-id": "ami-18038d77",
[taskcluster 2017-12-06T11:26:35.251Z]       "availability-zone": "eu-central-1c",
[taskcluster 2017-12-06T11:26:35.251Z]       "instance-id": "i-00f0b6a2a91125f87",
[taskcluster 2017-12-06T11:26:35.251Z]       "instance-type": "c4.2xlarge",
[taskcluster 2017-12-06T11:26:35.251Z]       "local-ipv4": "10.147.50.97",
[taskcluster 2017-12-06T11:26:35.251Z]       "public-hostname": "ec2-18-194-58-53.eu-central-1.compute.amazonaws.com",
[taskcluster 2017-12-06T11:26:35.251Z]       "public-ipv4": "18.194.58.53"
[taskcluster 2017-12-06T11:26:35.251Z]     },
[taskcluster 2017-12-06T11:26:35.251Z]     "config": {
[taskcluster 2017-12-06T11:26:35.251Z]       "deploymentId": "bec3aef21ffa",
[taskcluster 2017-12-06T11:26:35.251Z]       "runTasksAsCurrentUser": false
[taskcluster 2017-12-06T11:26:35.251Z]     },
[taskcluster 2017-12-06T11:26:35.251Z]     "generic-worker": {
[taskcluster 2017-12-06T11:26:35.251Z]       "go-arch": "386",
[taskcluster 2017-12-06T11:26:35.251Z]       "go-os": "windows",
[taskcluster 2017-12-06T11:26:35.251Z]       "go-version": "go1.9",
[taskcluster 2017-12-06T11:26:35.251Z]       "release": "https://github.com/taskcluster/generic-worker/releases/tag/v10.3.1",
[taskcluster 2017-12-06T11:26:35.251Z]       "revision": "bc1ecb9aa266105bf8a936fa451bff4e2a35843e",
[taskcluster 2017-12-06T11:26:35.251Z]       "source": "https://github.com/taskcluster/generic-worker/tree/bc1ecb9aa266105bf8a936fa451bff4e2a35843e",
[taskcluster 2017-12-06T11:26:35.251Z]       "version": "10.3.1"
[taskcluster 2017-12-06T11:26:35.251Z]     },
[taskcluster 2017-12-06T11:26:35.251Z]     "machine-setup": {
[taskcluster 2017-12-06T11:26:35.251Z]       "ami-created": "2017-12-05 14:36:21.569Z",
[taskcluster 2017-12-06T11:26:35.251Z]       "manifest": "https://github.com/mozilla-releng/OpenCloudConfig/blob/bec3aef21ffac1363747d6d5dc49079be1b61d1c/userdata/Manifest/gecko-t-win7-32-beta.json"
[taskcluster 2017-12-06T11:26:35.251Z]     }
[taskcluster 2017-12-06T11:26:35.251Z]   }
[taskcluster 2017-12-06T11:26:35.251Z] Task ID: a_J-nNAvT2S_g1HbHQnypg
[taskcluster 2017-12-06T11:26:35.251Z] === Task Starting ===
[taskcluster 2017-12-06T11:26:36.299Z] Uploading redirect artifact public/logs/live.log to URL https://clbduniaaaawak4uxpi4qn4c3mgrkwj5uxhp3xkefmvo3mhn.taskcluster-worker.net:60023/log/TorEb--jSeqsgkgKBjqzPw with mime type "text/plain; charset=utf-8" and expiry 2017-12-06T11:27:35.889Z
[taskcluster 2017-12-06T11:26:36.738Z] Executing command 0: dir "C:\Program Files\Citrix\XenTools\XenDPriv.exe"
Z:\task_1512559551>dir "C:\Program Files\Citrix\XenTools\XenDPriv.exe" 
 Volume in drive C is OSDisk
 Volume Serial Number is FC62-2D8F
 Directory of C:\Program Files\Citrix\XenTools
04/08/2014  04:07 PM            12,288 XenDPriv.exe
               1 File(s)         12,288 bytes
               0 Dir(s)  11,514,023,936 bytes free
[taskcluster 2017-12-06T11:26:36.780Z]    Exit Code: 0
[taskcluster 2017-12-06T11:26:36.780Z] Success Code: 0x0
[taskcluster 2017-12-06T11:26:36.780Z]    User Time: 15.6001ms
[taskcluster 2017-12-06T11:26:36.780Z]  Kernel Time: 0s
[taskcluster 2017-12-06T11:26:36.780Z]    Wall Time: 30ms
[taskcluster 2017-12-06T11:26:36.780Z]  Peak Memory: 2273280
[taskcluster 2017-12-06T11:26:36.780Z]       Result: SUCCEEDED
[taskcluster 2017-12-06T11:26:36.780Z] === Task Finished ===
[taskcluster 2017-12-06T11:26:36.780Z] Task Duration: 42ms
Pete, that is related to the xen vm toolchain that amazon uses for its workers.  We need some of the Xen tools, but not that specific file which luckily works for fixing our clipboard problems.
Assignee

Updated

2 years ago
See Also: → 1394757
Assignee

Comment 47

2 years ago
I added the following to remove it from the golden AMIs (see https://github.com/mozilla-releng/OpenCloudConfig/commit/17db37e19674751ff1baacb9da438f494a148663):


+    {
+      "ComponentName": "DeleteXenDPriv.exe",
+      "ComponentType": "CommandRun",
+      "Comment": "See https://bugzilla.mozilla.org/show_bug.cgi?id=1399401#c43 and https://bugzilla.mozilla.org/show_bug.cgi?id=1394757",
+      "Command": "cmd.exe",
+      "Arguments": [
+        "/c",
+        "del",
+        "/f",
+        "/q",
+        "\"C:\\Program Files\\Citrix\\XenTools\\XenDPriv.exe\""
+      ],
+      "Validate": {
+        "PathsNotExist": [
+          "C:\\Program Files\\Citrix\\XenTools\\XenDPriv.exe"
+        ]
+      }
+    },


But the logs on a live worker show this isn't able to delete the file:

20171206164250-DeleteXenDPriv.exe-stderr.log
============================================
Access is denied.


Running attrib shows that the file is not read-only, which was my first thought about why we are not able to delete it:

Z:\task_1512656869>attrib "C:\Program Files\Citrix\XenTools\XenDPriv.exe" 
A       I    C:\Program Files\Citrix\XenTools\XenDPriv.exe


Rob (:grenade) suggested it might be because the file is in use. I'll look more in depth into bug 1394757 to see how this file was deleted in the test setup code before.

Note deleting this file during test setup no longer works, because tests do not run as admin.

Also deleting it during test setup is bad because it changes system state - i.e. tests running before a test that deletes this file could well behave differently to tests that run after this system file is deleted - therefore better for the file not to make it into a live environment in the first place, and for the test environment to be consistent between test runs - which is why I have chosen to delete it entirely from the worker environment in OpenCloudConfig.
what we do in the test script is:
1) kill XenDPriv.exe
2) rename the file (but deleting is fine as well)

it will try to restart all the time if you just kill the process and the file exists.
pmoore: if this file exists on the base ami, we can either remove it there and bake a new base ami or put the logic to kill it with fire in this method:
https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/rundsc.ps1#L105

i'd kind of like it in the remove-legacystuff method just so there's a record in source code that we are doing so deliberately but not too fussed since this bug is also a good record.
Assignee

Comment 50

2 years ago
(In reply to Rob Thijssen (:grenade UTC+2) from comment #49)
> i'd kind of like it in the remove-legacystuff method just so there's a
> record in source code that we are doing so deliberately but not too fussed
> since this bug is also a good record.

For now I've removed it in the manifest, but we can move it to rundsc.ps1 if you prefer.

---

I've made a new try push here:

  
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ce439beeb616415a842e11a69d1ad10a58117eef&filter-tier=1&filter-tier=2&filter-tier=3&duplicate_jobs=visible&group_state=expanded&filter-searchStr=windows&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=runnable

---

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #36)
> please do a --rebuild 20 on your try push, that will help get data on
> failure rates.

I've done this in the new try push above - hope I got the syntax right (I just added it to the end)! :-)
Pete, from comment 43:
outside of this, many tests are failing for prompt/notification/multi-window- that is a common theme from earlier when looking at taskcluster workers for windows in the past.

this holds true with your latest push, none of the test failures were fixed, so --rebuild 20 seemed a bit overkill.
Assignee

Comment 52

2 years ago
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #51)
> Pete, from comment 43:
> outside of this, many tests are failing for
> prompt/notification/multi-window- that is a common theme from earlier when
> looking at taskcluster workers for windows in the past.

Thanks - do you know if these problems were ever solved?

> this holds true with your latest push, none of the test failures were fixed,
> so --rebuild 20 seemed a bit overkill.

Sorry, I didn't realise --rebuild 20 would also rerun tasks that passed. Indeed, I was pretty alarmed to see how many tasks got generated when I came in this morning! I won't be doing that again unless there is some very strong justification.
Assignee

Comment 53

2 years ago
Summary of permanent failures
=============================

Windows 7 opt:
   1) test-windows7-32/opt-mochitest-browser-chrome-e10s-3
   2) test-windows7-32/opt-mochitest-chrome-3

Windows 7 debug:
   3) test-windows7-32/debug-mochitest-5
   4) test-windows7-32/debug-mochitest-browser-chrome-7
   5) test-windows7-32/debug-mochitest-browser-chrome-e10s-4
   6) test-windows7-32/debug-mochitest-clipboard

Windows 10 opt:
   7) test-windows10-64/opt-mochitest-browser-chrome-e10s-3
   8) test-windows10-64/opt-mochitest-chrome-3
   9) test-windows10-64/opt-mochitest-e10s-5

Windows 10 debug:
  10) test-windows10-64/debug-mochitest-browser-chrome-e10s-1
  11) test-windows10-64/debug-mochitest-chrome-3
  12) test-windows10-64/debug-mochitest-e10s-5
Assignee

Comment 54

2 years ago
One of the failures in test-windows7-32/debug-mochitest-clipboard is:


00:09:34    ERROR -  138 INFO TEST-UNEXPECTED-FAIL | devtools/client/commandline/test/browser_cmd_screenshot.js | arg.filename.value (for 'screenshot C:\Users\task_1512691342\AppData\Local\Temp\TestScreenshotFile.png') - Got C:\Users	ask_1512691342\AppData\Local\Temp\TestScreenshotFile.png, expected C:\Users\task_1512691342\AppData\Local\Temp\TestScreenshotFile.png


Here was can see the failure is simply because the string is getting escaped, i.e. `C:\Users\task` -> `C:\Users<tab>ask` because `\t` is being interpreted as the tab character. This is clearly a buggy test that needs fixing.

I suspect there is something funny going on here: https://dxr.mozilla.org/mozilla-central/rev/457b0fe91e0d49a5bc35014fb6f86729cd5bac9b/devtools/client/commandline/test/browser_cmd_screenshot.js#106
Flags: needinfo?(jmaher)
Assignee

Comment 55

2 years ago
Hi Matt,

Do you have any ideas about what might be the cause of the failures in comment 50 (and comment 54)? Or do you know if there are any open existing bugs that I can make dependencies of this bug if any of them are currently being investigated?

Thanks!
Flags: needinfo?(MattN+bmo)
:pmoore, interesting file on the clipboard failure, could we make it upper case Task to avoid this?  I agree we should look into a fix for the test.

:jryans, I see you in the file commit history often for browser_cmd_screenshot.js, would you happen to know where we get the filename value and why it might interpret a \t in the full path as a <tab> character? ^^ see comment 54.
Flags: needinfo?(jmaher) → needinfo?(jryans)
Assignee

Comment 57

2 years ago
Out of curiosity I've triggered the tasks from comment 53 again (just once each) but configured the task users to be in the Administrators group (using the "osGroups" feature in generic-worker[1]).

I put them in a single task group here:
  https://tools.taskcluster.net/groups/caeMQxVJQf6ix0UiYgXnvQ

I'm curious if this will fix any of them.

--

[1] https://docs.taskcluster.net/reference/workers/generic-worker/payload
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #56)
> :jryans, I see you in the file commit history often for
> browser_cmd_screenshot.js, would you happen to know where we get the
> filename value and why it might interpret a \t in the full path as a <tab>
> character? ^^ see comment 54.

Hmm, that's a fun one!

I think this patch should fix the issue, but I don't have a simple way to verify it myself.
Flags: needinfo?(jryans)
Assignee

Comment 59

2 years ago
Comment on attachment 8935897 [details] [diff] [review]
Escape backslashes in GCLI screenshot test

Review of attachment 8935897 [details] [diff] [review]:
-----------------------------------------------------------------

::: devtools/client/commandline/test/browser_cmd_screenshot.js
@@ +104,3 @@
>        check: {
>          args: {
>            filename: { value: "" + file.path },

Shouldn't the replace be on line 106 instead of line 103 ? I think line 103 just creates a description, whereas line 106 is the filename that is passed through.
Assignee

Updated

2 years ago
Flags: needinfo?(jryans)
(In reply to Pete Moore [:pmoore][:pete] from comment #59)
> Comment on attachment 8935897 [details] [diff] [review]
> Escape backslashes in GCLI screenshot test
> 
> Review of attachment 8935897 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: devtools/client/commandline/test/browser_cmd_screenshot.js
> @@ +104,3 @@
> >        check: {
> >          args: {
> >            filename: { value: "" + file.path },
> 
> Shouldn't the replace be on line 106 instead of line 103 ? I think line 103
> just creates a description, whereas line 106 is the filename that is passed
> through.

I believe the `setup` is the text to enter, while the `check` block states the expected value certain arguments should have after parsing.

The issue seems to be related to how we parse the backslashes in the input, so that's why I modified the `setup` to escape the text entered.

However if the patch doesn't work and your version does, that's fine too!
Flags: needinfo?(jryans)
Assignee

Updated

a year ago
Blocks: 1360198
Assignee

Comment 61

a year ago
(In reply to J. Ryan Stinnett [:jryans] (use ni?) from comment #60)
> (In reply to Pete Moore [:pmoore][:pete] from comment #59)
> > Comment on attachment 8935897 [details] [diff] [review]
> > Escape backslashes in GCLI screenshot test
> > 
> > Review of attachment 8935897 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > ::: devtools/client/commandline/test/browser_cmd_screenshot.js
> > @@ +104,3 @@
> > >        check: {
> > >          args: {
> > >            filename: { value: "" + file.path },
> > 
> > Shouldn't the replace be on line 106 instead of line 103 ? I think line 103
> > just creates a description, whereas line 106 is the filename that is passed
> > through.
> 
> I believe the `setup` is the text to enter, while the `check` block states
> the expected value certain arguments should have after parsing.
> 
> The issue seems to be related to how we parse the backslashes in the input,
> so that's why I modified the `setup` to escape the text entered.
> 
> However if the patch doesn't work and your version does, that's fine too!

Thanks Ryan! Trying your patch in

  * https://treeherder.mozilla.org/#/jobs?repo=try&revision=98eea9e5205be091f8f78af72c48c87f4c544870&filter-tier=1&filter-tier=2&filter-tier=3&duplicate_jobs=visible&group_state=expanded&filter-searchStr=windows&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=runnable
Assignee

Updated

a year ago
Summary: Upgrade all win7/win10 gecko workers to generic-worker 10.2.3 → Upgrade all win7/win10 gecko workers to generic-worker 10.4.1
Assignee

Comment 62

a year ago
The try push in comment 61 is looking much better!

Note, the try push is based on this mozilla-central push, which has some starred failures already:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=6f5fac320fcb6625603fa8a744ffa8523f8b3d71&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-searchStr=windows

I've retriggered failures, to see if they are intermittent.
Assignee

Comment 63

a year ago
Hey Joel,

Is there anyone that can help me with the last couple of failures?

https://tinyurl.com/ycatybqe

Many thanks!
Pete
Flags: needinfo?(jmaher)
I don't really see any obvious pattern. The notification tests have been disabled since they were intermittently failing on the old worker :(
Flags: needinfo?(MattN+bmo)
the failures are all prompts/multi-window failures, when you are logged into a session, can you use the browser and get prompts and multiple windows?  Can you run the tests locally in a vnc/rdp session and reproduce the failures?  Once we get to that point, it will be easier to determine who can help.
Flags: needinfo?(jmaher)
Assignee

Comment 66

a year ago
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #65)
> the failures are all prompts/multi-window failures, when you are logged into
> a session, can you use the browser and get prompts and multiple windows? 
> Can you run the tests locally in a vnc/rdp session and reproduce the
> failures?  Once we get to that point, it will be easier to determine who can
> help.

You're quite right - these are the next things to check. In order to do that, I've implemented a (rather basic) native RDP worker feature that will allow us to RDP in while the task is running (see bug 1172273). This is subtly different to the existing Windows loaner procedure as it will (hopefully, if it works) get you in to the actual session running the task with the real task user. I've released an alpha release of the worker which I'm deploying to our beta worker types in https://tools.taskcluster.net/groups/PKEayE2bRg-RnPWtAYYPHQ so when that deployment is complete, I will test out the new RDP procedure, and see if I can see what is going wrong.
Assignee

Updated

a year ago
Depends on: 1172273
Assignee

Updated

a year ago
Depends on: 1433854
Assignee

Updated

a year ago
Blocks: 1433854
No longer depends on: 1433854
Assignee

Comment 67

a year ago
So I'm able to watch tests running now via RDP.

Example try push: (5887a82a1d0416c0724ee355f59d3c90e6fcb83f):

  *  https://tinyurl.com/ydywxj2k


I connected initially with my native screen resolution, which seems to have caused the screen resolution update to 1280x1024 to fail, so tests did not run.

I then reconnected with 1280x1024 resolution, and was able to manually run the task, only to discover it passed.

I'll will trigger the task again, and connecting via rdp with 1280x1024, to see if the tests then run or not, and if we get the same failure[1] we consistently get when we don't connect via RDP using the upgraded worker.

--

[1] https://public-artifacts.taskcluster.net/FZncnsa3QmWrvwUNYgfWyg/0/public/test_info/mozilla-test-fail-screenshot_jbm8ac.png
Assignee

Comment 68

a year ago
Note: in order to connect via RDP, the workflow is:


1) Add the following patches to your gecko (firefox) checkout, to enable the beta worker types:

> curl -L 'https://bug1399401.bmoattachments.org/attachment.cgi?id=8935897' | hg import -
> curl -L 'https://bug1400012.bmoattachments.org/attachment.cgi?id=8948627' | hg import -

2) Prepare any other commits for changes you'd like to test, as normal, and push to try.
3) Find a the failing task you want to play with in treeherder, and visit the failing task in the taskcluster task inspector
4) Go to Actions -> Edit Task
5) In the "payload" section add "rdpInfo": "ldap/<ldapUser>/rdpinfo.txt" (e.g. "ldap/pmoore@mozilla.com/rdpinfo.txt")
6) Add "generic-worker:allow-rdp:aws-provisioner-v1/<workerType>" to scopes list, e.g.

> scopes:
>   - 'generic-worker:allow-rdp:aws-provisioner-v1/gecko-t-win7-32-beta'

7) Ask somebody in #taskcluster to grant you the generic-worker:allow-rdp:aws-provisioner-v1/<workerType> scope and queue:get-artifact:ldap/<ldapUser>/* for the workerType(s) and ldap user you use
8) Run the task, and when it starts, go to "Run Artifacts" to see the rdpInfo.txt file appear with rdp connection information
9) Enter the connection information into your RDP client of choice
10) Connect with screen resolution 1280x1024 !
Assignee

Comment 69

a year ago
Note, bug 1436002 will simplify step 7 in comment 68. :)
See Also: → 1436002
Assignee

Updated

a year ago
Blocks: 1368961
Assignee

Updated

a year ago
Blocks: tc-stability
Assignee

Updated

a year ago
Blocks: 1333957
No longer blocks: tc-stability
Assignee

Updated

a year ago
Summary: Upgrade all win7/win10 gecko workers to generic-worker 10.4.1 → Upgrade all win7/win10 gecko workers to generic-worker 10.5.1
Assignee

Comment 70

a year ago
(In reply to Pete Moore [:pmoore][:pete] from comment #68)
> Note: in order to connect via RDP, the workflow is: .... <snip/> ....


This is now a little bit simpler (step 5 changed, step 7 removed):

1) Add the following patches to your gecko (firefox) checkout, to enable the beta worker types:

> curl -L 'https://bug1399401.bmoattachments.org/attachment.cgi?id=8935897' | hg import -
> curl -L 'https://bug1400012.bmoattachments.org/attachment.cgi?id=8948627' | hg import -

2) Prepare any other commits for changes you'd like to test, as normal, and push to try.
3) Find a the failing task you want to play with in treeherder, and visit the failing task in the taskcluster task inspector
4) Go to Actions -> Edit Task
5) Add rdpInfo to the payload section:

> payload:
>   rdpInfo: 'login-identity/<login-identity>/rdpinfo.txt'

For example, 'login-identity/mozilla-ldap/pmoore@mozilla.com/rdpinfo.txt' (check https://tools.taskcluster.net/credentials to see what your login identity is, e.g. you should have the scope queue:create-artifact:login-identity/<login-identity>/*).

6) Add "generic-worker:allow-rdp:aws-provisioner-v1/<workerType>" to scopes list, e.g.

> scopes:
>   - 'generic-worker:allow-rdp:aws-provisioner-v1/gecko-t-win7-32-beta'

7) Run the task, and when it starts, go to "Run Artifacts" to see the rdpInfo.txt file appear with rdp connection information
8) Enter the connection information into your RDP client of choice
9) Connect with screen resolution 1280x1024 !
Assignee

Updated

a year ago
Blocks: 1358545
Assignee

Updated

a year ago
Blocks: 1439517
Pete: can I ask you to trigger another try run so I can look at current results? My try access is still broken and this will allow me to retrigger as necessary.
Flags: needinfo?(pmoore)
(In reply to Chris Cooper [:coop] from comment #71)
> Pete: can I ask you to trigger another try run so I can look at current
> results? My try access is still broken and this will allow me to retrigger
> as necessary.

I spoke with Joel and Kendall in the TC migration mtg today.

Pete: can I ask you to collate a list of the currently failing tests (from a new Try run, hopefully) in a new bug comment? I'm going to look at the failures myself using your loaner method and then write that process up so we can get a dev to help.
Assignee

Comment 73

a year ago
(In reply to Chris Cooper [:coop] from comment #72)
> (In reply to Chris Cooper [:coop] from comment #71)
> > Pete: can I ask you to trigger another try run so I can look at current
> > results? My try access is still broken and this will allow me to retrigger
> > as necessary.

No problem - I've made a try push:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=85b4ef4fa06f4a75d9b50f8a3de2a3ecab3f7afd

> 
> I spoke with Joel and Kendall in the TC migration mtg today.
> 
> Pete: can I ask you to collate a list of the currently failing tests (from a
> new Try run, hopefully) in a new bug comment? I'm going to look at the
> failures myself using your loaner method and then write that process up so
> we can get a dev to help.

I'll be gone by the time this try push completes - but following that treeherder link above should be the authoritative source of the information.

Note - I made it from running step 1 and 2 from comment 70. If anyone is investigating failures, that same comment explains how to retrigger the task with an interactive loaner, and troubleshoot the issue while the task is actually running.
Flags: needinfo?(pmoore)
Assignee

Comment 74

a year ago
(In reply to Pete Moore [:pmoore][:pete] from comment #73)

> No problem - I've made a try push:
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=85b4ef4fa06f4a75d9b50f8a3de2a3ecab3f7afd

The jobs are not currently running due to bug 1443595.
Depends on: 1443595
From jmaher via email:

"here is the most recent push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=85b4ef4fa06f4a75d9b50f8a3de2a3ecab3f7afd&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=runnable&filter-searchStr=x64

here you can see:
c3 -
* toolkit/content/tests/chrome/test_bug360437.xul
* toolkit/content/tests/chrome/test_dialogfocus.xul
* toolkit/content/tests/chrome/test_showcaret.xul
* toolkit/content/tests/widgets/test_menubar.xul
* toolkit/mozapps/downloads/tests/chrome/test_unknownContentType_delayedbutton.xul
5 -
* toolkit/components/prompts/test/test_prompts.html
* toolkit/components/prompts/test/test_modal_prompts.html
bc2/bc3 - 
* browser/components/customizableui/test/browser_panelUINotifications_multiWindow.js


these are all tests that deal with focus, specifically multi modal/window test cases.  In many of the screenshots you can see we pop up the window, but it is opaque in that you can see a shadow of it in the foreground.

Ideally you could watch a test run locally and then compare it to a loaner and see the difference.  Most of these tests send keys to specific windows to type/click/hotkey.  I wonder if there is some quirk where the windows or keys are getting crossed between users such as current user and administrator."
here are the components for the failures:

c3 -
* toolkit/content/tests/chrome/test_bug360437.xul
** Toolkit :: Find Toolbar
* toolkit/content/tests/chrome/test_dialogfocus.xul
** Toolkit :: XUL Widgets
* toolkit/content/tests/chrome/test_showcaret.xul
** Toolkit :: XUL Widgets
* toolkit/content/tests/widgets/test_menubar.xul
** Core :: XUL
* toolkit/mozapps/downloads/tests/chrome/test_unknownContentType_delayedbutton.xul
** Toolkit :: Downloads API
5 -
* toolkit/components/passwordmgr/test/mochitest/test_prompt.html
** Toolkit :: Password Manager
* toolkit/components/prompts/test/test_modal_prompts.html
** Toolkit :: General
bc2/bc3 - 
* browser/components/customizableui/test/browser_panelUINotifications_multiWindow.js
** Firefox :: Toolbars and Customization


Ideally there is something in the code of the tests that is in common and not seen in other tests, we could pinpoint the actions which seem to cause failure in this new environment.
the test failures in mochitest-e10s-5 are concerning because when I disable the above mentioned tests, the next test(s) in the list start failing in the same way (timeout).  This looks to be that we would end up disabling all prompt and modal tests for password manager and in toolkit general- the other failures seem to go away clean with disabling tests.

One observation I noticed was many of these failures are on the 3rd window, so we have the harness and we open a new window for a test and that new window opens a dialog or another new window.
Assignee

Updated

a year ago
Summary: Upgrade all win7/win10 gecko workers to generic-worker 10.5.1 → Upgrade all win7/win10 gecko workers to generic-worker 10.7.1
Pete and I met yesterday and discussed this. Here's a summation of our thoughts.

Windows has a few user experience interactions (pop-ups, messages, modal windows) that appear on first-run. Since the new worker is using a new user for every run, these interactions may appear every single time we run a test unless we find the correct settings to toggle them off. I can recall this happening before on Mac.

We don't know if this is the actual cause, and the timing of these interactions is unknown. Three ways we could proceed here:

1) To quote Pete, we should add a "big, dirty sleep" to the start of the test run, say 5 minutes. This will give us enough time to establish an RDP connection before the test starts to see if there's a errant popup, etc stealing focus. If may also give the pop-ups enough time to clear on their own before the test starts.

2) Failing that, we could use a Windows sys call to try to figure out which window has focus during each test.

3) We could do a screen recording of an entire test run. This would allow us to step through, rewind, etc to observe a behavior that may be too quick to manually notice otherwise.
Assignee

Comment 80

a year ago
(In reply to Chris Cooper [:coop] from comment #79)

> 3) We could do a screen recording of an entire test run. This would allow us
> to step through, rewind, etc to observe a behavior that may be too quick to
> manually notice otherwise.

https://www.dvdvideosoft.com/products/dvd/Free-Screen-Video-Recorder.htm looks like it might do the trick here.
Assignee

Comment 81

a year ago
(In reply to Pete Moore [:pmoore][:pete] from comment #80)
> (In reply to Chris Cooper [:coop] from comment #79)
> 
> > 3) We could do a screen recording of an entire test run. This would allow us
> > to step through, rewind, etc to observe a behavior that may be too quick to
> > manually notice otherwise.
> 
> https://www.dvdvideosoft.com/products/dvd/Free-Screen-Video-Recorder.htm
> looks like it might do the trick here.

I had some issues with installing "Free Screen Video Recorder", I'm taking a look at "OBS Studio" instead: https://obsproject.com/ instead...
Assignee

Comment 82

a year ago
(In reply to Chris Cooper [:coop] from comment #79)
> 1) To quote Pete, we should add a "big, dirty sleep" to the start of the
> test run, say 5 minutes. This will give us enough time to establish an RDP
> connection before the test starts to see if there's a errant popup, etc
> stealing focus. If may also give the pop-ups enough time to clear on their
> own before the test starts.

I've made a new try push to try this out:

remote: View your change here:
remote:   https://hg.mozilla.org/try/rev/04c887284be5672c06d78ae93624c0624e33e722
remote: 
remote: Follow the progress of your build on Treeherder:
remote:   https://treeherder.mozilla.org/#/jobs?repo=try&revision=04c887284be5672c06d78ae93624c0624e33e722
Assignee

Comment 83

a year ago
Note, I've (hopefully) fixed the issue with the taskbar on both Windows 7 and Windows 10 not being hidden in bug 1433851 and am testing in a new try push:

https://tinyurl.com/ycwrff4e
Assignee

Comment 84

a year ago
(In reply to Pete Moore [:pmoore][:pete] from comment #82)
> (In reply to Chris Cooper [:coop] from comment #79)
> > 1) To quote Pete, we should add a "big, dirty sleep" to the start of the
> > test run, say 5 minutes. This will give us enough time to establish an RDP
> > connection before the test starts to see if there's a errant popup, etc
> > stealing focus. If may also give the pop-ups enough time to clear on their
> > own before the test starts.
> 
> I've made a new try push to try this out:
> 
> remote: View your change here:
> remote:  
> https://hg.mozilla.org/try/rev/04c887284be5672c06d78ae93624c0624e33e722
> remote: 
> remote: Follow the progress of your build on Treeherder:
> remote:  
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=04c887284be5672c06d78ae93624c0624e33e722

I forgot to say - the big dirty sleep didn't help. :(
Assignee

Comment 85

a year ago
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #77)
> the test failures in mochitest-e10s-5 are concerning because when I disable
> the above mentioned tests, the next test(s) in the list start failing in the
> same way (timeout).  This looks to be that we would end up disabling all
> prompt and modal tests for password manager and in toolkit general- the
> other failures seem to go away clean with disabling tests.
> 
> One observation I noticed was many of these failures are on the 3rd window,
> so we have the harness and we open a new window for a test and that new
> window opens a dialog or another new window.

This is now resolved. TL;DR: It was due to the settings here[1].


In release 10.7.7 I've upgraded generic-worker to go 1.10, and in the process, rediscovered these STARTUPINFO settings[3].

In go 1.10, there is the possibility to use the go standard library to make CreateProcessAsUser system calls, which previously did not exist, so I have migrated the worker to use the standard library for spawning task user processes, and in the process done away with these flags. In the process of migrating, I discovered the STARTUPINFO flags are controlled by the standard library in go 1.10, and do not allow for any customisation.


From MSDN docs[3,4] we see the difference between the flag settings I was using, and those used in the standard library:


Pre generic-worker 10.7.7
=========================

Before we set the following STARTUPINFO process flags[1]:

	si.Flags = win32.STARTF_FORCEOFFFEEDBACK | syscall.STARTF_USESHOWWINDOW
	si.ShowWindow = syscall.SW_SHOWMINNOACTIVE

From the MSDN docs[3,4]:

STARTF_FORCEOFFFEEDBACK
Indicates that the feedback cursor is forced off while the process is starting. The Normal Select cursor is displayed.

STARTF_USESHOWWINDOW
The wShowWindow member contains additional information.

SW_SHOWMINNOACTIVE
Displays the window as a minimized window. This value is similar to SW_SHOWMINIMIZED, except the window is not activated.




Post generic-worker 10.7.7
==========================

Now we set the flags like this[2]:

	si.Flags = STARTF_USESTDHANDLES

From the MSDN docs[3]:

STARTF_USESTDHANDLES
The hStdInput, hStdOutput, and hStdError members contain additional information.....



Conclusion
==========

The problem here was with SW_SHOWMINNOACTIVE which creates a non-activated window. When rereading the docs, I was reminded that our failures were focus related, and led me to try using adopting the standard library instead, to see if that solved the issue. Of course we could have continued to use our custom runlib library, and adapted the flags, but this seemed like a good opportunity to simplify our codebase, and use the new feature of the go standard library.

--

[1] https://github.com/taskcluster/runlib/blob/4ab38b9ff487347cfe9707ca800d305baab444b5/subprocess/subprocess_windows.go#L139-L140
[2] https://github.com/golang/go/blob/go1.10/src/syscall/exec_windows.go#L311-L320
[3] https://msdn.microsoft.com/en-us/library/windows/desktop/ms686331%28v=vs.85%29.aspx
[4] https://msdn.microsoft.com/en-us/library/windows/desktop/ms633548(v=vs.85).aspx
Assignee

Updated

a year ago
Depends on: 1448197
Assignee

Updated

a year ago
Depends on: 1447265
Assignee

Comment 86

a year ago
I believe all blocking issues have now been resolved - but I'm on PTO - so will look at rolling out generic-worker next week when I'm back.
Assignee

Comment 87

a year ago
Latest try push with new settings. Failures were all intermittents, that passed in retries:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7af721a1d8a445af27f3ceaea939c75d6eb6266a&group_state=expanded
Assignee

Updated

a year ago
Summary: Upgrade all win7/win10 gecko workers to generic-worker 10.7.1 → Upgrade all win7/win10 gecko workers to generic-worker 10.7.8
Assignee

Updated

a year ago
Blocks: 1180187
Assignee

Comment 88

a year ago
Currently preparing the deployment...
Assignee

Updated

a year ago
Attachment #8935897 - Flags: review+

Comment 89

a year ago
Pushed by pmoore@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1f8e34bd956b
Escape backslashes in GCLI screenshot test,r=pmoore

Comment 90

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/1f8e34bd956b
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
As Pete wrote this upgrade hasn't been done yet, so I will reopen for now.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee

Comment 92

a year ago
This should do all the magic. I've spent a couple of days double and triple checking worker type definitions, and am reasonably confident everything has been accounted for.

This is really a *major* upgrade, so the potential for something to go wrong is higher than normal.

I've taken a snapshot of the (confidential) worker type definitions, which I will share with the team, so that they can be rolled back if needed. This means if any problems are discovered the rollback process is two-fold:

1) Revert PR 128 from OpenCloudConfig
2) Request that somebody in the taskcluster team reverts the worker type definitions to their current state (i.e. to the versions I am sending them in an email this afternoon)
Attachment #8964886 - Flags: review?(rthijssen)
Attachment #8964886 - Flags: review?(rthijssen) → review+
Assignee

Comment 94

a year ago
We haven't had any complaints of problems yet, so I'll close this now. Please reopen if any issues appear!
Status: REOPENED → RESOLVED
Last Resolved: a year agoa year ago
Resolution: --- → FIXED
Assignee

Updated

a year ago
No longer depends on: 1448197
:pmoore, can we remove this comment and code:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/coalesce.py#30
Flags: needinfo?(pmoore)
Assignee

Updated

7 months ago
Status: RESOLVED → REOPENED
Flags: needinfo?(pmoore)
Resolution: FIXED → ---
Assignee

Comment 96

7 months ago
Nice spot, Joel!

Does this look ok?
Attachment #9023219 - Flags: review?(jmaher)
Comment on attachment 9023219 [details] [diff] [review]
gecko patch: enable coalescing on win7/win10 worker types

Review of attachment 9023219 [details] [diff] [review]:
-----------------------------------------------------------------

cool!
Attachment #9023219 - Flags: review?(jmaher) → review+
Pete: is this live now?
Flags: needinfo?(pmoore)

Comment 99

4 months ago
Pushed by pmoore@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/cef0a23a3849
enable coalescing on Windows 7 and Windows 10 worker types,r=jmaher
Assignee

Comment 100

4 months ago

It should be soon - I've just pushed to mozilla-inbound, and this bug should get automatically closed when it lands on mozilla-central, so let's leave it open.

Fingers crossed! Thanks for chasing me up. :)

Flags: needinfo?(pmoore)
Status: REOPENED → RESOLVED
Last Resolved: a year ago4 months ago
Resolution: --- → FIXED
Component: Integration → Services
Product: Taskcluster → Taskcluster
You need to log in before you can comment on or make changes to this bug.