Closed Bug 1372172 Opened 7 years ago Closed 6 years ago

investigate ec2 windows instances stalled in impaired state

Categories

(Infrastructure & Operations :: RelOps: General, task, P3)

Unspecified
Windows

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: grenade, Assigned: grenade)

References

(Depends on 2 open bugs)

Details

Attachments

(2 files, 3 obsolete files)

Windows 7 and 10 instances are too frequently entering an impaired state (hung. no logging to papertrail and no running of taskcluster tasks). this is currently worked around by monitoring ec2 via cron script and rebooting instances determined to be in this state. event log entries from around the time the impaired state is detected: Jun 12 12:37:52 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com generic-worker: 2017/06/12 11:37:51 Exiting worker with exit code 67 Jun 12 12:37:56 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Application_Error: Faulting application name: Explorer.EXE, version: 10.0.15063.0, time stamp: 0xa25b6083 Faulting module name: PROPSYS.dll, version: 7.0.15063.0, time stamp: 0x74650d36 Exception code: 0xc0000006 Fault offset: 0x000000000003a895 Faulting process id: 0x17bc Faulting application start time: 0x01d2e36c5df05d7a Faulting application path: C:\Windows\Explorer.EXE Faulting module path: C:\Windows\SYSTEM32\PROPSYS.dll Report Id: bd44c21e-c016-49c6-a3c4-2ebae3a3c865 Faulting package full name: Faulting package-relative application ID: Jun 12 12:37:56 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Application_Error: Windows cannot access the file for one of the following reasons: there is a problem with the network connection, the disk that the file is stored on, or the storage drivers installed on this computer; or the disk is missing. Windows closed the program Windows Explorer because of this error. Program: Windows Explorer File: The error value is listed in the Additional Data section. User Action 1. Open the file again. This situation might be a temporary problem that corrects itself when the program runs again. 2. If the file still cannot be accessed and - It is on the network, your network administrator should verify that there is not a problem with the network and that the server can be contacted. - It is on a removable disk, for example, a floppy disk or CD-ROM, verify that the disk is fully inserted into the computer. 3. Check and repair the file system by running CHKDSK. To run CHKDSK, click Start, click Run, type CMD, and then click OK. At the command prompt, type CHKDSK /F, and then press ENTER. 4. If the problem persists, restore the file from a backup copy. 5. Determine whether other files on the same disk can be opened. If not, the disk might be damaged. If it is a hard disk, contact your administrator or computer hardware vendor for further assistance. Additional Data Error value: C000026E Disk type: 0 Jun 12 12:37:58 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com User32: The process C:\Windows\system32\shutdown.exe (I-026957CE49B4B) has initiated the restart of computer I-026957CE49B4B on behalf of user NT AUTHORITY\SYSTEM for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: restart Comment: Rebooting as generic worker ran successfully Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: generic-worker is not running. Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: OpenCloudConfig is not running. Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Get-Uptime :: last boot: 6/12/2017 11:09:19 AM; uptime: 00:28:42.9313708. Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: remote desktop session is not active. Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: drive format is not in progress. Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: instance failed validity check and will be halted. Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: An attempt to write to the file "Z:\task_1497265725\AppData\Local\TileDataLayer\Database\EDB.log" at offset 159744 (0x0000000000027000) for 4096 (0x00001000) bytes failed after 0.000 seconds with system error 21 (0x00000015): "The device is not ready. ". The write operation will fail with error -1022 (0xfffffc02). If this error persists then the file may be damaged and may need to be restored from a previous backup. Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: Unable to write to section 2 while flushing logfile Z:\task_1497265725\AppData\Local\TileDataLayer\Database\EDB.log. Error -1022 (0xfffffc02). Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: The logfile sequence in "Z:\task_1497265725\AppData\Local\TileDataLayer\Database\" has been halted due to a fatal error. No further updates are possible for the databases that use this logfile sequence. Please correct the problem and restart or restore from backup. Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: An attempt to write to the file "Z:\task_1497265725\AppData\Local\TileDataLayer\Database\vedatamodel.jfm" at offset 0 (0x0000000000000000) for 8192 (0x00002000) bytes failed after 0.000 seconds with system error 21 (0x00000015): "The device is not ready. ". The write operation will fail with error -1022 (0xfffffc02). If this error persists then the file may be damaged and may need to be restored from a previous backup. Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: The database engine stopped the instance (0) with error (-510). Internal Timing Sequence: [1] 0.000006 +J(0) [2] 0.000012 +J(0) [3] 0.000007 +J(0) [4] 0.000005 +J(0) [5] 0.000038 +J(0) +M(C:0K, Fs:1, WS:4K # 0K, PF:0K # 0K, P:0K) [6] 0.000016 +J(0) [7] - [8] 0.000009 +J(0) [9] 0.000885 +J(0) +M(C:0K, Fs:3, WS:-36K # 0K, PF:-48K # 0K, P:-48K) [10] - [11] 0.001168 +J(0) +M(C:0K, Fs:2, WS:8K # 0K, PF:0K # 0K, P:0K) [12] - [13] 0.000052 +J(0) +M(C:0K, Fs:3, WS:8K # 0K, PF:-4K # 0K, P:-4K) [14] 0.000037 +J(0) +M(C:0K, Fs:0, WS:-8K # 0K, PF:-8K # 0K, P:-8K) [15] 0.000016 +J(0) +M(C:0K, Fs:0, WS:-12K # 0K, PF:-12K # 0K, P:-12K) [16] 0.000006 +J(0). Jun 12 12:39:17 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Microsoft-Windows-GroupPolicy: Shutdown script failed. GPO Name : Local Group Policy GPO File System Path : C:\Windows\System32\GroupPolicy\Machine Script Name: C:\scripts\set_user_data.ps1 Jun 12 12:39:17 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Microsoft-Windows-Kernel-PnP: The driver \Driver\WudfRd failed to load for the device SWD\WPDBUSENUM\{70ffd6cb-3efa-11e7-9146-806e6f6e6963}#0000000595E00000. Jun 12 12:39:17 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Service_Control_Manager: The CldFlt service failed to start due to the following error: The request is not supported. Jun 12 12:39:17 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Microsoft-Windows-Kernel-PnP: The driver \Driver\WudfRd failed to load for the device SWD\WPDBUSENUM\{70ffd6cb-3efa-11e7-9146-806e6f6e6963}#0000000000100000.
Attached file reboot-impaired-instances.sh (obsolete) —
this is the script i am using to reboot impaired instances. it runs under cron every 10 minutes. rebooted instances appear to perform fine and run tc tasks correctly.
Rob, this slightly modified version of the inner loop should only kill instances that have been impaired for more than ten minutes: /usr/bin/aws ec2 describe-instance-status --profile taskcluster --region "${aws_region}" --filters 'Name=instance-status.status,Values=impaired' --query 'InstanceStatuses[*].{InstanceId:InstanceId,Since:InstanceStatus.Details[0].ImpairedSince}' --output text | while read instance_id impaired_since do # check if impaired for more than 600 seconds (10 mins) # note, since dates are iso 8601 in UTC, a string comparison of dates is sufficient and correct if [ "${impaired_since}" \< "$(date -d @$(($(date +%s) - 600)) --iso-8601=seconds)" ]; then key_name=$(aws ec2 describe-instances --profile taskcluster --region ${aws_region} --instance-ids ${instance_id} | jq -r '.Reservations[0].Instances[0].KeyName') case "${key_name}" in *gecko-t-win*) aws ec2 reboot-instances --profile taskcluster --region ${aws_region} --instance-ids ${instance_id}>/home/grenade/cron/log/reboot-${aws_region}-${instance_id}.json 2>&1 echo $(/usr/bin/date -u '+%Y-%m-%d %H:%M:%S') ${aws_region} ${instance_id} ${key_name}>> /home/grenade/cron/log/reboot.log ;; *) echo ${key_name}>/home/grenade/cron/log/skip-${aws_region}-${instance_id}.log 2>&1 ;; esac fi done The reason I'd like to try this adjusted version is that I would expect all rebooting instances to be impaired at times, but only if they remain impaired for some time (I think 10 mins should cover it), should this indicate a real problem. If the status checks run less frequently than every 10 mins, we should increase the duration above (600s) accordingly. I'm curious if when filtering out the false positives, if we're just left with win10 gpu or not (I'm wondering if the instances of the other types is just related to the status check running while it is in the process of rebooting). In any case, this script adjustment should hopefully filter out some of the noise and reduce some of the false positives, leaving us with a clearer picture of how widespread the problem is. Many thanks!
My 2 cents is that it might be better to monitor for system status rather than instance status. I've seen instance status go into an impaired state right after reboot for a few minutes, but then resolve. The instances that are truly rogue always seem to have an impaired system status though.
Also for some more datapoints to add to the investigation, while dealing with these instances last week and over the weekend, I noticed: 1. primarily concentrated to the windows 10 gpu worker type 2. failure can happen at different places 3. Some instances claimed no tasks at all 4. some were in the middle of a task (and not the same type of task),. These I think are the ones reported as exception here 5. some completed the last task successfully and failed after reboot I assume I also was only analyzing instances which were in a system impaired state, not instance. I found that instances can often be in an instance impaired state for a few minutes but then resolved. Some of the information was collected and put into this sheet: https://docs.google.com/spreadsheets/d/1rqFavRrFUefvtd6pAPmU0l2pGCuYwt86HZOzi4D4-zU/edit#gid=1470233531
Attached file reboot.log (obsolete) —
some quick counts of impaired instances by worker type (between 2017-06-12 11:41:16 and 2017-06-13 13:23:27 UTC): - gecko-t-win7-32: 742 - gecko-t-win7-32-gpu: 504 - gecko-t-win10-64: 39 - gecko-t-win10-64-gpu: 1362
For papertrail troubleshooting, it looks like this papertrail search finds the instances that we reboot via this script: https://papertrailapp.com/groups/853883/events?q=+The+system+has+rebooted+without+cleanly+shutting+down+first
I notice that e.g. this worker (i-0f82062fc0bb59517/us-west-2/gecko-t-win7-32-gpu) successfully ran 8 tasks[1] (rebooting between each one) before being caught by reboot-impaired-instances.sh shortly after its last reboot[2] (before claiming a 9th task): Jun 13 11:39:55 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 09:39:54 Running task https://tools.taskcluster.net/task-inspector/#LgCf6LkKR2emOJlhj_7OpA/0 Jun 13 11:57:40 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 09:57:39 Running task https://tools.taskcluster.net/task-inspector/#ObKuKSvISmq2Auos23tVGQ/0 Jun 13 12:03:43 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 10:03:42 Running task https://tools.taskcluster.net/task-inspector/#QKpFbfqbTRmBcWZuvSWJZQ/0 Jun 13 12:11:26 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 10:11:25 Running task https://tools.taskcluster.net/task-inspector/#LUMtilzmSaaJLrHuj96sxw/0 Jun 13 12:17:23 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 10:17:21 Running task https://tools.taskcluster.net/task-inspector/#Tgyfl2XdSrSuPG_ElaPhZA/0 Jun 13 12:48:02 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 10:47:59 Running task https://tools.taskcluster.net/task-inspector/#LlANDjYtRLyZJXS7YKP4Kg/0 Jun 13 13:16:21 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 11:16:19 Running task https://tools.taskcluster.net/task-inspector/#IbQKYt8dRXCcaPBbf2Z9nQ/0 Jun 13 14:21:37 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 12:21:35 Running task https://tools.taskcluster.net/task-inspector/#dr14IIpSSGaqlxDhMEGutg/0 This seems suspicious to me, because it was able to start up 8 times without crashing, yet was determined to be impaired on the 9th startup, and before any task was claimed that might have exercised it in some way. I still wonder if we are catching too many instances in our reboot-impaired-instances.sh net, even with the 10 minute wait from comment 2. Maybe we should adjust the query per comment 3? What do you guys think? -- [1] https://papertrailapp.com/systems/942648811/events?q=Running+task&focus=811235686693920784 [2] https://papertrailapp.com/systems/942648811/events?focus=811265815423926311&selected=811265815423926311
we're not trapping instances that are not impaired. there's a different ec2 cli call for instances whose state has not yet been determined. the call we are running is specifically designed to only reboot instances that are genuinely impaired. even if my statement above were completely incorrect (it's not, but hypothetically speaking), we know we have an issue with instances going into impaired state. the kill script is a temporary measure in order to keep the fires lit. worrying about the kill script does absolutely nothing about the very real issue that we have instances becoming impaired which would simply hold resources until we ground to a halt. if we could focus attention on determining how instances are getting impaired instead, i think that would be the most productive use of our time. i appreciate that the script may be flawed. but i am completely sure that the script is not what is causing instances to become impaired so criticisms of the script might be perfectly valid but achieve precisely nothing in debugging the cause of the impaired state. polishing the temporary script will not solve our problem.
Can we roll back the golden AMIs and the OCC manifests, and then apply the changes (be they base AMI changes, or OCC changes) iteratively, until we reproduce the issue? Is there anything you would like a hand with?
Flags: needinfo?(rthijssen)
i'm focusing on the event log messages about lost network connectivity and the configuration of the pv driver. please feel free to roll back images if you prefer that approach for identifying the cause of the impaired state.
Flags: needinfo?(rthijssen)
found a suspect in the event logs for win 10 64 gpu: Application: ViscosityService.exe Framework Version: v4.0.30319 Description: The process was terminated due to an unhandled exception. Exception Info: System.Runtime.InteropServices.COMException at System.Runtime.InteropServices.Marshal.ThrowExceptionForHRInternal(Int32, IntPtr) at System.Management.SinkForEventQuery.Cancel() at System.Management.ManagementEventWatcher.Stop() at System.Management.ManagementEventWatcher.Finalize() could explain the loss of network since viscosity creates a virtual adapter for the vpn. timing of the viscosity install also coincides with the start of the issues with impaired instances so testing a patch now which disables it: - https://github.com/mozilla-releng/OpenCloudConfig/commit/cd9c0ef083e416ee325760ba9e6b2c3d254f6bc1 - https://tools.taskcluster.net/task-group-inspector/#/JiBoMQ9JSMGR9-1ZbC17KQ
Assignee: relops → rthijssen
Status: NEW → ASSIGNED
Attached file reboot-impaired-instances.sh (obsolete) —
the windows 10 gpu instances still go into impaired state and need to be rebooted when this happens. if they aren't rebooted, they use up all of our configured allocation for gecko-t-win10-64-gpu and the pending count for that instance type goes up while no work gets done on the impaired instances. i run the attached script to hunt down impaired instances and reboot them as part of a cron job. which i have configured to run every five minutes. someone will need to run this script or something like it while i'm on pto or until we resolve the root cause of the instances going into this state.
Attachment #8876703 - Attachment is obsolete: true
Attachment #8877149 - Attachment is obsolete: true
Attachment #8919820 - Flags: feedback?(mcornmesser)
I think I have something working here. Thought he logs are generating currently. I am going to check it again in an hour. (In reply to Rob Thijssen (:grenade - UTC+3) from comment #12) > Created attachment 8919820 [details] > reboot-impaired-instances.sh > > the windows 10 gpu instances still go into impaired state and need to be > rebooted when this happens. if they aren't rebooted, they use up all of our > configured allocation for gecko-t-win10-64-gpu and the pending count for > that instance type goes up while no work gets done on the impaired instances. > > i run the attached script to hunt down impaired instances and reboot them as > part of a cron job. which i have configured to run every five minutes. > > someone will need to run this script or something like it while i'm on pto > or until we resolve the root cause of the instances going into this state.
Attachment #8919820 - Flags: feedback?(mcornmesser) → feedback+
https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-win10-64-gpu has again issues 421 pending jobs and 256 instances in the pool - many in state "exception" or "completed".
Depends on: 1436274
No longer blocks: tc-stability
(In reply to Rob Thijssen (:grenade UTC+2) from comment #12) > i run the attached script to hunt down impaired instances and reboot them as > part of a cron job. which i have configured to run every five minutes. > > someone will need to run this script or something like it while i'm on pto > or until we resolve the root cause of the instances going into this state. Rob: Where is the script running right now? If it's still running on someone's (your?) personal machine, can we please get it moved onto an AWS instance or a datacenter machine?
Flags: needinfo?(rthijssen)
:jhford has established that since today, that script no longer will work, as it relies on some convention which is no longer followed. I've terminated workers in the most affected pools, and :jhford is seeing if he can stand up a modified version of the script. The path back to "old-world" where that script works is non-trivial, which is why :jhford is initially investigating patching the existing script from this bug.
(In reply to Pete Moore [:pmoore][:pete] from comment #16) > :jhford has established that since today, that script no longer will work, > as it relies on some convention which is no longer followed. > > I've terminated workers in the most affected pools, and :jhford is seeing if > he can stand up a modified version of the script. > > The path back to "old-world" where that script works is non-trivial, which > is why :jhford is initially investigating patching the existing script from > this bug. (canceling the NI on :grenade) It's worrying in that we deployed a provisioning change that invalidated other parts of our infra, and found out about it after the fact. How do we keep the Taskcluster and RelOps sides of this coin in better sync going forward?
Flags: needinfo?(rthijssen)
Comment on attachment 8919820 [details] reboot-impaired-instances.sh Obsoleted in favour of https://bugzilla.mozilla.org/attachment.cgi?id=8950677&action=edit This script was using internal implementation details of the provisioner to determine which instances belonged to which workerType. For future reference, the correct endpoint to determine this information is: https://aws-provisioner.taskcluster.net/v1/state/gecko-t-win10-64-gpu Bug 1437973 will track deployment of this script.
Attachment #8919820 - Attachment is obsolete: true
Trees remain closed for this since about 20:00UTC. We have moved the winstance-slayer script into a more visible hook in bug 1437973 and started terminating instead of rebooting, but to no avail -- pending seems to keep going up. It does not appear that the instances are taking any jobs.
Severity: normal → blocker
Priority: -- → P1
(In reply to Chris Cooper [:coop] from comment #17) > It's worrying in that we deployed a provisioning change that invalidated > other parts of our infra, and found out about it after the fact. Agreed. I didn't know anything about this bug beyond a passing familiarity with Gecko testing on Windows being unreliable, so it came as a shock to me as well. This logic of script could've been integrated into ec2-manager in about two hours. The bulk of that that code is what was written for 1437973. Greg was involved in this bug and we'd spoken about checks similar to this as well as the removal of the ssh key pair name in our 1x1s for months, so it's confusing to me that I wasn't looped in. More concerning, however, is that the way that this script is developed and deployed. It's critical to having open trees, but it seems that it's being run as a cronjob on someone's workstation. That's fine when it's a couple days to get through a hairy spot, but this has been running for eight months. It's using internal implementation details of another system. Using unsupported interfaces and workstation deployments for something critical is risky, as this situation has highlighted. This script could've also negatively impacted EC2-Manager, causing other tree closures. It's using two API endpoints which are extremely expensive on the EC2 API call rate limits. The two describe* calls can list theoretically unbounded data inside the EC2 api. That could exhaust our API rate limit quota. If that happens, ec2-manager would fail when it needed to call describeInstances endpoint itself. This call limit exhaustion is the direct cause of nearly all provisioning outages in the last year or so, so this is not an abstract concern. Maybe the solution is to block all access to the instances in our EC2 account other than the supported interfaces of the AWS-Provisioner. When a need for more access arises, we can make sure it's done in the most reliable way possible or at least document the exceptions. > How do we keep the Taskcluster and RelOps sides of this coin in better sync > going forward? I will include Relops on emails relating to major EC2-Manager or AWS-Provisioner changes if desired. As I'm not sure which internal details are used, I don't know which changes are important. I can also suggest following the EC2-Manager repository. That's where the development for all of these changes happens. Issues and Pull Requests are the best source of information. Changes are usually landed to master branch a day before deployment. Some upcoming changes which might be important for Relops in EC2-Manager are: - switching how we generate taskcluster credentials for workers - removal of provisioner secrets - we will be doing reliability based bidding - we will be rotating the credentials which ec2-manager uses The first two changes are changes to supported interfaces and backwards compatibility is a high priority. The last two should be invisible. It would be great if Relops could let me know which tools operate on the taskcluster ec2 account. The tools of interest will either call listing operations or change state. Anything which does either thing automatically would be great too. Things like CloudWatch monitoring isn't important.
I cc-ed :markco on the AWS case, but in case[1] he can't see it, I'm copying the response here: "From my understanding of your issue, you have an instance which is failing status checks & would like some assistance in investigating this issue. "I have reviewed your instance i-001e32d40884f82ed & can see from looking at the status checks there is an issue with EC2 detecting the Network card for your instance which is generally caused by the interface being disabled within Windows. "To resolve this issue, can you please create a new Elastic Network Interface, Attach it to your instance, then RDP to your instance using the new ENI IP address & ensure the Primary Network interface is not disabled?" 1. https://console.aws.amazon.com/support/v1#/case/?displayId=4875481371&language=en
The problem turned out to be two-fold, but from the same root cause. We had code that relied on the key name of provisioned instances containing the worker type name. 1) The kill instances script relied on this, and therefore no longer was effective in killing workers. 2) OpenCloudConfig relied on this, and therefore OCC Windows instances would not work. Non-OCC Windows workers are not affected (such as those used by nss, and nss-try). 1) Has a fix here: https://mozilla.logbot.info/taskcluster/20180214#c14298284 2) Has a fix here: https://github.com/mozilla-releng/OpenCloudConfig/commit/d7d2df5a174087bad52e7d3636ae92e043f999f0 The fix from 1) is already deployed. It will take one to two hours for 2) to automatically deploy to the affected worker types, at which point we should see the backlogs to start to recover.
Thank you for the investigation and fixing the issue. Pending counts are going down, sheriffs waited for the jobs pending on non-Try trees to run up to latest potential merge candidates, did a merge and then reopened the trees a few minutes ago. Autoland will get closed again because there should be many patches pending and we will reopen it again after we got some build and test coverage.
Trees are closing again because Windows builds stopped running after some time. Do we hit an AWS query limit? https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/ currently lists 256 pending jobs for gecko-3-b-win2012. Pending build jobs whose platforms start with 'win' grouped by state: https://sql.telemetry.mozilla.org/queries/51150 pending 394 running 5 The 'running' number is falling. Pending build jobs whose platforms start with 'win', looked at the column 'state': https://sql.telemetry.mozilla.org/queries/51149
Flags: needinfo?(pmoore)
Summary: investigate ec2 windows instances stalled in impaired state → Tree closure: investigate ec2 windows instances stalled in impaired state
ec2-manager::CRIMSON=> select region, code, "workerType", count(code) from awsrequests where error = true and method = 'runInstances' and called > now() - interval '3h' group by "workerType", code, region; region | code | workerType | count --------------+------------------------------+----------------------+------- us-east-1 | InsufficientInstanceCapacity | gecko-1-b-linux | 4 us-east-1 | InsufficientInstanceCapacity | gecko-1-b-macosx64 | 5 us-east-1 | RequestLimitExceeded | gecko-1-b-macosx64 | 1 us-west-1 | InsufficientInstanceCapacity | gecko-3-b-android | 9 us-west-1 | RequestLimitExceeded | gecko-3-b-android | 3 us-west-2 | RequestLimitExceeded | gecko-3-b-android | 1 us-east-1 | InsufficientInstanceCapacity | gecko-3-b-linux | 6 us-west-1 | InsufficientInstanceCapacity | gecko-3-b-linux | 44 us-west-1 | RequestLimitExceeded | gecko-3-b-linux | 4 us-east-1 | InsufficientInstanceCapacity | gecko-3-b-macosx64 | 6 us-west-2 | RequestLimitExceeded | gecko-t-linux-large | 46 us-west-2 | RequestLimitExceeded | gecko-t-linux-xlarge | 43 eu-central-1 | InsufficientInstanceCapacity | gecko-t-win10-64 | 134 us-west-1 | InsufficientInstanceCapacity | gecko-t-win10-64 | 66 eu-central-1 | RequestLimitExceeded | gecko-t-win10-64 | 1506 us-west-1 | RequestLimitExceeded | gecko-t-win10-64 | 34 us-west-2 | InsufficientInstanceCapacity | gecko-t-win10-64-gpu | 520 us-west-2 | InternalError | gecko-t-win10-64-gpu | 1 us-west-2 | RequestLimitExceeded | gecko-t-win10-64-gpu | 646 eu-central-1 | InsufficientInstanceCapacity | gecko-t-win7-32 | 139 us-west-1 | InsufficientInstanceCapacity | gecko-t-win7-32 | 93 eu-central-1 | RequestLimitExceeded | gecko-t-win7-32 | 1673 us-west-1 | RequestLimitExceeded | gecko-t-win7-32 | 35 us-west-2 | InsufficientInstanceCapacity | gecko-t-win7-32-gpu | 20 us-west-2 | RequestLimitExceeded | gecko-t-win7-32-gpu | 16 (25 rows) These are the calls which attempted to spawn instances. The RequestLimitExceeded code is returned when too many calls are being made, the InsufficientInstanceCapacity code is returned when EC2 does not have capacity available for us. I have work in progress for exposing this information in the Aws-Provisioner tools page.
Flags: needinfo?(pmoore)
The Windows builds not starting is being tracked in bug 1438152, got told this bug is only about gecko-t-win10-64-gpu.
See Also: → 1438152
Severity: blocker → normal
Priority: P1 → P3
Depends on: 1438684
Summary: Tree closure: investigate ec2 windows instances stalled in impaired state → investigate ec2 windows instances stalled in impaired state
Any progress on this? It's blocking being able to run some kinds of tests in CI for QuantumRender, which in turn means it's going to be harder to ship QuantumRender.
Blocks: 1424755
No longer depends on: 1436274
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #27) > Any progress on this? It's blocking being able to run some kinds of tests in > CI for QuantumRender, which in turn means it's going to be harder to ship > QuantumRender. :coop, can you help prioritize this? This is blocking enabling QR for Windows Nightly.
Flags: needinfo?(coop)
If it helps, this is super easy to repro, just run with the patch from [1]: > ./mach try -b o -p win64 -u awsy-e10s[windows10-64-qr] -t none --rebuild 5 --artifact [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=65633282852e99abbed00c59ef8df057cb9ababb
Redirecting NI to :fubar. I can't speak directly for :grenade's time.
Flags: needinfo?(coop) → needinfo?(klibby)
Attached file m3-livelog —
Per :pmoore's request in bug 1339136 comment 18, I ran curl -L on the livelog for one of these jobs that ended up going blue. The task is at [1], I think it was run 3 or 4 on which I ran curl: $ curl --location -o m3-livelog "https://queue.taskcluster.net/v1/task/DTrZAoN_SQq6iMZ2RW858A/runs/3/artifacts/public%2Flogs%2Flive.log" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 138 100 138 0 0 461 0 --:--:-- --:--:-- --:--:-- 463 100 2928k 0 2928k 0 0 3199 0 --:--:-- 0:15:37 --:--:-- 0 curl: (56) LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 60 The last couple of lines in the log are this: 15:55:51 INFO - GECKO(3112) | --DOMWINDOW == 23 (0000016EB614BA00) [pid = 3388] [serial = 823] [outer = 0000000000000000] [url = http://mochi.test:8888/tests/dom/tests/mochitest/bugs/file_window_bar.html?default]^M 15:55:51 INFO - GECKO(3112) | --DOMWINDOW == 22 (0000016EB7843800) [pid = 3388] [serial = 820] [outer = 0000000000000000] [url = http://mochi.test:8888/tests/dom/tests/mochitest/bugs/fi so it looks like it just cuts off at some point. The "full" log (i.e. the m3-livelog file I got from curl) is attached. [1] https://tools.taskcluster.net/groups/Qq3j73IqQ76miagF9laLxQ/tasks/DTrZAoN_SQq6iMZ2RW858A/details
Also I should note that the m3-livelog file is 2998272 bytes, which is an exact multiple of 16k, so it's likely that something was buffering the output and the last stuff in the file is just whatever was last flushed, not necessarily the last output generated by the worker.
Mark, can you have a go at this while Rob's on PTO? Given that we have a method to reproduce (thanks!) maybe we can manually run that while watching the event log, etc, and get a better idea of what's tripping us up.
Flags: needinfo?(klibby) → needinfo?(mcornmesser)
Rgr.
Flags: needinfo?(mcornmesser)
Out of curiosity, are any of the tasks/jobs that we've run into this issue on also run on buildbot/ESR52?
What is the problem being reproduced in comment 29 and comment 30? Are those producing instances in an impaired state, or is an artifact upload issue? Do we have anyway to identify instances that are in an impaired state?
(In reply to Mark Cornmesser [:markco] from comment #38) > What is the problem being reproduced in comment 29 and comment 30? Are those > producing instances in an impaired state, or is an artifact upload issue? > > Do we have anyway to identify instances that are in an impaired state? I believe the instances are in an impaired state; trying to inspect one of the machines [1] indicates it's not provisioned. [1] https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-win10-64-gpu/workers/eu-central-1/i-0aace1bdc7bd7ba24
Because of bug 1443589 I have not been able to spend much time on this bug this week.
We can still reproduce this with Windows coverage builds. Are you still planning to investigate this?
Flags: needinfo?(rthijssen)
Flags: needinfo?(mcornmesser)
hi Marco, yes, we're still actively trying to solve this. it would be good to test if the problem can be reproduced on the new auto-generated amis (from bug 1474548) used on our beta workers. if you have a good way to reproduce the problem, can i ask that you import this commit: https://hg.mozilla.org/try/raw-rev/0103760 before pushing to try? the commit forces all windows tasks onto our beta workers where we'd like to check if the generated windows 10 amis maybe don't suffer from problems that cause them to go offline. thank you!
Flags: needinfo?(rthijssen)
Flags: needinfo?(mcornmesser)
(In reply to Rob Thijssen (:grenade UTC+2) from comment #42) > hi Marco, yes, we're still actively trying to solve this. > > it would be good to test if the problem can be reproduced on the new > auto-generated amis (from bug 1474548) used on our beta workers. > > if you have a good way to reproduce the problem, can i ask that you import > this commit: https://hg.mozilla.org/try/raw-rev/0103760 before pushing to > try? the commit forces all windows tasks onto our beta workers where we'd > like to check if the generated windows 10 amis maybe don't suffer from > problems that cause them to go offline. > > thank you! Yes, that patch seems to fix the problem: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4535f9712505f214b9078a1081c442ebbc1fdc9f.
Great news, thanks guys!
jmaher: we think we're ready to roll out auto-generated windows 10 amis for gecko-t-win10-64-gpu worker types. we'd like to do a test run on the gecko-t-win10-64-gpu-b beta workers that runs a fairly heavy test load against that worker type. i would like to create some try pushes that trigger all the tests that already run on gecko-t-win10-64-gpu but redirect them to gecko-t-win10-64-gpu-b using the worker swap from https://hg.mozilla.org/try/raw-rev/0103760. do you know what try syntax i should use to trigger quite a lot of win 10 gpu tests? do you want to do any testing as well that i should wait on?
Flags: needinfo?(jmaher)
here are some examples: https://searchfox.org/mozilla-central/search?q=%3A+hardware&case=false&regexp=false&path=taskcluster%2Fci ignore the |default: hardware|, those are perf tests. I would also do win10-ccov as well- you can find that in |./mach try fuzzy -q 'windows ccov'| since these tests fail 90% of the time on existing configurations, just getting them to run a few times will be more than adequate :) If you want to up the load, just run all win10 jobs 4 or 5 times :)
Flags: needinfo?(jmaher)
here's what i came up with to test (hopefully i got my patches and syntax correct): https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd4980f
this looks good- a few perma failing tests in reftest- those can be fixed easily.
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #49) > this looks good- a few perma failing tests in reftest- those can be fixed > easily. great. is there anything obvious on infra that could be causing those reftest failures? i converted some of the base64 images that had pixel differences to see if i could spot a desktop popup or something but the images i saw were just greyscale spheres on a white background for both the reference png and the test png. should i wait on promoting this ami to gecko-t-win10-64-gpu or is it easier to get reftest fixed if we go live with the updated ami?
Flags: needinfo?(jmaher)
here's the last test run i did last night that included mochitest-gpu as well: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=c406a22
here are the failures (debug job names- same tests fail in opt as well): R1 * https://searchfox.org/mozilla-central/source/layout/reftests/svg/reftest.list#340 ** already fuzzy-if win10 to 15, we need 16 * https://searchfox.org/mozilla-central/source/layout/reftests/css-grid/reftest.list#38 ** NEW FAILURE - VISIBLE DIFFERENCE R3 * fails already in verify mode: https://searchfox.org/mozilla-central/source/layout/reftests/canvas/reftest.list#80 ** doesn't render in repeated attempts, I would skip it or fuzzy-if(1,45) * https://searchfox.org/mozilla-central/source/layout/reftests/async-scrolling/reftest.list#58 ** accepts up to 20 pixels incorrect, we have 22 now R4 (these are in wpt already...can we turn these off?) * https://searchfox.org/mozilla-central/source/layout/reftests/w3c-css/submitted/masking/reftest.list#33 ** already fuzzy up to 3, we just need 7 pixels * https://searchfox.org/mozilla-central/source/layout/reftests/w3c-css/submitted/masking/reftest.list#91-99 ** already fuzzy to 21, we need 27 pixels (only tests that have existing fuzzyif(winwidget) R4 * https://searchfox.org/mozilla-central/source/layout/reftests/border-radius/reftest.list#47 ** fuzzy if already for win 6.2 (i.e. win8), need to remove that and make win10 fuzzy to 12 * https://searchfox.org/mozilla-central/source/layout/reftests/border-radius/reftest.list#50 ** fuzzyif for win8 need to make win10 * https://searchfox.org/mozilla-central/source/layout/reftests/border-radius/reftest.list#83 ** NEW FAILURE? * https://searchfox.org/mozilla-central/source/gfx/tests/reftest/reftest.list#17 ** already fuzzy to 26354, make it 26621 in all of this there are 2 things we probably should investigate annotated with NEW FAILURE. :mattwoodrow, can you look at the failure in transforms-1.html (https://searchfox.org/mozilla-central/source/layout/reftests/border-radius/reftest.list#83)? ** https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/dKv_7M14TquglVDxpmdvBQ/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1 :dbaron, can you look at the failure in grid-placement-auto-implicit-001.html (https://searchfox.org/mozilla-central/source/layout/reftests/css-grid/reftest.list#38)? ** https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/P6Cm0gv3QuKnIcn_LKfWqg/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(jmaher)
Flags: needinfo?(dbaron)
crashtest, mochitest-(1-5) & mochitest-media-e10s are currently configured to run - on gpu instances (virtual-with-gpu) when running a "Windows 10 x64 QuantumRender" [windows10-64-qr] test configuration and - without gpu (virtual) when running a normal "Windows 10 x64" [Windows 10] test configuration. see: https://dxr.mozilla.org/mozilla-central/search?q=windows10-64-qr%2F.*%3A+virtual-with-gpu it looks like mochitest-media-e10s is failing when run on the new ami under the windows10-64-qr config. i don't know if that's significant, but thought i would mention it in case we need to fix that too. if someone understands the significance of why these suites run on gpu for the qr configuration only, maybe they could make a call on what we should do about this. it looks like it was switched on in bug 1426155. here's a push showing failure for mochitest-media-e10s on the new ami (beta) with gpu and success on old ami (prod) without gpu: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=68cc57c crashtest & mochitest-(1-5) look fine: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=78df068
(In reply to Rob Thijssen (:grenade UTC+2) from comment #57) > it looks like mochitest-media-e10s is failing when run on the new ami under > the windows10-64-qr config. i don't know if that's significant, but thought > i would mention it in case we need to fix that too. For now we can disable the failing test on the windows10-64-qr platform. Although really it looks like the test doesn't stall, it just takes a really long time as the log shows output still coming from the test at the point that it times out. I'm happy to write a patch to deal with this failure if you need me to. > if someone understands > the significance of why these suites run on gpu for the qr configuration > only, maybe they could make a call on what we should do about this. These suites run on gpu because the qr configuration has webrender enabled, and webrender only works if you have a gpu.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #58) > For now we can disable the failing test on the windows10-64-qr platform. > Although really it looks like the test doesn't stall, it just takes a really > long time Actually it looks like the test is waiting for a particular pixel to turn red but never sees that happening, and it keeps looping and checking until the test times out. I'm not sure offhand why this is happening as the screenshot seems to show the expected behaviour, but I'm happy to disable the test on QR until we figure it out.
transforms-1.html is failing with a max difference of 1 with the two images disagreeing on 127 vs 128 (~0.5 as a float). That's super subtle, and probably just a slight difference in floating point calculations somewhere, maybe the graphics driver. We should just fuzz it.
Flags: needinfo?(matt.woodrow)
redirecting my needinfo from comment 52 to mats
Flags: needinfo?(dbaron) → needinfo?(mats)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #52) > :dbaron, can you look at the failure in > grid-placement-auto-implicit-001.html > (https://searchfox.org/mozilla-central/source/layout/reftests/css-grid/ > reftest.list#38)? Looks like it's using JS to generate some part of the test/reference, so I suspect it needs to use "reftest-wait". Please disable it and file a bug in Core/Layout:Grid.
Flags: needinfo?(mats)
jmaher: i believe we need 3 patches to tests in order to start using the new auto generated amis. - disable failing mda test on QR (as per :kats' patch in comment 62) - fuzz reftests/transforms-1.html tests (as per :mattwoodrow's comment 63) - disable reftests/grid-placement-auto-implicit-001.html (as per mats comment 65) is this something you are able to organise or assign to the appropriate people?
Flags: needinfo?(jmaher)
If Joel's not around I'm happy to do this.
let me get some bugs in place and we can resolve them independently (ideally this week)
Flags: needinfo?(jmaher)
in addition there are a series of small adjustments to make to existing fuzzy-if statements outlined in comment 52. I have try pushes going for the first two bugs
Comment on attachment 9022091 [details] [diff] [review] patch to disable failing mda test on QR I landed this patch on bug 1509064
Attachment #9022091 - Flags: checked-in+
jmaher, kats: are you happy for me to promote the new gecko-t-win10-64-gpu ami (from beta to prod), based on the test results linked in comment 73 above, or should i wait for further fuzzy patches? i note that there are a couple failures: - test-windows10-64-qr/opt-reftest-e10s-2 (R2): failed 3/3 retries - test-windows10-64-qr/opt-awsy-base-dmd-e10s (ab-d): failed 2/3 retries
Flags: needinfo?(kats)
Flags: needinfo?(jmaher)
(In reply to Rob Thijssen (:grenade UTC+2) from comment #74) > i note that there are a couple failures: > - test-windows10-64-qr/opt-reftest-e10s-2 (R2): failed 3/3 retries This one needs fixing still - it looks like the test on windows has both the 'webrender' and 'skiaContent' annotations activated, and the latter is taking precedence over the former. If we switch the order of those annotations it should be good. I'll land a patch for this. > - test-windows10-64-qr/opt-awsy-base-dmd-e10s (ab-d): failed 2/3 retries This is a pre-existing failure, so I'm not worried about this > jmaher, kats: > are you happy for me to promote the new gecko-t-win10-64-gpu ami (from beta > to prod), based on the test results linked in comment 73 above, or should i > wait for further fuzzy patches? > Please wait for me to fix the fuzzy failure above. The trees are closed at the moment so it will be whenever the trees reopen.
Flags: needinfo?(kats)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #75) > This one needs fixing still - it looks like the test on windows has both the > 'webrender' and 'skiaContent' annotations activated, and the latter is > taking precedence over the former. If we switch the order of those > annotations it should be good. I'll land a patch for this. Sorry, I was looking at an old version of the code. The actual problem is the &&isDebugBuild clause on the annotation. I'll need to drop that since it happens on both opt and debug. Will do that once trees reopen.
I landed the patch on inbound earlier today. If the AMI change happens as part of an in-tree change then you could land it on inbound now and it should be good. Otherwise we should wait until that patch has propagated to all the trunk trees.
The patch is on central now so we should be good to go!
New failures on Windows 10 have increase heavily: https://sql.telemetry.mozilla.org/queries/60373/ Rob, can you take a look at those, please? Table with links to bugs, logs and infos on job type and platform: https://sql.telemetry.mozilla.org/queries/60371/
Flags: needinfo?(rthijssen)
i'm afraid i don't know what to make of this data or what info is being asked of me but i'll take a stab at a response: the change implemented by this bug was to create and use a new windows 10 ami which uses a source controlled, automated process to deterministically create an ec2 ami from a windows build 1703 iso (see bug 1474548), replacing an ami that was manually created (by a human, installing windows on a vm, and importing the resulting vhd into ec2, as an ami). the change gives us a mechanism to deterministically modify the ami generation process in an incremental and audited way in order to improve the repeatability of that exercise for subsequent builds of windows without having to rely on human memory from the last time an ami was manually built. a happy byproduct of the change was that instances created from the new amis do not suffer from going into an impaired state whenever the gpu is put through its paces, hence the resolution of this bug. we can of course revert the deployment to go back to the manually created ami which did not have the problems referred to by the telemetry link (but which did have other problems, like instances that die mid-task), or we can choose to bite the bullet and start to incrementally improve the automated process to reduce the number of intermittently failing tests, as and when we understand the causes of the individual failures. i prefer the latter, but it's not my call and i'm happy to work with whatever moves us forward. if there's any other specific questions i can answer, please feel free to ni me again.
Flags: needinfo?(rthijssen)
(In reply to Rob Thijssen (:grenade UTC+2) from comment #84) > i'm afraid i don't know what to make of this data I'm also somewhat overwhelmed by the data. Sebastian: can you (or one of the sheriffs) call out a few standout bugs from that list to block this bug on if you want Rob to dig into this more, please?
Flags: needinfo?(aryx.bugmail)
I think the problem is there are so many new bugs that are individual- it isn't as if one or two bugs are happening at large scale, this is more of a trend that shows a low rate of windows 10 new intermittent bugs and in the last 3 days a massive jump in new issues. I also agree with you :coop, this is a hard problem to understand with that data- I think looking into common failures (crashes, hangs, startup hangs, focus, specific suites/messages) would help make this actionable. In fact it would be nice if we had metrics broken down daily like: * config - overall intermittent rate, overall new intermittent rate * test suite - overall intermittent rate, overall new intermittent rate * buckets (hangs, timeouts, crashes, infra, test failure) - overall intermittent rate, overall new intermittent rate We could at least have a common ground for understanding what the different patterns are and catch issues that are more widespread easier.
The failure rate uptick is caused by Windows 10 update prompts overlaying the screen which causes focus, click automation and color detection issues. See bug 1511293.
Flags: needinfo?(aryx.bugmail) → needinfo?(rthijssen)
That issue should be fixed in bug 1510220.
Flags: needinfo?(rthijssen)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: