Closed
Bug 1372172
Opened 7 years ago
Closed 6 years ago
investigate ec2 windows instances stalled in impaired state
Categories
(Infrastructure & Operations :: RelOps: General, task, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: grenade, Assigned: grenade)
References
(Depends on 2 open bugs)
Details
Attachments
(2 files, 3 obsolete files)
2.86 MB,
text/plain
|
Details | |
1.28 KB,
patch
|
jmaher
:
review+
kats
:
checked-in+
|
Details | Diff | Splinter Review |
Windows 7 and 10 instances are too frequently entering an impaired state (hung. no logging to papertrail and no running of taskcluster tasks).
this is currently worked around by monitoring ec2 via cron script and rebooting instances determined to be in this state.
event log entries from around the time the impaired state is detected:
Jun 12 12:37:52 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com generic-worker: 2017/06/12 11:37:51 Exiting worker with exit code 67
Jun 12 12:37:56 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Application_Error: Faulting application name: Explorer.EXE, version: 10.0.15063.0, time stamp: 0xa25b6083 Faulting module name: PROPSYS.dll, version: 7.0.15063.0, time stamp: 0x74650d36 Exception code: 0xc0000006 Fault offset: 0x000000000003a895 Faulting process id: 0x17bc Faulting application start time: 0x01d2e36c5df05d7a Faulting application path: C:\Windows\Explorer.EXE Faulting module path: C:\Windows\SYSTEM32\PROPSYS.dll Report Id: bd44c21e-c016-49c6-a3c4-2ebae3a3c865 Faulting package full name: Faulting package-relative application ID:
Jun 12 12:37:56 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Application_Error: Windows cannot access the file for one of the following reasons: there is a problem with the network connection, the disk that the file is stored on, or the storage drivers installed on this computer; or the disk is missing. Windows closed the program Windows Explorer because of this error. Program: Windows Explorer File: The error value is listed in the Additional Data section. User Action 1. Open the file again. This situation might be a temporary problem that corrects itself when the program runs again. 2. If the file still cannot be accessed and - It is on the network, your network administrator should verify that there is not a problem with the network and that the server can be contacted. - It is on a removable disk, for example, a floppy disk or CD-ROM, verify that the disk is fully inserted into the computer. 3. Check and repair the file system by running CHKDSK. To run CHKDSK, click Start, click Run, type CMD, and then click OK. At the command prompt, type CHKDSK /F, and then press ENTER. 4. If the problem persists, restore the file from a backup copy. 5. Determine whether other files on the same disk can be opened. If not, the disk might be damaged. If it is a hard disk, contact your administrator or computer hardware vendor for further assistance. Additional Data Error value: C000026E Disk type: 0
Jun 12 12:37:58 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com User32: The process C:\Windows\system32\shutdown.exe (I-026957CE49B4B) has initiated the restart of computer I-026957CE49B4B on behalf of user NT AUTHORITY\SYSTEM for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: restart Comment: Rebooting as generic worker ran successfully
Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: generic-worker is not running.
Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: OpenCloudConfig is not running.
Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Get-Uptime :: last boot: 6/12/2017 11:09:19 AM; uptime: 00:28:42.9313708.
Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: remote desktop session is not active.
Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: drive format is not in progress.
Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: instance failed validity check and will be halted.
Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: An attempt to write to the file "Z:\task_1497265725\AppData\Local\TileDataLayer\Database\EDB.log" at offset 159744 (0x0000000000027000) for 4096 (0x00001000) bytes failed after 0.000 seconds with system error 21 (0x00000015): "The device is not ready. ". The write operation will fail with error -1022 (0xfffffc02). If this error persists then the file may be damaged and may need to be restored from a previous backup.
Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: Unable to write to section 2 while flushing logfile Z:\task_1497265725\AppData\Local\TileDataLayer\Database\EDB.log. Error -1022 (0xfffffc02).
Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: The logfile sequence in "Z:\task_1497265725\AppData\Local\TileDataLayer\Database\" has been halted due to a fatal error. No further updates are possible for the databases that use this logfile sequence. Please correct the problem and restart or restore from backup.
Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: An attempt to write to the file "Z:\task_1497265725\AppData\Local\TileDataLayer\Database\vedatamodel.jfm" at offset 0 (0x0000000000000000) for 8192 (0x00002000) bytes failed after 0.000 seconds with system error 21 (0x00000015): "The device is not ready. ". The write operation will fail with error -1022 (0xfffffc02). If this error persists then the file may be damaged and may need to be restored from a previous backup.
Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: The database engine stopped the instance (0) with error (-510). Internal Timing Sequence: [1] 0.000006 +J(0) [2] 0.000012 +J(0) [3] 0.000007 +J(0) [4] 0.000005 +J(0) [5] 0.000038 +J(0) +M(C:0K, Fs:1, WS:4K # 0K, PF:0K # 0K, P:0K) [6] 0.000016 +J(0) [7] - [8] 0.000009 +J(0) [9] 0.000885 +J(0) +M(C:0K, Fs:3, WS:-36K # 0K, PF:-48K # 0K, P:-48K) [10] - [11] 0.001168 +J(0) +M(C:0K, Fs:2, WS:8K # 0K, PF:0K # 0K, P:0K) [12] - [13] 0.000052 +J(0) +M(C:0K, Fs:3, WS:8K # 0K, PF:-4K # 0K, P:-4K) [14] 0.000037 +J(0) +M(C:0K, Fs:0, WS:-8K # 0K, PF:-8K # 0K, P:-8K) [15] 0.000016 +J(0) +M(C:0K, Fs:0, WS:-12K # 0K, PF:-12K # 0K, P:-12K) [16] 0.000006 +J(0).
Jun 12 12:39:17 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Microsoft-Windows-GroupPolicy: Shutdown script failed. GPO Name : Local Group Policy GPO File System Path : C:\Windows\System32\GroupPolicy\Machine Script Name: C:\scripts\set_user_data.ps1
Jun 12 12:39:17 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Microsoft-Windows-Kernel-PnP: The driver \Driver\WudfRd failed to load for the device SWD\WPDBUSENUM\{70ffd6cb-3efa-11e7-9146-806e6f6e6963}#0000000595E00000.
Jun 12 12:39:17 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Service_Control_Manager: The CldFlt service failed to start due to the following error: The request is not supported.
Jun 12 12:39:17 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Microsoft-Windows-Kernel-PnP: The driver \Driver\WudfRd failed to load for the device SWD\WPDBUSENUM\{70ffd6cb-3efa-11e7-9146-806e6f6e6963}#0000000000100000.
Assignee | ||
Comment 1•7 years ago
|
||
this is the script i am using to reboot impaired instances. it runs under cron every 10 minutes. rebooted instances appear to perform fine and run tc tasks correctly.
Comment 2•7 years ago
|
||
Rob, this slightly modified version of the inner loop should only kill instances that have been impaired for more than ten minutes:
/usr/bin/aws ec2 describe-instance-status --profile taskcluster --region "${aws_region}" --filters 'Name=instance-status.status,Values=impaired' --query 'InstanceStatuses[*].{InstanceId:InstanceId,Since:InstanceStatus.Details[0].ImpairedSince}' --output text | while read instance_id impaired_since
do
# check if impaired for more than 600 seconds (10 mins)
# note, since dates are iso 8601 in UTC, a string comparison of dates is sufficient and correct
if [ "${impaired_since}" \< "$(date -d @$(($(date +%s) - 600)) --iso-8601=seconds)" ]; then
key_name=$(aws ec2 describe-instances --profile taskcluster --region ${aws_region} --instance-ids ${instance_id} | jq -r '.Reservations[0].Instances[0].KeyName')
case "${key_name}" in
*gecko-t-win*)
aws ec2 reboot-instances --profile taskcluster --region ${aws_region} --instance-ids ${instance_id}>/home/grenade/cron/log/reboot-${aws_region}-${instance_id}.json 2>&1
echo $(/usr/bin/date -u '+%Y-%m-%d %H:%M:%S') ${aws_region} ${instance_id} ${key_name}>> /home/grenade/cron/log/reboot.log
;;
*)
echo ${key_name}>/home/grenade/cron/log/skip-${aws_region}-${instance_id}.log 2>&1
;;
esac
fi
done
The reason I'd like to try this adjusted version is that I would expect all rebooting instances to be impaired at times, but only if they remain impaired for some time (I think 10 mins should cover it), should this indicate a real problem. If the status checks run less frequently than every 10 mins, we should increase the duration above (600s) accordingly.
I'm curious if when filtering out the false positives, if we're just left with win10 gpu or not (I'm wondering if the instances of the other types is just related to the status check running while it is in the process of rebooting).
In any case, this script adjustment should hopefully filter out some of the noise and reduce some of the false positives, leaving us with a clearer picture of how widespread the problem is.
Many thanks!
Comment 3•7 years ago
|
||
My 2 cents is that it might be better to monitor for system status rather than instance status. I've seen instance status go into an impaired state right after reboot for a few minutes, but then resolve. The instances that are truly rogue always seem to have an impaired system status though.
Comment 4•7 years ago
|
||
Also for some more datapoints to add to the investigation, while dealing with these instances last week and over the weekend, I noticed:
1. primarily concentrated to the windows 10 gpu worker type
2. failure can happen at different places
3. Some instances claimed no tasks at all
4. some were in the middle of a task (and not the same type of task),. These I think are the ones reported as exception here
5. some completed the last task successfully and failed after reboot I assume
I also was only analyzing instances which were in a system impaired state, not instance. I found that instances can often be in an instance impaired state for a few minutes but then resolved.
Some of the information was collected and put into this sheet:
https://docs.google.com/spreadsheets/d/1rqFavRrFUefvtd6pAPmU0l2pGCuYwt86HZOzi4D4-zU/edit#gid=1470233531
Assignee | ||
Comment 5•7 years ago
|
||
some quick counts of impaired instances by worker type (between 2017-06-12 11:41:16 and 2017-06-13 13:23:27 UTC):
- gecko-t-win7-32: 742
- gecko-t-win7-32-gpu: 504
- gecko-t-win10-64: 39
- gecko-t-win10-64-gpu: 1362
Comment 6•7 years ago
|
||
For papertrail troubleshooting, it looks like this papertrail search finds the instances that we reboot via this script:
https://papertrailapp.com/groups/853883/events?q=+The+system+has+rebooted+without+cleanly+shutting+down+first
Comment 7•7 years ago
|
||
I notice that e.g. this worker (i-0f82062fc0bb59517/us-west-2/gecko-t-win7-32-gpu) successfully ran 8 tasks[1] (rebooting between each one) before being caught by reboot-impaired-instances.sh shortly after its last reboot[2] (before claiming a 9th task):
Jun 13 11:39:55 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 09:39:54 Running task https://tools.taskcluster.net/task-inspector/#LgCf6LkKR2emOJlhj_7OpA/0
Jun 13 11:57:40 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 09:57:39 Running task https://tools.taskcluster.net/task-inspector/#ObKuKSvISmq2Auos23tVGQ/0
Jun 13 12:03:43 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 10:03:42 Running task https://tools.taskcluster.net/task-inspector/#QKpFbfqbTRmBcWZuvSWJZQ/0
Jun 13 12:11:26 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 10:11:25 Running task https://tools.taskcluster.net/task-inspector/#LUMtilzmSaaJLrHuj96sxw/0
Jun 13 12:17:23 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 10:17:21 Running task https://tools.taskcluster.net/task-inspector/#Tgyfl2XdSrSuPG_ElaPhZA/0
Jun 13 12:48:02 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 10:47:59 Running task https://tools.taskcluster.net/task-inspector/#LlANDjYtRLyZJXS7YKP4Kg/0
Jun 13 13:16:21 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 11:16:19 Running task https://tools.taskcluster.net/task-inspector/#IbQKYt8dRXCcaPBbf2Z9nQ/0
Jun 13 14:21:37 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 12:21:35 Running task https://tools.taskcluster.net/task-inspector/#dr14IIpSSGaqlxDhMEGutg/0
This seems suspicious to me, because it was able to start up 8 times without crashing, yet was determined to be impaired on the 9th startup, and before any task was claimed that might have exercised it in some way.
I still wonder if we are catching too many instances in our reboot-impaired-instances.sh net, even with the 10 minute wait from comment 2. Maybe we should adjust the query per comment 3? What do you guys think?
--
[1] https://papertrailapp.com/systems/942648811/events?q=Running+task&focus=811235686693920784
[2] https://papertrailapp.com/systems/942648811/events?focus=811265815423926311&selected=811265815423926311
Assignee | ||
Comment 8•7 years ago
|
||
we're not trapping instances that are not impaired. there's a different ec2 cli call for instances whose state has not yet been determined. the call we are running is specifically designed to only reboot instances that are genuinely impaired.
even if my statement above were completely incorrect (it's not, but hypothetically speaking), we know we have an issue with instances going into impaired state. the kill script is a temporary measure in order to keep the fires lit. worrying about the kill script does absolutely nothing about the very real issue that we have instances becoming impaired which would simply hold resources until we ground to a halt.
if we could focus attention on determining how instances are getting impaired instead, i think that would be the most productive use of our time. i appreciate that the script may be flawed. but i am completely sure that the script is not what is causing instances to become impaired so criticisms of the script might be perfectly valid but achieve precisely nothing in debugging the cause of the impaired state. polishing the temporary script will not solve our problem.
Comment 9•7 years ago
|
||
Can we roll back the golden AMIs and the OCC manifests, and then apply the changes (be they base AMI changes, or OCC changes) iteratively, until we reproduce the issue?
Is there anything you would like a hand with?
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 10•7 years ago
|
||
i'm focusing on the event log messages about lost network connectivity and the configuration of the pv driver. please feel free to roll back images if you prefer that approach for identifying the cause of the impaired state.
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 11•7 years ago
|
||
found a suspect in the event logs for win 10 64 gpu:
Application: ViscosityService.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.Runtime.InteropServices.COMException
at System.Runtime.InteropServices.Marshal.ThrowExceptionForHRInternal(Int32, IntPtr)
at System.Management.SinkForEventQuery.Cancel()
at System.Management.ManagementEventWatcher.Stop()
at System.Management.ManagementEventWatcher.Finalize()
could explain the loss of network since viscosity creates a virtual adapter for the vpn. timing of the viscosity install also coincides with the start of the issues with impaired instances so testing a patch now which disables it:
- https://github.com/mozilla-releng/OpenCloudConfig/commit/cd9c0ef083e416ee325760ba9e6b2c3d254f6bc1
- https://tools.taskcluster.net/task-group-inspector/#/JiBoMQ9JSMGR9-1ZbC17KQ
Assignee | ||
Updated•7 years ago
|
Assignee: relops → rthijssen
Status: NEW → ASSIGNED
Assignee | ||
Comment 12•7 years ago
|
||
the windows 10 gpu instances still go into impaired state and need to be rebooted when this happens. if they aren't rebooted, they use up all of our configured allocation for gecko-t-win10-64-gpu and the pending count for that instance type goes up while no work gets done on the impaired instances.
i run the attached script to hunt down impaired instances and reboot them as part of a cron job. which i have configured to run every five minutes.
someone will need to run this script or something like it while i'm on pto or until we resolve the root cause of the instances going into this state.
Attachment #8876703 -
Attachment is obsolete: true
Attachment #8877149 -
Attachment is obsolete: true
Attachment #8919820 -
Flags: feedback?(mcornmesser)
Comment 13•7 years ago
|
||
I think I have something working here. Thought he logs are generating currently. I am going to check it again in an hour.
(In reply to Rob Thijssen (:grenade - UTC+3) from comment #12)
> Created attachment 8919820 [details]
> reboot-impaired-instances.sh
>
> the windows 10 gpu instances still go into impaired state and need to be
> rebooted when this happens. if they aren't rebooted, they use up all of our
> configured allocation for gecko-t-win10-64-gpu and the pending count for
> that instance type goes up while no work gets done on the impaired instances.
>
> i run the attached script to hunt down impaired instances and reboot them as
> part of a cron job. which i have configured to run every five minutes.
>
> someone will need to run this script or something like it while i'm on pto
> or until we resolve the root cause of the instances going into this state.
Updated•7 years ago
|
Attachment #8919820 -
Flags: feedback?(mcornmesser) → feedback+
Comment 14•7 years ago
|
||
https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-win10-64-gpu has again issues
421 pending jobs and 256 instances in the pool - many in state "exception" or "completed".
Updated•7 years ago
|
Blocks: tc-stability
Updated•7 years ago
|
No longer blocks: tc-stability
Updated•7 years ago
|
Blocks: tc-stability
Comment 15•7 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #12)
> i run the attached script to hunt down impaired instances and reboot them as
> part of a cron job. which i have configured to run every five minutes.
>
> someone will need to run this script or something like it while i'm on pto
> or until we resolve the root cause of the instances going into this state.
Rob: Where is the script running right now?
If it's still running on someone's (your?) personal machine, can we please get it moved onto an AWS instance or a datacenter machine?
Flags: needinfo?(rthijssen)
Comment 16•7 years ago
|
||
:jhford has established that since today, that script no longer will work, as it relies on some convention which is no longer followed.
I've terminated workers in the most affected pools, and :jhford is seeing if he can stand up a modified version of the script.
The path back to "old-world" where that script works is non-trivial, which is why :jhford is initially investigating patching the existing script from this bug.
Comment 17•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #16)
> :jhford has established that since today, that script no longer will work,
> as it relies on some convention which is no longer followed.
>
> I've terminated workers in the most affected pools, and :jhford is seeing if
> he can stand up a modified version of the script.
>
> The path back to "old-world" where that script works is non-trivial, which
> is why :jhford is initially investigating patching the existing script from
> this bug.
(canceling the NI on :grenade)
It's worrying in that we deployed a provisioning change that invalidated other parts of our infra, and found out about it after the fact.
How do we keep the Taskcluster and RelOps sides of this coin in better sync going forward?
Flags: needinfo?(rthijssen)
Comment 18•7 years ago
|
||
Comment on attachment 8919820 [details]
reboot-impaired-instances.sh
Obsoleted in favour of https://bugzilla.mozilla.org/attachment.cgi?id=8950677&action=edit
This script was using internal implementation details of the provisioner to determine which instances belonged to which workerType. For future reference, the correct endpoint to determine this information is: https://aws-provisioner.taskcluster.net/v1/state/gecko-t-win10-64-gpu
Bug 1437973 will track deployment of this script.
Attachment #8919820 -
Attachment is obsolete: true
Comment 19•7 years ago
|
||
Trees remain closed for this since about 20:00UTC. We have moved the winstance-slayer script into a more visible hook in bug 1437973 and started terminating instead of rebooting, but to no avail -- pending seems to keep going up. It does not appear that the instances are taking any jobs.
Updated•7 years ago
|
Severity: normal → blocker
Priority: -- → P1
Comment 20•7 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #17)
> It's worrying in that we deployed a provisioning change that invalidated
> other parts of our infra, and found out about it after the fact.
Agreed. I didn't know anything about this bug beyond a passing familiarity with Gecko testing on Windows being unreliable, so it came as a shock to me as well. This logic of script could've been integrated into ec2-manager in about two hours. The bulk of that that code is what was written for 1437973.
Greg was involved in this bug and we'd spoken about checks similar to this as well as the removal of the ssh key pair name in our 1x1s for months, so it's confusing to me that I wasn't looped in.
More concerning, however, is that the way that this script is developed and deployed. It's critical to having open trees, but it seems that it's being run as a cronjob on someone's workstation. That's fine when it's a couple days to get through a hairy spot, but this has been running for eight months. It's using internal implementation details of another system. Using unsupported interfaces and workstation deployments for something critical is risky, as this situation has highlighted.
This script could've also negatively impacted EC2-Manager, causing other tree closures. It's using two API endpoints which are extremely expensive on the EC2 API call rate limits. The two describe* calls can list theoretically unbounded data inside the EC2 api. That could exhaust our API rate limit quota. If that happens, ec2-manager would fail when it needed to call describeInstances endpoint itself. This call limit exhaustion is the direct cause of nearly all provisioning outages in the last year or so, so this is not an abstract concern.
Maybe the solution is to block all access to the instances in our EC2 account other than the supported interfaces of the AWS-Provisioner. When a need for more access arises, we can make sure it's done in the most reliable way possible or at least document the exceptions.
> How do we keep the Taskcluster and RelOps sides of this coin in better sync
> going forward?
I will include Relops on emails relating to major EC2-Manager or AWS-Provisioner changes if desired. As I'm not sure which internal details are used, I don't know which changes are important. I can also suggest following the EC2-Manager repository. That's where the development for all of these changes happens. Issues and Pull Requests are the best source of information. Changes are usually landed to master branch a day before deployment.
Some upcoming changes which might be important for Relops in EC2-Manager are:
- switching how we generate taskcluster credentials for workers
- removal of provisioner secrets
- we will be doing reliability based bidding
- we will be rotating the credentials which ec2-manager uses
The first two changes are changes to supported interfaces and backwards compatibility is a high priority. The last two should be invisible.
It would be great if Relops could let me know which tools operate on the taskcluster ec2 account. The tools of interest will either call listing operations or change state. Anything which does either thing automatically would be great too. Things like CloudWatch monitoring isn't important.
Comment 21•7 years ago
|
||
I cc-ed :markco on the AWS case, but in case[1] he can't see it, I'm copying the response here:
"From my understanding of your issue, you have an instance which is failing status checks & would like some assistance in investigating this issue.
"I have reviewed your instance i-001e32d40884f82ed & can see from looking at the status checks there is an issue with EC2 detecting the Network card for your instance which is generally caused by the interface being disabled within Windows.
"To resolve this issue, can you please create a new Elastic Network Interface, Attach it to your instance, then RDP to your instance using the new ENI IP address & ensure the Primary Network interface is not disabled?"
1. https://console.aws.amazon.com/support/v1#/case/?displayId=4875481371&language=en
Comment 22•7 years ago
|
||
The problem turned out to be two-fold, but from the same root cause. We had code that relied on the key name of provisioned instances containing the worker type name.
1) The kill instances script relied on this, and therefore no longer was effective in killing workers.
2) OpenCloudConfig relied on this, and therefore OCC Windows instances would not work.
Non-OCC Windows workers are not affected (such as those used by nss, and nss-try).
1) Has a fix here:
https://mozilla.logbot.info/taskcluster/20180214#c14298284
2) Has a fix here:
https://github.com/mozilla-releng/OpenCloudConfig/commit/d7d2df5a174087bad52e7d3636ae92e043f999f0
The fix from 1) is already deployed.
It will take one to two hours for 2) to automatically deploy to the affected worker types, at which point we should see the backlogs to start to recover.
Comment 23•7 years ago
|
||
Thank you for the investigation and fixing the issue.
Pending counts are going down, sheriffs waited for the jobs pending on non-Try trees to run up to latest potential merge candidates, did a merge and then reopened the trees a few minutes ago. Autoland will get closed again because there should be many patches pending and we will reopen it again after we got some build and test coverage.
Comment 24•7 years ago
|
||
Trees are closing again because Windows builds stopped running after some time. Do we hit an AWS query limit?
https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/ currently lists 256 pending jobs for gecko-3-b-win2012.
Pending build jobs whose platforms start with 'win' grouped by state: https://sql.telemetry.mozilla.org/queries/51150
pending 394
running 5
The 'running' number is falling.
Pending build jobs whose platforms start with 'win', looked at the column 'state': https://sql.telemetry.mozilla.org/queries/51149
Flags: needinfo?(pmoore)
Summary: investigate ec2 windows instances stalled in impaired state → Tree closure: investigate ec2 windows instances stalled in impaired state
Comment 25•7 years ago
|
||
ec2-manager::CRIMSON=> select region, code, "workerType", count(code) from awsrequests where error = true and method = 'runInstances' and called > now() - interval '3h' group by "workerType", code, region;
region | code | workerType | count
--------------+------------------------------+----------------------+-------
us-east-1 | InsufficientInstanceCapacity | gecko-1-b-linux | 4
us-east-1 | InsufficientInstanceCapacity | gecko-1-b-macosx64 | 5
us-east-1 | RequestLimitExceeded | gecko-1-b-macosx64 | 1
us-west-1 | InsufficientInstanceCapacity | gecko-3-b-android | 9
us-west-1 | RequestLimitExceeded | gecko-3-b-android | 3
us-west-2 | RequestLimitExceeded | gecko-3-b-android | 1
us-east-1 | InsufficientInstanceCapacity | gecko-3-b-linux | 6
us-west-1 | InsufficientInstanceCapacity | gecko-3-b-linux | 44
us-west-1 | RequestLimitExceeded | gecko-3-b-linux | 4
us-east-1 | InsufficientInstanceCapacity | gecko-3-b-macosx64 | 6
us-west-2 | RequestLimitExceeded | gecko-t-linux-large | 46
us-west-2 | RequestLimitExceeded | gecko-t-linux-xlarge | 43
eu-central-1 | InsufficientInstanceCapacity | gecko-t-win10-64 | 134
us-west-1 | InsufficientInstanceCapacity | gecko-t-win10-64 | 66
eu-central-1 | RequestLimitExceeded | gecko-t-win10-64 | 1506
us-west-1 | RequestLimitExceeded | gecko-t-win10-64 | 34
us-west-2 | InsufficientInstanceCapacity | gecko-t-win10-64-gpu | 520
us-west-2 | InternalError | gecko-t-win10-64-gpu | 1
us-west-2 | RequestLimitExceeded | gecko-t-win10-64-gpu | 646
eu-central-1 | InsufficientInstanceCapacity | gecko-t-win7-32 | 139
us-west-1 | InsufficientInstanceCapacity | gecko-t-win7-32 | 93
eu-central-1 | RequestLimitExceeded | gecko-t-win7-32 | 1673
us-west-1 | RequestLimitExceeded | gecko-t-win7-32 | 35
us-west-2 | InsufficientInstanceCapacity | gecko-t-win7-32-gpu | 20
us-west-2 | RequestLimitExceeded | gecko-t-win7-32-gpu | 16
(25 rows)
These are the calls which attempted to spawn instances. The RequestLimitExceeded code is returned when too many calls are being made, the InsufficientInstanceCapacity code is returned when EC2 does not have capacity available for us.
I have work in progress for exposing this information in the Aws-Provisioner tools page.
Flags: needinfo?(pmoore)
Comment 26•7 years ago
|
||
The Windows builds not starting is being tracked in bug 1438152, got told this bug is only about gecko-t-win10-64-gpu.
See Also: → 1438152
Updated•7 years ago
|
Severity: blocker → normal
Priority: P1 → P3
Assignee | ||
Updated•7 years ago
|
Summary: Tree closure: investigate ec2 windows instances stalled in impaired state → investigate ec2 windows instances stalled in impaired state
Comment 27•7 years ago
|
||
Any progress on this? It's blocking being able to run some kinds of tests in CI for QuantumRender, which in turn means it's going to be harder to ship QuantumRender.
Blocks: 1424755
Comment 28•6 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #27)
> Any progress on this? It's blocking being able to run some kinds of tests in
> CI for QuantumRender, which in turn means it's going to be harder to ship
> QuantumRender.
:coop, can you help prioritize this? This is blocking enabling QR for Windows Nightly.
Flags: needinfo?(coop)
Comment 29•6 years ago
|
||
If it helps, this is super easy to repro, just run with the patch from [1]:
> ./mach try -b o -p win64 -u awsy-e10s[windows10-64-qr] -t none --rebuild 5 --artifact
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=65633282852e99abbed00c59ef8df057cb9ababb
Comment 30•6 years ago
|
||
You can also easily reproduce with the patch from https://treeherder.mozilla.org/#/jobs?repo=try&revision=c57c6fe3ba01f2e3e62b34d6d02ca77c573fe2ac.
Comment 32•6 years ago
|
||
Redirecting NI to :fubar. I can't speak directly for :grenade's time.
Flags: needinfo?(coop) → needinfo?(klibby)
Comment 33•6 years ago
|
||
Per :pmoore's request in bug 1339136 comment 18, I ran curl -L on the livelog for one of these jobs that ended up going blue. The task is at [1], I think it was run 3 or 4 on which I ran curl:
$ curl --location -o m3-livelog "https://queue.taskcluster.net/v1/task/DTrZAoN_SQq6iMZ2RW858A/runs/3/artifacts/public%2Flogs%2Flive.log"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 138 100 138 0 0 461 0 --:--:-- --:--:-- --:--:-- 463
100 2928k 0 2928k 0 0 3199 0 --:--:-- 0:15:37 --:--:-- 0
curl: (56) LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 60
The last couple of lines in the log are this:
15:55:51 INFO - GECKO(3112) | --DOMWINDOW == 23 (0000016EB614BA00) [pid = 3388] [serial = 823] [outer = 0000000000000000] [url = http://mochi.test:8888/tests/dom/tests/mochitest/bugs/file_window_bar.html?default]^M
15:55:51 INFO - GECKO(3112) | --DOMWINDOW == 22 (0000016EB7843800) [pid = 3388] [serial = 820] [outer = 0000000000000000] [url = http://mochi.test:8888/tests/dom/tests/mochitest/bugs/fi
so it looks like it just cuts off at some point. The "full" log (i.e. the m3-livelog file I got from curl) is attached.
[1] https://tools.taskcluster.net/groups/Qq3j73IqQ76miagF9laLxQ/tasks/DTrZAoN_SQq6iMZ2RW858A/details
Comment 34•6 years ago
|
||
Also I should note that the m3-livelog file is 2998272 bytes, which is an exact multiple of 16k, so it's likely that something was buffering the output and the last stuff in the file is just whatever was last flushed, not necessarily the last output generated by the worker.
Comment 35•6 years ago
|
||
Mark, can you have a go at this while Rob's on PTO? Given that we have a method to reproduce (thanks!) maybe we can manually run that while watching the event log, etc, and get a better idea of what's tripping us up.
Flags: needinfo?(klibby) → needinfo?(mcornmesser)
Comment 37•6 years ago
|
||
Out of curiosity, are any of the tasks/jobs that we've run into this issue on also run on buildbot/ESR52?
Comment 38•6 years ago
|
||
What is the problem being reproduced in comment 29 and comment 30? Are those producing instances in an impaired state, or is an artifact upload issue?
Do we have anyway to identify instances that are in an impaired state?
Comment 39•6 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #38)
> What is the problem being reproduced in comment 29 and comment 30? Are those
> producing instances in an impaired state, or is an artifact upload issue?
>
> Do we have anyway to identify instances that are in an impaired state?
I believe the instances are in an impaired state; trying to inspect one of the machines [1] indicates it's not provisioned.
[1] https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-win10-64-gpu/workers/eu-central-1/i-0aace1bdc7bd7ba24
Comment 40•6 years ago
|
||
Because of bug 1443589 I have not been able to spend much time on this bug this week.
Comment 41•6 years ago
|
||
We can still reproduce this with Windows coverage builds. Are you still planning to investigate this?
Flags: needinfo?(rthijssen)
Flags: needinfo?(mcornmesser)
Assignee | ||
Comment 42•6 years ago
|
||
hi Marco, yes, we're still actively trying to solve this.
it would be good to test if the problem can be reproduced on the new auto-generated amis (from bug 1474548) used on our beta workers.
if you have a good way to reproduce the problem, can i ask that you import this commit: https://hg.mozilla.org/try/raw-rev/0103760 before pushing to try? the commit forces all windows tasks onto our beta workers where we'd like to check if the generated windows 10 amis maybe don't suffer from problems that cause them to go offline.
thank you!
Flags: needinfo?(rthijssen)
Flags: needinfo?(mcornmesser)
Comment 43•6 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #42)
> hi Marco, yes, we're still actively trying to solve this.
>
> it would be good to test if the problem can be reproduced on the new
> auto-generated amis (from bug 1474548) used on our beta workers.
>
> if you have a good way to reproduce the problem, can i ask that you import
> this commit: https://hg.mozilla.org/try/raw-rev/0103760 before pushing to
> try? the commit forces all windows tasks onto our beta workers where we'd
> like to check if the generated windows 10 amis maybe don't suffer from
> problems that cause them to go offline.
>
> thank you!
Yes, that patch seems to fix the problem: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4535f9712505f214b9078a1081c442ebbc1fdc9f.
Comment 44•6 years ago
|
||
Great news, thanks guys!
Comment 45•6 years ago
|
||
The patch worked for me as well: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=a73d0ef7bae6722730367ef060f312d8a375623c
Any ETA on landing it? Really excited to see this finally fixed!
Assignee | ||
Comment 46•6 years ago
|
||
jmaher: we think we're ready to roll out auto-generated windows 10 amis for gecko-t-win10-64-gpu worker types. we'd like to do a test run on the gecko-t-win10-64-gpu-b beta workers that runs a fairly heavy test load against that worker type.
i would like to create some try pushes that trigger all the tests that already run on gecko-t-win10-64-gpu but redirect them to gecko-t-win10-64-gpu-b using the worker swap from https://hg.mozilla.org/try/raw-rev/0103760.
do you know what try syntax i should use to trigger quite a lot of win 10 gpu tests? do you want to do any testing as well that i should wait on?
Flags: needinfo?(jmaher)
Comment 47•6 years ago
|
||
here are some examples:
https://searchfox.org/mozilla-central/search?q=%3A+hardware&case=false®exp=false&path=taskcluster%2Fci
ignore the |default: hardware|, those are perf tests.
I would also do win10-ccov as well- you can find that in |./mach try fuzzy -q 'windows ccov'|
since these tests fail 90% of the time on existing configurations, just getting them to run a few times will be more than adequate :)
If you want to up the load, just run all win10 jobs 4 or 5 times :)
Flags: needinfo?(jmaher)
Assignee | ||
Comment 48•6 years ago
|
||
here's what i came up with to test (hopefully i got my patches and syntax correct):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd4980f
Comment 49•6 years ago
|
||
this looks good- a few perma failing tests in reftest- those can be fixed easily.
Assignee | ||
Comment 50•6 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #49)
> this looks good- a few perma failing tests in reftest- those can be fixed
> easily.
great. is there anything obvious on infra that could be causing those reftest failures? i converted some of the base64 images that had pixel differences to see if i could spot a desktop popup or something but the images i saw were just greyscale spheres
on a white background for both the reference png and the test png.
should i wait on promoting this ami to gecko-t-win10-64-gpu or is it easier to get reftest fixed if we go live with the updated ami?
Flags: needinfo?(jmaher)
Assignee | ||
Comment 51•6 years ago
|
||
here's the last test run i did last night that included mochitest-gpu as well:
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=c406a22
Comment 52•6 years ago
|
||
here are the failures (debug job names- same tests fail in opt as well):
R1
* https://searchfox.org/mozilla-central/source/layout/reftests/svg/reftest.list#340
** already fuzzy-if win10 to 15, we need 16
* https://searchfox.org/mozilla-central/source/layout/reftests/css-grid/reftest.list#38
** NEW FAILURE - VISIBLE DIFFERENCE
R3
* fails already in verify mode: https://searchfox.org/mozilla-central/source/layout/reftests/canvas/reftest.list#80
** doesn't render in repeated attempts, I would skip it or fuzzy-if(1,45)
* https://searchfox.org/mozilla-central/source/layout/reftests/async-scrolling/reftest.list#58
** accepts up to 20 pixels incorrect, we have 22 now
R4 (these are in wpt already...can we turn these off?)
* https://searchfox.org/mozilla-central/source/layout/reftests/w3c-css/submitted/masking/reftest.list#33
** already fuzzy up to 3, we just need 7 pixels
* https://searchfox.org/mozilla-central/source/layout/reftests/w3c-css/submitted/masking/reftest.list#91-99
** already fuzzy to 21, we need 27 pixels (only tests that have existing fuzzyif(winwidget)
R4
* https://searchfox.org/mozilla-central/source/layout/reftests/border-radius/reftest.list#47
** fuzzy if already for win 6.2 (i.e. win8), need to remove that and make win10 fuzzy to 12
* https://searchfox.org/mozilla-central/source/layout/reftests/border-radius/reftest.list#50
** fuzzyif for win8 need to make win10
* https://searchfox.org/mozilla-central/source/layout/reftests/border-radius/reftest.list#83
** NEW FAILURE?
* https://searchfox.org/mozilla-central/source/gfx/tests/reftest/reftest.list#17
** already fuzzy to 26354, make it 26621
in all of this there are 2 things we probably should investigate annotated with NEW FAILURE.
:mattwoodrow, can you look at the failure in transforms-1.html (https://searchfox.org/mozilla-central/source/layout/reftests/border-radius/reftest.list#83)?
** https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/dKv_7M14TquglVDxpmdvBQ/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
:dbaron, can you look at the failure in grid-placement-auto-implicit-001.html (https://searchfox.org/mozilla-central/source/layout/reftests/css-grid/reftest.list#38)?
** https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/P6Cm0gv3QuKnIcn_LKfWqg/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(jmaher)
Flags: needinfo?(dbaron)
Assignee | ||
Comment 53•6 years ago
|
||
Assignee | ||
Comment 54•6 years ago
|
||
Assignee | ||
Comment 55•6 years ago
|
||
Assignee | ||
Comment 56•6 years ago
|
||
Assignee | ||
Comment 57•6 years ago
|
||
crashtest, mochitest-(1-5) & mochitest-media-e10s are currently configured to run
- on gpu instances (virtual-with-gpu) when running a "Windows 10 x64 QuantumRender" [windows10-64-qr] test configuration
and
- without gpu (virtual) when running a normal "Windows 10 x64" [Windows 10] test configuration.
see: https://dxr.mozilla.org/mozilla-central/search?q=windows10-64-qr%2F.*%3A+virtual-with-gpu
it looks like mochitest-media-e10s is failing when run on the new ami under the windows10-64-qr config. i don't know if that's significant, but thought i would mention it in case we need to fix that too. if someone understands the significance of why these suites run on gpu for the qr configuration only, maybe they could make a call on what we should do about this. it looks like it was switched on in bug 1426155.
here's a push showing failure for mochitest-media-e10s on the new ami (beta) with gpu and success on old ami (prod) without gpu:
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=68cc57c
crashtest & mochitest-(1-5) look fine:
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=78df068
Comment 58•6 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #57)
> it looks like mochitest-media-e10s is failing when run on the new ami under
> the windows10-64-qr config. i don't know if that's significant, but thought
> i would mention it in case we need to fix that too.
For now we can disable the failing test on the windows10-64-qr platform. Although really it looks like the test doesn't stall, it just takes a really long time as the log shows output still coming from the test at the point that it times out. I'm happy to write a patch to deal with this failure if you need me to.
> if someone understands
> the significance of why these suites run on gpu for the qr configuration
> only, maybe they could make a call on what we should do about this.
These suites run on gpu because the qr configuration has webrender enabled, and webrender only works if you have a gpu.
Assignee | ||
Comment 59•6 years ago
|
||
Assignee | ||
Comment 60•6 years ago
|
||
Comment 61•6 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #58)
> For now we can disable the failing test on the windows10-64-qr platform.
> Although really it looks like the test doesn't stall, it just takes a really
> long time
Actually it looks like the test is waiting for a particular pixel to turn red but never sees that happening, and it keeps looping and checking until the test times out. I'm not sure offhand why this is happening as the screenshot seems to show the expected behaviour, but I'm happy to disable the test on QR until we figure it out.
Comment 62•6 years ago
|
||
Comment 63•6 years ago
|
||
transforms-1.html is failing with a max difference of 1 with the two images disagreeing on 127 vs 128 (~0.5 as a float).
That's super subtle, and probably just a slight difference in floating point calculations somewhere, maybe the graphics driver.
We should just fuzz it.
Flags: needinfo?(matt.woodrow)
redirecting my needinfo from comment 52 to mats
Flags: needinfo?(dbaron) → needinfo?(mats)
Comment 65•6 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #52)
> :dbaron, can you look at the failure in
> grid-placement-auto-implicit-001.html
> (https://searchfox.org/mozilla-central/source/layout/reftests/css-grid/
> reftest.list#38)?
Looks like it's using JS to generate some part of the test/reference,
so I suspect it needs to use "reftest-wait". Please disable it and
file a bug in Core/Layout:Grid.
Flags: needinfo?(mats)
Assignee | ||
Comment 66•6 years ago
|
||
jmaher: i believe we need 3 patches to tests in order to start using the new auto generated amis.
- disable failing mda test on QR (as per :kats' patch in comment 62)
- fuzz reftests/transforms-1.html tests (as per :mattwoodrow's comment 63)
- disable reftests/grid-placement-auto-implicit-001.html (as per mats comment 65)
is this something you are able to organise or assign to the appropriate people?
Flags: needinfo?(jmaher)
Comment 67•6 years ago
|
||
If Joel's not around I'm happy to do this.
Comment 68•6 years ago
|
||
let me get some bugs in place and we can resolve them independently (ideally this week)
Flags: needinfo?(jmaher)
Updated•6 years ago
|
Attachment #9022091 -
Flags: review+
Comment 69•6 years ago
|
||
in addition there are a series of small adjustments to make to existing fuzzy-if statements outlined in comment 52. I have try pushes going for the first two bugs
Assignee | ||
Comment 70•6 years ago
|
||
Assignee | ||
Comment 71•6 years ago
|
||
Comment 72•6 years ago
|
||
Comment on attachment 9022091 [details] [diff] [review]
patch to disable failing mda test on QR
I landed this patch on bug 1509064
Attachment #9022091 -
Flags: checked-in+
Assignee | ||
Comment 73•6 years ago
|
||
Assignee | ||
Comment 74•6 years ago
|
||
jmaher, kats:
are you happy for me to promote the new gecko-t-win10-64-gpu ami (from beta to prod), based on the test results linked in comment 73 above, or should i wait for further fuzzy patches?
i note that there are a couple failures:
- test-windows10-64-qr/opt-reftest-e10s-2 (R2): failed 3/3 retries
- test-windows10-64-qr/opt-awsy-base-dmd-e10s (ab-d): failed 2/3 retries
Flags: needinfo?(kats)
Flags: needinfo?(jmaher)
Comment 75•6 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #74)
> i note that there are a couple failures:
> - test-windows10-64-qr/opt-reftest-e10s-2 (R2): failed 3/3 retries
This one needs fixing still - it looks like the test on windows has both the 'webrender' and 'skiaContent' annotations activated, and the latter is taking precedence over the former. If we switch the order of those annotations it should be good. I'll land a patch for this.
> - test-windows10-64-qr/opt-awsy-base-dmd-e10s (ab-d): failed 2/3 retries
This is a pre-existing failure, so I'm not worried about this
> jmaher, kats:
> are you happy for me to promote the new gecko-t-win10-64-gpu ami (from beta
> to prod), based on the test results linked in comment 73 above, or should i
> wait for further fuzzy patches?
>
Please wait for me to fix the fuzzy failure above. The trees are closed at the moment so it will be whenever the trees reopen.
Flags: needinfo?(kats)
Comment 76•6 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #75)
> This one needs fixing still - it looks like the test on windows has both the
> 'webrender' and 'skiaContent' annotations activated, and the latter is
> taking precedence over the former. If we switch the order of those
> annotations it should be good. I'll land a patch for this.
Sorry, I was looking at an old version of the code. The actual problem is the &&isDebugBuild clause on the annotation. I'll need to drop that since it happens on both opt and debug. Will do that once trees reopen.
Comment 77•6 years ago
|
||
I landed the patch on inbound earlier today. If the AMI change happens as part of an in-tree change then you could land it on inbound now and it should be good. Otherwise we should wait until that patch has propagated to all the trunk trees.
Comment 78•6 years ago
|
||
The patch is on central now so we should be good to go!
Assignee | ||
Comment 79•6 years ago
|
||
Assignee | ||
Comment 80•6 years ago
|
||
new amis are now live:
https://github.com/mozilla-releng/OpenCloudConfig/commit/606fe69601a6621b144580f17365234e5347766d
https://github.com/mozilla-releng/OpenCloudConfig/commit/496b6e8d17392503b50c05086c38167d5747fea6
https://tools.taskcluster.net/task-group-inspector/#/X-DxvMHGR3uKGODnIykkxA
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: needinfo?(jmaher)
Resolution: --- → FIXED
Comment 81•6 years ago
|
||
\o/ thank you!
Assignee | ||
Comment 82•6 years ago
|
||
Comment 83•6 years ago
|
||
New failures on Windows 10 have increase heavily: https://sql.telemetry.mozilla.org/queries/60373/
Rob, can you take a look at those, please?
Table with links to bugs, logs and infos on job type and platform: https://sql.telemetry.mozilla.org/queries/60371/
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 84•6 years ago
|
||
i'm afraid i don't know what to make of this data or what info is being asked of me but i'll take a stab at a response:
the change implemented by this bug was to create and use a new windows 10 ami which uses a source controlled, automated process to deterministically create an ec2 ami from a windows build 1703 iso (see bug 1474548), replacing an ami that was manually created (by a human, installing windows on a vm, and importing the resulting vhd into ec2, as an ami).
the change gives us a mechanism to deterministically modify the ami generation process in an incremental and audited way in order to improve the repeatability of that exercise for subsequent builds of windows without having to rely on human memory from the last time an ami was manually built. a happy byproduct of the change was that instances created from the new amis do not suffer from going into an impaired state whenever the gpu is put through its paces, hence the resolution of this bug.
we can of course revert the deployment to go back to the manually created ami which did not have the problems referred to by the telemetry link (but which did have other problems, like instances that die mid-task), or we can choose to bite the bullet and start to incrementally improve the automated process to reduce the number of intermittently failing tests, as and when we understand the causes of the individual failures. i prefer the latter, but it's not my call and i'm happy to work with whatever moves us forward.
if there's any other specific questions i can answer, please feel free to ni me again.
Flags: needinfo?(rthijssen)
Comment 85•6 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #84)
> i'm afraid i don't know what to make of this data
I'm also somewhat overwhelmed by the data.
Sebastian: can you (or one of the sheriffs) call out a few standout bugs from that list to block this bug on if you want Rob to dig into this more, please?
Flags: needinfo?(aryx.bugmail)
Comment 86•6 years ago
|
||
I think the problem is there are so many new bugs that are individual- it isn't as if one or two bugs are happening at large scale, this is more of a trend that shows a low rate of windows 10 new intermittent bugs and in the last 3 days a massive jump in new issues.
I also agree with you :coop, this is a hard problem to understand with that data- I think looking into common failures (crashes, hangs, startup hangs, focus, specific suites/messages) would help make this actionable.
In fact it would be nice if we had metrics broken down daily like:
* config - overall intermittent rate, overall new intermittent rate
* test suite - overall intermittent rate, overall new intermittent rate
* buckets (hangs, timeouts, crashes, infra, test failure) - overall intermittent rate, overall new intermittent rate
We could at least have a common ground for understanding what the different patterns are and catch issues that are more widespread easier.
Comment 87•6 years ago
|
||
The failure rate uptick is caused by Windows 10 update prompts overlaying the screen which causes focus, click automation and color detection issues. See bug 1511293.
Flags: needinfo?(aryx.bugmail) → needinfo?(rthijssen)
You need to log in
before you can comment on or make changes to this bug.
Description
•