Closed Bug 1317213 Opened 9 years ago Closed 9 years ago

nss-win2012r2 has tasks expiring by deadline and 50 instances not consuming tasks

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jonasfj, Unassigned)

Details

Attachments

(1 file)

https://tools.taskcluster.net/aws-provisioner/#nss-win2012r2/ @pmoore, you probably have a clue what is up..
Flags: needinfo?(pmoore)
It looks like NSS jobs are hanging here: # Run tests. cd nss/tests && ./all.sh + cd nss/tests + ./all.sh testdir is c:/Users/task_1478920874/tests_results/security which: domainname: unknown command init.sh init: Testing PATH .:/c/Users/task_1478920874/dist/WIN954.0_x86_64_64_OPT.OBJ/bin:/c/Users/task_1478920874/dist/WIN954.0_x86_64_64_OPT.OBJ/lib:/bin:/usr/bin:/c/Users/task_1478920874/vs2015u2/VC/bin/amd64:/c/Users/task_1478920874/vs2015u2/VC/bin:/c/Users/task_1478920874/vs2015u2/SDK/bin/x64:/c/Users/task_1478920874/vs2015u2/VC/redist/x64/Microsoft.VC140.CRT:/c/Users/task_1478920874/vs2015u2/SDK/Redist/ucrt/DLLs/x64:/c/mozilla-build/python:/usr/local/bin:/c/mozilla-build/7zip:/c/mozilla-build/info-zip:/c/mozilla-build/python/Scripts:/c/mozilla-build/yasm:/c/Windows/system32:/c/mozilla-build/upx391w:/c/mozilla-build/moztools-x64/bin:/c/mozilla-build/wget against LIB /c/Users/task_1478920874/dist/WIN954.0_x86_64_64_OPT.OBJ/lib: ./all.sh: Testing with shared library =============================== Running tests for ssl TIMESTAMP ssl BEGIN: Sat Nov 12 03:22:33 CUT 2016 ssl.sh: SSL tests =============================== ssl.sh: CRL SSL Client Tests - with ECC =============================== ssl.sh: TLS Request don't require client auth (client does not provide auth) ---- selfserv starting at Sat Nov 12 03:22:34 CUT 2016 selfserv -D -p 8443 -d ../server -n localhost.localdomain \ -e localhost.localdomain-ecmixed -e localhost.localdomain-ec -S localhost.localdomain-dsa -w nss -r -i ../tests_pid.1904\ -V ssl3:tls1.2 -H 1 & trying to connect to selfserv at Sat Nov 12 03:22:34 CUT 2016 tstclnt -p 8443 -h localhost.localdomain -q \ -d ../client < /c/Users/task_1478920874/nss/tests/ssl/sslreq.dat Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":[taskcluster 2016-11-13T02:53:32.675Z] TASK EXCEPTION due to reclaim failure - please report this in #taskcluster as it is a serious error
Flags: needinfo?(pmoore)
Attached file live_backing.log
An example task log.
This example task log was from: https://tools.taskcluster.net/task-inspector/#eINvHHNyTwaNMeEMmGJ9qA/ Note - due to being on a slightly older version of generic worker, the maxRunTime isn't respected (bug 1278639) and therefore the task runs for 24 hours. Yelp. This then also causes an unexpected error, so the generic worker exits. The best solution to the general problem is that we upgrade the generic worker. This will have the following benefits: 1) maxRunTime will be respected, so task can only run for one hour (or whatever we put in the payload), not 24. 2) even if a task hangs, the worker will not shutdown - it will run for maxRunTime, upload logs, mark as failed, and then continue taking jobs 3) if a worker ever does decide it is in a bad state, it won't only exit, it will also terminate instance, so the provisioner can spawn a new one Separately from upgrading the worker to overcome these already solved generic worker bugs, we should find out why the task requires "Password or Pin" in comment 1, and get the NSS tests fixed not to require interaction.
franziskus, tim, do you know what is going on here? Have we fixed this already? I know that we had a few pin-related issues recently.
Flags: needinfo?(ttaubert)
Flags: needinfo?(franziskuskiefer)
I believe this was backed out in http://hg.mozilla.org/projects/nss/rev/90d9e7ad5af2 so the nss part is done - just rolling out a new worker now to avoid that such a change could cause the worker pool to be entirely consumed in future. Previous generic worker version: 5.3.1 New generic worker version: 7.0.2alpha1 Note, this is still an alpha release (not yet production tested), so if we see any problems, we should roll back. However, it does have a ton of fixes / improvements, so should be a help.
Flags: needinfo?(ttaubert)
Flags: needinfo?(franziskuskiefer)
In case I'm offline, and this needs to get rolled back, these are the current AMIs we are running with. Anyone in the taskcluster team can roll back to these versions, if required (if the new worker doesn't play nicely). "regions": [ { "launchSpec": { "ImageId": "ami-b387fca4" }, "region": "us-east-1", "scopes": [], "secrets": {}, "userData": {} }, { "launchSpec": { "ImageId": "ami-9f3779ff" }, "region": "us-west-1", "scopes": [], "secrets": {}, "userData": {} }, { "launchSpec": { "ImageId": "ami-a14d91c1" }, "region": "us-west-2", "scopes": [], "secrets": {}, "userData": {} } ],
(to be explicit: the above AMIs are the *old* AMIs with generic worker 5.3.1 - *not* the new AMIs with generic worker 7.0.2alpha1 that I am currently building)
New version is deployed. Note - newly provisioned workers will have the new worker version - but existing workers with the older version will continue to run until either they run for an hour without a job, or they run for 96 hours. So if you see any problems, be sure to check the generic worker version from the task log, to see if it is the old version, or the new version.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: Operations → Operations and Service Requests
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: