Closed
Bug 1317213
Opened 9 years ago
Closed 9 years ago
nss-win2012r2 has tasks expiring by deadline and 50 instances not consuming tasks
Categories
(Taskcluster :: Operations and Service Requests, task)
Taskcluster
Operations and Service Requests
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jonasfj, Unassigned)
Details
Attachments
(1 file)
27.17 KB,
text/plain
|
Details |
https://tools.taskcluster.net/aws-provisioner/#nss-win2012r2/
@pmoore, you probably have a clue what is up..
Flags: needinfo?(pmoore)
Comment 1•9 years ago
|
||
It looks like NSS jobs are hanging here:
# Run tests.
cd nss/tests && ./all.sh
+ cd nss/tests
+ ./all.sh
testdir is c:/Users/task_1478920874/tests_results/security
which: domainname: unknown command
init.sh init: Testing PATH .:/c/Users/task_1478920874/dist/WIN954.0_x86_64_64_OPT.OBJ/bin:/c/Users/task_1478920874/dist/WIN954.0_x86_64_64_OPT.OBJ/lib:/bin:/usr/bin:/c/Users/task_1478920874/vs2015u2/VC/bin/amd64:/c/Users/task_1478920874/vs2015u2/VC/bin:/c/Users/task_1478920874/vs2015u2/SDK/bin/x64:/c/Users/task_1478920874/vs2015u2/VC/redist/x64/Microsoft.VC140.CRT:/c/Users/task_1478920874/vs2015u2/SDK/Redist/ucrt/DLLs/x64:/c/mozilla-build/python:/usr/local/bin:/c/mozilla-build/7zip:/c/mozilla-build/info-zip:/c/mozilla-build/python/Scripts:/c/mozilla-build/yasm:/c/Windows/system32:/c/mozilla-build/upx391w:/c/mozilla-build/moztools-x64/bin:/c/mozilla-build/wget against LIB /c/Users/task_1478920874/dist/WIN954.0_x86_64_64_OPT.OBJ/lib:
./all.sh: Testing with shared library ===============================
Running tests for ssl
TIMESTAMP ssl BEGIN: Sat Nov 12 03:22:33 CUT 2016
ssl.sh: SSL tests ===============================
ssl.sh: CRL SSL Client Tests - with ECC ===============================
ssl.sh: TLS Request don't require client auth (client does not provide auth) ----
selfserv starting at Sat Nov 12 03:22:34 CUT 2016
selfserv -D -p 8443 -d ../server -n localhost.localdomain \
-e localhost.localdomain-ecmixed -e localhost.localdomain-ec -S localhost.localdomain-dsa -w nss -r -i ../tests_pid.1904\
-V ssl3:tls1.2 -H 1 &
trying to connect to selfserv at Sat Nov 12 03:22:34 CUT 2016
tstclnt -p 8443 -h localhost.localdomain -q \
-d ../client < /c/Users/task_1478920874/nss/tests/ssl/sslreq.dat
Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":Enter Password or Pin for "NSS Certificate DB":[taskcluster 2016-11-13T02:53:32.675Z] TASK EXCEPTION due to reclaim failure - please report this in #taskcluster as it is a serious error
Flags: needinfo?(pmoore)
Comment 2•9 years ago
|
||
An example task log.
Comment 3•9 years ago
|
||
This example task log was from:
https://tools.taskcluster.net/task-inspector/#eINvHHNyTwaNMeEMmGJ9qA/
Note - due to being on a slightly older version of generic worker, the maxRunTime isn't respected (bug 1278639) and therefore the task runs for 24 hours. Yelp.
This then also causes an unexpected error, so the generic worker exits.
The best solution to the general problem is that we upgrade the generic worker. This will have the following benefits:
1) maxRunTime will be respected, so task can only run for one hour (or whatever we put in the payload), not 24.
2) even if a task hangs, the worker will not shutdown - it will run for maxRunTime, upload logs, mark as failed, and then continue taking jobs
3) if a worker ever does decide it is in a bad state, it won't only exit, it will also terminate instance, so the provisioner can spawn a new one
Separately from upgrading the worker to overcome these already solved generic worker bugs, we should find out why the task requires "Password or Pin" in comment 1, and get the NSS tests fixed not to require interaction.
Comment 4•9 years ago
|
||
franziskus, tim, do you know what is going on here? Have we fixed this already? I know that we had a few pin-related issues recently.
Flags: needinfo?(ttaubert)
Flags: needinfo?(franziskuskiefer)
Comment 5•9 years ago
|
||
I believe this was backed out in http://hg.mozilla.org/projects/nss/rev/90d9e7ad5af2 so the nss part is done - just rolling out a new worker now to avoid that such a change could cause the worker pool to be entirely consumed in future.
Previous generic worker version: 5.3.1
New generic worker version: 7.0.2alpha1
Note, this is still an alpha release (not yet production tested), so if we see any problems, we should roll back. However, it does have a ton of fixes / improvements, so should be a help.
Flags: needinfo?(ttaubert)
Flags: needinfo?(franziskuskiefer)
Comment 6•9 years ago
|
||
In case I'm offline, and this needs to get rolled back, these are the current AMIs we are running with. Anyone in the taskcluster team can roll back to these versions, if required (if the new worker doesn't play nicely).
"regions": [
{
"launchSpec": {
"ImageId": "ami-b387fca4"
},
"region": "us-east-1",
"scopes": [],
"secrets": {},
"userData": {}
},
{
"launchSpec": {
"ImageId": "ami-9f3779ff"
},
"region": "us-west-1",
"scopes": [],
"secrets": {},
"userData": {}
},
{
"launchSpec": {
"ImageId": "ami-a14d91c1"
},
"region": "us-west-2",
"scopes": [],
"secrets": {},
"userData": {}
}
],
Comment 7•9 years ago
|
||
(to be explicit: the above AMIs are the *old* AMIs with generic worker 5.3.1 - *not* the new AMIs with generic worker 7.0.2alpha1 that I am currently building)
Comment 8•9 years ago
|
||
New version is deployed.
Note - newly provisioned workers will have the new worker version - but existing workers with the older version will continue to run until either they run for an hour without a job, or they run for 96 hours.
So if you see any problems, be sure to check the generic worker version from the task log, to see if it is the old version, or the new version.
Comment 9•9 years ago
|
||
Retriggered a task, and all looks ok:
https://tools.taskcluster.net/task-inspector/#L9BTxumjTg2jbBwt21vVTg/0
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•6 years ago
|
Component: Operations → Operations and Service Requests
You need to log in
before you can comment on or make changes to this bug.
Description
•