Closed
Bug 1441208
Opened 7 years ago
Closed 7 years ago
Trees closed: we are getting many talos jobs failing with exceptions and no log file on windows as of this morning
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jmaher, Assigned: markco)
References
Details
(Whiteboard: [stockwell infra])
you can see the purple jobs here:
https://treeherder.mozilla.org/#/jobs?repo=autoland&tochange=e3cce6ae4b1569a180aba116908d432483fa7b04&filter-searchStr=talos%20win&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=runnable
this started this morning- was on central yesterday- but for days before that we didn't see jobs.
Without a log file, I don't have any way to debug this.
Reporter | ||
Comment 1•7 years ago
|
||
:markco- do you know of any reason why we would be getting exceptions on our windows hardware jobs?
Flags: needinfo?(mcornmesser)
Comment 2•7 years ago
|
||
Another occurence on central:https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=6d72eade26af359ffc3cd3e381fd79c88922b9b8&filter-searchStr=R2&filter-platform=windows10-64-ccov%20debug&selectedJob=164396800
Failure log keeps loading, the raw log does not show anything, and when we access the task info from taskcluster it simply says:
"no content". After a few refreshes, the content changes and it is the one from this bug 1437407.
The following errors are still left with the "no content" error.
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=683e5aafae7c35798ebe01e4f8a6dabd251c4217&filter-searchStr=Windows%207%20pgo
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=74102ba585b5fc0184e896ad5b2858ba86dbbce3&filter-searchStr=windows%207%20pgo
Assignee | ||
Comment 3•7 years ago
|
||
I am taking a look into this. I have no initial thoughts on what might be the cause.
Flags: needinfo?(mcornmesser)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 5•7 years ago
|
||
I suspect that the exceptions are coming from machines that are be rebooted before the task starts.
I am still trying to figure this out. It is weird though. The papertrail logs are dominated by entries for T-W1064-MS-040. Which was regarding a disk space issue:
Feb 26 15:05:19 T-W1064-MS-040.mdc1.mozilla.com generic-worker: 2018/02/26 22:49:30 Not able to free up enough disk space - require 10737418240 bytes, but only have 9928675328 bytes - and nothing left to delete.
I shut down that node from the chassis and opened Bug 1441371.
However the weird part is multiple machines, which are the ones that have some of the exceptions, are reporting to papertrail as T-W1064-MS-040:
Feb 26 15:59:54 T-W1064-MS-040.mdc1.mozilla.com generic-worker: Worker Id: T-W1064-MS-205
Feb 26 16:19:00 T-W1064-MS-040.mdc1.mozilla.com generic-worker: Worker Id: T-W1064-MS-120
Which makes does not make sense.
Part of the issue troubleshooting is I am unable to connect to the node to see what is happening because of the reboots. However, I am going to dive back into this tomorrow morning.
grenade: pmoore: I don't have an additional information yet, but do you all have any thoughts? Any ideas on the clean portion of this?
Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)
Comment 6•7 years ago
|
||
I see two problems:
1) A lot of machines are running out of disk space
2) Workers that run out of disk space are then taking more jobs - so eating their way through large numbers of tasks
https://papertrailapp.com/groups/1141234/events?q=Not%20able%20to%20free%20up%20enough%20disk%20space&focus=904978881818288152
Mark, can you find out what is eating up all the disk space on the machines? Also, if the worker gives has a return code of 69, it shouldn't take more jobs, and probably an alert should be sent, the machine should be quarantined, etc.
The generic-worker --help command provides some information about exit codes at the end of the usage text:
Exit Codes:
0 Tasks completed successfully; no more tasks to run (see config setting
numberOfTasksToRun).
67 A task user has been created, and the generic-worker needs to reboot in order
to log on as the new task user. Note, the reboot happens automatically unless
config setting disableReboots is set to true - in either code this exit code will
be issued.
68 The generic-worker hit its idle timeout limit (see config settings idleTimeoutSecs
and shutdownMachineOnIdle).
69 Worker panic - either a worker bug, or the environment is not suitable for running
a task, e.g. a file cannot be written to the file system, or something else did
not work that was required in order to execute a task. See config setting
shutdownMachineOnInternalError.
70 A new deploymentId has been issued in the AWS worker type configuration, meaning
this worker environment is no longer up-to-date. Typcially workers should
terminate.
71 The worker was terminated via an interrupt signal (e.g. Ctrl-C pressed).
Flags: needinfo?(pmoore) → needinfo?(mcornmesser)
Comment 7•7 years ago
|
||
I've created bug 1441482 about generic-worker not claiming a task if there isn't enough disk space (I think currently it must clear up disk space after claiming a task). This will mean a worker won't even try to claim a task if it determines it won't have enough disk space to execute the task, and that should leave it available to other workers that aren't so low on disk space.
Comment 8•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #7)
> I've created bug 1441482 about generic-worker not claiming a task if there
> isn't enough disk space (I think currently it must clear up disk space after
> claiming a task). This will mean a worker won't even try to claim a task if
> it determines it won't have enough disk space to execute the task, and that
> should leave it available to other workers that aren't so low on disk space.
Note, this does nothing to solve the underlying problem - it will just prevent that the trees go blue because of it. The lack of disk space is a real problem which will need to be investigated and solved in order to resolve this bug. I don't think I have access to the machines to be able to do that myself.
Comment 9•7 years ago
|
||
i believe we need to implement a fix to the task directory delete routine in generic worker mentioned in bug 1433854
Flags: needinfo?(rthijssen)
Comment 10•7 years ago
|
||
to be clear, on windows hardware there's nothing cleaning up task directories between tasks except the routine in gw which is broken. it deletes most things but always leaves a bit behind. in ec2 it's no big deal because instances don't live forever but on hw, eventually the drive will be full from undeleted task garbage from previous tasks. the task cleanup routine in gw needs to be patched so that it retries failed delete operations instead of ignoring them.
Comment 11•7 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #10)
> to be clear, on windows hardware there's nothing cleaning up task
> directories between tasks except the routine in gw which is broken. it
> deletes most things but always leaves a bit behind. in ec2 it's no big deal
> because instances don't live forever but on hw, eventually the drive will be
> full from undeleted task garbage from previous tasks. the task cleanup
> routine in gw needs to be patched so that it retries failed delete
> operations instead of ignoring them.
This task demonstrates this is not the case:
https://tools.taskcluster.net/groups/FHUE2eUoTRC25VyqQpghIg/tasks/FHUE2eUoTRC25VyqQpghIg/runs/2/logs/public%2Flogs%2Flive.log
We'll need to analyze a worker to find out what is eating up the disk space - it isn't task directories, but perhaps tasks are leaving files in other places, or some log files are growing too large.
Reporter | ||
Comment 12•7 years ago
|
||
this is blocking windows 10 reftests as well- try hasn't scheduled windows 10 hardware jobs in 24+ hours, and we have 4+ hours on autoland. All I know is on buildbot/IX we didn't have this issue. I know the OS config is different and there might have been other scripts running from the buildbot side to clean things up.
I do see two directories:
GenericWorker
task_1519734166
I assume we delete the task_1519734166 directory each time we start a new job. On logs from other win10 hardware runs [1], I see:
12:10:34 INFO - 'TEMP': 'C:\\Users\\GenericWorker\\AppData\\Local\\Temp',
12:10:34 INFO - 'TMP': 'C:\\Users\\GenericWorker\\AppData\\Local\\Temp',
in addition, I see we use a profile in the genericworker directory:
12:10:34 INFO - TEST-INFO | started process 8960 (C:\Users\task_1519732928\build\application\firefox\firefox -profile c:\users\genericworker\appdata\local\temp\tmptanube\profile http://localhost:49754/startup_test/tresize/addon/content/tresize-test.html)
I imagine after a hundred profiles or so we would run out of disk space. I sanity checked on VMs and we do the same thing, so this is configured to run as recommended on the AWS VM machines.
[1] https://taskcluster-artifacts.net/SqbPLxzTRFSP4v9sN6m4Tg/1/public/logs/live_backing.log
![]() |
||
Comment 13•7 years ago
|
||
Trees closed for this together with bug 1441557.
Severity: normal → blocker
Summary: we are getting many talos jobs failing with exceptions and no log file on windows as of this morning → Trees closed: we are getting many talos jobs failing with exceptions and no log file on windows as of this morning
Assignee | ||
Comment 14•7 years ago
|
||
Currently out of 20 machines I found 1 that was not in the boot loop. I have disabled the generic worker task. Once the current task finishes I will be able to dive in and find the problematic directories.
Additionally I am going to start a rolling reinstall so tasks can continue to be picked up.
Comment 15•7 years ago
|
||
I see in the logs:
"runTaskAsCurrentUser": true
This means there is no task user separation, so a task can potentially write files anywhere to the file system. There is no means for the worker to clean up files written outside the task directory when running as current user.
I suspect that the tests should work running generic worker 10 with task user isolation enabled - I would recommend generic-worker 10.5.1 as the current latest stable release:
https://github.com/taskcluster/generic-worker/releases
Without task isolation, it isn't possible to know where the task has written files on the file system, and clean them up.
Comment 16•7 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #14)
> Additionally I am going to start a rolling reinstall so tasks can continue
> to be picked up.
++
Reporter | ||
Comment 17•7 years ago
|
||
how can we test this on try, specifically:
* new worker
* new worker + task isolation
I would love to get things in a more ideal state and happy to help run tests and analyze results.
Assignee | ||
Comment 18•7 years ago
|
||
Added to run-hw-generic-worker-and-reboot.bat:
+rem needed for the generic worker 8.* to keep disk space free https://bugzilla.mozilla.org/show_bug.cgi?id=1441208#c12
+del /s /s C:\Users\GenericWorker\AppData\Local\Temp
+rmdir /s /q C:\Users\GenericWorker\AppData\Local\Temp
However, I am going to continue the reinstall because of https://bugzilla.mozilla.org/show_bug.cgi?id=1438606. The new changes may not be picked up.
Flags: needinfo?(mcornmesser)
Assignee | ||
Comment 19•7 years ago
|
||
This may also be related to recent changes in with github. The machine may not be getting certain packages:
C:\generic-worker>powershell Invoke-WebRequest -Uri https://github.com/taskcluster/generic-worker/releases/download/v8.3.0/generic-worker-windows-amd64.exe
Invoke-WebRequest : The request was aborted: Could not create SSL/TLS secure channel.
At line:1 char:1
+ Invoke-WebRequest -Uri https://github.com/taskcluster/generic-worker/ ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-WebRequest], WebExc
eption
+ FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand
Assignee | ||
Comment 20•7 years ago
|
||
The fix is in place for that will remove the local/temp directory before generic worker is started.
I have a temporary work around in place to get around the the TLS OCC/github issue.
Machines should start returning to the pool with in the hour and I will work on getting the full pool back online tonight.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 22•7 years ago
|
||
It appears that much of the pool has picked up the change to remove local\temp contents. So the initial issue should be dissipating. I will open bugs for the other issue(s).
Assignee | ||
Comment 23•7 years ago
|
||
From what I can tell in paertrail and irc scroll back 3 machines had not recovered; 109, 118, and 119. I have kicked off installs on 109 and 118. 119 has been removed from production due hardware errors (bug 1439774).
![]() |
||
Comment 24•7 years ago
|
||
Yesterday around ~9pm UTC the backlog had reduced from 2400 to 1800 pending jobs. Sheriffs landed a backout and a patch and reopened trees because there were no pending or running jobs on the integration trees (autoland and inbound).
But the issue still affects the trees: There are many exceptions and due to that still talos pending or running on pushes from 8 hours ago.
Trees will get closed again for that very soon.
Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)
Comment 25•7 years ago
|
||
i'm working on gaining access to the affected hardware now...
Flags: needinfo?(rthijssen)
![]() |
||
Comment 26•7 years ago
|
||
Trees have been closed again for this.
Comment 27•7 years ago
|
||
I'm not sure this is the same issue - at least, the last instance I see of this in papertrail logs is > 5 hours ago:
Feb 28 06:27:40 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:27:38 Not able to free up enough disk space - require 10737418240 bytes, but only have 4851302400 bytes - and nothing left to delete.
Feb 28 06:28:34 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:28:31 Not able to free up enough disk space - require 10737418240 bytes, but only have 7994687488 bytes - and nothing left to delete.
Feb 28 06:29:07 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:29:05 Not able to free up enough disk space - require 10737418240 bytes, but only have 4850147328 bytes - and nothing left to delete.
Feb 28 06:30:25 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:30:23 Not able to free up enough disk space - require 10737418240 bytes, but only have 4841652224 bytes - and nothing left to delete.
Feb 28 06:31:11 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:29:46 Not able to free up enough disk space - require 10737418240 bytes, but only have 7993532416 bytes - and nothing left to delete.
Feb 28 06:31:22 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:31:19 Not able to free up enough disk space - require 10737418240 bytes, but only have 8000544768 bytes - and nothing left to delete.
Feb 28 06:31:29 T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 05:27:19 Not able to free up enough disk space - require 10737418240 bytes, but only have 9140207616 bytes - and nothing left to delete.
Feb 28 06:31:29 T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 05:29:28 Not able to free up enough disk space - require 10737418240 bytes, but only have 9132785664 bytes - and nothing left to delete.
Feb 28 06:31:39 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:31:36 Not able to free up enough disk space - require 10737418240 bytes, but only have 4843982848 bytes - and nothing left to delete.
Feb 28 06:31:40 T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 05:31:37 Not able to free up enough disk space - require 10737418240 bytes, but only have 9136242688 bytes - and nothing left to delete.
Feb 28 06:33:08 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:33:06 Not able to free up enough disk space - require 10737418240 bytes, but only have 4837761024 bytes - and nothing left to delete.
Feb 28 06:33:50 T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 05:33:48 Not able to free up enough disk space - require 10737418240 bytes, but only have 9130254336 bytes - and nothing left to delete.
Feb 28 06:36:00 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:32:51 Not able to free up enough disk space - require 10737418240 bytes, but only have 7988305920 bytes - and nothing left to delete.
Feb 28 06:36:00 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:34:40 Not able to free up enough disk space - require 10737418240 bytes, but only have 7992299520 bytes - and nothing left to delete.
Feb 28 06:36:01 T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 05:35:58 Not able to free up enough disk space - require 10737418240 bytes, but only have 9123897344 bytes - and nothing left to delete.
Feb 28 06:36:10 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:36:08 Not able to free up enough disk space - require 10737418240 bytes, but only have 7986253824 bytes - and nothing left to delete.
Feb 28 06:42:18 T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 05:38:06 Not able to free up enough disk space - require 10737418240 bytes, but only have 9129041920 bytes - and nothing left to delete.
Feb 28 06:42:18 T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 05:40:16 Not able to free up enough disk space - require 10737418240 bytes, but only have 9124790272 bytes - and nothing left to delete.
Feb 28 06:42:29 T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 05:42:28 Not able to free up enough disk space - require 10737418240 bytes, but only have 9121226752 bytes - and nothing left to delete.
Feb 28 06:42:29 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:37:21 Not able to free up enough disk space - require 10737418240 bytes, but only have 7981592576 bytes - and nothing left to delete.
Feb 28 06:42:29 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:38:41 Not able to free up enough disk space - require 10737418240 bytes, but only have 7990796288 bytes - and nothing left to delete.
Feb 28 06:42:30 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:39:55 Not able to free up enough disk space - require 10737418240 bytes, but only have 7982653440 bytes - and nothing left to delete.
Feb 28 06:42:30 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:41:14 Not able to free up enough disk space - require 10737418240 bytes, but only have 7991902208 bytes - and nothing left to delete.
Feb 28 06:42:40 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:42:38 Not able to free up enough disk space - require 10737418240 bytes, but only have 7972843520 bytes - and nothing left to delete.
Feb 28 06:43:24 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:43:23 Not able to free up enough disk space - require 10737418240 bytes, but only have 4846284800 bytes - and nothing left to delete.
Feb 28 06:44:40 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:44:38 Not able to free up enough disk space - require 10737418240 bytes, but only have 4839202816 bytes - and nothing left to delete.
Feb 28 06:45:59 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:45:57 Not able to free up enough disk space - require 10737418240 bytes, but only have 4834242560 bytes - and nothing left to delete.
Feb 28 06:47:17 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:47:16 Not able to free up enough disk space - require 10737418240 bytes, but only have 4834983936 bytes - and nothing left to delete.
Feb 28 06:48:41 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:48:39 Not able to free up enough disk space - require 10737418240 bytes, but only have 4838375424 bytes - and nothing left to delete.
Feb 28 06:49:32 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:44:03 Not able to free up enough disk space - require 10737418240 bytes, but only have 7978160128 bytes - and nothing left to delete.
Feb 28 06:49:32 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:45:42 Not able to free up enough disk space - require 10737418240 bytes, but only have 7974117376 bytes - and nothing left to delete.
Feb 28 06:49:33 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:47:06 Not able to free up enough disk space - require 10737418240 bytes, but only have 7971471360 bytes - and nothing left to delete.
Feb 28 06:49:33 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:48:31 Not able to free up enough disk space - require 10737418240 bytes, but only have 7970897920 bytes - and nothing left to delete.
Feb 28 06:49:43 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:49:41 Not able to free up enough disk space - require 10737418240 bytes, but only have 7967232000 bytes - and nothing left to delete.
Feb 28 06:50:05 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:50:04 Not able to free up enough disk space - require 10737418240 bytes, but only have 4831232000 bytes - and nothing left to delete.
Feb 28 06:51:25 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:51:23 Not able to free up enough disk space - require 10737418240 bytes, but only have 4830515200 bytes - and nothing left to delete.
Feb 28 06:52:19 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:51:04 Not able to free up enough disk space - require 10737418240 bytes, but only have 7969775616 bytes - and nothing left to delete.
Feb 28 06:52:30 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:52:27 Not able to free up enough disk space - require 10737418240 bytes, but only have 7963586560 bytes - and nothing left to delete.
Feb 28 06:52:38 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:52:36 Not able to free up enough disk space - require 10737418240 bytes, but only have 4828897280 bytes - and nothing left to delete.
Feb 28 06:53:51 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:53:50 Not able to free up enough disk space - require 10737418240 bytes, but only have 4829233152 bytes - and nothing left to delete.
Feb 28 06:55:12 T-W1064-MS-109.mdc1.mozilla.com generic-worker: 2018/02/28 05:55:10 Not able to free up enough disk space - require 10737418240 bytes, but only have 4826759168 bytes - and nothing left to delete.
Feb 28 06:55:17 T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 05:44:36 Not able to free up enough disk space - require 10737418240 bytes, but only have 9118052352 bytes - and nothing left to delete.
Feb 28 06:55:17 T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 05:46:48 Not able to free up enough disk space - require 10737418240 bytes, but only have 9112969216 bytes - and nothing left to delete.
Feb 28 06:55:17 T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 05:48:57 Not able to free up enough disk space - require 10737418240 bytes, but only have 9118175232 bytes - and nothing left to delete.
Feb 28 06:55:17 T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 05:51:06 Not able to free up enough disk space - require 10737418240 bytes, but only have 9106341888 bytes - and nothing left to delete.
Feb 28 06:55:18 T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 05:53:15 Not able to free up enough disk space - require 10737418240 bytes, but only have 9108832256 bytes - and nothing left to delete.
Feb 28 06:55:28 T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 05:55:24 Not able to free up enough disk space - require 10737418240 bytes, but only have 9101242368 bytes - and nothing left to delete.
Feb 28 06:57:41 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:53:37 Not able to free up enough disk space - require 10737418240 bytes, but only have 7963439104 bytes - and nothing left to delete.
Feb 28 06:57:41 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:55:04 Not able to free up enough disk space - require 10737418240 bytes, but only have 7955894272 bytes - and nothing left to delete.
Feb 28 06:57:41 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:56:24 Not able to free up enough disk space - require 10737418240 bytes, but only have 7959216128 bytes - and nothing left to delete.
Feb 28 06:57:52 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:57:49 Not able to free up enough disk space - require 10737418240 bytes, but only have 7956316160 bytes - and nothing left to delete.
Feb 28 07:00:20 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 05:59:14 Not able to free up enough disk space - require 10737418240 bytes, but only have 7950729216 bytes - and nothing left to delete.
Feb 28 07:00:31 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 06:00:29 Not able to free up enough disk space - require 10737418240 bytes, but only have 7948390400 bytes - and nothing left to delete.
Feb 28 07:04:09 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 06:02:01 Not able to free up enough disk space - require 10737418240 bytes, but only have 7944708096 bytes - and nothing left to delete.
Feb 28 07:04:20 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 06:04:08 Not able to free up enough disk space - require 10737418240 bytes, but only have 7946424320 bytes - and nothing left to delete.
Feb 28 07:07:25 T-W1064-MS-119.mdc1.mozilla.com generic-worker: 2018/02/28 06:07:13 Not able to free up enough disk space - require 10737418240 bytes, but only have 7944331264 bytes - and nothing left to delete.
Could the latest problem be a different issue?
Flags: needinfo?(pmoore)
Comment 28•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #6)
> Also, if the worker gives has a return code of 69, it shouldn't take more
> jobs, and probably an alert should be sent, the machine should be
> quarantined, etc.
Since this is generic-worker 8, not generic-worker 10 - the return code will be 1 and not 69.
@Mark, Rob - can we disable the worker if it exits with return code 1? This should help to avoid that a bad worker eats its way through healthy tasks, resolving them all as exception.
(In reply to Pete Moore [:pmoore][:pete] from comment #27)
> I'm not sure this is the same issue - at least, the last instance I see of this in papertrail logs is > 5 hours ago:
...
...
> Could the latest problem be a different issue?
It looks like it is the same issue, but that the workers are not logging to papertrail. Looking at e.g. https://tools.taskcluster.net/groups/S89qPNvKSoys_YqTzDYDNw/tasks/KJuhozciRQGyVMR-9t9k1A I see 6 task runs (0,1,2,3,4,5), but when searching in papertrail, I see only worker logs for claims 1 and 2:
> T-W1064-MS-118.mdc1.mozilla.com generic-worker: 2018/02/28 02:34:11 Running task https://tools.taskcluster.net/task-inspector/#KJuhozciRQGyVMR-9t9k1A/1
> T-W1064-MS-113.mdc1.mozilla.com generic-worker: 2018/02/28 03:47:29 Running task https://tools.taskcluster.net/task-inspector/#KJuhozciRQGyVMR-9t9k1A/2
Since the logs seem to not be present for task runs 0, 3, 4, 5, I wonder if some workers aren't logging to papertrail?
Flags: needinfo?(rthijssen)
Flags: needinfo?(mcornmesser)
Comment 29•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #28)
> @Mark, Rob - can we disable the worker if it exits with return code 1? This
> should help to avoid that a bad worker eats its way through healthy tasks,
> resolving them all as exception.
something like this Pete?
https://github.com/mozilla-releng/OpenCloudConfig/pull/122
Flags: needinfo?(rthijssen)
Comment 30•7 years ago
|
||
That looks good - but see my comment in the PR.
Comment 31•7 years ago
|
||
Is it understood why the change described in comment 18 did not fix the issue, or why the papertrail logs are missing (comment 28)?
If this could take a while to resolve, should we move these tasks back to buildbot to get the trees reopened?
Flags: needinfo?(klibby)
Flags: needinfo?(jmaher)
Flags: needinfo?(coop)
Assignee | ||
Comment 32•7 years ago
|
||
Yes, it looks like not all info is going to papertail, but the nxlog config looks right:
<Input generic_worker_log>
Module im_file
File 'C:\generic-worker\generic-worker.log'
SavePos TRUE
ReadFromLast TRUE
InputType LineBased
</Input>
Starting there to try to get more info.
Flags: needinfo?(mcornmesser)
Reporter | ||
Comment 33•7 years ago
|
||
I am looking at backing out the patches, the reftests are not working on my initial try push (ix machines were <40 minutes/job, I am at 60 minutes and waiting)- but the talos tests appear to be working (although 'g2' is taking a long time).
It might take a few more try pushes to get to the bottom of this- stay tuned.
Reporter | ||
Comment 34•7 years ago
|
||
ok, things run and look green- I am happy to push my patches:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=fbfdf86fd7576d8c41019180baedea997324e04a
Do we think we will have this fixed in the next hour or two? if not, I would like to get the trees open.
Flags: needinfo?(jmaher)
Comment 35•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #31)
> Is it understood why the change described in comment 18 did not fix the
> issue, or why the papertrail logs are missing (comment 28)?
No, I think partly because the systems are getting into a reboot loop when g-w has issues. Rob's got a patch to fix that, which should allow Mark to get onto one of the systems with disk space issues to figure out where it's all being used. We're still working with netops to try and fix the logging issue (bug 1441544); there's yet another piece where livelog.exe isn't being (re?)installed, because it hadn't been on tooltool, which caused OCC(?) to revert to pulling it from Github which fails due to TLS issues. Mark's uploading that to tooltool, so machines that were re-installed should be able to come back to life.
I think those are all of the various pieces I know of, offhand.
Flags: needinfo?(klibby)
Assignee | ||
Comment 36•7 years ago
|
||
I am still working on this, but need to take off for a bit. The github tls issue has been circumvented with tooltool.
Overall between disk space, the github tsl mismatch and other issues such as hardware machines receiving files needed for AWS, there are many machines in an odd state. I am going to work on getting more logging to papertrail tonight, and hopefully we will have better picture once that happens.
Comment hidden (Intermittent Failures Robot) |
Comment 38•7 years ago
|
||
Removing blocker status, but still important to get this fixed.
Assignee: nobody → mcornmesser
Severity: blocker → critical
Flags: needinfo?(coop)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 41•7 years ago
|
||
I think the cause of this was a graphic card driver update being installed and then knocking the node off line. There is now a registry setting being managed by OCC that will prevent this update.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•7 years ago
|
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
Updated•7 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Comment hidden (Intermittent Failures Robot) |
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•