Closed Bug 839375 Opened 11 years ago Closed 11 years ago

foopy24 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: Callek, Unassigned)

References

Details

So, foopy24 currently has on it:

bash-3.2$ ls -d tegra-*
tegra-191       tegra-201       tegra-277       tegra-279       tegra-281       tegra-283       tegra-285       tegra-192       tegra-276       tegra-278       tegra-280       tegra-282       tegra-284

and was having issues earlier, (timeouts in steps doing foopy disk utilization)

(rm -rf ...; unzip -oq ...)

At fear of a larger issue I stopped clientproxy on all devices on this foopy after the poke from philor.

I'm doing some cleanup (`rm -f /builds/tegra-*/{client*,twistd}.log.*`) right now, because there was a LOT of these old copies of files around here, but disk capacity is awfully low usage:
bash-3.2$ df -h
Filesystem      Size   Used  Avail Capacity  Mounted on
/dev/disk0s2   465Gi   42Gi  423Gi    10%    /
devfs          183Ki  183Ki    0Bi   100%    /dev
map -hosts       0Bi    0Bi    0Bi   100%    /net
map auto_home    0Bi    0Bi    0Bi   100%    /home

Also ganglia doesn't show any obvious pattern of issue: http://ganglia3.build.mtv1.mozilla.com/ganglia/?r=4hr&cs=&ce=&c=RelEngMTV1&h=foopy24.build.mtv1.mozilla.com&tab=m&vn=&mc=2&z=small&metric_group=ALLGROUPS

(I'll note however, that ganglia is not showing stats for wio at all on this foopy, which might just be a factor of it being a mac foopy, or something in the config)

Rebooting this foopy in moments, before I bring all the devices back up, having seen no obvious problems, this bug is to help track incase of future issues.

================
[Times ET]
[21:38:13]	philor	huh, what would produce a sudden rash of timeouts in tegra 'rm -rf build' steps?
[21:43:54]	Callek|Buildduty	philor: tegra rm -rf build !? odd
[21:47:14]	philor	Callek|Buildduty: tegra-191, tegra-279, tegra-201, tegra-278, tegra-282, https://tbpl.mozilla.org/php/getParsedLog.php?id=19553407&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=19553415&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=19553399&tree=Mozilla-Inbound
[21:47:31]	philor	is that maybe all one foopy?
[21:48:09]	Callek|Buildduty	about to find out
[21:48:10]	Callek|Buildduty	
[21:48:58]	Callek|Buildduty	philor: yes, *ALL* one foopy
[21:49:13]	Callek|Buildduty	philor: sounds like a good case of "stop them all then investigate"
[21:49:14]	Callek|Buildduty	on it
[21:49:16]	=-=	philor is now known as philor|away
[21:49:59]	Callek|Buildduty	philor|away: I'm killing all on that one foopy
[23:05:52]	Callek|Buildduty	philor: sooo, I'm not seeing the rm -rf failure -- I do see hwoever in all those logs timeouts when unzipping tests
[23:07:54]	philor	Callek|Buildduty: ah, five failures that all appeared at once, the last two were unzipping, the first three were rm -rf, and I didn't notice the last two weren't the same
[23:08:28]	Callek|Buildduty	ahh ok
[23:08:29]	philor	https://tbpl.mozilla.org/php/getParsedLog.php?id=19553399&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=19553410&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=19553411&tree=Mozilla-Inbound
(In reply to Justin Wood (:Callek) from comment #0)
> I'm doing some cleanup (`rm -f /builds/tegra-*/{client*,twistd}.log.*`)
> right now, because there was a LOT of these old copies of files around here,
> but disk capacity is awfully low usage:

Ok, having thought about this, it is likely the problem, heres why:

Foopies have a high demand on disk i/o as it is, we need a lot of read/writes for the actual jobs.
twistd, will do an enumeration+rename of *every* twistd.log* file when it needs to rotate, and expand on that, indefinitely. twistd.log.2 --> twistd.log.3 , twistd.log.1 --> twistd.log.2 [c.f. http://mxr.mozilla.org/build/source/twisted/twisted/python/logfile.py#177 ]
clientproxy also uses log rotation, of a similar logic with RotatingFileHandler: http://docs.python.org/release/2.4.4/lib/node348.html

This combined can yeild a horrid I/O throughput if the conditions line up right. I'll be going through foopies tomorrow to clean up old logs, due to this issue.
Summary: foopy24 problem tracking → Rotating logfiles indefinitely on foopies leads to failures in I/O heavy buildbot steps.
Hrm maybe not... philor just reported more issues here... re-stopping all devices on this foopy, and will enroll it in diagnostics (stuck rm is not a good sign, imo) -- also whole thing is acting VERY slow/problematic when I am sshing in right now.


Processes: 97 total, 3 running, 10 stuck, 84 sleeping, 281 threads                                      22:38:35
Load Avg: 0.15, 0.15, 0.16  CPU usage: 0.72% user, 2.42% sys, 96.85% idle
SharedLibs: 12M resident, 4376K data, 0B linkedit.
MemRegions: 10162 total, 648M resident, 19M private, 192M shared.
PhysMem: 1303M wired, 606M active, 425M inactive, 2333M used, 5856M free.
VM: 231G vsize, 1148M framework vsize, 17489(0) pageins, 0(0) pageouts.
Networks: packets: 2098605/2001M in, 880390/337M out. Disks: 293904/3555M read, 736228/14G written.

PID   COMMAND      %CPU TIME     #TH  #WQ  #POR #MRE RPRVT  RSHRD  RSIZE  VPRVT  VSIZE  PGRP PPID STATE    UID
2909  rm           0.5  00:00.01 1    0    17   22+  104K-  224K+  380K+  17M-   2378M  646  647  stuck    501
2908  rm           0.7  00:00.01 1    0    17   21   108K-  224K+  380K+  17M-   2378M  667  668  stuck    501
2907  rm           0.6  00:00.01 1    0    17   21   112K+  220K-  380K+  17M+   2378M  634  635  stuck    501
2894  top          2.9  00:00.34 1/1  0    32   38   1040K  216K   1752K  19M    2379M  2894 2892 running  0
2892  bash         0.0  00:00.01 1    0    21   23   272K   840K   960K   17M    2378M  2892 2891 sleeping 501
2891  sshd         0.0  00:00.00 1    0    13   86   252K   1628K  816K   1504K  2413M  2889 2889 sleeping 501
2889  sshd         0.0  00:00.12 2    1    37   81   480K   1628K  2956K  712K   2412M  2889 1    sleeping 0
2887  unzip        1.0  00:00.58 1    0    18   23   228K   328K   588K   17M    2378M  494  495  stuck    501
2886  rm           0.2  00:00.43 1    0    17   21   128K   220K   396K   17M    2378M  488  489  stuck    501
2884  wget         0.2  00:00.48 2    1    40   67   652K   480K   1708K  40M    2404M  360  361  stuck    501
2781  Python       0.5  00:10.90 2    1    33   209  21M    1920K  42M    104M   2478M  548  549  stuck    501
2686  unzip        0.4  00:04.96 1    0    17   23   228K   324K   576K   17M    2378M  362  363  stuck    501
2557  distnoted    0.0  00:00.01 2    1    41   54   548K   240K   1184K  42M    2411M  2557 2540 sleeping 501
2540  launchd      0.0  00:00.09 2    0    52   48   412K   412K   844K   48M    2409M  2540 1    sleeping 501
1023  taskgated    0.0  00:00.36 1    0    30   43   720K   364K   2080K  46M    2406M  1023 1    sleeping 0
668   Python       0.0  00:02.48 2    0    11   181  18M    3568K- 13M    52M    2425M  667  1    sleeping 501
647   Python       0.0  00:02.58 2    0    11   180  18M    3568K- 13M    51M    2424M  646  1    sleeping 501
635   Python       0.0  00:02.61 2    0    11   181  18M    3568K  13M    52M    2425M  634  1    sleeping 501
604   Python       0.0  00:00.59 1    0    9    135  1288K  10M    2812K  2676K  2390M  601  602  sleeping 501
602   Python       0.0  00:00.19 2    0    10   143  1724K  10M    3520K  30M    2417M  601  1    sleeping 501
566   Python       0.0  00:02.47 2    0    11   181  16M    3568K  13M    52M    2425M  565  1    sleeping 501
551   Python       0.0  00:02.54 2    0    11   181  16M    3568K  13M    52M    2425M  550  1    sleeping 501
549   Python       0.0  00:02.68 2    0    11   181  17M    3568K  13M    52M    2425M  548  1    sleeping 501
Depends on: 839407
Blocks: tegra-191
Blocks: tegra-192
Blocks: tegra-201
Blocks: tegra-276
Blocks: tegra-277
Blocks: tegra-278
Blocks: tegra-279
Blocks: tegra-280
Blocks: tegra-281
Blocks: tegra-282
Blocks: tegra-283
Blocks: tegra-284
Blocks: tegra-285
changing summary to reflect that its about foopy24 in the end, since it was still having issues after this, and no other foopies exhibited issues
Summary: Rotating logfiles indefinitely on foopies leads to failures in I/O heavy buildbot steps. → foopy24 problem tracking
Blocks: 843918
Blocks: 851562
Depends on: 851579
No longer depends on: 851579
So, I just ignorantly  started all of these tegras up on foopy127 because I was trying to work my way through the buildduty queue. They're in production now, is this going to break things?
They seem to be green....
resolving as "incomplete" since we never actually fixed the issues on foopy24, but we now no longer have foopy24 for our devices. So I'm not expecting it to resurface on the linux foopies.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → INCOMPLETE
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.