Closed Bug 985399 Opened 10 years ago Closed 10 years ago

buildbot-master54's buildbot process got OOM'ed

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

Details

I noticed this going off in Nagios. I looked at the log, and it ended abruptly:
[cltbld@buildbot-master54.srv.releng.usw2.mozilla.com tests1-linux]$ tail master/twistd.log
2014-03-19 04:44:44-0700 [-] ShellCommand.startCommand(cmd=<RemoteShellCommand '['bash', '-c', 'for file in `ls -1`; do cat $file; done']'>)
2014-03-19 04:44:44-0700 [-]   cmd.args = {'workdir': 'properties', 'timeout': 1200, 'env': None, 'want_stdout': 1, 'usePTY': 'slave-config', 'maxTime': None, 'logEnviron': True, 'want_stderr': 1, 'logfiles': {}}
2014-03-19 04:44:44-0700 [-] <RemoteShellCommand '['bash', '-c', 'for file in `ls -1`; do cat $file; done']'>: RemoteCommand.run [1376180]
2014-03-19 04:44:44-0700 [-] command '['bash', '-c', 'for file in `ls -1`; do cat $file; done']' in dir 'properties'
2014-03-19 04:44:44-0700 [-] LoggedRemoteCommand.start
2014-03-19 04:44:44-0700 [-] nextAWSSlave: 0 retries for Ubuntu VM 12.04 mozilla-central pgo test cppunit
2014-03-19 04:44:44-0700 [-] nextAWSSlave: Choosing spot since there aren't any retries
2014-03-19 04:44:44-0700 [-] Claimed buildrequestids: [38336093L]
2014-03-19 04:44:44-0700 [-] <Builder ''Ubuntu VM 12.04 mozilla-central pgo test cppunit'' at 3508324328>: got assignments: {<SlaveBuilder builder='Ubuntu VM 12.04 mozilla-central pgo test cppunit' slave='tst-linux32-spot-437'>: [<buildbot.buildrequest.BuildRequest instance at 0x7c0d25f0>]}
2014-03-19 04:44:56-0700 [Broker,122567,10.132.57.97] <RemoteShellCommand '['/tools/buildbot/bin/python', 'scripts/scripts/desktop_unittest.py', '--cfg', 'unittests/linux_unittest.py', '--mochitest-suite', 'plain1', '--blob-upload-branch', 'try', '--download-symbols', 'ondemand']'> rc=0

Found this in syslog:
Mar 19 04:46:29 buildbot-master54 kernel: buildbot invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Mar 19 04:46:29 buildbot-master54 kernel: buildbot cpuset=/ mems_allowed=0
Mar 19 04:46:29 buildbot-master54 kernel: Pid: 17305, comm: buildbot Not tainted 2.6.32-220.el6.x86_64 #1
Mar 19 04:46:29 buildbot-master54 kernel: Call Trace:
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff810c2cb1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff81113a30>] ? dump_header+0x90/0x1b0
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff81113eba>] ? oom_kill_process+0x8a/0x2c0
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff81113df1>] ? select_bad_process+0xe1/0x120
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff81114310>] ? out_of_memory+0x220/0x3c0
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff8112402e>] ? __alloc_pages_nodemask+0x89e/0x940
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff81158c7a>] ? alloc_pages_vma+0x9a/0x150
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff8113beeb>] ? handle_pte_fault+0x76b/0xb50
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff81007c8f>] ? xen_restore_fl_direct_end+0x0/0x1
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff81006e2b>] ? xen_set_pmd_hyper+0x8b/0xc0
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff81006ecb>] ? xen_set_pmd+0x6b/0xb0
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff81004a49>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff8113c4b4>] ? handle_mm_fault+0x1e4/0x2b0
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff81042b39>] ? __do_page_fault+0x139/0x480
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff8114068a>] ? vma_merge+0x29a/0x3e0
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff81140a9b>] ? __vm_enough_memory+0x3b/0x190
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff81141a5c>] ? do_brk+0x26c/0x350
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff8100ba9d>] ? retint_restore_args+0x5/0x6
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff814f246e>] ? do_page_fault+0x3e/0xa0
Mar 19 04:46:29 buildbot-master54 kernel: [<ffffffff814ef825>] ? page_fault+0x25/0x30
Mar 19 04:46:29 buildbot-master54 kernel: Mem-Info:
Mar 19 04:46:29 buildbot-master54 kernel: Node 0 DMA per-cpu:
Mar 19 04:46:29 buildbot-master54 kernel: CPU    0: hi:    0, btch:   1 usd:   0
Mar 19 04:46:29 buildbot-master54 kernel: CPU    1: hi:    0, btch:   1 usd:   0
Mar 19 04:46:29 buildbot-master54 kernel: Node 0 DMA32 per-cpu:
Mar 19 04:46:29 buildbot-master54 kernel: CPU    0: hi:  186, btch:  31 usd:  31
Mar 19 04:46:29 buildbot-master54 kernel: CPU    1: hi:  186, btch:  31 usd:   8
Mar 19 04:46:29 buildbot-master54 kernel: Node 0 Normal per-cpu:
Mar 19 04:46:29 buildbot-master54 kernel: CPU    0: hi:  186, btch:  31 usd:  56
Mar 19 04:46:29 buildbot-master54 kernel: CPU    1: hi:  186, btch:  31 usd:  11
Mar 19 04:46:29 buildbot-master54 kernel: active_anon:1610693 inactive_anon:249289 isolated_anon:0
Mar 19 04:46:29 buildbot-master54 kernel: active_file:72 inactive_file:52 isolated_file:0
Mar 19 04:46:29 buildbot-master54 kernel: unevictable:0 dirty:0 writeback:0 unstable:0
Mar 19 04:46:29 buildbot-master54 kernel: free:8188 slab_reclaimable:3106 slab_unreclaimable:7669
Mar 19 04:46:29 buildbot-master54 kernel: mapped:170 shmem:163 pagetables:4841 bounce:0
Mar 19 04:46:29 buildbot-master54 kernel: Node 0 DMA free:7752kB min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:7748kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Mar 19 04:46:29 buildbot-master54 kernel: lowmem_reserve[]: 0 4024 7559 7559
Mar 19 04:46:29 buildbot-master54 kernel: Node 0 DMA32 free:19968kB min:5920kB low:7400kB high:8880kB active_anon:3493324kB inactive_anon:407256kB active_file:288kB inactive_file:208kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:4120800kB mlocked:0kB dirty:0kB writeback:0kB mapped:664kB shmem:512kB slab_reclaimable:4580kB slab_unreclaimable:11132kB kernel_stack:328kB pagetables:10608kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:768 all_unreclaimable? yes
Mar 19 04:46:29 buildbot-master54 kernel: lowmem_reserve[]: 0 0 3535 3535
Mar 19 04:46:29 buildbot-master54 kernel: Node 0 Normal free:5032kB min:5200kB low:6500kB high:7800kB active_anon:2949448kB inactive_anon:589900kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3619840kB mlocked:0kB dirty:0kB writeback:0kB mapped:16kB shmem:140kB slab_reclaimable:7844kB slab_unreclaimable:19544kB kernel_stack:672kB pagetables:8756kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Mar 19 04:46:29 buildbot-master54 kernel: lowmem_reserve[]: 0 0 0 0
Mar 19 04:46:29 buildbot-master54 kernel: Node 0 DMA: 2*4kB 0*8kB 2*16kB 1*32kB 0*64kB 2*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 1*4096kB = 7752kB
Mar 19 04:46:29 buildbot-master54 kernel: Node 0 DMA32: 3460*4kB 2*8kB 2*16kB 2*32kB 2*64kB 2*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 19968kB
Mar 19 04:46:29 buildbot-master54 kernel: Node 0 Normal: 238*4kB 2*8kB 2*16kB 2*32kB 0*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 5032kB
Mar 19 04:46:29 buildbot-master54 kernel: 298 total pagecache pages
Mar 19 04:46:29 buildbot-master54 kernel: 0 pages in swap cache
Mar 19 04:46:29 buildbot-master54 kernel: Swap cache stats: add 0, delete 0, find 0/0
Mar 19 04:46:29 buildbot-master54 kernel: Free swap  = 0kB
Mar 19 04:46:29 buildbot-master54 kernel: Total swap = 0kB
Mar 19 04:46:29 buildbot-master54 kernel: Total swap = 0kB
Mar 19 04:46:29 buildbot-master54 kernel: 1966079 pages RAM
Mar 19 04:46:29 buildbot-master54 kernel: 54513 pages reserved
Mar 19 04:46:29 buildbot-master54 kernel: 1324 pages shared
Mar 19 04:46:29 buildbot-master54 kernel: 1899816 pages non-shared
Mar 19 04:46:29 buildbot-master54 kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Mar 19 04:46:29 buildbot-master54 kernel: [  321]     0   321     2712      135   1     -17         -1000 udevd
Mar 19 04:46:29 buildbot-master54 kernel: [  786]     0   786     2293      125   0       0             0 dhclient
Mar 19 04:46:29 buildbot-master54 kernel: [  824]     0   824    62187      144   0       0             0 rsyslogd
Mar 19 04:46:29 buildbot-master54 kernel: [  836]     0   836     2301       40   1       0             0 irqbalance
Mar 19 04:46:29 buildbot-master54 kernel: [  850]    32   850     4756       58   0       0             0 rpcbind
Mar 19 04:46:29 buildbot-master54 kernel: [  867]    81   867     5392       92   0       0             0 dbus-daemon
Mar 19 04:46:29 buildbot-master54 kernel: [  891]    68   891     6216      128   0       0             0 hald
Mar 19 04:46:29 buildbot-master54 kernel: [  892]     0   892     4539       48   0       0             0 hald-runner
Mar 19 04:46:29 buildbot-master54 kernel: [ 1144]     0  1144     2304       56   0       0             0 abrt-dump-oops
Mar 19 04:46:29 buildbot-master54 kernel: [ 1153]     0  1153     5500       48   1       0             0 abrtd
Mar 19 04:46:29 buildbot-master54 kernel: [ 1161]     0  1161     5110      162   0       0             0 crond
Mar 19 04:46:29 buildbot-master54 kernel: [ 1172]     0  1172     1029       22   1       0             0 mingetty
Mar 19 04:46:29 buildbot-master54 kernel: [ 1174]     0  1174     1029       21   1       0             0 mingetty
Mar 19 04:46:29 buildbot-master54 kernel: [ 1176]     0  1176     1029       22   1       0             0 mingetty
Mar 19 04:46:29 buildbot-master54 kernel: [ 1178]     0  1178     1029       21   1       0             0 mingetty
Mar 19 04:46:29 buildbot-master54 kernel: [ 1180]     0  1180     1032       23   1       0             0 agetty
Mar 19 04:46:29 buildbot-master54 kernel: [ 1181]     0  1181     1029       22   0       0             0 mingetty
Mar 19 04:46:29 buildbot-master54 kernel: [ 1188]     0  1188     1029       22   0       0             0 mingetty
Mar 19 04:46:29 buildbot-master54 kernel: [ 6673]   500  6673     2322       52   0       0             0 run_command_run
Mar 19 04:46:29 buildbot-master54 kernel: [ 6674]   500  6674    24815     2287   0       0             0 python
Mar 19 04:46:29 buildbot-master54 kernel: [19032]     0 19032    19669      226   0       0             0 master
Mar 19 04:46:29 buildbot-master54 kernel: [19036]    89 19036    19732      255   1       0             0 qmgr
Mar 19 04:46:29 buildbot-master54 kernel: [ 5168]     0  5168    16017      183   0       0             0 sshd
Mar 19 04:46:29 buildbot-master54 kernel: [ 8954]   498  8954    10248      109   1       0             0 nrpe
Mar 19 04:46:29 buildbot-master54 kernel: [17305]   500 17305  2121880  1839596   0       0             0 buildbot
Mar 19 04:46:29 buildbot-master54 kernel: [22419]   500 22419     2322       53   1       0             0 run_pulse_publi
Mar 19 04:46:29 buildbot-master54 kernel: [22420]   500 22420    30453     4360   0       0             0 python
Mar 19 04:46:29 buildbot-master54 kernel: [ 1699]     0  1699     1543       23   1       0             0 collectdmon
Mar 19 04:46:29 buildbot-master54 kernel: [ 1701]     0  1701   201048      195   1       0             0 collectd
Mar 19 04:46:29 buildbot-master54 kernel: [18042]    38 18042     7552      131   0       0             0 ntpd
Mar 19 04:46:29 buildbot-master54 kernel: [ 7233]    89  7233    19689      219   0       0             0 pickup
Mar 19 04:46:29 buildbot-master54 kernel: [11546]     0 11546    17175      210   0       0             0 sshd
Mar 19 04:46:29 buildbot-master54 kernel: [11548]   500 11548    17246      323   0       0             0 sshd
Mar 19 04:46:29 buildbot-master54 kernel: [11549]   500 11549    27098      108   0       0             0 bash
Mar 19 04:46:29 buildbot-master54 kernel: [11612]     0 11612    17175      210   0       0             0 sshd
Mar 19 04:46:29 buildbot-master54 kernel: [11619]   500 11619    17471      499   0       0             0 sshd
Mar 19 04:46:29 buildbot-master54 kernel: [11620]   500 11620    27098      106   0       0             0 bash
Mar 19 04:46:29 buildbot-master54 kernel: [11655]   500 11655    29687       63   1       0             0 screen
Mar 19 04:46:29 buildbot-master54 kernel: [11656]   500 11656    31666     2077   0       0             0 screen
Mar 19 04:46:29 buildbot-master54 kernel: [11657]   500 11657    27098      121   1       0             0 bash
Mar 19 04:46:29 buildbot-master54 kernel: [11671]   500 11671    29687       66   1       0             0 screen
Mar 19 04:46:29 buildbot-master54 kernel: [11672]     0 11672    17175      210   0       0             0 sshd
Mar 19 04:46:29 buildbot-master54 kernel: [11674]   500 11674    17471      499   0       0             0 sshd
Mar 19 04:46:29 buildbot-master54 kernel: [11675]   500 11675    27098      105   1       0             0 bash
Mar 19 04:46:29 buildbot-master54 kernel: [11698]   500 11698    29687       66   1       0             0 screen
Mar 19 04:46:29 buildbot-master54 kernel: [11865]   500 11865    37113     7392   0       0             0 view
Mar 19 04:46:29 buildbot-master54 kernel: Out of memory: Kill process 17305 (buildbot) score 964 or sacrifice child
Mar 19 04:46:29 buildbot-master54 kernel: Killed process 17305, UID 500, (buildbot) total-vm:8487520kB, anon-rss:7357904kB, file-rss:480kB
Mar 19 04:46:29 buildbot-master54 kernel: rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Mar 19 04:46:29 buildbot-master54 kernel: rsyslogd cpuset=/ mems_allowed=0
Mar 19 04:46:29 buildbot-master54 kernel: Pid: 1692, comm: rsyslogd Not tainted 2.6.32-220.el6.x86_64 #1
Mar 19 04:46:29 buildbot-master54 kernel: Call Trace:


And eventually:
Mar 19 04:46:29 buildbot-master54 kernel: Killed process 11865, UID 500, (view) total-vm:148452kB, anon-rss:29496kB, file-rss:72kB
Graphite shows used memory spike from about 4.5G to almost 7.5G around 4:44, right before the OOM killer comes along. The only thing notable in the log before that is a lot of build claiming/builder prioritizing. Eg:
2014-03-19 04:44:39-0700 [-] Claimed buildrequestids: [38333584L, 38334264L]
2014-03-19 04:44:39-0700 [-] Claimed buildrequestids: [38335119L]
2014-03-19 04:44:39-0700 [-] Claimed buildrequestids: [38335126L]
2014-03-19 04:44:39-0700 [-] Claimed buildrequestids: [38335127L]
2014-03-19 04:44:39-0700 [-] Claimed buildrequestids: [38333214L]
2014-03-19 04:44:39-0700 [-] Claimed buildrequestids: [38335290L]
2014-03-19 04:44:39-0700 [-] Claimed buildrequestids: [38335289L]
2014-03-19 04:44:39-0700 [-] Claimed buildrequestids: [38334697L]
2014-03-19 04:44:39-0700 [-] Claimed buildrequestids: [38335346L, 38335223L]
2014-03-19 04:44:39-0700 [-] Claimed buildrequestids: [38334861L]
2014-03-19 04:44:40-0700 [-] Claimed buildrequestids: [38335335L]
2014-03-19 04:44:40-0700 [-] Claimed buildrequestids: [38334595L]
2014-03-19 04:44:40-0700 [-] Claimed buildrequestids: [38334586L]
<snip>
nd']'> rc=0
2014-03-19 04:44:43-0700 [-] prioritizeBuilders: 0.00s starting
2014-03-19 04:44:43-0700 [-] prioritizeBuilders: 0.12s got 241 request(s)
2014-03-19 04:44:43-0700 [-] prioritizeBuilders: 0.13s requests for my builders: 139
2014-03-19 04:44:43-0700 [-] prioritizeBuilders: 0.13s builders with requests: 139
2014-03-19 04:44:44-0700 [-] prioritizeBuilders: 0.26s found 5 available of 60 connected slaves
2014-03-19 04:44:44-0700 [-] prioritizeBuilders: 0.27s builders with slaves: 50
2014-03-19 04:44:44-0700 [-] prioritizeBuilders: 0.27s prioritized 50 builder(s): [.....]
<snip>
2014-03-19 04:44:44-0700 [-] prioritizeBuilders: 0.27s important builder Ubuntu VM 12.04 mozilla-central pgo test jetpack (p == (3, 0L, 100, 1395229278L))
2014-03-19 04:44:44-0700 [-] prioritizeBuilders: 0.27s important builder Ubuntu VM 12.04 mozilla-central pgo test cppunit (p == (3, 0L, 100, 1395229278L))
2014-03-19 04:44:44-0700 [-] prioritizeBuilders: 0.27s important builder Ubuntu VM 12.04 mozilla-central pgo test mochitest-other (p == (3, 0L, 100, 1395229278L))
2014-03-19 04:44:44-0700 [-] prioritizeBuilders: 0.27s unimportant builder Ubuntu VM 12.04 b2g-inbound pgo test crashtest-ipc ((4, 0L, 100, 1395229400L) != (3, 0L, 100, 1395229278L))
2014-03-19 04:44:44-0700 [-] prioritizeBuilders: 0.27s unimportant builder Ubuntu VM 12.04 b2g-inbound pgo test jetpack ((4, 0L, 100, 1395229400L) != (3, 0L, 100, 1395229278L))
2014-03-19 04:44:44-0700 [-] prioritizeBuilders: 0.27s unimportant builder Ubuntu VM 12.04 b2g-inbound pgo test reftest-no-accel ((4, 0L, 100, 1395229400L) != (3, 0L, 100, 1395229278L))
2014-03-19 04:44:44-0700 [-] prioritizeBuilders: 0.27s unimportant builder Ubuntu VM 12.04 b2g-inbound pgo test mochitest-3 ((4, 0L, 100, 1395229400L) != (3, 0L, 100, 1395229278L))
<snip>
2014-03-19 04:44:44-0700 [-] prioritizeBuilders: 0.28s triggering builder loop again since we've dropped some lower priority builders
2014-03-19 04:44:44-0700 [-] prioritizeBuilders: 0.28s finished prioritization

There's a few build steps starting after that, and then it's dead.


I don't see any interesting changes to the test masters recently, so let's chalk this up to solar flares?
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.