Cannot allocate memory errors on buildbot masters

RESOLVED FIXED

Status

Release Engineering
General
P2
normal
RESOLVED FIXED
7 years ago
4 years ago

People

(Reporter: catlee, Assigned: catlee)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments)

(Assignee)

Description

7 years ago
Our masters have been hitting this exception on occasion:

2011-04-18 07:20:08-0700 [Broker,7059,10.12.50.71] Running ['/builds/buildbot/tests_master5/bin/python', '/builds/buildbot/tests_master5/lib/python2.6/site-packages/buildbotcustom/bin/log_uploader.py', 'stage.mozilla.org', '-u', 'ffxbld', '-i', '/home/cltbld/.ssh/ffxbld_dsa', '-b', 'mozilla-aurora', '-p', 'macosx64', '././mozilla-aurora_snowleopard-debug_test-mochitests-4', '4']
2011-04-18 07:20:08-0700 [Broker,7059,10.12.50.71] Error running ['/builds/buildbot/tests_master5/bin/python', '/builds/buildbot/tests_master5/lib/python2.6/site-packages/buildbotcustom/bin/log_uploader.py', 'stage.mozilla.org', '-u', 'ffxbld', '-i', '/home/cltbld/.ssh/ffxbld_dsa', '-b', 'mozilla-aurora', '-p', 'macosx64', '././mozilla-aurora_snowleopard-debug_test-mochitests-4', '4']
2011-04-18 07:20:08-0700 [Broker,7059,10.12.50.71] Log output:
2011-04-18 07:20:08-0700 [Broker,7059,10.12.50.71] Unhandled Error
        Traceback (most recent call last):
          File "/tools/python-2.6.5/lib/python2.6/threading.py", line 484, in run
            self.__target(*self.__args, **self.__kwargs)
          File "/builds/buildbot/tests_master5/lib/python2.6/site-packages/twisted/python/threadpool.py", line 207, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/builds/buildbot/tests_master5/lib/python2.6/site-packages/twisted/python/context.py", line 59, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/builds/buildbot/tests_master5/lib/python2.6/site-packages/twisted/python/context.py", line 37, in callWithContext
            return func(*args,**kw)
        --- <exception caught here> ---
          File "/builds/buildbot/tests_master5/lib/python2.6/site-packages/buildbotcustom/status/log_handlers.py", line 95, in handleLogs
            subprocess.check_call(cmd, stdout=output, stderr=subprocess.STDOUT)
          File "/tools/python-2.6.5/lib/python2.6/subprocess.py", line 493, in check_call
            retcode = call(*popenargs, **kwargs)
          File "/tools/python-2.6.5/lib/python2.6/subprocess.py", line 480, in call
            return Popen(*popenargs, **kwargs).wait()
          File "/tools/python-2.6.5/lib/python2.6/subprocess.py", line 633, in __init__
            errread, errwrite)
          File "/tools/python-2.6.5/lib/python2.6/subprocess.py", line 1049, in _execute_child
            self.pid = os.fork()
        exceptions.OSError: [Errno 12] Cannot allocate memory
(Assignee)

Comment 1

7 years ago
I wonder if this could be somehow related to having too many threads running?

ps -eLf | grep tests_master6  | wc -l
172

Each SubprocessLogHandler has its own thread pool. Maybe they should all use reactor.callInThread instead of maintaining their own pool.
Assignee: nobody → catlee
(Assignee)

Updated

7 years ago
Priority: -- → P2
Summary: Cannot allocate memory errors → Cannot allocate memory errors on buildbot masters
(Assignee)

Comment 2

7 years ago
Created attachment 526970 [details] [diff] [review]
stop using individual thread pools

Use the global thread pool instead of creating one thread pool per log handler.
Attachment #526970 - Flags: review?(dustin)
(Assignee)

Updated

7 years ago
Depends on: 651104
Comment on attachment 526970 [details] [diff] [review]
stop using individual thread pools

Looks good, although I don't see why you changed SubprocessLogHandler?
Attachment #526970 - Flags: review?(dustin) → review+
(Assignee)

Comment 4

7 years ago
(In reply to comment #3)
> Comment on attachment 526970 [details] [diff] [review]
>   --> https://bugzilla.mozilla.org/attachment.cgi?id=526970
> stop using individual thread pools
> 
> Looks good, although I don't see why you changed SubprocessLogHandler?

Needed to get rid of the now unused 'size' parameter.
(Assignee)

Comment 5

7 years ago
Comment on attachment 526970 [details] [diff] [review]
stop using individual thread pools

http://hg.mozilla.org/build/buildbotcustom/rev/5fd234770d41
Attachment #526970 - Flags: checked-in+
(Assignee)

Updated

7 years ago
Flags: needs-reconfig?
(Assignee)

Updated

7 years ago
Flags: needs-reconfig? → needs-reconfig+
(Assignee)

Comment 6

7 years ago
So it looks like this is actually caused by....not enough memory!

I don't have exact details, but a quick test of this script reproduces the same failure on my machine (with all swap disabled):

<<<
import os
big = "s"
while True:
    print len(big)
    if os.fork() == 0:
        break
    os.wait()
    big = big * 2
>>>

when len(big) == 1073741824, I have over 500 MB of ram free on my machine yet the fork fails. I'm guessing that the kernel is making sure there would be enough room in memory for a process with the same resident set size as the calling process. Using vfork() here may help, but python doesn't expose that system call.

Adding more swap space does count as adding more memory, and so could be a good short term solution to this problem.
(Assignee)

Comment 7

7 years ago
Created attachment 529205 [details] [diff] [review]
Die json!

With the json web status disabled, an idle master maxes out at 396,376 MB according to 'ps -o rss -p <pid>'

With the json web status enabled, we max out at 11,673,00 MB.
Attachment #529205 - Flags: review?(dustin)
Comment on attachment 529205 [details] [diff] [review]
Die json!

> 11,673,00 MB.

I think you mean 1,167,300 MB - not 11GB :)  Still, a sizeable savings!
Attachment #529205 - Flags: review?(dustin) → review+

Updated

7 years ago
Blocks: 649734

Comment 9

7 years ago
Created attachment 529483 [details]
[screenshot] memory usage reduction

For posterity here is a screenshot of a week long memory usage by the masters [1] that currently have this patch deployed since Saturday.

The memory usage dropped in all of them by different amounts and some of the masters have gone back up but not as far as they were originally.

I am curious to see what this will look like after this week's business days come into play.
(Assignee)

Updated

7 years ago
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.