Closed Bug 650801 Opened 13 years ago Closed 13 years ago

Cannot allocate memory errors on buildbot masters

Categories

(Release Engineering :: General, defect, P2)

x86_64
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

References

Details

Attachments

(3 files)

Our masters have been hitting this exception on occasion:

2011-04-18 07:20:08-0700 [Broker,7059,10.12.50.71] Running ['/builds/buildbot/tests_master5/bin/python', '/builds/buildbot/tests_master5/lib/python2.6/site-packages/buildbotcustom/bin/log_uploader.py', 'stage.mozilla.org', '-u', 'ffxbld', '-i', '/home/cltbld/.ssh/ffxbld_dsa', '-b', 'mozilla-aurora', '-p', 'macosx64', '././mozilla-aurora_snowleopard-debug_test-mochitests-4', '4']
2011-04-18 07:20:08-0700 [Broker,7059,10.12.50.71] Error running ['/builds/buildbot/tests_master5/bin/python', '/builds/buildbot/tests_master5/lib/python2.6/site-packages/buildbotcustom/bin/log_uploader.py', 'stage.mozilla.org', '-u', 'ffxbld', '-i', '/home/cltbld/.ssh/ffxbld_dsa', '-b', 'mozilla-aurora', '-p', 'macosx64', '././mozilla-aurora_snowleopard-debug_test-mochitests-4', '4']
2011-04-18 07:20:08-0700 [Broker,7059,10.12.50.71] Log output:
2011-04-18 07:20:08-0700 [Broker,7059,10.12.50.71] Unhandled Error
        Traceback (most recent call last):
          File "/tools/python-2.6.5/lib/python2.6/threading.py", line 484, in run
            self.__target(*self.__args, **self.__kwargs)
          File "/builds/buildbot/tests_master5/lib/python2.6/site-packages/twisted/python/threadpool.py", line 207, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/builds/buildbot/tests_master5/lib/python2.6/site-packages/twisted/python/context.py", line 59, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/builds/buildbot/tests_master5/lib/python2.6/site-packages/twisted/python/context.py", line 37, in callWithContext
            return func(*args,**kw)
        --- <exception caught here> ---
          File "/builds/buildbot/tests_master5/lib/python2.6/site-packages/buildbotcustom/status/log_handlers.py", line 95, in handleLogs
            subprocess.check_call(cmd, stdout=output, stderr=subprocess.STDOUT)
          File "/tools/python-2.6.5/lib/python2.6/subprocess.py", line 493, in check_call
            retcode = call(*popenargs, **kwargs)
          File "/tools/python-2.6.5/lib/python2.6/subprocess.py", line 480, in call
            return Popen(*popenargs, **kwargs).wait()
          File "/tools/python-2.6.5/lib/python2.6/subprocess.py", line 633, in __init__
            errread, errwrite)
          File "/tools/python-2.6.5/lib/python2.6/subprocess.py", line 1049, in _execute_child
            self.pid = os.fork()
        exceptions.OSError: [Errno 12] Cannot allocate memory
I wonder if this could be somehow related to having too many threads running?

ps -eLf | grep tests_master6  | wc -l
172

Each SubprocessLogHandler has its own thread pool. Maybe they should all use reactor.callInThread instead of maintaining their own pool.
Assignee: nobody → catlee
Priority: -- → P2
Summary: Cannot allocate memory errors → Cannot allocate memory errors on buildbot masters
Use the global thread pool instead of creating one thread pool per log handler.
Attachment #526970 - Flags: review?(dustin)
Depends on: 651104
Comment on attachment 526970 [details] [diff] [review]
stop using individual thread pools

Looks good, although I don't see why you changed SubprocessLogHandler?
Attachment #526970 - Flags: review?(dustin) → review+
(In reply to comment #3)
> Comment on attachment 526970 [details] [diff] [review]
>   --> https://bugzilla.mozilla.org/attachment.cgi?id=526970
> stop using individual thread pools
> 
> Looks good, although I don't see why you changed SubprocessLogHandler?

Needed to get rid of the now unused 'size' parameter.
Comment on attachment 526970 [details] [diff] [review]
stop using individual thread pools

http://hg.mozilla.org/build/buildbotcustom/rev/5fd234770d41
Attachment #526970 - Flags: checked-in+
Flags: needs-reconfig?
Flags: needs-reconfig? → needs-reconfig+
So it looks like this is actually caused by....not enough memory!

I don't have exact details, but a quick test of this script reproduces the same failure on my machine (with all swap disabled):

<<<
import os
big = "s"
while True:
    print len(big)
    if os.fork() == 0:
        break
    os.wait()
    big = big * 2
>>>

when len(big) == 1073741824, I have over 500 MB of ram free on my machine yet the fork fails. I'm guessing that the kernel is making sure there would be enough room in memory for a process with the same resident set size as the calling process. Using vfork() here may help, but python doesn't expose that system call.

Adding more swap space does count as adding more memory, and so could be a good short term solution to this problem.
Attached patch Die json!Splinter Review
With the json web status disabled, an idle master maxes out at 396,376 MB according to 'ps -o rss -p <pid>'

With the json web status enabled, we max out at 11,673,00 MB.
Attachment #529205 - Flags: review?(dustin)
Comment on attachment 529205 [details] [diff] [review]
Die json!

> 11,673,00 MB.

I think you mean 1,167,300 MB - not 11GB :)  Still, a sizeable savings!
Attachment #529205 - Flags: review?(dustin) → review+
Blocks: 649734
For posterity here is a screenshot of a week long memory usage by the masters [1] that currently have this patch deployed since Saturday.

The memory usage dropped in all of them by different amounts and some of the masters have gone back up but not as far as they were originally.

I am curious to see what this will look like after this week's business days come into play.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: