Closed Bug 498393 Opened 13 years ago Closed 12 years ago

Linux builds failing with "Memory exhausted"

Categories

(Release Engineering :: General, defect, P2)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jfkthame, Unassigned)

References

Details

(Whiteboard: [buildduty])

Attachments

(1 file)

This tinderbox is repeatedly failing with a "Memory exhausted" message. The issue is not occurring on mozilla-central, and is not specific to a particular patch being tested, so it looks like a machine problem rather than bad code.

Example logs showing the failure:

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245080253.1245082924.28865.gz

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245079948.1245083306.29743.gz

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245075628.1245078314.18503.gz

and several more currently on Tinderbox.
Duplicate of this bug: 498414
Rebooted linux builders where unittest failures on build occurred due to
"Memory Exhaustion":

try-linux-slave19
try-linux-slave18
try-linux-slave17
try-linux-slave16
try-linux-slave15
try-linux-slave12

The other try slaves are either having green runs, orange but complete runs or
any failures were for other reasons like the patch failing.

Will monitor these slaves in the next few hours (or overnight) to see if the
reboot leads to some successful builds and determine if try slaves should
auto-reboot after builds like talos builders do.
Assignee: nobody → lsblakk
Checked in this morning and try-linux-slave19 had a build after reboot with
warnings xpcshell-tests reftest mochitest-plain mochitest-chrome mochitest-browser-chrome mochitest-a11y errors:

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245144748.1245150245.21127.gz

The next build had a "Memory Exhausted" compile fail:

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245159149.1245161702.14323.gz

try-linux-slave15 had an orange run following reboot:

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245156687.1245162733.18577.gz

I strongly suspect that the next unit test build it does will fail as well.
Are there processes left over after the timeout ? Are we using the old process killing stuff from unit on the try server ?
On try-linux-slave11, after failed builds on both unittest and hg build I looked at the running processes and found that ssltunnel and firefox-bin were still running on an idle slave.  Killed those processes and pushed some dummy patches to get builds and the next builds were orange for unittest with 5 fails on mochitest-chrome:

6666 ERROR TEST-UNEXPECTED-FAIL | chrome://mochikit/content/chrome/toolkit/content/tests/chrome/test_panel_focus.xul | button focus focus event fired - got false, expected true
6667 ERROR TEST-UNEXPECTED-FAIL | chrome://mochikit/content/chrome/toolkit/content/tests/chrome/test_panel_focus.xul | popup skipped in focus navigation focus event fired - got false, expected true
6668 ERROR TEST-UNEXPECTED-FAIL | chrome://mochikit/content/chrome/toolkit/content/tests/chrome/test_panel_focus.xul | focus removed when popup opened - got 0, expected 1
6707 ERROR TEST-UNEXPECTED-FAIL | chrome://mochikit/content/chrome/toolkit/content/tests/chrome/test_panel_focus.xul | blur events fired within popup - got 2, expected 3
6708 ERROR TEST-UNEXPECTED-FAIL | chrome://mochikit/content/chrome/toolkit/content/tests/chrome/test_panel_focus.xul | focus removed when popup closed - got 3, expected 4


and two fails on mochitest-plain (which I wonder if NO_FAIL_ON_ERRORS = 1 would fix from bug 445611):
82093 ERROR TEST-UNEXPECTED-FAIL | /tests/toolkit/components/url-classifier/tests/mochitest/test_classifier.html | Error thrown during test: uncaught exception: [Exception... "Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIUrlClassifierDBService.beginUpdate]"  nsresult: "0x80040111 (NS_ERROR_NOT_AVAILABLE)"  location: "JS frame :: http://localhost:8888/tests/toolkit/components/url-classifier/tests/mochitest/test_classifier.html :: doUpdate :: line 58"  data: no] - got 0, expected 1
82095 INFO Running /tests/toolkit/components/url-classifier/tests/mochitest/test_classifier_worker.html...
82096 ERROR TEST-UNEXPECTED-FAIL | /tests/toolkit/components/url-classifier/tests/mochitest/test_classifier_worker.html | Error thrown during test: uncaught exception: [Exception... "Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIUrlClassifierDBService.beginUpdate]"  nsresult: "0x80040111 (NS_ERROR_NOT_AVAILABLE)"  location: "JS frame :: http://localhost:8888/tests/toolkit/components/url-classifier/tests/mochitest/test_classifier_worker.html :: doUpdate :: line 58"  data: no] - got 0, expected 1
82098 INFO Running /tests/toolkit/content/tests/widgets/test_bug359754.xul...


The hg build run went through green. 

Looks like we may want to do auto-rebooting after each build to ensure no processes are left running.  As well there are some cronjob zombie processes that are left over after the display is set.  These are probably not hurting anything.
the mochitest-plain errors should be fixed by bug 497884
Duplicate of this bug: 500344
Summary: "Linux try hg unit test" repeatedly failing, Memory exhausted → Linux try builds repeatedly failing, Memory exhausted
(In reply to comment #8)
> *** Bug 500344 has been marked as a duplicate of this bug. ***

I went round all the linux slaves and did a 'killall firefox-bin ssltunnel' where it was needed, which was try-linux-slave11,12,13,14,15,16,19. Typically there'd be a group of 4/5 processes associated with running one of the test suites, most commonly one of the mochitests but I saw crashtest at least once too. 

Curious that this seems to affect the newer slaves only.
When comparing mochitest-plain logs for mozilla-central and try there are four (non-trival) differences in the environment variables:
* we don't set MINIDUMP_STACKWALK on try (bug 483111)
* we don't use an OBJDIR on try
* TERM is xterm on m-c, linux on try
* buildbot has "using PTY: True" on m-c, False on try.
The last comes from "usepty = 0|1" in buildbot.tac on the slave. 

The signature for this problem is a timeout or problem in a test suite, then processes are not killed off properly. A subsequent compile will fail with the memory exhausted error. 

Looking at all the linux slaves, only try-linux-slave10 through 19 have usepty = 0; the rest of try linux and m-c linux set 1. Note that 10 through 19 are relatively new slaves, which had their buildbot.tac generated by Buildbot 0.7.10. And that changed the default from usepty from 1 to 0, 
  http://buildbot.net/trac/ticket/158
(It also allows you to set usepty on a per-step basis) 

So I've flipped usepty from 0 to 1 on those 10 slaves and that should fix us up.

Incidentally, bug 478150 is the same problem on bm-xserve17.
The ref platform notes are fixed to include usepty = 1.

I've set that on try-w32-slave17 through 19, then noticed that they're not setting keepAlive = None either.  Lukas, please verify that all the win32 slaves for try and moz2 have the three settings at 
  https://wiki.mozilla.org/ReferencePlatforms/Win32#Setup_Slave
All the win32 try slaves have been updated now.
How many needed adjusting ?
almost all of the try slaves needed keepalive set to None, a few needed the usepty set, and most of the older slaves (and many in the moz2 pool) needed just the umask set.
all the moz2 pool of win32 slaves have been adjusted as needed for parity.  from slave 01 - 15 most needed keepalive set to None from 600, slave03 needed usepty set to 1, slave 08 needed usepty set to 1 as well as umask corrected from 022 to 002.  most of the slaves from 17 - 29 needed umask set to 002.
We haven't heard anything about this on Try recently; possibly a result of patches or the above fixes.  But we're planning on bumping RAM on our linux slaves.  (Marking bug 501222 as a dependency.)
Depends on: 501222
Duplicate of this bug: 483199
The build slaves have been upgraded as part of bug 501222 and all that's left is the masters which requires downtime.  Closing this bug since we haven't seen a "Memory Exhausted" error since the upgrade which took place several days ago.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Just happened on production windows build machine:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1248817814.1248820526.15431.gz
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #19)
> Just happened on production windows build machine:
> 
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1248817814.1248820526.15431.gz

I think you mean production Linux machine. I see it in the logs for moz2-linux-slave15 and 24 recently. Both of these machines are on the same cluster, and both failed right around 15:40 PDT yesterday which is a little suspicious.

I've noticed that since we've done the RAM upgrade that our ESX hosts are in a 'warning' state sometimes. I'll ping IT about this, it's possible that we're running out of physical RAM rather than RAM in the VM itself.
Phong increased the sensitivity of the load balancer on our ESX clusters yesterday and we haven't seen any more failures since that. Let's have this run over the weekend before closing again, though.
Linux electrolysis unit test on moz2-slave-slave22 hit this on Sat Aug 1 at 22:53:50
Reassigning to nobody@mozilla and leaving this open to track occurences.
Assignee: lsblakk → nobody
Assignee: nobody → armenzg
Assignee: armenzg → nobody
(In reply to comment #26)
This was moz2-linux-slave15

So are these two:
Linux tracemonkey build on 2009/09/01 17:30:38
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1251851438.1251854151.26859.gz

Linux tracemonkey unit test on 2009/09/01 16:40:47
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1251848447.1251851480.30511.gz

This machine appears to have some sort of memory leak as a process listing doesn't account for the memory reported in use. I'll reboot it in a bit.
(In reply to comment #27)
> This was moz2-linux-slave15
> 
> This machine appears to have some sort of memory leak as a process listing
> doesn't account for the memory reported in use. I'll reboot it in a bit.

Stopping and starting buildbot was enough to get back to 
             total       used       free     shared    buffers     cached
Mem:          2027        375       1651          0         51        186
-/+ buffers/cache:        137       1889
Swap:          509         72        437

Previously that 137 was more like 1300, even though ps claimed buildbot was not using nearly that much.
per bsmedberg in dev meeting today, "30-50% of Linux electrolysis builds are
failing due to out-of-memory debug linking libxul..." 

unknown when this failure pattern started.
bsmedberg: are you seeing this with electrolysis builds on try-server only, or also for build-triggered-on-checkin on electrolysis branch?
I don't know if anyone is doing tryserver/electrolysis: this is with standard build-on-checkin builds: some samples from recently:

Linux electrolysis build on 2009/09/08 03:26:15 http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1252405575.1252408388.3815.gz

Linux electrolysis unit test on 2009/09/04 00:17:18 http://tinderbox.mozilla.org/showlog.cgi?tree=Electrolysis&errorparser=unittest&logfile=1252048638.1252051576.5758.gz&buildtime=1252048638&buildname=Linux%20electrolysis%20unit%20test&fulltext=

Linux electrolysis build on 2009/08/27 16:25:55 http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1251415555.1251418442.14995.gz

Linux electrolysis unit test on 2009/08/26 18:50:56 http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1251337856.1251340788.10042.gz

Linux electrolysis unit test on 2009/08/26 09:40:47 http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1251304847.1251308856.7955.gz

That's less than 30%; more like 10% or so, but still enough to make life painful.
(In reply to comment #31)
> I don't know if anyone is doing tryserver/electrolysis: this is with standard
> build-on-checkin builds: some samples from recently:

ok, I've tweaked the summary to match. 



> Linux electrolysis build on 2009/09/08 03:26:15
> http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1252405575.1252408388.3815.gz
> 
> Linux electrolysis unit test on 2009/09/04 00:17:18
> http://tinderbox.mozilla.org/showlog.cgi?tree=Electrolysis&errorparser=unittest&logfile=1252048638.1252051576.5758.gz&buildtime=1252048638&buildname=Linux%20electrolysis%20unit%20test&fulltext=
> 
> Linux electrolysis build on 2009/08/27 16:25:55
> http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1251415555.1251418442.14995.gz
> 
> Linux electrolysis unit test on 2009/08/26 18:50:56
> http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1251337856.1251340788.10042.gz
> 
> Linux electrolysis unit test on 2009/08/26 09:40:47
> http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1251304847.1251308856.7955.gz
> 
> That's less than 30%; more like 10% or so, but still enough to make life
> painful.

Phew.  10% is still not good, and still needs resolution, but its much better then the 30%-50% reported in the meeting!
Summary: Linux try builds repeatedly failing, Memory exhausted → Linux builds failing with "Memory exhausted"
Previous fixes for these memory issues involved reboots of the affected build slaves.

Marking dependent on bug 393411.
Depends on: 393411
Assignee: nobody → aki
We should now be rebooting build slaves after every 5 builds, so if a reboot fixes that problem, the 10% build failures mentioned in comment 32 should drop by at least 20% (to 8% =\ ) and maybe as much as 100%.

We are planning on moving this to a reboot per build.  At that point, if a reboot fixes this issue on electrolysis, the problem should disappear.

If the electrolysis issue manifests whether a slave has rebooted recently or not, then we will continue seeing this.
Had another one today on e10n around 2am.
It sucks to see a bunch of these even thought we're rebooting after every build these days.
I didn't really have a solution beyond the reboots, which didn't fix the problem.
I'm open to suggestions.  One possibility is adding more memory to every slave (how much?), another is tracking down what's eating all the memory and fixing or working around that.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1256300917.1256304781.1566.gz
Linux mozilla-central build on 2009/10/23 05:28:37  

/usr/bin/ld: final link failed: Memory exhausted
collect2: ld returned 1 exit status
make[4]: *** [libxul.so] Error 1
And again
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1256605329.1256609375.22067.gz

That and comments 45 & 46 are all on moz2-linux-slave15, which is frequently failing to compile m-c opt and debug builds.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1256605329.1256609375.22067.gz
Linux mozilla-central leak test build on 2009/10/26 18:02:09
moz2-linux-slave15
Good catch!

I've pulled moz2-linux-slave15 out of the pool, and will file a bug to nuke/reclone it.
Recloning/re-setting-up moz2-linux-slave15 in bug 524717 + bug 524719 .
Any recent occurrences of this?
Yes, it's still a problem for e10s.  libxul OOM happened 4 times in the last ~24 hours there.  I'd estimate it happens on maybe 1/4-1/5 of linux builds, though that may be confirmation bias.
Throwing back in the pool as it looks like I won't have sufficient time for this, and someone else may be able to work on this.

My current ideas: 1) set puppet to only run on boot (or check to see if a build is running); 2) add more memory to all the slaves.
Assignee: aki → nobody
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1257536089.1257539648.17100.gz
Linux mozilla-central leak test build on 2009/11/06 11:34:49
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1257554181.1257555690.8280.gz
Linux mozilla-central leak test build on 2009/11/06 16:36:21
Comment 55, comment 56, comment 57, comment 58 (and the one right after it I didn't bother pasting) and comment 59 are all moz2-linux-slave22, which seems like a significant pattern.
Depends on: 527688
(In reply to comment #60)
> Comment 55, comment 56, comment 57, comment 58 (and the one right after it I
> didn't bother pasting) and comment 59 are all moz2-linux-slave22, which seems
> like a significant pattern.

Yes, I'd just done the analysis here and filed bug 527688 to cover getting linux slaves 22 and 24 re-cloned.

Can I ask reporter's to take the extra time to grep the slave name out of the tinderbox log when reporting recurrences? It will be listed in the "environment:" in the footer of the brief log. Having the slavename reported will make establishing patterns easier.
Assignee: nobody → ccooper
Status: REOPENED → ASSIGNED
Priority: -- → P2
(In reply to comment #61)
> Can I ask reporter's to take the extra time to grep the slave name out of the
> tinderbox log when reporting recurrences?

This would be much easier if the slave name were just included in the tinderbox log header string. e.g. 
Linux mozilla-central leak test build / moz2-linux-slave22 on 2009/11/10 02:15:32
This will add the hostname to the tinderbox summary, in green, now that bug 454055 has been pushed to tinderbox.m.o in bug 526287.
Attachment #411466 - Flags: review?(ccooper)
Comment on attachment 411466 [details] [diff] [review]
Add the hostname to tinderbox summary

l10n already does this, FWIW. Can we switch the syntax to be the same, i.e. "s: slavename" for brevity's sake?

r+ with that fix.
Attachment #411466 - Flags: review?(ccooper) → review+
Comment on attachment 411466 [details] [diff] [review]
Add the hostname to tinderbox summary

http://hg.mozilla.org/build/buildbotcustom/rev/4c607d12e811

Slave names should be visible in the brief log now.
Attachment #411466 - Flags: checked-in+
Putting back into the pool for buildduty monitoring.
Assignee: ccooper → nobody
Status: ASSIGNED → NEW
Whiteboard: [buildduty]
Maybe we should try adding --no-keep-memory to our linker flags (as per bug 494068)?
I'm not so naive as to believe this is truly fixed. Please re-open to add new examples.
Status: NEW → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.