Closed
Bug 498393
Opened 15 years ago
Closed 15 years ago
Linux builds failing with "Memory exhausted"
Categories
(Release Engineering :: General, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jfkthame, Unassigned)
References
Details
(Whiteboard: [buildduty])
Attachments
(1 file)
942 bytes,
patch
|
coop
:
review+
mozilla
:
checked-in+
|
Details | Diff | Splinter Review |
This tinderbox is repeatedly failing with a "Memory exhausted" message. The issue is not occurring on mozilla-central, and is not specific to a particular patch being tested, so it looks like a machine problem rather than bad code. Example logs showing the failure: http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245080253.1245082924.28865.gz http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245079948.1245083306.29743.gz http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245075628.1245078314.18503.gz and several more currently on Tinderbox.
Comment 2•15 years ago
|
||
Rebooted linux builders where unittest failures on build occurred due to "Memory Exhaustion": try-linux-slave19 try-linux-slave18 try-linux-slave17 try-linux-slave16 try-linux-slave15 try-linux-slave12 The other try slaves are either having green runs, orange but complete runs or any failures were for other reasons like the patch failing. Will monitor these slaves in the next few hours (or overnight) to see if the reboot leads to some successful builds and determine if try slaves should auto-reboot after builds like talos builders do.
Updated•15 years ago
|
Assignee: nobody → lsblakk
Comment 3•15 years ago
|
||
Checked in this morning and try-linux-slave19 had a build after reboot with warnings xpcshell-tests reftest mochitest-plain mochitest-chrome mochitest-browser-chrome mochitest-a11y errors: http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245144748.1245150245.21127.gz The next build had a "Memory Exhausted" compile fail: http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245159149.1245161702.14323.gz try-linux-slave15 had an orange run following reboot: http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245156687.1245162733.18577.gz I strongly suspect that the next unit test build it does will fail as well.
Comment 4•15 years ago
|
||
Are there processes left over after the timeout ? Are we using the old process killing stuff from unit on the try server ?
Comment 5•15 years ago
|
||
This has also happened on a regular hg try build: http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245156687.1245159986.9909.gz
Comment 6•15 years ago
|
||
On try-linux-slave11, after failed builds on both unittest and hg build I looked at the running processes and found that ssltunnel and firefox-bin were still running on an idle slave. Killed those processes and pushed some dummy patches to get builds and the next builds were orange for unittest with 5 fails on mochitest-chrome: 6666 ERROR TEST-UNEXPECTED-FAIL | chrome://mochikit/content/chrome/toolkit/content/tests/chrome/test_panel_focus.xul | button focus focus event fired - got false, expected true 6667 ERROR TEST-UNEXPECTED-FAIL | chrome://mochikit/content/chrome/toolkit/content/tests/chrome/test_panel_focus.xul | popup skipped in focus navigation focus event fired - got false, expected true 6668 ERROR TEST-UNEXPECTED-FAIL | chrome://mochikit/content/chrome/toolkit/content/tests/chrome/test_panel_focus.xul | focus removed when popup opened - got 0, expected 1 6707 ERROR TEST-UNEXPECTED-FAIL | chrome://mochikit/content/chrome/toolkit/content/tests/chrome/test_panel_focus.xul | blur events fired within popup - got 2, expected 3 6708 ERROR TEST-UNEXPECTED-FAIL | chrome://mochikit/content/chrome/toolkit/content/tests/chrome/test_panel_focus.xul | focus removed when popup closed - got 3, expected 4 and two fails on mochitest-plain (which I wonder if NO_FAIL_ON_ERRORS = 1 would fix from bug 445611): 82093 ERROR TEST-UNEXPECTED-FAIL | /tests/toolkit/components/url-classifier/tests/mochitest/test_classifier.html | Error thrown during test: uncaught exception: [Exception... "Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIUrlClassifierDBService.beginUpdate]" nsresult: "0x80040111 (NS_ERROR_NOT_AVAILABLE)" location: "JS frame :: http://localhost:8888/tests/toolkit/components/url-classifier/tests/mochitest/test_classifier.html :: doUpdate :: line 58" data: no] - got 0, expected 1 82095 INFO Running /tests/toolkit/components/url-classifier/tests/mochitest/test_classifier_worker.html... 82096 ERROR TEST-UNEXPECTED-FAIL | /tests/toolkit/components/url-classifier/tests/mochitest/test_classifier_worker.html | Error thrown during test: uncaught exception: [Exception... "Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIUrlClassifierDBService.beginUpdate]" nsresult: "0x80040111 (NS_ERROR_NOT_AVAILABLE)" location: "JS frame :: http://localhost:8888/tests/toolkit/components/url-classifier/tests/mochitest/test_classifier_worker.html :: doUpdate :: line 58" data: no] - got 0, expected 1 82098 INFO Running /tests/toolkit/content/tests/widgets/test_bug359754.xul... The hg build run went through green. Looks like we may want to do auto-rebooting after each build to ensure no processes are left running. As well there are some cronjob zombie processes that are left over after the display is set. These are probably not hurting anything.
Comment 7•15 years ago
|
||
the mochitest-plain errors should be fixed by bug 497884
Updated•15 years ago
|
Summary: "Linux try hg unit test" repeatedly failing, Memory exhausted → Linux try builds repeatedly failing, Memory exhausted
Comment 9•15 years ago
|
||
(In reply to comment #8) > *** Bug 500344 has been marked as a duplicate of this bug. *** I went round all the linux slaves and did a 'killall firefox-bin ssltunnel' where it was needed, which was try-linux-slave11,12,13,14,15,16,19. Typically there'd be a group of 4/5 processes associated with running one of the test suites, most commonly one of the mochitests but I saw crashtest at least once too. Curious that this seems to affect the newer slaves only.
Comment 10•15 years ago
|
||
When comparing mochitest-plain logs for mozilla-central and try there are four (non-trival) differences in the environment variables: * we don't set MINIDUMP_STACKWALK on try (bug 483111) * we don't use an OBJDIR on try * TERM is xterm on m-c, linux on try * buildbot has "using PTY: True" on m-c, False on try. The last comes from "usepty = 0|1" in buildbot.tac on the slave. The signature for this problem is a timeout or problem in a test suite, then processes are not killed off properly. A subsequent compile will fail with the memory exhausted error. Looking at all the linux slaves, only try-linux-slave10 through 19 have usepty = 0; the rest of try linux and m-c linux set 1. Note that 10 through 19 are relatively new slaves, which had their buildbot.tac generated by Buildbot 0.7.10. And that changed the default from usepty from 1 to 0, http://buildbot.net/trac/ticket/158 (It also allows you to set usepty on a per-step basis) So I've flipped usepty from 0 to 1 on those 10 slaves and that should fix us up. Incidentally, bug 478150 is the same problem on bm-xserve17.
Comment 11•15 years ago
|
||
The ref platform notes are fixed to include usepty = 1. I've set that on try-w32-slave17 through 19, then noticed that they're not setting keepAlive = None either. Lukas, please verify that all the win32 slaves for try and moz2 have the three settings at https://wiki.mozilla.org/ReferencePlatforms/Win32#Setup_Slave
Comment 12•15 years ago
|
||
All the win32 try slaves have been updated now.
Comment 13•15 years ago
|
||
How many needed adjusting ?
Comment 14•15 years ago
|
||
almost all of the try slaves needed keepalive set to None, a few needed the usepty set, and most of the older slaves (and many in the moz2 pool) needed just the umask set.
Comment 15•15 years ago
|
||
all the moz2 pool of win32 slaves have been adjusted as needed for parity. from slave 01 - 15 most needed keepalive set to None from 600, slave03 needed usepty set to 1, slave 08 needed usepty set to 1 as well as umask corrected from 022 to 002. most of the slaves from 17 - 29 needed umask set to 002.
Comment 16•15 years ago
|
||
We haven't heard anything about this on Try recently; possibly a result of patches or the above fixes. But we're planning on bumping RAM on our linux slaves. (Marking bug 501222 as a dependency.)
Depends on: 501222
Comment 18•15 years ago
|
||
The build slaves have been upgraded as part of bug 501222 and all that's left is the masters which requires downtime. Closing this bug since we haven't seen a "Memory Exhausted" error since the upgrade which took place several days ago.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Comment 19•15 years ago
|
||
Just happened on production windows build machine: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1248817814.1248820526.15431.gz
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 20•15 years ago
|
||
(In reply to comment #19) > Just happened on production windows build machine: > > http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1248817814.1248820526.15431.gz I think you mean production Linux machine. I see it in the logs for moz2-linux-slave15 and 24 recently. Both of these machines are on the same cluster, and both failed right around 15:40 PDT yesterday which is a little suspicious. I've noticed that since we've done the RAM upgrade that our ESX hosts are in a 'warning' state sometimes. I'll ping IT about this, it's possible that we're running out of physical RAM rather than RAM in the VM itself.
Comment 21•15 years ago
|
||
Phong increased the sensitivity of the load balancer on our ESX clusters yesterday and we haven't seen any more failures since that. Let's have this run over the weekend before closing again, though.
Comment 22•15 years ago
|
||
Linux electrolysis unit test on moz2-slave-slave22 hit this on Sat Aug 1 at 22:53:50
Comment 23•15 years ago
|
||
Reassigning to nobody@mozilla and leaving this open to track occurences.
Assignee: lsblakk → nobody
Updated•15 years ago
|
Assignee: nobody → armenzg
Comment 24•15 years ago
|
||
http://production-master02.build.mozilla.org:8010/builders/Linux%20electrolysis%20nightly/builds/16
Comment 25•15 years ago
|
||
.
Updated•15 years ago
|
Assignee: armenzg → nobody
Still seeing this on electrolysis builds http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1251817564.1251821239.11484.gz
Comment 27•15 years ago
|
||
(In reply to comment #26) This was moz2-linux-slave15 So are these two: Linux tracemonkey build on 2009/09/01 17:30:38 http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1251851438.1251854151.26859.gz Linux tracemonkey unit test on 2009/09/01 16:40:47 http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1251848447.1251851480.30511.gz This machine appears to have some sort of memory leak as a process listing doesn't account for the memory reported in use. I'll reboot it in a bit.
Comment 28•15 years ago
|
||
(In reply to comment #27) > This was moz2-linux-slave15 > > This machine appears to have some sort of memory leak as a process listing > doesn't account for the memory reported in use. I'll reboot it in a bit. Stopping and starting buildbot was enough to get back to total used free shared buffers cached Mem: 2027 375 1651 0 51 186 -/+ buffers/cache: 137 1889 Swap: 509 72 437 Previously that 137 was more like 1300, even though ps claimed buildbot was not using nearly that much.
Comment 29•15 years ago
|
||
per bsmedberg in dev meeting today, "30-50% of Linux electrolysis builds are failing due to out-of-memory debug linking libxul..." unknown when this failure pattern started.
Comment 30•15 years ago
|
||
bsmedberg: are you seeing this with electrolysis builds on try-server only, or also for build-triggered-on-checkin on electrolysis branch?
Comment 31•15 years ago
|
||
I don't know if anyone is doing tryserver/electrolysis: this is with standard build-on-checkin builds: some samples from recently: Linux electrolysis build on 2009/09/08 03:26:15 http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1252405575.1252408388.3815.gz Linux electrolysis unit test on 2009/09/04 00:17:18 http://tinderbox.mozilla.org/showlog.cgi?tree=Electrolysis&errorparser=unittest&logfile=1252048638.1252051576.5758.gz&buildtime=1252048638&buildname=Linux%20electrolysis%20unit%20test&fulltext= Linux electrolysis build on 2009/08/27 16:25:55 http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1251415555.1251418442.14995.gz Linux electrolysis unit test on 2009/08/26 18:50:56 http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1251337856.1251340788.10042.gz Linux electrolysis unit test on 2009/08/26 09:40:47 http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1251304847.1251308856.7955.gz That's less than 30%; more like 10% or so, but still enough to make life painful.
Comment 32•15 years ago
|
||
(In reply to comment #31) > I don't know if anyone is doing tryserver/electrolysis: this is with standard > build-on-checkin builds: some samples from recently: ok, I've tweaked the summary to match. > Linux electrolysis build on 2009/09/08 03:26:15 > http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1252405575.1252408388.3815.gz > > Linux electrolysis unit test on 2009/09/04 00:17:18 > http://tinderbox.mozilla.org/showlog.cgi?tree=Electrolysis&errorparser=unittest&logfile=1252048638.1252051576.5758.gz&buildtime=1252048638&buildname=Linux%20electrolysis%20unit%20test&fulltext= > > Linux electrolysis build on 2009/08/27 16:25:55 > http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1251415555.1251418442.14995.gz > > Linux electrolysis unit test on 2009/08/26 18:50:56 > http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1251337856.1251340788.10042.gz > > Linux electrolysis unit test on 2009/08/26 09:40:47 > http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1251304847.1251308856.7955.gz > > That's less than 30%; more like 10% or so, but still enough to make life > painful. Phew. 10% is still not good, and still needs resolution, but its much better then the 30%-50% reported in the meeting!
Summary: Linux try builds repeatedly failing, Memory exhausted → Linux builds failing with "Memory exhausted"
Comment 33•15 years ago
|
||
Previous fixes for these memory issues involved reboots of the affected build slaves. Marking dependent on bug 393411.
Depends on: 393411
Updated•15 years ago
|
Assignee: nobody → aki
Comment 34•15 years ago
|
||
We should now be rebooting build slaves after every 5 builds, so if a reboot fixes that problem, the 10% build failures mentioned in comment 32 should drop by at least 20% (to 8% =\ ) and maybe as much as 100%. We are planning on moving this to a reboot per build. At that point, if a reboot fixes this issue on electrolysis, the problem should disappear. If the electrolysis issue manifests whether a slave has rebooted recently or not, then we will continue seeing this.
Comment 35•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1253014168.1253016910.5520.gz&fulltext=1 So rebooting every 5 builds hasn't fixed it.
Comment 36•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?tree=Electrolysis&errorparser=unix&logfile=1253271156.1253274116.10511.gz&buildtime=1253271156&buildname=Linux%20electrolysis%20build&fulltext= Linux electrolysis build on 2009/09/18 03:52:36
Comment 37•15 years ago
|
||
Had another one today on e10n around 2am.
Comment 38•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1254860259.1254864179.15994.gz
Comment 39•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?tree=Firefox&buildname=Linux%20mozilla-central%20build&buildtime=1255084445&logfile=1255084445.1255087452.20638.gz&errorparser=unix
Comment 40•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?tree=Electrolysis&errorparser=unix&logfile=1255690920.1255694458.26973.gz&buildtime=1255690920&buildname=Linux%20electrolysis%20nightly&fulltext=
Comment 41•15 years ago
|
||
It sucks to see a bunch of these even thought we're rebooting after every build these days.
Comment 42•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1255971287.1255974808.25626.gz Linux mozilla-central unit test on 2009/10/19 09:54:47
Comment 43•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1256055153.1256058494.16175.gz
Comment 44•15 years ago
|
||
I didn't really have a solution beyond the reboots, which didn't fix the problem. I'm open to suggestions. One possibility is adding more memory to every slave (how much?), another is tracking down what's eating all the memory and fixing or working around that.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1256300917.1256304781.1566.gz Linux mozilla-central build on 2009/10/23 05:28:37 /usr/bin/ld: final link failed: Memory exhausted collect2: ld returned 1 exit status make[4]: *** [libxul.so] Error 1
Comment 46•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1256423835.1256427671.24505.gz
Comment 47•15 years ago
|
||
And again http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1256605329.1256609375.22067.gz That and comments 45 & 46 are all on moz2-linux-slave15, which is frequently failing to compile m-c opt and debug builds.
Comment 48•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1256605329.1256609375.22067.gz Linux mozilla-central leak test build on 2009/10/26 18:02:09 moz2-linux-slave15
Comment 49•15 years ago
|
||
Good catch! I've pulled moz2-linux-slave15 out of the pool, and will file a bug to nuke/reclone it.
Comment 50•15 years ago
|
||
Recloning/re-setting-up moz2-linux-slave15 in bug 524717 + bug 524719 .
Comment 51•15 years ago
|
||
Any recent occurrences of this?
Yes, it's still a problem for e10s. libxul OOM happened 4 times in the last ~24 hours there. I'd estimate it happens on maybe 1/4-1/5 of linux builds, though that may be confirmation bias.
Comment 53•15 years ago
|
||
Throwing back in the pool as it looks like I won't have sufficient time for this, and someone else may be able to work on this. My current ideas: 1) set puppet to only run on boot (or check to see if a build is running); 2) add more memory to all the slaves.
Assignee: aki → nobody
Comment 54•15 years ago
|
||
Linux electrolysis build on 2009/11/04 08:54:22 http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1257353662.1257357221.4696.gz
Comment 55•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1257536089.1257539648.17100.gz Linux mozilla-central leak test build on 2009/11/06 11:34:49
Comment 56•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1257554181.1257555690.8280.gz Linux mozilla-central leak test build on 2009/11/06 16:36:21
Comment 57•15 years ago
|
||
Monday, November 9th @ 7:15am, http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1257777924.1257779505.13872.gz
Comment 58•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1257830384.1257832632.10170.gz#err0 Linux mozilla-central leak test build on 2009/11/09 21:19:44
Reporter | ||
Comment 59•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1257848132.1257848940.1059.gz Linux mozilla-central leak test build (again)
Comment 60•15 years ago
|
||
Comment 55, comment 56, comment 57, comment 58 (and the one right after it I didn't bother pasting) and comment 59 are all moz2-linux-slave22, which seems like a significant pattern.
Comment 61•15 years ago
|
||
(In reply to comment #60) > Comment 55, comment 56, comment 57, comment 58 (and the one right after it I > didn't bother pasting) and comment 59 are all moz2-linux-slave22, which seems > like a significant pattern. Yes, I'd just done the analysis here and filed bug 527688 to cover getting linux slaves 22 and 24 re-cloned. Can I ask reporter's to take the extra time to grep the slave name out of the tinderbox log when reporting recurrences? It will be listed in the "environment:" in the footer of the brief log. Having the slavename reported will make establishing patterns easier.
Assignee: nobody → ccooper
Status: REOPENED → ASSIGNED
Priority: -- → P2
Comment 62•15 years ago
|
||
(In reply to comment #61) > Can I ask reporter's to take the extra time to grep the slave name out of the > tinderbox log when reporting recurrences? This would be much easier if the slave name were just included in the tinderbox log header string. e.g. Linux mozilla-central leak test build / moz2-linux-slave22 on 2009/11/10 02:15:32
Comment 63•15 years ago
|
||
This will add the hostname to the tinderbox summary, in green, now that bug 454055 has been pushed to tinderbox.m.o in bug 526287.
Attachment #411466 -
Flags: review?(ccooper)
Comment 64•15 years ago
|
||
Comment on attachment 411466 [details] [diff] [review] Add the hostname to tinderbox summary l10n already does this, FWIW. Can we switch the syntax to be the same, i.e. "s: slavename" for brevity's sake? r+ with that fix.
Attachment #411466 -
Flags: review?(ccooper) → review+
Comment 65•15 years ago
|
||
Comment on attachment 411466 [details] [diff] [review] Add the hostname to tinderbox summary http://hg.mozilla.org/build/buildbotcustom/rev/4c607d12e811 Slave names should be visible in the brief log now.
Attachment #411466 -
Flags: checked-in+
Comment 66•15 years ago
|
||
Putting back into the pool for buildduty monitoring.
Assignee: ccooper → nobody
Status: ASSIGNED → NEW
Whiteboard: [buildduty]
Comment 67•15 years ago
|
||
Maybe we should try adding --no-keep-memory to our linker flags (as per bug 494068)?
Comment 68•15 years ago
|
||
I'm not so naive as to believe this is truly fixed. Please re-open to add new examples.
Status: NEW → RESOLVED
Closed: 15 years ago → 15 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•