We tried to get it to run to completion (at least sometimes) with https://hg.mozilla.org/mozilla-central/rev/e7439c6c1e81 disabling the frequently failing browser_thumbnails_background.js, but that didn't particularly green it up. Based on the things that are failing (primarily a whole lot of toolkit/mozapps/extensions/ stuff), and the way they've failed before, I sort of vaguely suspect OOM. One possible savior is bug 819963 - we're actually very close to the time when Mac debug browser-chrome will start turning red by taking over 9000 seconds (over 9000!) and being killed by buildbot, which we'll probably fix by ramming that through rather than fixing it. Meantime, ASan b-c has gotten so horrible (the last time it was actually green on inbound was 9 hours ago) that I'm hiding it, so we'll stop expecting it to run and only I'll see it and star it as this, rather than continue to bombard people, mostly addonsmgr people, with failure bugspam that they can only fix by fixing the fact that we were already teetering on the OOM cliff before they even started running.
AN CLUE: while I was typing up the bureaucratic bug in the Tinderboxpushlog component about "Unhide when it gets better," I was about to type the "hidden on..." line including mozilla-aurora, since we merged to there from a horribly broken tree this morning, when I realized that I haven't been starring the ASan b-c failures there since the merge. Perfectly green, four runs in a row, which is three more in a row than we've managed in days, if not weeks, on the trunk. So, does --disable-profiling or one of the other unlikely to have this effect changes between trunk and aurora cause this problem to disappear?
To bolster my OOM claim, https://tbpl.mozilla.org/php/getParsedLog.php?id=29817157&full=1&branch=mozilla-inbound#error0 is ASan say that it's OOM.
Maybe not profiling, since https://tbpl.mozilla.org/?showall=1&tree=Profiling builds with --disable-profiling, and while it's a bit better than inbound/central/fx-team, it's not quite perfect, nor as good as aurora.
Do we have a regression range for this? When we started with ASan b-c, it was perfectly green all the time. So if you're saying you're having trouble for weeks now, why didn't we investigate earlier? Is there a particular testing going OOM, or is it just randomly? I guess it's either a test or an option that makes the difference between aurora and central, but a regression range would really be good.
It seems to have slowly deteriorated. One of the problems is that ASan browser-chrome is (obviously) only 1 of many browser-chrome jobs run per push, so even a drastic deterioration in it's failure rate gets averaged out by all the rest, when jumping from tree to tree starring on only-unstarred mode.
(The sheriffs are more than eager to back things out if they have been spotted to increase a failure rate, and do so quite regularly, though there are occasions when it increased so gradually this wasn't possible)
Assuming that memory-pressure increased in general, here's a try run for a smaller quarantine size on 2-4 GB memory builders: https://tbpl.mozilla.org/?tree=Try&rev=758951544a82 I am still curious though what makes difference between aurora and inbound in this case.
Assigning to myself. Philor: I was trying to figure out the influence of profiling, but as far as I can see, ASan doesn't build with --enable-profiling (the regular nightly config has it, the nightly-asan config does not). Did we make this (or any other config change) somewhere else, outside the of the mozconfigs? That said, mochitest-bc was always close to the memory limits, could be that some changes just increased memory usage by a little, causing random failures. The try push I made should help in that case.
Assignee: nobody → choller
If you want a clearer example of why it's hard to see things exactly like this when they happen, take a look at bug 920976. Pick the instance that (in hindsight) seems to be the start, either the first one yesterday or the double-starred one on the 21st, open the log, and click the "push abc123" link at the top to get to tbpl for that push, then click the down-arrow. Was it actually because of sunfish's self-reviewed followup to bug 925729? Was it something in a merge from fx-team that only became broken when it met something else on inbound? Was it one of the pushes that hardly ran any tests at all?
Oops, bug 920978, which is a rather neat example, not 976 whatever it is.
Created attachment 824129 [details] [diff] [review] asan-bc-oom.patch Decrease the quarantine_size for mochitests running ASan by 20% to solve the memory issues on mochitest-bc.
Attachment #824129 - Flags: review?(ted)
Comment on attachment 824129 [details] [diff] [review] asan-bc-oom.patch rs=ted on IRC.
Attachment #824129 - Flags: review?(ted) → review+
That's done a great job of exposing https://tbpl.mozilla.org/php/getParsedLog.php?id=29854349&tree=Mozilla-Inbound, which must have landed sometime between the parent of your try push and yesterday afternoon on inbound, where it's visible if you know to look past all the other failures.
Whiteboard: [leave open]
Should we try to back this out now?
Fwiw, I've checked the last pushes to mozilla-inbound and mochitest-bc was always green. Maybe the failure in comment 11 is a rare/unrelated intermittent? I suggest we unhide asan mochitest-bc again given that it's green now. The only thing we should maybe check before doing so is if the OOM patch I made is still necessary, as Ryan pointed out. I'm currently doing a try push with the backout of my oom memory patch to check that.
It's not at all rare, it's just not quite permaorange. But apparently people think that since ASan bc is hidden for one bustage followed by another bustage, that makes it fine to break it again, so I'm unhiding it to stop the pile-on.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.