Last Comment Bug 823989 - Permaorange browser-chrome on Aurora Linux64 and Win7 nightly builds since merge of Firefox 20 to Aurora
: Permaorange browser-chrome on Aurora Linux64 and Win7 nightly builds since me...
Status: RESOLVED FIXED
: crash, intermittent-failure
Product: Testing
Classification: Components
Component: BrowserTest (show other bugs)
: 20 Branch
: All All
: -- critical with 1 vote (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
: 825246 (view as bug list)
Depends on: 831854 832702 841029
Blocks: 784681
  Show dependency treegraph
 
Reported: 2012-12-21 09:50 PST by Richard Newman [:rnewman]
Modified: 2013-04-25 02:11 PDT (History)
16 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
-
wontfix
affected


Attachments
Work-around (1.03 KB, patch)
2013-01-17 10:54 PST, :Ehsan Akhgari
mak77: review+
Details | Diff | Splinter Review
Extreme measures (680 bytes, patch)
2013-01-17 19:47 PST, Phil Ringnalda (:philor, back in August)
no flags Details | Diff | Splinter Review

Description Richard Newman [:rnewman] 2012-12-21 09:50:43 PST
+++ This bug was initially created as a clone of Bug #818990 +++

*** Start BrowserChrome Test Results ***
TEST-INFO | checking window state
TEST-INFO | unknown test url | must wait for focus
TEST-INFO | (browser-test.js) | Console message: PAC file installed from data:text/plain,function%20FindProxyForURL(url,%20host){%20%20var%20origins%20=%20['http://127.0.0.1:80',%20'…
INFO | automation.py | Application ran for: 0:05:36.670928
INFO | automation.py | Reading PID log: /tmp/tmpQDy7Vppidlog
Downloading symbols from: http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/tinderbox-builds/mozilla-aurora-linux64/1356092418/firefox-19.0a2.en-US.linux-x86_64.crashreporter-symbols.zip
PROCESS-CRASH | automation.py | application crashed [@ libc-2.11.so + 0xd4aa3]
Crash dump filename: /tmp/tmpXMOrul/minidumps/451a7cee-12ba-4cd4-039396d9-6fa8d400.dmp
Operating system: Linux
                  0.0.0 Linux 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64
CPU: amd64
     family 6 model 23 stepping 10
     2 CPUs

Crash reason:  SIGABRT
Crash address: 0x1f4000008a1

Thread 0 (crashed)
 0  libc-2.11.so + 0xd4aa3
    rbx = 0x00007f3b2a4d4d00   r12 = 0x00000000ffffffff
    r13 = 0x00000034d4ce5160   r14 = 0x0000000000000008
    r15 = 0x00007f3b456596d8   rip = 0x00000034d2ed4aa3
    rsp = 0x00007fff2068fa40   rbp = 0x0000000000000008
    Found by: given as instruction pointer in context
 1  libxul.so!PollWrapper [nsAppShell.cpp:4287525881ec : 35 + 0xd]
    rip = 0x00007f3b42b8409e   rsp = 0x00007fff2068fa70
    Found by: stack scanning
 2  libglib-2.0.so.0.2200.2 + 0x3c9fb
    rbx = 0x00007f3b456596d0   r12 = 0x00007f3b42b84070
    rip = 0x00000034d4a3c9fc   rsp = 0x00007fff2068fa90
    rbp = 0x00007f3b2a4d4d00
    Found by: call frame info
 3  libglib-2.0.so.0.2200.2 + 0x2e4ac7
    rip = 0x00000034d4ce4ac8   rsp = 0x00007fff2068fa98
    Found by: stack scanning
 4  libglib-2.0.so.0.2200.2 + 0x2e4aff
    rip = 0x00000034d4ce4b00   rsp = 0x00007fff2068faa0
    Found by: stack scanning
 5  libpthread-2.11.so + 0x8daf
    rip = 0x00000034d3608db0   rsp = 0x00007fff2068faf8
    Found by: stack scanning
 6  libglib-2.0.so.0.2200.2 + 0x3cd39
    rip = 0x00000034d4a3cd3a   rsp = 0x00007fff2068fb10
    Found by: stack scanning
 7  libxul.so!nsAppShell::ProcessNextNativeEvent(bool) [nsAppShell.cpp:4287525881ec : 135 + 0xa]
    rip = 0x00007f3b42b8405f   rsp = 0x00007fff2068fb40
    Found by: stack scanning
 8  libxul.so!nsBaseAppShell::DoProcessNextNativeEvent(bool, unsigned int) [nsBaseAppShell.cpp:4287525881ec : 139 + 0x5]
    rip = 0x00007f3b42b89ec9   rsp = 0x00007fff2068fb50
    Found by: call frame info
 9  libxul.so!nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool, unsigned int) [nsBaseAppShell.cpp:4287525881ec : 298 + 0x4]
    rbx = 0x00007f3b37b53080   r12 = 0x00000000002aa076
    rip = 0x00007f3b42b8a081   rsp = 0x00007fff2068fb80
    rbp = 0x00007f3b45625d40
    Found by: call frame info
10  libxul.so!nsThread::ProcessNextEvent(bool, bool*) [nsThread.cpp:4287525881ec : 600 + 0x7]
    rbx = 0x00007f3b45625d40   r12 = 0x0000000000000001
Comment 1 Ed Morley [:emorley] 2012-12-21 09:56:05 PST
Don't suppose you have the log/tbpl url? :-)
Comment 2 Richard Newman [:rnewman] 2012-12-21 10:34:07 PST
Sorry, haven't had my coffee yet!

https://tbpl.mozilla.org/php/getParsedLog.php?id=18162696&tree=Mozilla-Aurora
Comment 3 Ed Morley [:emorley] 2012-12-21 10:36:13 PST
Thank you :-)
Comment 4 Treeherder Robot 2012-12-21 18:11:27 PST
heycam
https://tbpl.mozilla.org/php/getParsedLog.php?id=18185213&tree=Mozilla-Inbound
Rev3 Fedora 12x64 mozilla-inbound opt test mochitest-1 on 2012-12-21 17:33:55
slave: talos-r3-fed64-040

TEST-UNEXPECTED-FAIL | /tests/content/events/test/test_bug508479.html | application timed out after 330 seconds with no output
PROCESS-CRASH | /tests/content/events/test/test_bug508479.html | application crashed [@ libc-2.11.so + 0xd4aa3]
Thread 0 (crashed)
Comment 5 Treeherder Robot 2012-12-22 20:10:03 PST
mbrubeck
https://tbpl.mozilla.org/php/getParsedLog.php?id=18202686&tree=Mozilla-Aurora
Rev3 Fedora 12x64 mozilla-aurora pgo test mochitest-browser-chrome on 2012-12-22 06:10:41
slave: talos-r3-fed64-041

TEST-UNEXPECTED-FAIL | automation.py | application timed out after 330 seconds with no output
PROCESS-CRASH | automation.py | application crashed [@ libc-2.11.so + 0xd4aa3]
Thread 0 (crashed)
Comment 6 Treeherder Robot 2012-12-23 10:29:21 PST
RyanVM
https://tbpl.mozilla.org/php/getParsedLog.php?id=18221672&tree=Mozilla-Aurora
Rev3 Fedora 12x64 mozilla-aurora pgo test mochitest-browser-chrome on 2012-12-23 07:03:04
slave: talos-r3-fed64-030

TEST-UNEXPECTED-FAIL | automation.py | application timed out after 330 seconds with no output
PROCESS-CRASH | automation.py | application crashed [@ libc-2.11.so + 0xd4aa3]
Thread 0 (crashed)
Comment 7 Treeherder Robot 2012-12-24 08:05:18 PST
RyanVM
https://tbpl.mozilla.org/php/getParsedLog.php?id=18240125&tree=Mozilla-Aurora
Rev3 Fedora 12x64 mozilla-aurora pgo test mochitest-browser-chrome on 2012-12-24 06:14:29
slave: talos-r3-fed64-046

TEST-UNEXPECTED-FAIL | automation.py | application timed out after 330 seconds with no output
PROCESS-CRASH | automation.py | application crashed [@ libc-2.11.so + 0xd4aa3]
Thread 0 (crashed)
Comment 8 Treeherder Robot 2012-12-26 05:37:11 PST
RyanVM
https://tbpl.mozilla.org/php/getParsedLog.php?id=18256038&tree=Mozilla-Aurora
Rev3 Fedora 12x64 mozilla-aurora pgo test mochitest-browser-chrome on 2012-12-25 06:13:20
slave: talos-r3-fed64-046

TEST-UNEXPECTED-FAIL | automation.py | application timed out after 330 seconds with no output
PROCESS-CRASH | automation.py | application crashed [@ libc-2.11.so + 0xd4aa3]
Thread 0 (crashed)
Comment 9 Treeherder Robot 2012-12-26 06:26:37 PST
RyanVM
https://tbpl.mozilla.org/php/getParsedLog.php?id=18269330&tree=Mozilla-Aurora
Rev3 Fedora 12x64 mozilla-aurora pgo test mochitest-browser-chrome on 2012-12-26 06:09:37
slave: talos-r3-fed64-066

TEST-UNEXPECTED-FAIL | automation.py | application timed out after 330 seconds with no output
PROCESS-CRASH | automation.py | application crashed [@ libc-2.11.so + 0xd4aa3]
Thread 0 (crashed)
Comment 10 Jeff Hammel 2012-12-26 10:51:49 PST
ABICT, this isn't a talos bug?
Comment 11 Treeherder Robot 2012-12-27 14:43:03 PST
RyanVM
https://tbpl.mozilla.org/php/getParsedLog.php?id=18293088&tree=Mozilla-Aurora
Rev3 Fedora 12x64 mozilla-aurora pgo test mochitest-browser-chrome on 2012-12-27 06:11:29
slave: talos-r3-fed64-019

TEST-UNEXPECTED-FAIL | automation.py | application timed out after 330 seconds with no output
PROCESS-CRASH | automation.py | application crashed [@ libc-2.11.so + 0xd4aa3]
Thread 0 (crashed)
Comment 12 Treeherder Robot 2012-12-28 09:15:24 PST
jdm
https://tbpl.mozilla.org/php/getParsedLog.php?id=18318609&tree=Mozilla-Aurora
Rev3 Fedora 12x64 mozilla-aurora pgo test mochitest-browser-chrome on 2012-12-28 06:50:30
slave: talos-r3-fed64-068

TEST-UNEXPECTED-FAIL | automation.py | application timed out after 330 seconds with no output
PROCESS-CRASH | automation.py | application crashed [@ libc-2.11.so + 0xd4aa3]
Thread 0 (crashed)
Comment 13 Treeherder Robot 2012-12-29 07:05:54 PST
RyanVM
https://tbpl.mozilla.org/php/getParsedLog.php?id=18336711&tree=Mozilla-Aurora
Rev3 Fedora 12x64 mozilla-aurora pgo test mochitest-browser-chrome on 2012-12-29 06:21:16
slave: talos-r3-fed64-025

TEST-UNEXPECTED-FAIL | automation.py | application timed out after 330 seconds with no output
PROCESS-CRASH | automation.py | application crashed [@ libc-2.11.so + 0xd4aa3]
Thread 0 (crashed)
Comment 14 Phil Ringnalda (:philor, back in August) 2012-12-29 13:36:50 PST
Charmingly enough, it's not just Linux64 and it's not particularly intermittent either.

https://tbpl.mozilla.org/?tree=Mozilla-Aurora&rev=2f801d18884d was the first rev to build nightlies after 19 merged to Aurora, and it hit this. Since then, there have been only 3 or 4 Linux browser-chrome runs that should have been running against the nightly build which have not hit this (which may mean it's nearly permaorange rather than actually permaorange, or may mean that tests against nightlies don't always actually run on the nightly, I didn't investigate them).

The "crash" signature is different between Linux64 and Linux32, but I don't think that's significant, it's just where they happen to be sitting idling when the timeout kills them. The visible and possibly significant differences between working runs on dep jobs and failing runs on nightlies seem to be that testpilot is enabled and installed, and that the nightlies have that line as shown abbreviated in comment 0 about "TEST-INFO | (browser-test.js) | Console message: PAC file installed from data:text/plain,function%20FindProxyForURL".
Comment 15 Phil Ringnalda (:philor, back in August) 2012-12-29 13:37:37 PST
*** Bug 825246 has been marked as a duplicate of this bug. ***
Comment 16 Phil Ringnalda (:philor, back in August) 2012-12-30 08:09:23 PST
Repros on try (https://tbpl.mozilla.org/?tree=Try&rev=5ab7bc28b255) with "export MOZ_UPDATE_CHANNEL=aurora" so that testpilot winds up installed (and failed to repro when I took a different and failed approach to getting testpilot installed by just hacking at extension/Makefile.in). Not sure if there are other byproducts of setting MOZ_UPDATE_CHANNEL, though.

https://tbpl.mozilla.org/php/getParsedLog.php?id=18352661&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=18352610&tree=Mozilla-Aurora
Comment 21 Phil Ringnalda (:philor, back in August) 2013-01-04 09:12:42 PST
Not quite perma, since I only got the one https://tbpl.mozilla.org/php/getParsedLog.php?id=18463151&tree=Mozilla-Aurora out of https://tbpl.mozilla.org/?tree=Mozilla-Aurora&onlyunstarred=1&rev=32dba69af0fa and the linux32 one does appear to have downloaded the nightly.
Comment 24 Phil Ringnalda (:philor, back in August) 2013-01-08 15:28:14 PST
And not 19, since Aurora 20 is affected, so I'll bet what I really meant was "any build which includes testpilot, but the only ones of those where I see the tests are Aurora nightlies." 

https://tbpl.mozilla.org/php/getParsedLog.php?id=18605589&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=18603373&tree=Mozilla-Aurora
Comment 26 Ed Morley [:emorley] 2013-01-16 09:02:57 PST
akeybl, this is permaorange on Aurora nightlies. Aurora has now been closed since no one has been forthcoming in fixing it. Could you find someone with some cycles that could take a look?
Comment 27 Phil Ringnalda (:philor, back in August) 2013-01-16 09:27:12 PST
Along with the Linux permaorange that came in on the 19 merge, Mac and Windows b-c became permaorange on the 20 merge - Mac with "leaking until shutdown" like https://tbpl.mozilla.org/php/getParsedLog.php?id=18817908&tree=Mozilla-Aurora and Windows with those plus failures a la https://tbpl.mozilla.org/php/getParsedLog.php?id=18822117&tree=Mozilla-Aurora which remind me a great deal of the Aurora bustage that turned out to be OOM from too many threads from bug 802239. 

Does testpilot spin up a huge number of threads?

Do we really want to keep this situation where we run tests with testpilot included, but only on Aurora nightlies and not anywhere else?
Comment 28 Alex Keybl [:akeybl] 2013-01-16 14:16:24 PST
I'm going to reach out to fx-team, given the reproducible steps in comment 16.
Comment 29 :Ehsan Akhgari 2013-01-17 08:18:10 PST
This is easily reproducible locally as well.
Comment 30 :Ehsan Akhgari 2013-01-17 10:42:18 PST
So here's what happens.  During testpilot init, we get to this code: <http://mxr.mozilla.org/mozilla-central/source/browser/app/profile/extensions/testpilot@labs.mozilla.com/modules/interface.js#87>.  As part of BrowserToolboxCustomizeDone, the content area gets focused: <http://mxr.mozilla.org/mozilla-central/source/browser/base/content/browser.js#3681>.

Later on, when we want to start running the tests, we get to this point: <http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/browser-test.js#281>.  waitForWindowsState calls waitForFocus here: <http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/browser-test.js#154> and we attempt to wait for focus on the window, but the focus event is never dispatched since the element to be focused is already focused, but the focus manager's activeWindow property returns null, so we can't detect that case.

nsFocusManager::WindowRaised seems to be responsible for updating mActiveWindow, and when I turn on focus manager logging, WindowRaised is called way after this stuff.
Comment 31 :Ehsan Akhgari 2013-01-17 10:54:24 PST
Created attachment 703412 [details] [diff] [review]
Work-around

This patch works around the problem by preventing the testpilot extension from trying to customize the toolbar and hence screwing with focus.
Comment 32 :Ehsan Akhgari 2013-01-17 10:58:42 PST
Filed bug 831854 as a follow-up to fix this issue for real.  I think I'll go ahead and push the patch pending post-landing review.  I've already wasted enough time on this.
Comment 34 Ed Morley [:emorley] 2013-01-17 15:16:00 PST
I've filed bug 832050 for making sure that Nightly-only test breakage is more obvious (they are currently indistinguishable from pgo test results, meaning a later green pgo result implies it was only an intermittent).
Comment 35 Phil Ringnalda (:philor, back in August) 2013-01-17 18:49:00 PST
After Ehsan's fix and finally getting runs on every platform, our status now is that Linux32, Mac, and WinXP only have "leaked until shutdown" errors from devtools tests, but Win7 and Linux64 have those plus things just like the symptoms of bug 798849 (timeouts in devtools tests, yeah, but get them out of the way and you have to deal with pdfjs timeouts, get them out of the way and you have to deal with addonmgr timeouts and browser_bug666317.js and a host of others) that bug 802239 fixed. Whether testpilot uses (or leaks) a ton of memory, or we're right on the threshold anyway and it pushes us over, or it's something else, we *look* exactly like we do when we're OOM.
Comment 36 Phil Ringnalda (:philor, back in August) 2013-01-17 19:47:23 PST
Created attachment 703732 [details] [diff] [review]
Extreme measures

Couple of choices:

You can pass this bug around through your top generalists, khuey and bz and roc and dbaron and bsmedberg and billm and karlt and I'll think of the next set of people who aren't afraid to look at something that could be coming from any part of the codebase when I need to, until you hit on one who wants to land something on aurora badly enough to borrow a slave (since they're unlikely to have a sufficiently hobbled machine to let them repro OOM) and figure out what's going wrong remotely.

Or you can just land this patch, stop building testpilot on the only tree where we actually look at the results of testing with it built, and reopen aurora in a few hours.

Personally, I can't quite decide which choice I'd take, if I were in the unfortunate position to choose.
Comment 37 Phil Ringnalda (:philor, back in August) 2013-01-17 21:41:37 PST
I installed testpilot, and when I went looking for the active tests that would explain why we are shipping it, it looks like there's one (either active or forgotten) for Thunderbird, and the last active Firefox tests were in the spring of 2011.

I take it back, I *can* decide whether I'd take a weeks-long closure of aurora while burning the time of some of our most expensive developers or stop shipping an addon that hasn't done anything for a year and a half.
Comment 38 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2013-01-18 02:11:32 PST
So do we have a good regression window for this?  It looks like it started before the last merge (which was January 7?)... which makes me puzzled as to why it's not happening on beta now too.
Comment 39 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2013-01-18 02:21:47 PST
I've been scrolling way down on https://tbpl.mozilla.org/?tree=Mozilla-Aurora&jobname=Rev3%20Fedora%2012x64%20mozilla-aurora%20pgo%20test%20mochitest-browser-chrome (though I suppose I could have pulled the nightly changeset hashes off FTP); hopefully I'll have an answer at some point.
Comment 40 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2013-01-18 02:31:20 PST
Actually, I'm guessing it's not showing up on Beta because we don't do nightlies on Beta (or at least there aren't any on tbpl).

And as I scrolled further down, I realized it probably was the previous merge (when 19 merged to aurora), so I pulled:
https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/11/2012-11-19-04-20-13-mozilla-aurora/firefox-18.0a2.en-US.linux-x86_64.txt
https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/11/2012-11-20-04-20-14-mozilla-aurora/firefox-19.0a2.en-US.linux-x86_64.txt

which led to:
https://tbpl.mozilla.org/?tree=Mozilla-Aurora&rev=edc2aedfaed5
https://tbpl.mozilla.org/?tree=Mozilla-Aurora&rev=5f19747d3410

But I guess philor found that already in comment 14; I should have read more closely.  Why don't I put it in the summary where it belongs so others don't do the same, at least.
Comment 41 Marco Bonardo [::mak] (Away 6-20 Aug) 2013-01-18 02:48:05 PST
Comment on attachment 703412 [details] [diff] [review]
Work-around

Review of attachment 703412 [details] [diff] [review]:
-----------------------------------------------------------------

nit: would have been better to keep the testpilot prefs near each other (there was already one some rows below)
Comment 42 :Ehsan Akhgari 2013-01-18 07:28:11 PST
(In reply to Marco Bonardo [:mak] from comment #41)
> Comment on attachment 703412 [details] [diff] [review]
> Work-around
> 
> Review of attachment 703412 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> nit: would have been better to keep the testpilot prefs near each other
> (there was already one some rows below)

(Landed on trunk as https://hg.mozilla.org/integration/mozilla-inbound/rev/7d5fdfc2b165, totally not worth backporting to Aurora)
Comment 43 :Ehsan Akhgari 2013-01-18 08:20:41 PST
Comment on attachment 703732 [details] [diff] [review]
Extreme measures

FWIW I'd take this if it gives us all green nightly builds.  AFAIK we're not actually running any user studies through this extension.
Comment 44 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2013-01-18 08:38:54 PST
So I'd missed comment 27, though I think the new failures likely belong in another bug.

However, to test philor's theory in comment 36 that all of the failures are due to testpilot, which after discussion appears not to have been confirmed, I did two try runs off of aurora, one with testpilot:
https://tbpl.mozilla.org/?tree=Try&rev=09687ee6aec9
and one without:
https://tbpl.mozilla.org/?tree=Try&rev=54bfcb35a934
(at least assuming I did it correctly).
Comment 45 Phil Ringnalda (:philor, back in August) 2013-01-18 10:44:18 PST
(In reply to David Baron [:dbaron] from comment #44)
> I did two try runs off of aurora, one with testpilot:
> https://tbpl.mozilla.org/?tree=Try&rev=09687ee6aec9

That didn't actually get testpilot for you, because you have to export the env var before http://mxr.mozilla.org/mozilla-aurora/source/browser/config/mozconfigs/linux32/nightly#1 (or redo that line after you export it in the override).
Comment 46 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2013-01-18 11:24:58 PST
Indeed.  New pair, overriding the configure option directly:

As things are now on aurora, with aurora update channel:
https://tbpl.mozilla.org/?tree=Try&rev=3186beecad30

Plus removal of testpilot:
https://tbpl.mozilla.org/?tree=Try&rev=0d401b02bc4a
Comment 47 :Ehsan Akhgari 2013-01-18 12:28:59 PST
(In reply to comment #46)
> Indeed.  New pair, overriding the configure option directly:
> 
> As things are now on aurora, with aurora update channel:
> https://tbpl.mozilla.org/?tree=Try&rev=3186beecad30
> 
> Plus removal of testpilot:
> https://tbpl.mozilla.org/?tree=Try&rev=0d401b02bc4a

Seems like the second push is burning at least on Linux and Mac.
Comment 48 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2013-01-18 14:32:56 PST
New second try push:
https://tbpl.mozilla.org/?tree=Try&rev=bd8b84c0b9b1
in which:
https://hg.mozilla.org/users/dbaron_mozilla.com/patches-aurora/raw-file/2a9366c139d9/no-aurora-testpilot
replaces attachment 703732 [details] [diff] [review] from comment 36.
Comment 49 Phil Ringnalda (:philor, back in August) 2013-01-18 20:32:52 PST
Current status:

Aurora is closed because Linux64 and Win7 browser-chrome against nightlies fail in a way which stops the test suite from finishing, making it impossible to tell whether any new failures have been added.

On those two platforms we get a complex of test failures which looks exactly like the bug 798849 OOM failures that we hit in both June/July and October 2012. We have no idea what fixed them in June/July; in October it turned out that we were winding up with ~300 storage threads.
Comment 50 Phil Ringnalda (:philor, back in August) 2013-01-18 20:39:32 PST
Comment on attachment 703732 [details] [diff] [review]
Extreme measures

I don't know why I'm surprised that this awful method of enabling or disabling building an extension based on overloading an env var instead of using configure leads to confusion and bustage.
Comment 51 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2013-01-19 00:21:11 PST
So this pair of try runs:

> As things are now on aurora, with aurora update channel:
> https://tbpl.mozilla.org/?tree=Try&rev=3186beecad30
> 
> Plus removal of testpilot:
> https://tbpl.mozilla.org/?tree=Try&rev=bd8b84c0b9b1

seems to show that disabling testpilot fixes all of the browser-chrome failures.

(Ignore the android builds; the mechanism I used to override the update channel setting and simulate a nightly on try didn't apply to them anyway due to a build system bug that I'll prepare an m-c patch for shortly; I'm not sure why they're orange, though.)
Comment 52 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2013-01-19 01:11:15 PST
So let me try to summarize the current state of what's going on here:

Our continuous integration testing on TBPL generates builds for pushes, and runs tests on them, occasionally coalescing them.  This happens on all of our active development branches.  On mozilla-central and mozilla-aurora (but not mozilla-beta or mozilla-release, I think), the nightly builds we generate also show up on TBPL, and have unit tests run on them.

This bug covers a set of permanent test failures (perma-oranges) that occur *only* on the unit tests of nightly builds (which differ in some ways from the other builds, most notably by setting the update channel) and not on the unit tests of the push-generated builds.  Furthermore, these test failures are happening only on Aurora, and the Aurora tree is currently closed for those failures.

Disabling the testpilot extension fixes *all* of these failures (see comment 51); the patch to disable it is the patch linked in comment 48.   Since building testpilot is conditional on the update channel being aurora or beta, the only place we run unit tests on builds with testpilot is the unit tests we run of nightly builds on mozilla-aurora.

These failures (again, all fixed by disabling testpilot) were introduced at separate points:

 (a) when Firefox 19 merged to aurora, we introduced a focus-related perma-orange on the browser-chrome tests on Linux.  This permaorange was worked around yesterday by https://hg.mozilla.org/releases/mozilla-aurora/rev/12f52471747d and bug 831854 covers fixing it better.

 (b) when Firefox 20 merged to aurora, additional browser-chrome failures were introduced.  These failures were similar to failures previously observed twice before (see comment 49)

 (c) There was also a set of leaks from devtools tests, investigated in bug 824016 rather than this bug, which I believe (but am not sure) were also introduced when Firefox 20 merged to aurora.  These tests have been disabled in https://hg.mozilla.org/releases/mozilla-aurora/rev/a8d6394508a3 after a set of attempts to fix them failed.  Since that fix was not included in the with-and-without testpilot comparative try runs in comment 51 (though the previous attempts to fix those failures were), these devtools leaks also appear related to testpilot.



I am aware of three options going forward:

 (1) Decide that our push-based testing is sufficient test coverage and that we're ok reopening the aurora tree with permanent test failures in the tests of *nightly* builds, and reopen mozilla-aurora.  (jlebar and I were advocating this in the thread on dev-platform; ehsan was against, as I think were some others; this was before option (2) was confirmed to be an available solution.)

 (2) Disable the testpilot extension on aurora using the patch in comment 48, and reopen mozilla-aurora.  comment 43 says that we're not currently running any studies using testpilot (and also that ehsan supports this solution).

 (3) Continue to hold mozilla-aurora closed for further investigation of the group (b) failures above.  This does not provide a clear path to reopening or to shipping Firefox 20.
Comment 53 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2013-01-19 01:16:54 PST
(In reply to David Baron [:dbaron] from comment #52)
>  (c) There was also a set of leaks from devtools tests, investigated in bug
> 824016 rather than this bug, which I believe (but am not sure) were also
> introduced when Firefox 20 merged to aurora.

philor confirms that these were indeed introduced when Firefox 20 merged to aurora.
Comment 54 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2013-01-19 01:21:15 PST
One other point to add to the summary, actually:  tests of nightlies aren't currently distinguished on tbpl from tests of pgo builds (bug 832050 covers fixing this).  This meant that *all* of the failures described in this bug appeared to be intermittent failures rather than permanent failures unless they were examined very closely.  That's one of the reasons it took so long for these failures to lead to the tree being closed.
Comment 55 :Ehsan Akhgari 2013-01-19 06:51:48 PST
I support option 2 in comment 52.
Comment 56 :Ms2ger (⌚ UTC+1/+2) 2013-01-19 10:45:54 PST
https://hg.mozilla.org/mozilla-central/rev/1b1be4ac343f
Comment 57 Alex Keybl [:akeybl] 2013-01-19 16:33:14 PST
(In reply to :Ehsan Akhgari from comment #55)
> I support option 2 in comment 52.

Agreed - Cheng and Jinghua (the main creators of testpilot surveys) hopefully don't have any urgent surveys in the short term while we continue our investigation. a=akeybl on option 2.
Comment 58 Alex Keybl [:akeybl] 2013-01-19 16:35:06 PST
When I say I'm in support of option 2, I am assuming that we'll continue to investigate and find a final resolution allowing testpilot surveys on Aurora soon, of course.
Comment 59 Phil Ringnalda (:philor, back in August) 2013-01-19 17:02:02 PST
Landed the patch to turn off building testpilot on aurora in https://hg.mozilla.org/releases/mozilla-aurora/rev/c489c87349b5
Comment 60 Phil Ringnalda (:philor, back in August) 2013-01-19 17:22:33 PST
Filed bug 832702 - Reenable building testpilot on mozilla-aurora when it no longer causes test failures, dependent on bug 832703 - testpilot causes browser-chrome leaks on Mac and Linux and bug 832705 - Complex of OOM failures in Linux64 and Win7 browser-chrome tests with testpilot enabled.
Comment 61 Phil Ringnalda (:philor, back in August) 2013-01-19 20:16:30 PST
Aurora's reopened.
Comment 62 Phil Ringnalda (:philor, back in August) 2013-01-20 17:14:12 PST
And once the light of future merges dawned on me, pushed to m-c in https://hg.mozilla.org/mozilla-central/rev/4919e8091542
Comment 63 Gregg Lind (User Advocacy - Heartbeat - Test Pilot) 2013-02-04 11:21:21 PST
We use Test Pilot all the time, and continually deploy new tests on it.  

The situation with its code is bad, and we are trying to decide what the best way to handle this going forward is...

Fix 1.2?  Build 2.0?
Comment 64 Alex Keybl [:akeybl] 2013-02-11 16:40:36 PST
(In reply to Gregg Lind (User Research - Test Pilot) from comment #63)
> We use Test Pilot all the time, and continually deploy new tests on it.  
> 
> The situation with its code is bad, and we are trying to decide what the
> best way to handle this going forward is...
> 
> Fix 1.2?  Build 2.0?

Do you have an ETA on owners for bug 832703 and bug 832705?
Comment 65 Lukas Blakk [:lsblakk] use ?needinfo 2013-02-20 13:27:26 PST
Gregg: this is tracking for Firefox 20 which is now on Beta and will ship in 6 weeks - anything you can do to advance the investigation here (put additional pressure on the 6 day old bug about getting a 64 bit machine?)?
Comment 66 Lukas Blakk [:lsblakk] use ?needinfo 2013-02-25 14:58:52 PST
Moving this over to FF21 tracking (current Aurora) as I don't believe there is anything to do here for FF20.
Comment 67 Phil Ringnalda (:philor, back in August) 2013-02-25 16:01:58 PST
Well, yes and no - there's absolutely no reason to believe that 20-on-beta isn't leaking and OOMing just because we don't actually run the tests (or to be more painfully accurate, just because we run the tests, on release builds, but absolutely positively not one person ever looks at the results of the tests) that would tell us that it is.

As far as I know we haven't done any investigation about whether any of the test failures, those two or the screwy focus that we "fixed" by insisting that the addon stop customizing the toolbar, were actually things that users would also see.
Comment 68 Gregg Lind (User Advocacy - Heartbeat - Test Pilot) 2013-02-26 11:09:54 PST
(Ask:  Real-time help me build and test on Linux-64-opt) 

I am blocked on this, honestly.  I need some real-time help building this on Unix and running the tests.  I have a build host, and have done mach builds on OSX, but unless I can get someone to walk me through the simplest testing / patching path, I really am failing at doing this.  

I want to fix this, and have time authorized to fix this, but the cost of re-figuring out the build/test process without guidance is very very expensive.  Help me lower it :)
Comment 69 Lukas Blakk [:lsblakk] use ?needinfo 2013-02-27 09:31:47 PST
Taking Gregg off this bug for now, since assigning Neil to bug 831854 looks to be the next steps here.  Also marking this tracking again for FF20 since, as philor calls out, we do need this test suite running prior to FF20 release to ensure we are not leaking and OOMing.
Comment 70 :Ehsan Akhgari 2013-02-27 12:12:36 PST
(In reply to comment #69)
> Taking Gregg off this bug for now, since assigning Neil to bug 831854 looks to
> be the next steps here.  Also marking this tracking again for FF20 since, as
> philor calls out, we do need this test suite running prior to FF20 release to
> ensure we are not leaking and OOMing.

Note that it might be possible to work around the focus issue in testpilot in case we won't have an immediate fix for bug 831854.
Comment 71 Robert Kaiser 2013-03-12 13:18:16 PDT
What's the status here? The doors are closing pretty soon on 20 and this is still marked for tracking that one...
Comment 72 Alex Keybl [:akeybl] 2013-03-14 10:00:29 PDT
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #71)
> What's the status here? The doors are closing pretty soon on 20 and this is
> still marked for tracking that one...

Is there any risk here outside of Test Pilot? If not, we can untrack for FF20 at this point (releasing in 2 weeks).
Comment 73 Phil Ringnalda (:philor, back in August) 2013-03-14 10:17:50 PDT
Nothing outside of testpilot - it needed the flag that doesn't exist, tracking-the-20-betas, since they may or may not have leaked and OOMed, but we don't build or ship testpilot with releases, so at this point it's 20-whatever and on to shipping 21 betas that may or may not leak and OOM.
Comment 74 :Ehsan Akhgari 2013-03-14 14:10:36 PDT
Yeah, what philor said.
Comment 75 Alex Keybl [:akeybl] 2013-04-03 12:57:27 PDT
We untracking in favor of bug 840108.

Note You need to log in before you can comment on or make changes to this bug.