Closed Bug 870002 Opened 7 years ago Closed 6 years ago

Intermittent test_peerConnection_basicAudioVideo.html,test_peerConnection_basicAudioVideoCombined.html,test_peerConnection_throwInCallbacks.html | Exited with code -2147483645 during test run | application crashed [Unknown top frame]

Categories

(Core :: WebRTC: Audio/Video, defect)

x86
Windows 7
defect
Not set

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox22 - affected
firefox23 - affected

People

(Reporter: RyanVM, Assigned: roc)

References

Details

(Keywords: crash, intermittent-failure, regression, Whiteboard: [WebRTC][blocking-webrtc-][leave-open][qa-automation-blocked][webrtc-uplift])

Attachments

(3 files, 1 obsolete file)

Maybe related to this push?
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=c9737a4136cf

https://tbpl.mozilla.org/php/getParsedLog.php?id=22727216&tree=Mozilla-Inbound

Rev3 WINNT 6.1 mozilla-inbound opt test mochitest-3 on 2013-05-08 05:04:13 PDT for push 5971dba36391
slave: talos-r3-w7-107

05:21:48     INFO -  18171 INFO TEST-INFO | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudioVideo.html | Got media stream: audio (local)
05:21:48     INFO -  18172 INFO TEST-INFO | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudioVideo.html | Call getUserMedia for {"video":true,"fake":true}
05:21:48     INFO -  0[ae82080]: [CCAPP Task|def] ccapi.c:1161: SIPCC-CC_API: 1/4, cc_int_feature2: UI -> GSM: ADDSTREAM
05:21:48     INFO -  0[ae821d0]: [GSM Task|def] dcsm.c:532: SIPCC-DCSM: dcsm_process_event: DCSM 23  :(DCSM_READY:ADDSTREAM )
05:21:48     INFO -  0[ae821d0]: [GSM Task|fsm_sm] sm.c:46: SIPCC-FSM: sm_process_event: DEF 4   : 6C281E04x: sm entry: (IDLE:ADDSTREAM)
05:21:48     INFO -  0[ae821d0]: [GSM Task|fsm_sm] fsmdef.c:3535: SIPCC-FSM: fsmdef_ev_addstream: Entered.
05:21:48     INFO -  0[ae821d0]: [GSM Task|def] sm.c:65: SIPCC-GSM: 1/4, sm_process_event: DEF   :(IDLE:ADDSTREAM )
05:23:05  WARNING -  TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudioVideo.html | Exited with code -2147483645 during test run
05:23:07     INFO -  INFO | automation.py | Application ran for: 0:15:26.393000
05:23:07     INFO -  INFO | zombiecheck | Reading PID log: c:\users\cltbld\appdata\local\temp\tmpsmm_wbpidlog
05:23:07     INFO -  ==> process 1540 launched child process 956
05:23:07     INFO -  ==> process 1540 launched child process 3564
05:23:07     INFO -  ==> process 1540 launched child process 404
05:23:07     INFO -  ==> process 1540 launched child process 976
05:23:07     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 956
05:23:07     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 3564
05:23:07     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 404
05:23:07     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 976
05:23:07     INFO -  mozcrash INFO | Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1368011389/firefox-23.0a1.en-US.win32.crashreporter-symbols.zip
05:23:08     INFO -  Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1368011389/firefox-23.0a1.en-US.win32.crashreporter-symbols.zip
05:23:21  WARNING -  PROCESS-CRASH | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudioVideo.html | application crashed [Unknown top frame]
Summary: Intermittent test_peerConnection_basicAudioVideo.html | Exited with code -2147483645 during test run | application crashed [Unknown top frame] → Intermittent test_peerConnection_basicAudioVideo.html,test_peerConnection_basicAudioVideoCombined.html | Exited with code -2147483645 during test run | application crashed [Unknown top frame]
https://tbpl.mozilla.org/php/getParsedLog.php?id=22757204&tree=Mozilla-Inbound
Summary: Intermittent test_peerConnection_basicAudioVideo.html,test_peerConnection_basicAudioVideoCombined.html | Exited with code -2147483645 during test run | application crashed [Unknown top frame] → Intermittent test_peerConnection_basicAudioVideo.html,test_peerConnection_basicAudioVideoCombined.html,test_peerConnection_throwInCallbacks.html | Exited with code -2147483645 during test run | application crashed [Unknown top frame]
Attachment #747282 - Attachment is obsolete: true
Comment on attachment 747287 [details] [diff] [review]
move data-processing debugs in MSG to level 5 to allow granular logging

This is just a patch to move a bunch of high-volume debugs in MSG to level 5 from 4 (PR_LOG_DEBUG), so we can turn on debugging of things like adding tracks in automation without generating 20MB+ log files.  The dom/media/tests/mochitests logs drop from 22MB to 800K with mediastreamgraph:4 logging (500K without any MSG logging)

For this sort of patch (just changing debug levels) I'll take whomever can review first.

We could locally define in MediaStreamGraph.h a LOG_MSG_DETAILS or some such and use that instead of PR_LOG_DEBUG+1; I don't really care either way.
Attachment #747287 - Flags: review?(tterribe)
Attachment #747287 - Flags: review?(roc)
Attachment #747287 - Flags: review?(paul)
Attachment #747287 - Flags: review?(ehsan)
Attachment #747287 - Flags: review?(cpearce)
Attachment #747287 - Flags: review?(adam)
Attachment #747287 - Flags: review?(roc) → review+
Comment on attachment 747283 [details] [diff] [review]
enable MediaStreamGraph logging to try to hunt down bug 870002

Once the other bug here to change debugs in MSG is approved, this will turn on more "what's going on" debugging for MSG without blowing up the logs (a few hundred K more roughly).  I'll take whomever feels they can r+ this.  This is intended to be backed out as soon as we've figured out bug 870002 (or decided this debug change doesn't help find it).
Attachment #747283 - Flags: review?(ted)
Attachment #747283 - Flags: review?(ryanvm)
Attachment #747283 - Flags: review?(philringnalda)
Attachment #747283 - Flags: review?(emorley)
Whiteboard: [leave-open]
Attachment #747283 - Flags: review?(ted)
Attachment #747283 - Flags: review?(ryanvm)
Attachment #747283 - Flags: review?(philringnalda)
Attachment #747283 - Flags: review?(emorley)
Attachment #747283 - Flags: review+
Attachment #747287 - Flags: review?(tterribe)
Attachment #747287 - Flags: review?(paul)
Attachment #747287 - Flags: review?(ehsan)
Attachment #747287 - Flags: review?(cpearce)
Attachment #747287 - Flags: review?(adam)
Ok, we got a hit on the retriggers!

So we see this sequence from MSG right before the crash:
09:01:58     INFO -  3288[f8654c8]: Adding media stream 1301ea30 to the graph
09:01:58     INFO -  3288[f8654c8]: Adding media stream 12e9d1e0 to the graph
09:01:58     INFO -  3288[f8654c8]: Adding MediaInputPort 1620d580 (from 12e9d1e0 to 1301ea30) to the graph
09:01:58     INFO -  3288[f8654c8]: SourceMediaStream 12e9d1e0 creating track 1, rate 1000000, start 0, initial end 33333

Normally in all the other instances above in the log, it has this line following it:
09:01:58     INFO -  3288[f8654c8]: TrackUnionStream 1301e0c8 adding track 1 for input stream c401870 track 1, start ticks 0

Roc: any ideas?  Debugs/asserts to add?
but that's missing here.
Flags: needinfo?(roc)
Hmmmm.  The hit from m-c has the missing TrackUnion line. :-(

Back to "any debugs/asserts we can add?"  roc?  abr/ehugg?  Anything stick out to you aboud where it's failing?
Whiteboard: [leave-open] → [WebRTC][blocking-webrtc?][leave-open]
Can we get the minidumps being produced by these crashes? That might help...
Flags: needinfo?(roc)
The minidumps are 0 bytes.... :-(

ted	jesup: either that or figure out how to get windows to generate a minidump for you, and disable breakpad
ted	since windows does it out-of-process
ted	jesup: http://msdn.microsoft.com/en-us/library/windows/desktop/bb787181%28v=vs.85%29.aspx
ted	maybe take a slave out of service, configure that, run the test repeatedly till it fails?
ted	running under a debugger might change the failure mode
jesup	ted: do the slaves run one mochitest at a time, or multiple?
ted	multiple
catlee-buildduty	in sequence
ted	we split the run into 5 chunks, each test run runs all the tests in that chunk
jesup	So we may need to emulate that to produce the timings needed to force the failure
ted	you can just take a build that has displayed the failure and run the same chunk
ted	file a bug to get a test slave set aside
Added dependencies to possible sources
Depends on: 863224, 866514, 868406
I'm going to guess the tracking nom here means there's agreement this is a blocker as a crash regression.
Whiteboard: [WebRTC][blocking-webrtc?][leave-open] → [WebRTC][blocking-webrtc+][leave-open]
Assignee: nobody → roc
I ran 200 iterations of the dom/media mochitests on my Windows laptop, with no failures.
(In reply to TinderboxPushlog Robot from comment #57)
> RyanVM
> https://tbpl.mozilla.org/php/getParsedLog.php?id=22925320&tree=Mozilla-Beta
> Rev3 WINNT 6.1 mozilla-beta pgo test mochitest-3 on 2013-05-13 19:29:15
> slave: talos-r3-w7-051
> 
> 19:34:15  WARNING -  TEST-UNEXPECTED-FAIL |
> /tests/dom/media/tests/mochitest/test_peerConnection_basicAudioVideoCombined.
> html | Exited with code -2147483645 during test run
> 19:34:33  WARNING -  PROCESS-CRASH |
> /tests/dom/media/tests/mochitest/test_peerConnection_basicAudioVideoCombined.
> html | application crashed [Unknown top frame]
> 19:34:39    ERROR - Return code: 1

Looks like bug 866514 is indeed at fault. "Yay"
Good. Flagging the regressing bug then.
Blocks: 866514
No longer depends on: 863224, 866514, 868406
Keywords: regression
Ted, is there anything we can do to diagnose the empty minidump? Maybe more diagnostics in the code that creates the minidump?
Flags: needinfo?(ted)
We simply call into a Microsoft library function: MinidumpWriteDump. The most common cause for an empty dump is running out of virtual memory, whether due to actual exhaustion or fragmentation. If you'd like to print out memory stats right after we write the minidump (or fail to), we already gather some to send with the crash report, you could put some logging statements here:
http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/nsExceptionHandler.cpp#551
Flags: needinfo?(ted)
I filed bug 872786 on gathering more information when minidump collection fails.
In https://tbpl.mozilla.org/?tree=Try&rev=da5eabb5aafe I have a try push with my patch for bug 872786, to try to gather more data when minidump creation fails. I'll try retriggering this test to see if we can collect some useful data there.
I retriggered roc's try a bunch more times, and got a hit:

https://tbpl.mozilla.org/php/getParsedLog.php?id=23021195&tree=Try&full=1

00:43:17     INFO -  out of memory: 0x0000000000070800 bytes requested
00:43:18     INFO -  Minidump creation for thread 328 failed with GetLastError() -2147024865!
00:43:18     INFO -  * EXCEPTION_RECORD Code=80000003 Flags=0 Address=7329113f Information[0]=0 Information[1]=-2067537872 Information[2]=3
00:43:18     INFO -  * CONTEXT Eax=0 Ebx=0 Ecx=72933896 Edx=3 Esi=728e1ec6 Edi=7293379c Ebp=11e2f6a0 Esp=11e2f698 Eip=7329113f EFlags=202 SegCs=1b SegSs=23 SegDs=23 SegEs=23 SegFs=3b SegGs=0
00:43:18     INFO -  * Memory at 732910bf:
There are two failures. Both crashed at the same address with the same OOM message.

http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/2f458521-4315-4295-9c85-336d693d55cc describes the same error when calling MiniDumpWriteDump, but apart from mentioning the memory allocation issue, does not help.

I wonder what generates that "out of memory: 0x0000000000070800 bytes requested" message.
Oh, that message comes from mozalloc_handle_oom.
See Also: → 872996
Running the tests locally, 0x70800 is from allocating a PlanarYCbCrImage --- 640*480*1.5 bytes per pixel. The allocation is made infallibly, which is probably a mistake.

This suggests maybe we're leaking temporarily, or something.
I meed to sleep now, but I want to look into the patch in bug 866514 and see if the media streams are being cleaned up properly. If we were temporarily leaking MediaStreams and their cached video frames, but cleaning them up on shutdown, that might cause this,
Right, I suspect the actual OOM point of failure isn't very interesting here, it's just whatever sucker tries to allocate memory at that point. The issue is "what's actually eating up all our memory".
"thread 328" is intriguing...  Why so many?  Something not getting cleaned up?
Interestingly, bug 866514 (or some other change around there) has made us clean up MediaStreams *earlier* when I just run the dom/media tests. Which doesn't help explain this bug at all.
Bug 872996 looks like this bug. However, in bug 872996 I would not expect the code changed in bug 866514 to have run yet. Very mysterious :-(.
We could try backing out 866514 and relanding it one little piece at a time. I don't have any better ideas.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #94)
> We could try backing out 866514 and relanding it one little piece at a time.
> I don't have any better ideas.

We could try the Microsoft AppVerifier - Ethan?  How easy would it be to try running it on a mochitest set?

Ted: I assume the out-of-memory could be some type of heap corruption?
Do we have in-tbpl ASAN mochitest runs for mac/linux at all?  Could we retrigger them a bunch of times, or if we don't, could we do an ASAN Try build and retrigger?

Try's are known to hit it (if retriggered enough) so we can submit Trys with different pieces landed and then use that to bisect the patch.  Weekend is coming and infra is more lightly loaded :-)
Flags: needinfo?(ted)
Flags: needinfo?(ethanhugg)
(In reply to Randell Jesup [:jesup] from comment #96)
> Ted: I assume the out-of-memory could be some type of heap corruption?
> Do we have in-tbpl ASAN mochitest runs for mac/linux at all?  Could we
> retrigger them a bunch of times, or if we don't, could we do an ASAN Try
> build and retrigger?

We do not have them on TBPL, but you can run them on Try. I don't know how well it works on Mochitests.
Flags: needinfo?(ted)
(In reply to Randell Jesup [:jesup] from comment #96)
> We could try the Microsoft AppVerifier - Ethan?  How easy would it be to try
> running it on a mochitest set?
> 
I will try this on AppVerif today.  I haven't run the mochitests on AppVerif yet, only the unittests and the by-hand demos.
Flags: needinfo?(ethanhugg)
I did not find the smoking gun I was looking for but I thought I'd document some AppVerif results here.  
These will happen with any page that uses a PeerConnection with default AppVerif checks.

spl_init.c:106
LOCK: EnterCriticalSection() called on Unititialized CS.
The CS is actually initialized by hand statically three lines earlier in the file. AppVerif complains because InitializeCriticalSection() was not called.

rw_lock_win.cc:55
SRWLOCK: AcquireLockShared() fails on PC shutdown, perhaps lock already destroyed.  
Get this several times when navigating away from a page that uses a peer connection

Stack:
 	vrfcore.dll!_VerifierStopMessageEx()	Unknown
 	vfbasics.dll!_AVrfpVerifySRWLockAcquire@12()	Unknown
 	vfbasics.dll!_AVrfpRtlAcquireSRWLockShared@4()	Unknown
>	xul.dll!webrtc::RWLockWin::AcquireLockShared() Line 56	C++
 	xul.dll!webrtc::voe::ChannelManagerBase::GetItem(int itemId) Line 158	C++
 	xul.dll!webrtc::voe::ChannelManager::GetChannel(const int channelId) Line 77	C++
 	xul.dll!webrtc::voe::ScopedChannel::ScopedChannel(webrtc::voe::ChannelManager & chManager, int channelId) Line 111	C++
 	xul.dll!webrtc::VoEBaseImpl::StopPlayout() Line 1471	C++
 	xul.dll!webrtc::VoEBaseImpl::StopPlayout(int channel) Line 1113	C++
 	xul.dll!mozilla::WebrtcAudioConduit::~WebrtcAudioConduit() Line 102	C++
 	xul.dll!mozilla::WebrtcAudioConduit::`scalar deleting destructor'(unsigned int)	C++
 	xul.dll!mozilla::MediaSessionConduit::Release() Line 139	C++
 	xul.dll!mozilla::RefPtr<mozilla::MediaSessionConduit>::unref(mozilla::MediaSessionConduit * t) Line 172	C++
 	xul.dll!mozilla::RefPtr<mozilla::MediaSessionConduit>::~RefPtr<mozilla::MediaSessionConduit>() Line 121	C++
 	xul.dll!mozilla::ConduitDeleteEvent::~ConduitDeleteEvent()	C++
 	xul.dll!mozilla::ConduitDeleteEvent::`scalar deleting destructor'(unsigned int)	C++
 	xul.dll!nsRunnable::Release() Line 31	C++
 	xul.dll!nsCOMPtr<nsIRunnable>::~nsCOMPtr<nsIRunnable>() Line 523	C++
 	xul.dll!nsThread::ProcessNextEvent(bool mayWait, bool * result) Line 635	C++
 	xul.dll!NS_ProcessNextEvent(nsIThread * thread, bool mayWait) Line 238	C++
 	xul.dll!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate * aDelegate) Line 82	C++
 	xul.dll!MessageLoop::RunInternal() Line 220	C++
 	xul.dll!MessageLoop::RunHandler() Line 213	C++
 	xul.dll!MessageLoop::Run() Line 187	C++
 	xul.dll!nsBaseAppShell::Run() Line 165	C++
 	xul.dll!nsAppShell::Run() Line 113	C++
 	xul.dll!nsAppStartup::Run() Line 269	C++
 	xul.dll!XREMain::XRE_mainRun() Line 3877	C++
 	xul.dll!XREMain::XRE_main(int argc, char * * argv, const nsXREAppData * aAppData) Line 3944	C++
 	xul.dll!XRE_main(int argc, char * * argv, const nsXREAppData * aAppData, unsigned int aFlags) Line 4145	C++
 	firefox.exe!do_main(int argc, char * * argv, nsIFile * xreDirectory) Line 272	C++
 	firefox.exe!NS_internal_main(int argc, char * * argv) Line 632	C++
 	firefox.exe!wmain(int argc, wchar_t * * argv) Line 105	C++
 	firefox.exe!__tmainCRTStartup() Line 533	C
 	firefox.exe!wmainCRTStartup() Line 377	C
 	kernel32.dll!@BaseThreadInitThunk@12()	Unknown
 	ntdll.dll!___RtlUserThreadStart@8()	Unknown
 	ntdll.dll!__RtlUserThreadStart@8()	Unknown

If I turn of lock and srwlock checking I don't get errors.  AppVerif has caught  heap errors like use-after-free in Firefox for me before.
Suggestions from bsmedberg:

[12:34]	bsmedberg	jesup: well can you dump about:memory to the testing log at the beginning of this/these tests?
[12:34]	jesup	I'm pretty sure we can
[12:36]	bsmedberg	What we really want is external crash reporting, but that's not a simple project
[12:38]	jesup	yeah. We need to find some way to solve this in the next week or so, which rules that out
[12:38]	bsmedberg	jesup: you could also try hacking the tests so that it disables the crash reporter and launches the process using procdump
[12:38]	bsmedberg	The test harness has changed enough that I don't know where we do that stuff nowadays.
[12:39]	jesup	ok; I don't know what's involved with that, but I can probably ping ted to help with that

ted: can you help with his suggestions?  (either/both)  Roc is at a work-week an taiwan, so will only be iffily available; I'll help as much as I can put together try runs (probably based on roc's try that found the OOM issue) and retrigger/etc, and analyze anything we can get from them.
Flags: needinfo?(ted)
I'm in SFO this week, so timezones are not fantastic and I don't have my full complement of machines, but I'll see if I can figure something out here.

I don't think external crash reporting is really going to help us, we've determined that this is just "crashing on OOM". What we really need to find out is *what* is eating the memory.
Flags: needinfo?(ted)
Could this somehow be related to bug 837835?
I hacked up some code to dump about:memory from a Mochitest:
http://pastebin.mozilla.org/2432760

It's terrible, but it seems to work.

(In reply to Henrik Skupin (:whimboo) from comment #124)
> Could this somehow be related to bug 837835?

It's possible, but we found the root cause for most of that spike in empty dumps and it was fixed.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #126)
> I hacked up some code to dump about:memory from a Mochitest:
> http://pastebin.mozilla.org/2432760
> 
> It's terrible, but it seems to work.

Ted, who's in the best position to add this to the tests?
Flags: needinfo?(ted)
I was hoping jesup would, but he seems to be busy with other things. I've been in SF this whole week so I don't have my full build environment handy, and I'm travelling tomorrow, so I won't have time for this until Monday at the earliest.
Flags: needinfo?(ted)
This crash can be seen constantly on try for my upcoming datachannel tests on bug 796894. So it might block its landing.
Blocks: 796894
Status: NEW → ASSIGNED
Whiteboard: [WebRTC][blocking-webrtc+][leave-open] → [WebRTC][blocking-webrtc+][leave-open][qa-automation-blocked]
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #144)
> I was hoping jesup would, but he seems to be busy with other things. I've
> been in SF this whole week so I don't have my full build environment handy,
> and I'm travelling tomorrow, so I won't have time for this until Monday at
> the earliest.

I can handle it
Whiteboard: [WebRTC][blocking-webrtc+][leave-open][qa-automation-blocked] → [WebRTC][blocking-webrtc+][leave-open][qa-automation-blocked][webrtc-uplift]
We're not getting any more useful info out of the MSG logging, and it's causing problem with M-1 log sizes (bug 876545)
Comment on attachment 754928 [details] [diff] [review]
remove mediastreamgraph:4 logging

r=me if you need it.  ;-)
Attachment #754928 - Flags: review+
Sorry, I've been a bit out of it with the Taiwan work week and since then, FirefoxOS stuff.

(In reply to Henrik Skupin (:whimboo) from comment #150)
> This crash can be seen constantly on try for my upcoming datachannel tests
> on bug 796894. So it might block its landing.

Can you reproduce that crash locally? If you can, that could really really help!
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #172)
> Can you reproduce that crash locally? If you can, that could really really
> help!

I cannot fully remember if I hit it locally but it was constantly failing on try. I can try if I can get it reproduced locally. Once I have it I can provide a better stack trace via gdb.
Sorry, I finally got around to hooking up my about:memory dumping code to these mochitests, I pushed a try run:
https://tbpl.mozilla.org/?tree=Try&rev=25f0a25a7a29
Someone helpfully retriggered 30 more Windows 7 mochitest-3 jobs on my Try push, and none of them were orange. I triggered 10 more, we'll see if anything happens. I am theorizing that perhaps opening and closing about:memory in a tab for every test changes our GC/CC behavior so as to make an OOM not happen. If I don't see any orange on these runs I'll fiddle the patch tomorrow to only open one about:memory tab.
Should be disabled now.
Whiteboard: [WebRTC][blocking-webrtc+][leave-open][qa-automation-blocked][webrtc-uplift] → [WebRTC][blocking-webrtc-][leave-open][qa-automation-blocked][webrtc-uplift]
(In reply to Jason Smith [:jsmith] from comment #191)
> Should be disabled now.

Meant to say - disabled per https://bugzilla.mozilla.org/show_bug.cgi?id=866514#c29.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #190)
> anything happens. I am theorizing that perhaps opening and closing
> about:memory in a tab for every test changes our GC/CC behavior so as to
> make an OOM not happen. If I don't see any orange on these runs I'll fiddle
> the patch tomorrow to only open one about:memory tab.

That's most likely the case. But instead of opening and closing the about:memory tab I wonder if we could directly call any API method. Nicholas, what is getting executed when you open about:memory?
Flags: needinfo?(n.nethercote)
> Nicholas, what is getting executed when you open about:memory?

toolkit/components/aboutmemory/contents/aboutMemory.js.
Flags: needinfo?(n.nethercote)
I did look at that, but I don't think it's straightforward to use that from a Mochitest. (The use case here is a little weird.)
It's interesting how few hits this has gotten since mid-last-week (when we had about 10 in a day)...
The lack of failures on retriggers with about:memory might be that the intermittent has become rare (the only one since 5/30 was on Birch)...  So I'd suggest retriggering some win7 opt/debug builds from a random inbound push to see if you see it there - if you don't, then about:memory isn't hiding the bug.  

Makes me concerned what caused it to go away might just be luck
After chatting with jesup I realized that we did make a large change in our test infra--we switched all the Windows test slaves to the new IX machines. You'll note that there are no failures on IX machines (comment 188 appears to be a mis-star).
I verified that Beta is still on Talos-* slaves, but the number of pushes there is low enough we may not see hits from a moderate/low freq intermittent.

It certainly does seem tied to the hardware change.  Ted and I speculated it might be garbage building up and (if the odds are right and enough other stuff is running on the slave, perhaps) it runs out of memory.  The new hardware apparently has more ram (and timings will be different).
That hit on beta with bug 866514 shows it wasn't caused by that patch.  We've relanded it.
We should consider removing this from tracking given the latest info
> I did look at that, but I don't think it's straightforward to use that from
> a Mochitest. (The use case here is a little weird.)

If you can explain exactly what you need I might be able to help further.
It's not terribly important now, but I was just trying to get a dump of about:memory into the Mochitest logs to try to get some diagnostics on memory usage during the tests.
I think we ought to be able to close this now.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.