Closed Bug 1244904 Opened 8 years ago Closed 6 years ago

Intermittent browser_windowRestore_perwindowpb.js | application crashed [@ mozilla::::RunWatchdog]

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: KWierso, Assigned: gbrown)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disabled])

Attachments

(1 file)

Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
It appears failures here started when browser_windowStateContainer.js was disabled in bug 1373116.
Blocks: 1373116
Whiteboard: [stockwell needswork]
This is a shutdown crash which happens after the sessionstore tests have run -- it is not strongly associated with browser_windowRestore_perwindowpb.js.

[task 2017-06-23T15:28:19.342163Z] 15:28:19     INFO - Crash reason:  SIGSEGV
[task 2017-06-23T15:28:19.343780Z] 15:28:19     INFO - Crash address: 0x0
[task 2017-06-23T15:28:19.345419Z] 15:28:19     INFO - Process uptime: not available
[task 2017-06-23T15:28:19.347019Z] 15:28:19     INFO - 
[task 2017-06-23T15:28:19.348591Z] 15:28:19     INFO - Thread 16 (crashed)
[task 2017-06-23T15:28:19.350440Z] 15:28:19     INFO -  0  libxul.so!RunWatchdog [nsTerminator.cpp:629f02cdaee7 : 160 + 0x2]
[task 2017-06-23T15:28:19.352133Z] 15:28:19     INFO -     eip = 0xf37e618d   esp = 0xbe7fe2f0   ebp = 0xbe7fe318   ebx = 0xf559d774
[task 2017-06-23T15:28:19.354177Z] 15:28:19     INFO -     esi = 0x0000003f   edi = 0xae6f15b4   eax = 0x00000000   ecx = 0x00000000
[task 2017-06-23T15:28:19.355926Z] 15:28:19     INFO -     edx = 0xf46fc60c   efl = 0x00010246
[task 2017-06-23T15:28:19.358223Z] 15:28:19     INFO -     Found by: given as instruction pointer in context

...we are hitting:

    // Shutdown is apparently dead. Crash the process.
    MOZ_CRASH("Shutdown too long, probably frozen, causing a crash.");
Assignee: nobody → gbrown
I have tried to find a set of sessionstore tests that can be skipped to avoid this failure, but I am having trouble reducing the set to less than about 20 tests -- unsatisfactory.

https://treeherder.mozilla.org/#/jobs?repo=try&author=gbrown@mozilla.com&fromchange=80a4e638401c74d2b1071edda185eda8b428e17f&tochange=ba2cfc2d66747b6e8301c5eca2115ed91834cb32
All failures are linux32-opt and non-e10s; a limited skip of the entire directory on that platform might be an option.
This is a very frequent intermittent that needs attention as soon as possible.

Although this is marked as dependent on bug 1358898, I don't think it is strongly related: 1358898 is about shutdown hangs associated with AsyncFreeSnowWhite, following a variety of tests. The crash reports for this bug generally do not show AsyncFreeSnowWhite in the stack, and hangs follow the sessionstore tests.

We tried disabling the last sessionstore test in bug 1373116, but that did not help -- shutdown hangs continued and the failure reports moved to this bug. It seems like many (20 or so) sessionstore tests must be skipped before there is a significant decrease in failure frequency. All failures are on linux32-opt and non-e10s; so we might consider skipping the whole sessionstore directory on that specific platform - but that seems drastic.

These shutdown hangs are easily reproduced on try (more than 50% of runs hang), on linux32-opt, by running the sessionstore mochitest-bc tests. Crash reports are consistent, showing the hang in ShutdownXPCOM, like:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=110230014&lineNumber=3349

[task 2017-06-28T00:56:55.328838Z] 00:56:55     INFO - Thread 0
[task 2017-06-28T00:56:55.330626Z] 00:56:55     INFO -  0  libpthread-2.23.so + 0xc16b
[task 2017-06-28T00:56:55.332662Z] 00:56:55     INFO -     eip = 0xf779516b   esp = 0xffbfa83c   ebp = 0x04000001   ebx = 0xf712a664
[task 2017-06-28T00:56:55.334656Z] 00:56:55     INFO -     esi = 0x00000001   edi = 0xf712a660   eax = 0x00000001   ecx = 0x00000085
[task 2017-06-28T00:56:55.338864Z] 00:56:55     INFO -     edx = 0x00000001   efl = 0x00000202
[task 2017-06-28T00:56:55.340681Z] 00:56:55     INFO -     Found by: given as instruction pointer in context
[task 2017-06-28T00:56:55.342355Z] 00:56:55     INFO -  1  firefox!_fini + 0x6e84
[task 2017-06-28T00:56:55.344185Z] 00:56:55     INFO -     eip = 0x08073248   esp = 0xffbfa84c   ebp = 0x04000001
[task 2017-06-28T00:56:55.345979Z] 00:56:55     INFO -     Found by: stack scanning
[task 2017-06-28T00:56:55.347858Z] 00:56:55     INFO -  2  firefox!mozilla::detail::ConditionVariableImpl::notify_one [ConditionVariable_posix.cpp:d9489f5df7be : 101 + 0xb]
[task 2017-06-28T00:56:55.350976Z] 00:56:55     INFO -     eip = 0x0804e0ad   esp = 0xffbfa850   ebp = 0x04000001
[task 2017-06-28T00:56:55.352815Z] 00:56:55     INFO -     Found by: stack scanning
[task 2017-06-28T00:56:55.354563Z] 00:56:55     INFO -  3  libxul.so!_fini + 0x139b9c0
[task 2017-06-28T00:56:55.357131Z] 00:56:55     INFO -     eip = 0xf5621f98   esp = 0xffbfa860   ebp = 0xffbfa94c
[task 2017-06-28T00:56:55.362765Z] 00:56:55     INFO -     Found by: stack scanning
[task 2017-06-28T00:56:55.364494Z] 00:56:55     INFO -  4  0xdd192f20
[task 2017-06-28T00:56:55.366282Z] 00:56:55     INFO -     eip = 0xdd192f20   esp = 0xffbfa954   ebp = 0xf2413215
[task 2017-06-28T00:56:55.368060Z] 00:56:55     INFO -     Found by: previous frame's frame pointer
[task 2017-06-28T00:56:55.369941Z] 00:56:55     INFO -  5  libxul.so + 0x25d19a0
[task 2017-06-28T00:56:55.371729Z] 00:56:55     INFO -     eip = 0xf24149a0   esp = 0xffbfa958   ebp = 0xf2413215
[task 2017-06-28T00:56:55.373478Z] 00:56:55     INFO -     Found by: stack scanning
[task 2017-06-28T00:56:55.375283Z] 00:56:55     INFO -  6  libxul.so!_fini + 0x2b8fb8
[task 2017-06-28T00:56:55.377048Z] 00:56:55     INFO -     eip = 0xf453f590   esp = 0xffbfa968   ebp = 0xf2413215
[task 2017-06-28T00:56:55.378847Z] 00:56:55     INFO -     Found by: stack scanning
[task 2017-06-28T00:56:55.383010Z] 00:56:55     INFO -  7  firefox!mozilla::TimeStamp::Now [TimeStamp_posix.cpp:d9489f5df7be : 77 + 0x5]
[task 2017-06-28T00:56:55.384802Z] 00:56:55     INFO -     eip = 0x08050bd2   esp = 0xffbfa970   ebp = 0xf2413215
[task 2017-06-28T00:56:55.386640Z] 00:56:55     INFO -     Found by: stack scanning
[task 2017-06-28T00:56:55.388356Z] 00:56:55     INFO -  8  libxul.so + 0x465e80
[task 2017-06-28T00:56:55.390306Z] 00:56:55     INFO -     eip = 0xf02a8e80   esp = 0xffbfa984   ebp = 0xf2413215
[task 2017-06-28T00:56:55.392083Z] 00:56:55     INFO -     Found by: stack scanning
[task 2017-06-28T00:56:55.394928Z] 00:56:55     INFO -  9  libxul.so!nsIdleService::SetTimerExpiryIfBefore [nsIdleService.cpp:d9489f5df7be : 787 + 0xa]
[task 2017-06-28T00:56:55.396690Z] 00:56:55     INFO -     eip = 0xf2412f00   esp = 0xffbfa99c   ebp = 0xf2413215
[task 2017-06-28T00:56:55.398399Z] 00:56:55     INFO -     Found by: stack scanning
[task 2017-06-28T00:56:55.402954Z] 00:56:55     INFO - 10  libxul.so!_fini + 0x139b9c0
[task 2017-06-28T00:56:55.404734Z] 00:56:55     INFO -     eip = 0xf5621f98   esp = 0xffbfa9a0   ebp = 0xf2413215
[task 2017-06-28T00:56:55.406499Z] 00:56:55     INFO -     Found by: stack scanning
[task 2017-06-28T00:56:55.408297Z] 00:56:55     INFO - 11  libxul.so!nsIdleService::ReconfigureTimer [nsIdleService.cpp:d9489f5df7be : 898 + 0x19]
[task 2017-06-28T00:56:55.410874Z] 00:56:55     INFO -     eip = 0xf2413589   esp = 0xffbfa9b0   ebp = 0xffbfaa08
[task 2017-06-28T00:56:55.412717Z] 00:56:55     INFO -     Found by: stack scanning
[task 2017-06-28T00:56:55.415040Z] 00:56:55     INFO - 12  libxul.so!nsIdleService::IdleTimerCallback [nsIdleService.cpp:d9489f5df7be : 717 + 0x8]
[task 2017-06-28T00:56:55.418554Z] 00:56:55     INFO -     eip = 0xf2413ff1   esp = 0xffbfaa10   ebp = 0xffbfab18   ebx = 0xf5621f98
[task 2017-06-28T00:56:55.420308Z] 00:56:55     INFO -     esi = 0xf26511ff   edi = 0x000012bd
[task 2017-06-28T00:56:55.421989Z] 00:56:55     INFO -     Found by: call frame info
[task 2017-06-28T00:56:55.423764Z] 00:56:55     INFO - 13  libxul.so!nsTimerImpl::Fire [nsTimerImpl.cpp:d9489f5df7be : 520 + 0x19]
[task 2017-06-28T00:56:55.425577Z] 00:56:55     INFO -     eip = 0xf02aa9e9   esp = 0xffbfab20   ebp = 0xffbfaba8   ebx = 0xf5621f98
[task 2017-06-28T00:56:55.429545Z] 00:56:55     INFO -     esi = 0xffffffff   edi = 0x00000001
[task 2017-06-28T00:56:55.431518Z] 00:56:55     INFO -     Found by: call frame info
[task 2017-06-28T00:56:55.433403Z] 00:56:55     INFO - 14  libxul.so!nsTimerEvent::Run [TimerThread.cpp:d9489f5df7be : 286 + 0x18]
[task 2017-06-28T00:56:55.435263Z] 00:56:55     INFO -     eip = 0xf0294e78   esp = 0xffbfabb0   ebp = 0xffbfac08   ebx = 0xf5621f98
[task 2017-06-28T00:56:55.436959Z] 00:56:55     INFO -     esi = 0x00000000   edi = 0xf715b340
[task 2017-06-28T00:56:55.438712Z] 00:56:55     INFO -     Found by: call frame info
[task 2017-06-28T00:56:55.443037Z] 00:56:55     INFO - 15  libxul.so!nsThread::ProcessNextEvent [nsThread.cpp:d9489f5df7be : 1422 + 0x8]
[task 2017-06-28T00:56:55.444869Z] 00:56:55     INFO -     eip = 0xf02af5fa   esp = 0xffbfac10   ebp = 0xffbface8   ebx = 0xf5621f98
[task 2017-06-28T00:56:55.446645Z] 00:56:55     INFO -     esi = 0x00000000   edi = 0xf715b340
[task 2017-06-28T00:56:55.448414Z] 00:56:55     INFO -     Found by: call frame info
[task 2017-06-28T00:56:55.450202Z] 00:56:55     INFO - 16  libxul.so!NS_ProcessPendingEvents [nsThreadUtils.cpp:d9489f5df7be : 416 + 0x15]
[task 2017-06-28T00:56:55.452187Z] 00:56:55     INFO -     eip = 0xf02acc35   esp = 0xffbfacf0   ebp = 0xffbfad38   ebx = 0xf5621f98
[task 2017-06-28T00:56:55.453993Z] 00:56:55     INFO -     esi = 0xf715b340   edi = 0x00000000
[task 2017-06-28T00:56:55.458949Z] 00:56:55     INFO -     Found by: call frame info
[task 2017-06-28T00:56:55.460824Z] 00:56:55     INFO - 17  libxul.so!mozilla::ShutdownXPCOM [XPCOMInit.cpp:d9489f5df7be : 900 + 0x5]
[task 2017-06-28T00:56:55.462649Z] 00:56:55     INFO -     eip = 0xf02c52d9   esp = 0xffbfad40   ebp = 0xffbfad88   ebx = 0xf5621f98
[task 2017-06-28T00:56:55.464346Z] 00:56:55     INFO -     esi = 0xf715b340   edi = 0xffffffff
[task 2017-06-28T00:56:55.466023Z] 00:56:55     INFO -     Found by: call frame info
[task 2017-06-28T00:56:55.467872Z] 00:56:55     INFO - 18  libxul.so!ScopedXPCOMStartup::~ScopedXPCOMStartup [nsAppRunner.cpp:d9489f5df7be : 1467 + 0xa]
[task 2017-06-28T00:56:55.469682Z] 00:56:55     INFO -     eip = 0xf386617b   esp = 0xffbfad90   ebp = 0xffbfadb8   ebx = 0xf5621f98
[task 2017-06-28T00:56:55.471382Z] 00:56:55     INFO -     esi = 0xf715f344   edi = 0xf715f344
[task 2017-06-28T00:56:55.474410Z] 00:56:55     INFO -     Found by: call frame info
[task 2017-06-28T00:56:55.476340Z] 00:56:55     INFO - 19  libxul.so!XREMain::XRE_main [UniquePtr.h:d9489f5df7be : 528 + 0x8]
[task 2017-06-28T00:56:55.478923Z] 00:56:55     INFO -     eip = 0xf386c5ea   esp = 0xffbfadc0   ebp = 0xffbfae18   ebx = 0xf5621f98
[task 2017-06-28T00:56:55.480722Z] 00:56:55     INFO -     esi = 0x00000000   edi = 0xf715f344
[task 2017-06-28T00:56:55.482857Z] 00:56:55     INFO -     Found by: call frame info


I'm not sure who to direct this to for follow-up. Mike, do you have an idea?
Assignee: gbrown → nobody
Flags: needinfo?(mdeboer)
https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=linux32%20opt%20browser-chrome-1&tochange=8fdb9e30b6a77b46cf81232bcff642d939f7aa51&fromchange=4229107601dd76168852410c0858a5ca42519acc

strongly suggests that this shutdown hang started (back when it was being reported as bug 1108587) with the change for bug 1368649, which modified the optimization flags used for Linux opt.

:glandium -- Any idea how your optimization change might have triggered this shutdown hang? (Note also that there's mozglue code in the stack for the hung ShutdownXPCOM thread, which you may be familiar with -- see comment 12 for the stack.)
Flags: needinfo?(mh+mozilla)
If the code is racy, optimization changes can make the race more likely to be lost...  the stack trace is also not very helpful, because it's mostly based on stack scanning, and looks completely wrong.
Flags: needinfo?(mh+mozilla)
(In reply to Carsten Book [:Tomcat] from comment #16)
> just a note: this crash seems to be somehow generic like we had 
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=autoland&bugfiler&noautoclassify&filter-
> searchStr=Linux%20opt%20Mochitests%20executed%20by%20TaskCluster%20test-
> linux32%2Fopt-mochitest-browser-chrome-7%20tc-
> M(bc7)&fromchange=99dfc776b001dca817851839f42a21c6f55ca1a5&selectedJob=110681
> 821
> 
> which turns out its not this bug here and more caused by Bug 1372072  which
> got backed out

A crash in RunWatchdog is pretty generic, in that it often indicates that shutdown is hung. We have tracked down and fixed particular causes of such shutdown hangs, like smaug's recent fix in bug 1358898. *This* @RunWatchdog crash is distinct in that it happens:
 - on linux32-opt, non-e10s
 - following the browser/components/sessionstore tests
 - in the crash report, one of the non-crashing threads has ShutdownXPCOM calling NS_ProcessPendingEvents

I am almost certain that this type of shutdown hang started with :glandium's change.

Bug 1372072 was not at fault in this case; see https://bugzilla.mozilla.org/show_bug.cgi?id=1372072#c20.
(In reply to Mike Hommey [:glandium] from comment #15)
> If the code is racy, optimization changes can make the race more likely to
> be lost... 

Absolutely.

> the stack trace is also not very helpful, because it's mostly
> based on stack scanning, and looks completely wrong.

The top frames are muddled, but maybe suggest:

...
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/mozglue/misc/TimeStamp_posix.cpp#77
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/mozglue/misc/TimeStamp_posix.cpp#207
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/widget/nsIdleService.cpp#788

?

...and the rest looks consistent to me:

https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/widget/nsIdleService.cpp#898
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/widget/nsIdleService.cpp#717
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/xpcom/threads/nsTimerImpl.cpp#520
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/xpcom/threads/TimerThread.cpp#286
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/xpcom/threads/nsThread.cpp#1422
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/xpcom/threads/nsThreadUtils.cpp#416
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/xpcom/build/XPCOMInit.cpp#900


Other logs (we have lots of logs!) have a variation, with less stack scanning:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=110720627&lineNumber=3496

[task 2017-06-29T14:34:34.439503Z] 14:34:34     INFO - Thread 0
[task 2017-06-29T14:34:34.441541Z] 14:34:34     INFO -  0  libpthread-2.23.so + 0xc16b
[task 2017-06-29T14:34:34.443905Z] 14:34:34     INFO -     eip = 0xf76ea16b   esp = 0xfffaa67c   ebp = 0x04000001   ebx = 0xf712a664
[task 2017-06-29T14:34:34.447891Z] 14:34:34     INFO -     esi = 0x00000001   edi = 0xf712a660   eax = 0x00000001   ecx = 0x00000085
[task 2017-06-29T14:34:34.449659Z] 14:34:34     INFO -     edx = 0x00000001   efl = 0x00000202
[task 2017-06-29T14:34:34.451350Z] 14:34:34     INFO -     Found by: given as instruction pointer in context
[task 2017-06-29T14:34:34.453972Z] 14:34:34     INFO -  1  firefox!_fini + 0x6e84
[task 2017-06-29T14:34:34.455918Z] 14:34:34     INFO -     eip = 0x08073248   esp = 0xfffaa68c   ebp = 0x04000001
[task 2017-06-29T14:34:34.457780Z] 14:34:34     INFO -     Found by: stack scanning
[task 2017-06-29T14:34:34.459998Z] 14:34:34     INFO -  2  firefox!mozilla::detail::ConditionVariableImpl::notify_one [ConditionVariable_posix.cpp:d2aff6fc075d : 101 + 0xb]
[task 2017-06-29T14:34:34.461969Z] 14:34:34     INFO -     eip = 0x0804e0ad   esp = 0xfffaa690   ebp = 0x04000001
[task 2017-06-29T14:34:34.464042Z] 14:34:34     INFO -     Found by: stack scanning
[task 2017-06-29T14:34:34.466023Z] 14:34:34     INFO -  3  libxul.so!_fini + 0x139ee44
[task 2017-06-29T14:34:34.472131Z] 14:34:34     INFO -     eip = 0xf557621c   esp = 0xfffaa6a0   ebp = 0xfffaa6d8
[task 2017-06-29T14:34:34.473925Z] 14:34:34     INFO -     Found by: stack scanning
[task 2017-06-29T14:34:34.475888Z] 14:34:34     INFO -  4  libxul.so!nsTimerImpl::InitCommon [nsTimerImpl.cpp:d2aff6fc075d : 204 + 0x20]
[task 2017-06-29T14:34:34.477742Z] 14:34:34     INFO -     eip = 0xf01ebcf2   esp = 0xfffaa6e0   ebp = 0xfffaa708
[task 2017-06-29T14:34:34.479535Z] 14:34:34     INFO -     Found by: previous frame's frame pointer
[task 2017-06-29T14:34:34.481329Z] 14:34:34     INFO -  5  libxul.so!nsTimerImpl::InitWithFuncCallbackCommon [nsTimerImpl.cpp:d2aff6fc075d : 252 + 0x1d]
[task 2017-06-29T14:34:34.483070Z] 14:34:34     INFO -     eip = 0xf01ebda4   esp = 0xfffaa710   ebp = 0xfffaa758   ebx = 0xf557621c
[task 2017-06-29T14:34:34.487793Z] 14:34:34     INFO -     Found by: call frame info
[task 2017-06-29T14:34:34.489624Z] 14:34:34     INFO -  6  libxul.so!nsTimerImpl::InitWithNamedFuncCallback [nsTimerImpl.cpp:d2aff6fc075d : 241 + 0x25]
[task 2017-06-29T14:34:34.491381Z] 14:34:34     INFO -     eip = 0xf01ec116   esp = 0xfffaa760   ebp = 0xfffaa788   ebx = 0xf557621c
[task 2017-06-29T14:34:34.493187Z] 14:34:34     INFO -     esi = 0xd239bd92   edi = 0x0132f3e9
[task 2017-06-29T14:34:34.494865Z] 14:34:34     INFO -     Found by: call frame info
[task 2017-06-29T14:34:34.496830Z] 14:34:34     INFO -  7  libxul.so!nsIdleService::SetTimerExpiryIfBefore [nsIdleService.cpp:d2aff6fc075d : 839 + 0x30]
[task 2017-06-29T14:34:34.498639Z] 14:34:34     INFO -     eip = 0xf23626d5   esp = 0xfffaa790   ebp = 0xfffaa7e8   ebx = 0xf557621c
[task 2017-06-29T14:34:34.507725Z] 14:34:34     INFO -     esi = 0xd239bd92   edi = 0x0132f3e9
[task 2017-06-29T14:34:34.508693Z] 14:34:34     INFO -     Found by: call frame info
[task 2017-06-29T14:34:34.512014Z] 14:34:34     INFO -  8  libxul.so!nsIdleService::ReconfigureTimer [nsIdleService.cpp:d2aff6fc075d : 898 + 0x19]
[task 2017-06-29T14:34:34.513039Z] 14:34:34     INFO -     eip = 0xf2362a49   esp = 0xfffaa7f0   ebp = 0xfffaa848   ebx = 0xf557621c
[task 2017-06-29T14:34:34.516280Z] 14:34:34     INFO -     esi = 0x00000000   edi = 0x00004465
[task 2017-06-29T14:34:34.517331Z] 14:34:34     INFO -     Found by: call frame info
[task 2017-06-29T14:34:34.518277Z] 14:34:34     INFO -  9  libxul.so!nsIdleService::IdleTimerCallback [nsIdleService.cpp:d2aff6fc075d : 717 + 0x8]
[task 2017-06-29T14:34:34.519251Z] 14:34:34     INFO -     eip = 0xf23634b1   esp = 0xfffaa850   ebp = 0xfffaa958   ebx = 0xf557621c
[task 2017-06-29T14:34:34.523365Z] 14:34:34     INFO -     esi = 0xcaf6bbed   edi = 0x00004465
[task 2017-06-29T14:34:34.524341Z] 14:34:34     INFO -     Found by: call frame info
[task 2017-06-29T14:34:34.531725Z] 14:34:34     INFO - 10  libxul.so!nsTimerImpl::Fire [nsTimerImpl.cpp:d2aff6fc075d : 520 + 0x19]
[task 2017-06-29T14:34:34.532647Z] 14:34:34     INFO -     eip = 0xf01edcb9   esp = 0xfffaa960   ebp = 0xfffaa9e8   ebx = 0xf557621c
[task 2017-06-29T14:34:34.533459Z] 14:34:34     INFO -     esi = 0xffffffff   edi = 0x00000001
[task 2017-06-29T14:34:34.534311Z] 14:34:34     INFO -     Found by: call frame info
[task 2017-06-29T14:34:34.535928Z] 14:34:34     INFO - 11  libxul.so!nsTimerEvent::Run [TimerThread.cpp:d2aff6fc075d : 286 + 0x18]
[task 2017-06-29T14:34:34.541055Z] 14:34:34     INFO -     eip = 0xf01d8148   esp = 0xfffaa9f0   ebp = 0xfffaaa48   ebx = 0xf557621c
[task 2017-06-29T14:34:34.541913Z] 14:34:34     INFO -     esi = 0x00000000   edi = 0xf715b340
[task 2017-06-29T14:34:34.542736Z] 14:34:34     INFO -     Found by: call frame info
[task 2017-06-29T14:34:34.547739Z] 14:34:34     INFO - 12  libxul.so!nsThread::ProcessNextEvent [nsThread.cpp:d2aff6fc075d : 1422 + 0x8]
[task 2017-06-29T14:34:34.555603Z] 14:34:34     INFO -     eip = 0xf01f28ca   esp = 0xfffaaa50   ebp = 0xfffaab28   ebx = 0xf557621c
[task 2017-06-29T14:34:34.556452Z] 14:34:34     INFO -     esi = 0x00000000   edi = 0xf715b340
[task 2017-06-29T14:34:34.557246Z] 14:34:34     INFO -     Found by: call frame info
[task 2017-06-29T14:34:34.558113Z] 14:34:34     INFO - 13  libxul.so!NS_ProcessPendingEvents [nsThreadUtils.cpp:d2aff6fc075d : 416 + 0x15]
[task 2017-06-29T14:34:34.558984Z] 14:34:34     INFO -     eip = 0xf01eff05   esp = 0xfffaab30   ebp = 0xfffaab78   ebx = 0xf557621c
[task 2017-06-29T14:34:34.563972Z] 14:34:34     INFO -     esi = 0xf715b340   edi = 0x00000000
[task 2017-06-29T14:34:34.564811Z] 14:34:34     INFO -     Found by: call frame info
[task 2017-06-29T14:34:34.565707Z] 14:34:34     INFO - 14  libxul.so!mozilla::ShutdownXPCOM [XPCOMInit.cpp:d2aff6fc075d : 900 + 0x5]
[task 2017-06-29T14:34:34.566561Z] 14:34:34     INFO -     eip = 0xf02085a9   esp = 0xfffaab80   ebp = 0xfffaabc8   ebx = 0xf557621c
[task 2017-06-29T14:34:34.567613Z] 14:34:34     INFO -     esi = 0xf715b340   edi = 0xffffffff
[task 2017-06-29T14:34:34.576032Z] 14:34:34     INFO -     Found by: call frame info

So then:

...
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/xpcom/threads/nsTimerImpl.cpp#204
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/xpcom/threads/nsTimerImpl.cpp#252
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/xpcom/threads/nsTimerImpl.cpp#241 (I don't understand this one...maybe #262?)
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/widget/nsIdleService.cpp#839
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/widget/nsIdleService.cpp#898
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/widget/nsIdleService.cpp#717
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/xpcom/threads/nsTimerImpl.cpp#520
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/xpcom/threads/TimerThread.cpp#286
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/xpcom/threads/nsThread.cpp#1422
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/xpcom/threads/nsThreadUtils.cpp#416
https://dxr.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136/xpcom/build/XPCOMInit.cpp#900
The entire stack is dubious. libxul.so!nsTimerImpl::InitCommon is supposedly found from the previous frame pointer, but I don't see it calling _fini. mozilla::detail::ConditionVariableImpl::notify_one is also not something that I'd expect to block (it should unblock another thread and return).
Hi Geoff, sorry for not getting back to this sooner, but I think you already suspected this: I haven't got a clue why this is happening so often. I also don't know where to begin investigating; a regression range might be a good start but my gut tells me that this must be originating from a lower level failure than JS sessionstore. Something on the platform side of things is causing linux-only crashes and sessionstore might certainly trigger it, but so might something else.
Flags: needinfo?(mdeboer)
FWIW, it seems that this started to permafail with a merge to mozilla-central made on the 2nd of July. 
Almost all the try builds on central after this one have a failure for this test on linux 32 opt.

try run on central for suspect merge: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=4d3de12dcdc539f14fcb06539da39fa7176c8955
corresponding changelog: https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=4d3de12dcdc539f14fcb06539da39fa7176c8955
(In reply to Julian Descottes [:jdescottes] from comment #27)
> FWIW, it seems that this started to permafail with a merge to
> mozilla-central made on the 2nd of July. 
> Almost all the try builds on central after this one have a failure for this
> test on linux 32 opt.
> 
> try run on central for suspect merge:
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> central&revision=4d3de12dcdc539f14fcb06539da39fa7176c8955
> corresponding changelog:
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?changeset=4d3de12dcdc539f14fcb06539da39fa7176c8955

My bad, ignore that, the test simply was running in a different bucket before that, so I missed it. Nothing related to the changelog I mentioned in the end. Sorry about the noise.
In https://treeherder.mozilla.org/#/jobs?repo=try&revision=2492e1f5b1a72d2596b282e3e58f736e5a1ba364 I found that it is easy to reproduce the hang with MOZ_LOG logging. I believe the log verifies that the hang is in nsTimerImpl::InitCommon. 

I can dig deeper next week.
Flags: needinfo?(gbrown)
On the activity-stream side of things, the code seems to trigger this crash relatively consistently. For those looking to track down the source of this crash, preffing on activity-stream might help shift it from intermittent to perma-red.

You'll want to apply turn it on in firefox.js:

diff --git a/browser/app/profile/firefox.js b/browser/app/profile/firefox.js
--- a/browser/app/profile/firefox.js
+++ b/browser/app/profile/firefox.js
@@ -1273 +1273 @@
-pref("browser.newtabpage.activity-stream.enabled", false);
+pref("browser.newtabpage.activity-stream.enabled", true);

In our investigation, removing a setTimeout with no time delay that runs on "sessionstore-windows-restored" significantly reduces the crash rate of potentially a related/same crash. Where mconley's current thought is that Timer.jsm might need to observe shutdown to clear timeouts from the Map. This might match up with gbrown's investigation re: nsTimerImpl::InitCommon ?

setTimeout removed https://hg.mozilla.org/try/rev/a7d0531396ccf51207cad47c8dc3d11aa1e71ab3
I pushed to try some cleaning up of timers on "quit-application-granted" as mconley suggested, and it seems to have fixed this perma-crash with activity-stream enabled:

with cleaning:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c499241e208fcc1246ada6cbb84d5a7cddbb3886&filter-searchStr=linux32%2Fopt%20bc3&group_state=expanded

without cleaning:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9066a2eda2e6c45ef051a9f7ce87ca90c5f31051&selectedJob=113113820&filter-searchStr=linux32%2Fopt%20bc3&group_state=expanded

I also added some debugging, and at a glance, there might be some oddness:
- two Timers.jsm scopes active and only one gets cleaned up?
- sessionstore is calling clearTimer with null during startup
- sessionstore clears timers that have triggered (unnecessary but not dangerous)
- sessionstore timers are being added during quit-application to save state
See Also: → 1379781
(In reply to Geoff Brown [:gbrown] from comment #31)
> In https://treeherder.mozilla.org/#/jobs?repo=try&revision=2492e1f5b1a72d2596b282e3e58f736e5a1ba364 I found that
> it is easy to reproduce the hang with MOZ_LOG logging. I believe the log
> verifies that the hang is in nsTimerImpl::InitCommon. 
> 
> I can dig deeper next week.

MOZ_LOGs in https://treeherder.mozilla.org/#/jobs?repo=try&revision=425575f4c2b21c5d3fb2596aafde7f3bb349da9f indicate nsTimerImpl::InitCommon calls TimerThread::AddTimer, which blocks indefinitely on the call to mMonitor.Notify().

https://dxr.mozilla.org/mozilla-central/rev/91c943f7373722ad4e122d98a2ddd6c79708b732/xpcom/threads/TimerThread.cpp#568


2017-07-10 22:43:18.453531 UTC - [Main Thread]: D/nsTimerImpl >>> InitCommon before AddTimer
2017-07-10 22:43:18.453542 UTC - [Main Thread]: D/nsTimerImpl TimerThread::Init [1]
2017-07-10 22:43:18.453553 UTC - [Main Thread]: D/nsTimerImpl AddTimer calls Notify on monitor
2017-07-10 22:43:18.453991 UTC - [Main Thread]: D/nsTimerImpl AddTimer back from Notify
2017-07-10 22:43:18.454025 UTC - [Timer]: D/nsTimerImpl Timer thread woke up 0.487414ms from when it was supposed to
2017-07-10 22:43:18.454047 UTC - [Timer]: D/nsTimerImpl waiting for 589
2017-07-10 22:43:18.454067 UTC - [Main Thread]: D/nsTimerImpl <<< InitCommon after AddTimer
2017-07-10 22:43:18.454083 UTC - [Main Thread]: D/nsTimerImpl [this=0xdb913080] Took 0.597563ms to fire timer callback
2017-07-10 22:43:18.454152 UTC - [Main Thread]: D/nsTimerImpl [this=0xe585b0c0] time between PostTimerEvent() and Fire(): 0.111107ms
2017-07-10 22:43:18.454168 UTC - [Main Thread]: D/nsTimerImpl [this=0xdb913080] expected delay time    0ms
2017-07-10 22:43:18.454178 UTC - [Main Thread]: D/nsTimerImpl [this=0xdb913080] actual delay time      0ms
2017-07-10 22:43:18.454188 UTC - [Main Thread]: D/nsTimerImpl [this=0xdb913080] (mType is 0)       -------
2017-07-10 22:43:18.454210 UTC - [Main Thread]: D/nsTimerImpl [this=0xdb913080]     delta              0ms
2017-07-10 22:43:18.454233 UTC - [Main Thread]: D/nsTimerImpl >>> InitCommon before AddTimer
2017-07-10 22:43:18.454244 UTC - [Main Thread]: D/nsTimerImpl TimerThread::Init [1]
2017-07-10 22:43:18.454256 UTC - [Main Thread]: D/nsTimerImpl AddTimer calls Notify on monitor
Flags: needinfo?(gbrown)
Sorry for the false lead in comment 34. Turns out it wasn't the clearing of timeouts that avoided the test failure. I had added some dump statements that included the stack, and turns out dumping 10 lines works just as well to avoid the crash..........

with dump: https://treeherder.mozilla.org/#/jobs?repo=try&revision=08e27814ceaf0db02ab5f649c67f5ccc49071b51&filter-searchStr=bc3
w/o dumps: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c8d7843a184d921b9dd7881c1975358bcc3e0e33&filter-searchStr=bc3
Component: Session Restore → XPCOM
Product: Firefox → Core
Can we find an owner for this, it is blocking Activity Stream being preffed on on Nightly.
Flags: needinfo?(n.nethercote)
Flags: needinfo?(erahm)
Priority: P3 → --
Thanks for the additional attention here!

To re-cap what we know so far:

This is a persistent, frequent intermittent failure in linux32-opt, non-e10s only. It likely started with bug 1368649, which modified the optimization flags used for Linux opt (comment 15). It is a shutdown hang which follows the browser/components/sessionstore tests, but it does not appear to be strongly associated with any one test; disabling 20 or so sessionstore tests reduces the frequency of hangs. Crash reports are available, but disputed (comment 15, comment 19). It is easily reproducible on try, even with MOZ_LOG logging. Logging indicates nsTimerImpl::InitCommon calls TimerThread::AddTimer, which blocks indefinitely on the call to mMonitor.Notify() (comment 35). Enabling activity-stream seems to make this failure more frequent and :mardak identified some suspicious timer conditions at shutdown (comment 33, comment 34, comment 37).
Those stacks are in nsIdleService where it calls |ReconfigureTimer| [1], presumably it's doing something bad during shutdown. The actual hang is in a pthread_cond_signal call, which should not block. That seems to indicate the cond var is corrupt, possibly due to nsTimerThread being destroyed already?

We should probably move this to core :: widget, but that might be a dead-end as none of the owners/peers are active.

[1] http://searchfox.org/mozilla-central/rev/31311070d9860b24fe4a7a36976c14b328c16208/widget/nsIdleService.cpp#714-719
I don't have anything to add beyond what erahm said, sorry.
Flags: needinfo?(n.nethercote)
How would one go about tracking down what timer is trying to be added when it blocks? I tried having Timers.jsm not actually make timers after quit-application-granted, but that didn't seem to help avoid the crash.
Whereas:
 - this is a very frequent intermittent
 - we don't seem to be making progress on fixing it
 - it only affects linux non-e10s
 - we intend to stop running most non-e10s tests in a few weeks

Can we just skip the sessionstore tests on linux non-e10s?

:erahm, others - any objections?
Attachment #8886635 - Flags: review?(mdeboer)
Comment on attachment 8886635 [details] [diff] [review]
skip sessionstore tests on non-e10s linux

Review of attachment 8886635 [details] [diff] [review]:
-----------------------------------------------------------------

OK, all of them! Well, I'm not opposed for all the valid reasons stated in comment 46.
Attachment #8886635 - Flags: review?(mdeboer) → review+
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/82f7afb4f46b
Skip all browser/components/sessionstore tests on non-e10s Linux; r=mikedeboer
Keywords: leave-open
Whiteboard: [stockwell needswork] → [stockwell disabled]
Flags: needinfo?(erahm)
The leave-open keyword is there and there is no activity for 6 months.
:froydnj, maybe it's time to close this bug?
Flags: needinfo?(nfroyd)
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(nfroyd)
Resolution: --- → FIXED
Assignee: nobody → gbrown
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: