Closed Bug 1220430 Opened 9 years ago Closed 8 years ago

Intermittent PROCESS-CRASH | tsvgx | application crashed [@ google_breakpad::ExceptionHandler::WriteMinidump(std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > const &,bool (*)(wchar_t const *,wchar_t const *,void *,_EXCEPTION_P

Categories

(Core :: DOM: Content Processes, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox45 --- affected

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure)

https://treeherder.mozilla.org/logviewer.html#?job_id=16627810&repo=mozilla-inbound

 PROCESS-CRASH | tsvgx | application crashed [@ google_breakpad::ExceptionHandler::WriteMinidump(std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > const &,bool (*)(wchar_t const *,wchar_t const *,void *,_EXCEPTION_POINTERS *,MDRawAssertionInfo *,bool),void *)] 
 PROCESS-CRASH | tsvgx | application crashed [@ KiFastSystemCallRet + 0x0]
Blocks: 1220489
Blocks: 1220933
Blocks: 1220934
interesting set of bugs- looking at this in slightly more detail, I see that we complete the test successfully, and it appears that we make it past the code which parses and stores the results, but we fail on the check for crashes.  In fact the browser exits with no error code, so we are seeing mini dumps sitting around.

The question I have is should we take action on these or ignore them?  Talos proper works just fine.  If we have these errors, show them, file bugs- we should have somebody looking at them.  I would vote for not checking for errors if we have success, but I know others might have reasons to care about the errors.

As a note the related bugs in here all seem to follow the same pattern.

:wlach, what are your thoughts on this?
Flags: needinfo?(wlachance)
Interesting question, is talos used to report only performance changes ? It seems like the browser crashed here, probably worst investigation, but maybe not under the talos bug category ?
yeah, talos is designed to measure performance and it is possible to do that even with these *crashes*.  If we could find somebody who can make these crashes actionable, that would be great, otherwise we are missing data and creating more noise for the sheriffs.
Firefox crashing during normal operation is super serious, not something we should can ignore. We should get someone to look into these problems when they happen. I'm not sure who that should be but we should find out IMO.
Flags: needinfo?(wlachance)
(In reply to Joel Maher (:jmaher) from comment #3)
> yeah, talos is designed to measure performance and it is possible to do that
> even with these *crashes*.  If we could find somebody who can make these
> crashes actionable, that would be great, otherwise we are missing data and
> creating more noise for the sheriffs.

There is a stack trace of the crash, that will probably help in investigating the issue - even if it is not easily reproducible.
ted, given the dump from comment 0, where do we look for the failure?  it shows breakpad at the top of the stack.
Flags: needinfo?(ted)
If you look down to frame 4 you'll see:
 17:03:56 INFO - 4 xul.dll!mozilla::dom::ContentParent::ForceKillTimerCallback(nsITimer *,void *) [ContentParent.cpp:fe6809fd4d43 : 3543 + 0xd] 

This is the chrome process detecting that the content process is not responding, writing a pair of minidumps for itself+the content process and then killing the content process.

Up a bit you can see:
 17:03:44 INFO - 2015-10-31 17:03:44,348 INFO : Browser exited with error code: 0 

The main process actually exited successfully after doing this.

If you look down to the next PROCESS-CRASH line you can see the stack for the content process. Some relevant lines are:
17:04:02     INFO -   9  xul.dll!mozilla::ipc::MessageChannel::WaitForSyncNotify(bool) [WindowsMessageLoop.cpp:fe6809fd4d43 : 1080 + 0x5]
17:04:02     INFO -      eip = 0x646bff7c   esp = 0x002ded20   ebp = 0x002ded68
17:04:02     INFO -      Found by: call frame info
17:04:02     INFO -  10  xul.dll!mozilla::ipc::MessageChannel::Send(IPC::Message *,IPC::Message *) [MessageChannel.cpp:fe6809fd4d43 : 946 + 0xa]
17:04:02     INFO -      eip = 0x646c7bb0   esp = 0x002ded70   ebp = 0x002dedc0
17:04:02     INFO -      Found by: call frame info
17:04:02     INFO -  11  xul.dll!mozilla::dom::PBrowserChild::SendGetInputContext(int *,int *,int *) [PBrowserChild.cpp:fe6809fd4d43 : 963 + 0x10]
17:04:02     INFO -      eip = 0x647a4dce   esp = 0x002dedc8   ebp = 0x002dee0c
17:04:02     INFO -      Found by: call frame info
17:04:02     INFO -  12  xul.dll!mozilla::widget::PuppetWidget::GetInputContext() [PuppetWidget.cpp:fe6809fd4d43 : 680 + 0x14]
17:04:02     INFO -      eip = 0x65309dad   esp = 0x002dee14   ebp = 0x002dee28
17:04:02     INFO -      Found by: call frame info

It's hung up doing a synchronous IPC call in some IME code.
Flags: needinfo?(ted)
awesome, thanks ted!
Blocks: 1227716
Blocks: 1228035
Component: Talos → DOM: Content Processes
Product: Testing → Core
Blocks: 1234429
using this bug to track this- we have a unique bug for each talos test- this means that we are seeing about 120 failures/week with this :(

Luckily this is 95% win7 e10s, and some win8- all on trunk.  We should investigate why the content process is hanging.
Flags: needinfo?(jmaher)
ok, my plan to look at recent failures and find a percentage of failures while collecting info on where we are forcing this crash (timeout, etc.) seems silly now that none of these errors have happened since Jan 28th.  I think I need to wait this out and see what comes up in the weekly reports- this might have been fixed by something else.
Flags: needinfo?(jmaher)
this bug hasn't been seen in 5+ weeks
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.