Firefox shutdown crash in testHomeButton.js

RESOLVED FIXED

Status

Mozilla QA
Mozmill Tests
P1
normal
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: lizzard, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments)

From this list of test reports, http://mozmill-release.blargon7.com/#/functional/reports?app=All&branch=31.0&platform=All&from=2014-06-10&to=2014-06-11   I sorted on failures and was going through the list of failures. Some of the individual reports for the failed tests are blank, like this one:

http://mozmill-release.blargon7.com/#/functional/report/3ed251269efbb25c65b0fc5d9d384e59

This may be from firefox crashing (process-crash here: http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_functional/5968/console) 

"10:32:33 PROCESS-CRASH | c:\jenkins\workspace\ondemand_functional\data\mozmill-tests\firefox\tests\functional\testToolbar\testHomeButton.js | application crashed [Unknown top frame]"

whimboo is explaining to me that there is a missing .extra file so we can't put this in crash-stats. 

I'm not sure how to dig further into this right now, but here is a placeholder for this problem.
(Reporter)

Comment 1

4 years ago
This appears to have crashed/failed in several other locales. Across several platforms.
(Reporter)

Updated

4 years ago
OS: Mac OS X → All
Priority: -- → P1
Hardware: x86 → All
(Reporter)

Comment 2

4 years ago
Here are the process-crash tests that come out with blank reports. 

Mozmill ondemand_functional testrun for Firefox 31.0 ms on mm-win-7-64-2 (2014-06-11_10-36-48) completed with 1 failures: http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_functional/6003/console

Mozmill ondemand_functional testrun for Firefox 31.0 zu on mm-win-7-64-3 (2014-06-11_10-36-10) completed with 1 failures. http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_functional/6002/

Mozmill ondemand_functional testrun for Firefox 31.0 lij on mm-win-7-64-4 (2014-06-11_10-22-04) completed with 1 failures. http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_functional/5971/

Mozmill ondemand_functional testrun for Firefox 31.0 ff on mm-win-7-32-1 (2014-06-11_10-21-43) completed with 1 failures. http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_functional/5966/

Mozmill ondemand_functional testrun for Firefox 31.0 kn on mm-win-7-32-2 (2014-06-11_10-21-46) completed with 1 failures. http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_functional/5967/
This bug should actually be about the crash when running our tests. It's not about the non-visible report in the dashboard. Therefore an issue on the github repository (http://github.com/whimboo/mozmill-dashboard) has to be filed.

Sadly none of this crashes caused Firefox to create the .extra file for the minidump. That way we will not be able to send the report to crashstats. :( So we will have to analyze the minidump with windbg on Windows. Anthony also mentioned crashes on OS X, so gdb should also be helpful there. But Liz didn't add those in her last comment. So maybe a look into the email archive might be necessary.

For how to work with WinDbg we have good docs on MDN:
https://developer.mozilla.org/en-US/docs/How_to_get_a_stacktrace_with_WinDbg

Andrei, can you have a look at this please? I would appreciate that. Thanks.
Flags: needinfo?(andrei.eftimie)
Summary: Blank report in mozmill from crash in testHomeButton.js → Firefox shutdown crash in testHomeButton.js
Version: Version 2 → unspecified

Comment 4

4 years ago
The MDN docs are for a hooking up to a process and reproducing the crash.

I've set up Visual Studio, installed the correct symbols and WinDBG, and tried investigating the minidump itself, but it fails with the following message:

> Failure when opening dump file
> 'C:\Users\mozauto\AppData\Roaming\Mozilla\Firefox\Crash
> Reports\pending\d55ccfe6-5a92-4c5f-b926-e52f89e3416b.dmp',
> HRESULT 0x800004005
> It mai be corrupt or in a format not understood by the debugger.
>
> Unspecified error

I can open up older dmp files, even those without the .extra file.
It might be that the recent crash dumps are corrupt.

I don't know how to investigate this further...
Flags: needinfo?(andrei.eftimie)
(In reply to Andrei Eftimie from comment #4)
> I've set up Visual Studio, installed the correct symbols and WinDBG, and
> tried investigating the minidump itself, but it fails with the following
> message:
> 
> > Failure when opening dump file
> > 'C:\Users\mozauto\AppData\Roaming\Mozilla\Firefox\Crash
> > Reports\pending\d55ccfe6-5a92-4c5f-b926-e52f89e3416b.dmp',
> > HRESULT 0x800004005
> > It mai be corrupt or in a format not understood by the debugger.
> >
> > Unspecified error
> 
> I can open up older dmp files, even those without the .extra file.
> It might be that the recent crash dumps are corrupt.

That's very strange. Maybe we produce invalid minidumps sometimes? Benjamin or Ted, do you have an idea what could be the problem here? Would it help to give you the minidump?

Andrei, I would suggest while we are waiting for the feedback from Benjamin and Ted, that you are uploading the minidump as attachment to this bug.

Comment 6

4 years ago
Created attachment 8439105 [details]
e60bc6ea-77e8-48fc-8c15-67554252cc85.dmp

There were 2 crashes yesterday. I'm uploading both .dmp files

Comment 7

4 years ago
Created attachment 8439106 [details]
d55ccfe6-5a92-4c5f-b926-e52f89e3416b.dmp

Comment 8

4 years ago
Also a crash on staging: http://mm-ci-staging.qa.scl3.mozilla.com:8080/job/mozilla-central_functional/2316/console

Why is the WIFI GEO: shutdown called message here.. this also appears earlier (a couple tests _after_ the Geolocation test.
> 07:52:49 TEST-START | testToolbar\testHomeButton.js | teardownModule
> 07:52:49 TEST-END | testToolbar\testHomeButton.js | finished in 260ms
> 07:52:52 *** WIFI GEO: shutdown called
> 07:52:52 
> 07:52:52 ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID
> 07:52:52 
> 07:52:52 
> [...] // 12 other messages like this
> 07:52:52 
> 07:52:52 
> 07:52:52 ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID
> 07:52:52 
> 07:52:53 PROCESS-CRASH | c:\jenkins\workspace\mozilla-central_functional\data\mozmill-tests\firefox\tests\functional\testToolbar\testHomeButton.js | application crashed [Unknown top frame]
> 07:52:53 Crash dump filename: c:\jenkins\workspace\mozilla-central_functional\data\profile\minidumps\77cd8f88-2e08-4756-83ea-70005144343a.dmp
> 07:52:53 No symbols path given, can't process dump.
> 07:52:53 MINIDUMP_STACKWALK not set, can't process dump.
> 07:52:53 mozcrash INFO | Saved minidump as C:\Users\mozauto\AppData\Roaming\Mozilla\Firefox\Crash Reports\pending\77cd8f88-2e08-4756-83ea-70005144343a.dmp
Both of those minidumps are malformed. They have a valid header but are missing the data directory which actually points to the contents. I'm not sure why this would be happening, if it's crashing during shutdown it's possible that it's a race condition of some sort.
Created attachment 8439233 [details]
jenkins_esr24.txt

This also fails with esr24, so it's not a bug in beta, also it failed only on nodes we made flash and java updates. I ran the testrun for 10 times on win 7 node only with flash update and then with java too but I couldn't reproduce the crash.
Comment on attachment 8439233 [details]
jenkins_esr24.txt

>04:59:49 Crash dump filename: c:\jenkins\workspace\mozilla-esr24_functional\data\profile\minidumps\55933401-5ca0-402b-b4ad-1cb49cbe32d9.dmp
>04:59:49 No symbols path given, can't process dump.
>04:59:49 MINIDUMP_STACKWALK not set, can't process dump.
>04:59:49 mozcrash INFO | Saved minidump as C:\Users\mozauto\AppData\Roaming\Mozilla\Firefox\Crash Reports\pending\55933401-5ca0-402b-b4ad-1cb49cbe32d9.dmp
>04:59:49 mozcrash INFO | Saved app info as C:\Users\mozauto\AppData\Roaming\Mozilla\Firefox\Crash Reports\pending\55933401-5ca0-402b-b4ad-1cb49cbe32d9.extra

This log shows an .extra file. Have you done any investigation for that crash, Cosmin?
Here the crash report: bp-fe4a1171-4af8-45a6-aa1d-63f412140612

So this is actually bug 980938. Cosmin, do you have installed the release version of Flash on all those Windows boxes? This is not what we should do! You should know that from all the trouble we had with those flash crashes!
Flags: needinfo?(cosmin.malutan)
(In reply to Andrei Eftimie from comment #4)
> The MDN docs are for a hooking up to a process and reproducing the crash.
> 
> I've set up Visual Studio, installed the correct symbols and WinDBG, and
> tried investigating the minidump itself, but it fails with the following
> message:

Can I ask where you have installed Visual Studio? I see that mm-win-7-32-3 got massive amounts of new packages installed today. I really hope you haven't done this on the production machine!
Flags: needinfo?(cosmin.malutan) → needinfo?(andrei.eftimie)
Liz, it would be great if you could re-build all the crashed jobs. I downgraded Flash on all the boxes to 13.0.0.124 debug, so we should not see those crashes anymore. Please let us know about the results. thanks.
(Reporter)

Comment 15

4 years ago
whimboo, OK, I will do that this afternoon!
(Reporter)

Comment 16

4 years ago
OK, I rebuilt them all, including the Mac failures, and they aren't failing now! So I think you found the culprit.
That's great to hear. So I'm going to close this bug now given that the debug version of Flash fixed the issue.

Ted, regarding the broken minidumps, where should I file that bug? I think it would be good to get this investigated and fixed if it is a core issue.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Flags: needinfo?(ted)
Resolution: --- → FIXED
Here's the thing: on Windows we use a Microsoft API to write minidumps, so we don't have any control over that code. It's possible that the bug here is in the way we inject Breakpad into Adobe's Flash processes, bsmedberg could comment on that.
Flags: needinfo?(ted)

Comment 19

4 years ago
(In reply to Henrik Skupin (:whimboo) from comment #13)
> (In reply to Andrei Eftimie from comment #4)
> > The MDN docs are for a hooking up to a process and reproducing the crash.
> > 
> > I've set up Visual Studio, installed the correct symbols and WinDBG, and
> > tried investigating the minidump itself, but it fails with the following
> > message:
> 
> Can I ask where you have installed Visual Studio? I see that mm-win-7-32-3
> got massive amounts of new packages installed today. I really hope you
> haven't done this on the production machine!

Of course I've done it on mm-win-7-32-3.
We had some crashes which we didn't knew if we could reproduce and a couple of dmp files.
If those dmp files led to nothing we might have needed to actually use windbg _while_ generating the crash. Where else would be the best place to have everything set up then the machine where the crash happened in the first place.

Now that we sorted things out, I've cleaned the machine (removed Visual Studio and all its dependencies).
I've left the symbols on FS1-QA since they can be linked from a network drive, so If we'll later need them they are there.
Flags: needinfo?(andrei.eftimie)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #18)
> Here's the thing: on Windows we use a Microsoft API to write minidumps, so
> we don't have any control over that code. It's possible that the bug here is
> in the way we inject Breakpad into Adobe's Flash processes, bsmedberg could
> comment on that.

Benjamin, or Georg what do you both think about that? Something we can improve and I should file a bug for? Or is there already one around, and the problem known. Hm, maybe this could be the underlying issue for bug 981641, and we can cover it there?
Flags: needinfo?(georg.fritzsche)
Flags: needinfo?(benjamin)
(In reply to Andrei Eftimie from comment #19)
> Of course I've done it on mm-win-7-32-3.
> We had some crashes which we didn't knew if we could reproduce and a couple
> of dmp files.

When it comes to analyzing crashes you don't have to do this on the exact same machine. Any other installation will be sufficient. Also in case of windbg you do not have to install the whole Visual Studio, the debugging tools are totally enough.

So next time please think if you can do stuff locally first, and if there is no other way around use one of the CI machines, better a staging than a production one.

> If those dmp files led to nothing we might have needed to actually use
> windbg _while_ generating the crash. Where else would be the best place to
> have everything set up then the machine where the crash happened in the
> first place.

I'm not afraid about the debugging tools being installed, but Visual Studio can leave traces behind. So I don't feel good when doing those things on production machines even.

> Now that we sorted things out, I've cleaned the machine (removed Visual
> Studio and all its dependencies).
> I've left the symbols on FS1-QA since they can be linked from a network
> drive, so If we'll later need them they are there.

Thanks
(In reply to Henrik Skupin (:whimboo) from comment #20)
> (In reply to Ted Mielczarek [:ted.mielczarek] from comment #18)
> > Here's the thing: on Windows we use a Microsoft API to write minidumps, so
> > we don't have any control over that code. It's possible that the bug here is
> > in the way we inject Breakpad into Adobe's Flash processes, bsmedberg could
> > comment on that.
> 
> Benjamin, or Georg what do you both think about that? Something we can
> improve and I should file a bug for? Or is there already one around, and the
> problem known. Hm, maybe this could be the underlying issue for bug 981641,
> and we can cover it there?

Are there clear STR? Can you file a bug on it with steps?

Comment 23

4 years ago
Given the nature of the bug here, I don't think this is high value, and debugging/fixing it will probably require significant effort. So if you have specific STR and want to file, that's fine, but expect it to sit.
Flags: needinfo?(georg.fritzsche)
Flags: needinfo?(benjamin)
(In reply to Georg Fritzsche [:gfritzsche] from comment #22)
> > Benjamin, or Georg what do you both think about that? Something we can
> > improve and I should file a bug for? Or is there already one around, and the
> > problem known. Hm, maybe this could be the underlying issue for bug 981641,
> > and we can cover it there?
> 
> Are there clear STR? Can you file a bug on it with steps?

It happens on and off. No clear steps yet but we have a high reproduction rate for it. But lets move this discussion over to bug 981641.
You need to log in before you can comment on or make changes to this bug.