[Crash] [@ ? | mozilla::MediaStreamGraphImpl::EnsureRunInStableState ]

RESOLVED WORKSFORME

Status

RESOLVED WORKSFORME
4 years ago
4 years ago

People

(Reporter: ntroast, Unassigned)

Tracking

({crash})

unspecified
ARM
Gonk (Firefox OS)
crash

Firefox Tracking Flags

(blocking-b2g:2.2+)

Details

(Whiteboard: [b2g-crash][caf-crash 606][caf priority: p3][CR 819433])

Attachments

(4 attachments)

(Reporter)

Description

4 years ago
We observed the following crash signature during testing.

[@ ? | mozilla::MediaStreamGraphImpl::EnsureRunInStableState | mozilla::MediaStreamGraphImpl::AppendMessage | mozilla::MediaStreamGraphImpl::UpdateConsumptionState ]

Cafbot will upload the decoded minidump and extra file.

This crash was produced during stability tests which involves monkey testing for several hours and there is no clear STR for this. If we are not able to identify the issue using provided logs then please feel free to provide us a debug patch with additional logging to identify the issue.
Created attachment 8589761 [details]
EXTRA file attachment -
Created attachment 8589762 [details]
decoded minidump -
Whiteboard: [CR 819433]
Whiteboard: [CR 819433] → [caf priority: p1][CR 819433]
Whiteboard: [caf priority: p1][CR 819433] → [b2g-crash][caf-crash 606][caf priority: p1][CR 819433]
Keywords: crash
Created attachment 8589817 [details]
EXTRA file attachment - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.123
Created attachment 8589818 [details]
decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.123

Comment 6

4 years ago
Hi Maire,

Please have Paul or someone else on your team pick this up. It's affecting our ability to reach the fxOS 2.2 MTBF goal per CAF's tests. If this isn't code your team handles please help route this to the team that does.

Thanks!
Mike
Flags: needinfo?(mreavy)
Paul and I talked about the MSG crashes in our 1:1 earlier today.  The current theory (per my conversation with Paul) is that all the crashes have the same root cause.  These include bug 1152439, bug 1152431, bug 1152439 and this bug.  Paul -- if you have new info that changes this theory, please chime in.

What's not clear is whether Rob (roc) or Paul should take the lead on fixing the issue.  I know they have been talking intently about the possible root cause.

Paul, Rob -- Who makes the most sense to take the lead on this? I was thinking Rob, but I don't know what else Rob is working on.  Also, can we dupe the other 3 MSG crasher bugs to one bug or is that premature?

Mike -- We REALLY need a regression range if we can get it.  Per Paul, this issue does NOT repro in Nightly -- so it is likely something that got fixed (perhaps when we did a refactor).  And it is not easy for Paul to repro.  I believe he was able to repro it twice on tbpl/treeherder.  If it is happening regularly for someone else, can we get a regression range?
Flags: needinfo?(roc)
Flags: needinfo?(padenot)
Flags: needinfo?(mreavy)
Flags: needinfo?(mlee)

Comment 8

4 years ago
Thanks Maire.

(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #7)
> 
> Mike -- We REALLY need a regression range if we can get it.  Per Paul, this
> issue does NOT repro in Nightly -- so it is likely something that got fixed
> (perhaps when we did a refactor).  And it is not easy for Paul to repro.  I
> believe he was able to repro it twice on tbpl/treeherder.  If it is
> happening regularly for someone else, can we get a regression range?

Nick,
Can CAF provide a regression range for this issue?

Thanks,
Mike
Flags: needinfo?(mlee) → needinfo?(ntroast)
Keywords: regressionwindow-wanted
blocking-b2g: 2.2? → 2.2+
(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #7)
> Mike -- We REALLY need a regression range if we can get it.  Per Paul, this
> issue does NOT repro in Nightly -- so it is likely something that got fixed
> (perhaps when we did a refactor).  And it is not easy for Paul to repro.  I
> believe he was able to repro it twice on tbpl/treeherder.

How? If we can ever reproduce anything like this on our test machines, that would be a big help, but I'll need to know how.

(In reply to cafbot (PoC: ggrisco) from comment #5)
> Created attachment 8589818 [details]
> decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.123

This decoded stack looks corrupt. MediaStreamGraphImpl::UpdateConsumptionState doesn't call AppendMessage, and in fact AppendMessage should not be called on the MSG thread along any path.

(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #7)
> What's not clear is whether Rob (roc) or Paul should take the lead on fixing
> the issue.  I know they have been talking intently about the possible root
> cause.
> 
> Paul, Rob -- Who makes the most sense to take the lead on this? I was
> thinking Rob, but I don't know what else Rob is working on.  Also, can we
> dupe the other 3 MSG crasher bugs to one bug or is that premature?

It's probably premature.

I don't really care who takes the lead. The problem is that there is very little data to go on. It looks like memory corruption, maybe limited to corruption of the MSG graph, maybe something more general.

ntroast: it would be helpful to know if memory gets low before we crash. It would also be helpful to know which process crashed, i.e. which FirefoxOS app. I can't see that in any of the crash dumps.
Flags: needinfo?(roc)
(Reporter)

Comment 10

4 years ago
(In reply to Mike Lee [:mlee] from comment #8)
> Thanks Maire.
> 
> (In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #7)
> > 
> > Mike -- We REALLY need a regression range if we can get it.  Per Paul, this
> > issue does NOT repro in Nightly -- so it is likely something that got fixed
> > (perhaps when we did a refactor).  And it is not easy for Paul to repro.  I
> > believe he was able to repro it twice on tbpl/treeherder.  If it is
> > happening regularly for someone else, can we get a regression range?
> 
> Nick,
> Can CAF provide a regression range for this issue?
> 
> Thanks,
> Mike

This particular issue was first seen on April 8th.
Taking into consideration all of the MSG issues I see critical mass around March 16th, but please take that with a grain of salt.
Flags: needinfo?(ntroast)

Comment 11

4 years ago
Thanks Nick. Is it possible to provide more detailed build and commit information for last known good and first failed runs similar to what's provided in comment 3?

(In reply to cafbot (PoC: ggrisco) from comment #3)
> Observed on: 
> 
> Device: msm8909
> Gonk Version: AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.123
> Moz BuildID: 20150405002503
> Manifest:
> https://www.codeaurora.org/cgit/quic/lf/b2g/manifest/tree/
> caf_AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.123.xml?h=release
> Gecko Version: 37.0
> Gaia: 
> http://git.mozilla.org/?p=releases/gaia.git;a=commit;
> h=a6351e1197d54f8624523c2db9ba1418f2aa046f
> Gecko:
> http://git.mozilla.org/?p=releases/gecko.git;a=commit;
> h=6bb2afcce9872a7cbc65b4a58f752e2d5ac02345
> Patches: bug 1145724, bug 1143694, bug 1146987, bug 1133398, bug 1152095,
> bug 1150924, bug 1133147, bug 1150271, bug 1150916
Flags: needinfo?(ntroast)
ntroast: it would be helpful to know if memory gets low before we crash. It would also be helpful to know which process crashed, i.e. which FirefoxOS app. I can't see that in any of the crash dumps.
> It would also be helpful to know which process crashed, i.e. which FirefoxOS app. I can't see that in
> any of the crash dumps.

Right now this is the most useful thing I can think of. Anyone know how to get it?
(Reporter)

Comment 14

4 years ago
Unfortunately there were no additional logs collected other than what I already provided. If more logs become available I will make sure to put them here.

Also, for Mike's question, the info in comment 3 is the first failed. The last known good would be AU 122 which is the AU just before 123
Flags: needinfo?(ntroast)
Nick - (Referencing comment 13 from Rob) Do you know which FirefoxOS app was running when you saw this crash?  Is it the ringtone app? (Same as bug 1152439?)
Flags: needinfo?(ntroast)
Regression-window was added to a specific party that's not QAnalysts (see comment 8). Also, this seems to be happening on a device that we don't have. Currently we have ringtone related bugs on Flame device: bug 1147386 and bug 1139157.

Adding keyword to exclude this in our queries.
QA Whiteboard: QAExclude
Flags: needinfo?(ktucker)
QA Whiteboard: QAExclude → [QAnalyst-Triage+] QAExclude
Flags: needinfo?(ktucker)
(Reporter)

Comment 17

4 years ago
(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #15)
> Nick - (Referencing comment 13 from Rob) Do you know which FirefoxOS app was
> running when you saw this crash?  Is it the ringtone app? (Same as bug
> 1152439?)

Sorry, there was no additional information included about this crash. It happened in automation, so I don't know which app was running.
Flags: needinfo?(ntroast)
Flags: needinfo?(padenot)

Comment 18

4 years ago
Hi roc,

If the process is killed by low memory killer or OOM killer, it won't have minidump on FxOS. I am guessing that it could be related to bug 1152439. As you mentioned in [1], if a MediaStream is connected between different MSGs, there could be problem. Could it possible that the MediaStream in [2] has been deleted and for some reason the memory address is allocated for other object, another MSG? Then we get the weird call stack. I also found that on main-thread, it is doing some releasing jobs. 

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1152439#c24
[2] http://git.mozilla.org/?p=releases/gecko.git;a=blob;f=dom/media/MediaStreamGraph.cpp;h=4a50ea3185be2f8f01ea44d2e52444e119c2e11a;hb=6bb2afcce9872a7cbc65b4a58f752e2d5ac02345#l149
See Also: → bug 1152439
Whiteboard: [b2g-crash][caf-crash 606][caf priority: p1][CR 819433] → [b2g-crash][caf-crash 606][caf priority: p3][CR 819433]

Comment 19

4 years ago
ni :roc per comment 18.
Flags: needinfo?(roc)
(In reply to StevenLee[:slee] from comment #18)
> If the process is killed by low memory killer or OOM killer, it won't have
> minidump on FxOS.

You mean, this crash can't be caused by OOM?

> I am guessing that it could be related to bug 1152439. As
> you mentioned in [1], if a MediaStream is connected between different MSGs,
> there could be problem. Could it possible that the MediaStream in [2] has
> been deleted and for some reason the memory address is allocated for other
> object, another MSG? Then we get the weird call stack.

Which weird call stack?

> I also found that on
> main-thread, it is doing some releasing jobs. 

I'm not sure what you mean by that.
Flags: needinfo?(roc)
NI :greg, to confirm if he is still hitting this.

Greg, I saw you closing a couple of related crashes and wnted to check if you guys are still hitting this one?
Flags: needinfo?(ggrisco)

Comment 22

4 years ago
(In reply to bhavana bajaj [:bajaj] from comment #21)
> NI :greg, to confirm if he is still hitting this.
> 
> Greg, I saw you closing a couple of related crashes and wnted to check if
> you guys are still hitting this one?

Thanks for the ni? on this.  We aren't seeing this crash either in past few builds.  I'm ok with closing it if you are.
Flags: needinfo?(bbajaj)

Updated

4 years ago
Flags: needinfo?(ggrisco)
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → WORKSFORME
"Closing issue which has not been seen since 04/05/15 18:41"
Flags: needinfo?(bbajaj)
Keywords: regressionwindow-wanted
You need to log in before you can comment on or make changes to this bug.