Closed Bug 1062541 Opened 10 years ago Closed 10 years ago

DeviceStorageFile::CalculateSizeAndModifiedDate crashes b2g process

Categories

(Core :: DOM: Device Interfaces, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
mozilla35
blocking-b2g 2.0+
Tracking Status
firefox33 --- wontfix
firefox34 --- wontfix
firefox35 --- fixed
b2g-v2.0 --- wontfix
b2g-v2.1 --- wontfix
b2g-v2.2 --- fixed

People

(Reporter: tkundu, Assigned: dhylands)

References

Details

(Keywords: crash, Whiteboard: [caf-crash 339][caf priority: p1][CR 717877][b2g-crash])

Attachments

(4 files)

STR:

Run random stability test with USB tethering. 


B2g is crashing after few hours.
[Blocking Requested - why for this release]:
blocking-b2g: --- → 2.0?
Whiteboard: [CR 717877]
Component: Gaia::System → DOM: Device Interfaces
Product: Firefox OS → Core
With this being such a specific crash, there must have been an assert failure? or back trace or something? Otherwise how do you know that DeviceStorageFile::CalculateSizeAndModifiedDate crashed?

Please provide the log or whatever so I can determine what type of crash we're dealing with here.
Flags: needinfo?(tkundu)
Attached file mozilla_logs.tar.bz2 —
logs attached
Flags: needinfo?(tkundu)
Flags: needinfo?(dhylands)
Keywords: crash
Whiteboard: [CR 717877] → [CR 717877][b2g-crash]
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #5)
> So mFile is null at
> http://mxr.mozilla.org/mozilla-b2g32_v2_0/source/dom/devicestorage/
> nsDeviceStorage.cpp#1202.

Yeah - I'm trying to understand how it got to be null.

GetRootDirectoryForType assigns mFile.

So this implies that one of the root directories is null.

It seems reasonable to have an if check in CalculateSizeAndModifiedDate

I'm just trying to understand what else might fail.
Flags: needinfo?(dhylands)
I went through the places where DeviceStorageFile::mFile might be null and added checks.
Attachment #8483867 - Flags: review?(khuey)
(In reply to Dave Hylands [:dhylands] from comment #7)
> Created attachment 8483867 [details] [diff] [review]
> Add additional checks for mFile being null
> 
> I went through the places where DeviceStorageFile::mFile might be null and
> added checks.

Could you please rebase this patch for FFOS v2.0 branch?
Flags: needinfo?(dhylands)
Attachment rebased on 2.0 branch
Flags: needinfo?(dhylands)
Assignee: nobody → dhylands
Comment on attachment 8483953 [details] [diff] [review]
Add additional checks for mFile being null. 2.0 Branch

Approval Request Comment
[Feature/regressing bug #]: not sure - I think its been this way since it was written
[User impact if declined]: crashes stability test
[Describe test coverage new/current, TBPL]:
[Risks and why]: It might change a crash into a non-crash which might be harder to track down.
[String/UUID change made/needed]: none

This patch is really just a preventative patch. I still don't understand the root cause.
Attachment #8483953 - Flags: approval-mozilla-aurora?
Pushed to master
https://hg.mozilla.org/integration/b2g-inbound/rev/6a316cc10211

The master patch should work on 2.1

There is a 2.0 specific patch.
Comment on attachment 8483953 [details] [diff] [review]
Add additional checks for mFile being null. 2.0 Branch

NOTE: This flag is now for security issues only. Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): I think its been this way since it was coded
User impact if declined: crashes stability test
Testing completed: none (just the TBPL tests)
Risk to taking this patch (and alternatives if risky): It might change a crash into a non-crash which might be harder to track down.
String or UUID changes made by this patch: none

This patch is really just a preventative patch. I still don't understand the root cause.
Attachment #8483953 - Flags: approval-mozilla-aurora? → approval-mozilla-b2g32?
Comment on attachment 8483867 [details] [diff] [review]
Add additional checks for mFile being null

Approval Request Comment
[Feature/regressing bug #]: not sure - I think its been this way since it was written
[User impact if declined]: crashes stability test
[Describe test coverage new/current, TBPL]:
[Risks and why]: It might change a crash into a non-crash which might be harder to track down.
[String/UUID change made/needed]: none

This patch is really just a preventative patch. I still don't understand the root cause.
Attachment #8483867 - Flags: approval-mozilla-aurora?
Whiteboard: [CR 717877][b2g-crash] → [caf priority: p1][CR 717877][b2g-crash]
Whiteboard: [caf priority: p1][CR 717877][b2g-crash] → [caf-crash 339][caf priority: p1][CR 717877][b2g-crash]
Observed on: 

Device: msm8226
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.074
Moz BuildID: 20140826000204
B2G Version: 2.0
Gecko Version: 32.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=a51633e29a7826b6bf07cb1c5ad81b3217b9820a
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=8f121f8fa5fa1a3e82967f91e2afe337465fc593
See Also: → 1063126
I've added some logging over in bug 1063126, and the patch applies cleanly in the 2.0 branch.
https://hg.mozilla.org/mozilla-central/rev/6a316cc10211
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
blocking-b2g: 2.0? → 2.0+
(In reply to Dave Hylands [:dhylands] from comment #13)
> Comment on attachment 8483867 [details] [diff] [review]
> Add additional checks for mFile being null
> 
> Approval Request Comment
> [Feature/regressing bug #]: not sure - I think its been this way since it
> was written
> [User impact if declined]: crashes stability test
> [Describe test coverage new/current, TBPL]:
> [Risks and why]: It might change a crash into a non-crash which might be
> harder to track down.
> [String/UUID change made/needed]: none
> 
> This patch is really just a preventative patch. I still don't understand the
> root cause.
Hmm, I am worried this may not be fully fixed and may come bite us again. Let's ask CAF to try the patch out for feedback before landing on the branch which we continue to investigate in 1063126
Flags: needinfo?(ikumar)
Due to recent policy changes, all patches need approval for uplift regardless of blocking status. Please request b2g32 and aurora approval on these when you get a chance. Sorry for the inconvenience :(
Flags: needinfo?(dhylands)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #18)
> Due to recent policy changes, all patches need approval for uplift
> regardless of blocking status. Please request b2g32 and aurora approval on
> these when you get a chance. Sorry for the inconvenience :(

Now I'm confused. One of the patches is currently sitting in approval-mozilla-aurora? and the other is in approval-mozilla-b2g32? (see comments 12 & 13).

Are you saying I need to do something else? Or did you just miss the fact that I already asked for approval?
Flags: needinfo?(dhylands) → needinfo?(ryanvm)
Sorry about that, one too many bugs in that mass change.
Flags: needinfo?(ryanvm)
> [Risks and why]: It might change a crash into a non-crash which might be
> harder to track down.

I also agree with this statement.

I think that root cause of this issue is in bug 1062550. I have a local fix for that .I am verifying it now (it is a gaia change in networkmanager.js) . I will update here again if we really need this fix or not .
Dave -- based on the feedback that the issue was identified somewhere else and that your change has a risk of masking the real issue, may be we should backout the change?
Flags: needinfo?(ikumar) → needinfo?(dhylands)
Perhaps. Based on what I'm seeing in some other bugs, I think that what's going on is that Nuwa is interfering, and/or a race is happening.

nsVolumeService sends a request to the parent to send out a list of volumes:
http://dxr.mozilla.org/mozilla-central/source/dom/system/gonk/nsVolumeService.cpp#93-97
This triggers the parent to send out an NS_VOLUME_STATE_CHANGED to establish the initial state of the volumes used by DeviceStorage. Any further changes get picked up async NS_VOLUME_STATE_CHANGED messages.

It turns out that this happens very early in the life of the child process.

This was written before Nuwa existed, and I'm also perfectly happy to recode this to do it a better way. Perhaps there is a way to know when Nuwa is ready in the child, and have the child do the request later.

Technically, the child is doing an async request with an async reply, so there may be a race condition happening where the child asks for the volume state before the async reply is received. So maybe the right thing is to make the child do a sync request, and rather than do it in the constructor, defer it until its actually needed (so we don't pay a penalty for apps which don't use device storage).

The fact that CalculateSizeAndModifiedDate is crashing tells me that the child process hasn't got the volume information from the parent (however it happened).

So it may be that under the stress test, the race condition is triggering things, or it may be some combination of a race with Nuwa (depending on when the nsVolumeService constructor is called).
Flags: needinfo?(dhylands)
(In reply to Dave Hylands [:dhylands][on PTO - back Sep 11] from comment #23)
> Perhaps. Based on what I'm seeing in some other bugs, I think that what's
> going on is that Nuwa is interfering, and/or a race is happening.
> 
> nsVolumeService sends a request to the parent to send out a list of volumes:
> http://dxr.mozilla.org/mozilla-central/source/dom/system/gonk/
> nsVolumeService.cpp#93-97
> This triggers the parent to send out an NS_VOLUME_STATE_CHANGED to establish
> the initial state of the volumes used by DeviceStorage. Any further changes
> get picked up async NS_VOLUME_STATE_CHANGED messages.
> 
> It turns out that this happens very early in the life of the child process.


My guess is that b2g crashes many times and it is happening while trying on/off usb tethering very quickly. Some RACE might have triggered this crash during that usb tethering crash. I asked our test team to reproduce it usb tethering fix.
I put together a patch over in bug 1063877 which changes the way the volume service works. With the patched version it uses a sync IPC call to get the initial set of volumes.

I think that the root cause of this bug is that the child process doesn't have the list of volumes (which is one of the few reasons that mFile can be null).
(In reply to Dave Hylands [:dhylands][on PTO - back Sep 11] from comment #25)
> I put together a patch over in bug 1063877 which changes the way the volume
> service works. With the patched version it uses a sync IPC call to get the
> initial set of volumes.
> 
> I think that the root cause of this bug is that the child process doesn't
> have the list of volumes (which is one of the few reasons that mFile can be
> null).

Thanks for the good work. We will also test with fix from bug 1063877 comment 2 as it seems to be fixing some different RACE which is root cause of this bug :) 

BTW I am seeing that you workaround attachment 8483867 [details] [diff] [review] has already landed and you already pointed us to potential risk of this patch in Comment 13 . Do you want to revert your fix (attachment 8483867 [details] [diff] [review]) now ?
Flags: needinfo?(dhylands)
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #26)
> (In reply to Dave Hylands [:dhylands][on PTO - back Sep 11] from comment #25)
> > I put together a patch over in bug 1063877 which changes the way the volume
> > service works. With the patched version it uses a sync IPC call to get the
> > initial set of volumes.
> > 
> > I think that the root cause of this bug is that the child process doesn't
> > have the list of volumes (which is one of the few reasons that mFile can be
> > null).
> 
> Thanks for the good work. We will also test with fix from bug 1063877
> comment 2 as it seems to be fixing some different RACE which is root cause
> of this bug :) 
> 
> BTW I am seeing that you workaround attachment 8483867 [details] [diff] [review]
> [review] has already landed and you already pointed us to potential risk of
> this patch in Comment 13 . Do you want to revert your fix (attachment
> 8483867 [details] [diff] [review]) now ?

There's nothing wrong with the patch (and it makes the code more consistent internally), so I don't see any reason to revert it on master.

It hasn't been approved yet for 2.0. You may wish to revert it in your tree.

If the crash goes away with the other patches in place and this one removed, then that validates that we got the root cause. Once we've established that we've actually fixed the root cause, then I would still argue that the patch should be applied since it makes things more robust.

I feel that it's better for an app to misbehave than for the phone to crash. But from a testing standpoint, the crash is probably easier to detect.
Flags: needinfo?(dhylands)
Flags: needinfo?(tkundu)
Tapas, any update on testing patch in https://bugzilla.mozilla.org/show_bug.cgi?id=1063877 ?
Comment on attachment 8483867 [details] [diff] [review]
Add additional checks for mFile being null

These may not need to land on 2.0/2.1 so clearing approvals here.

Patch from 1063877 may be the final fix here, but we are waiting for final confirmation from CAF.
Attachment #8483867 - Flags: approval-mozilla-aurora?
Attachment #8483953 - Flags: approval-mozilla-b2g32?
We are not seeing this issue so far from our test team report even without fix from bug 1063877.

We landed a fix for bug 1062550 and we are not seeing this crash anymore. We need to wait for more testing hours.

Meanwhile, could you please uplift bug 1063877 for 2.0 branch if you are sure that this is real fix ?
Flags: needinfo?(tkundu) → needinfo?(bbajaj)
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #30)
> We are not seeing this issue so far from our test team report even without
> fix from bug 1063877.
> 
> We landed a fix for bug 1062550 and we are not seeing this crash anymore. We
> need to wait for more testing hours.
> 
> Meanwhile, could you please uplift bug 1063877 for 2.0 branch if you are
> sure that this is real fix ?

sure, I think that's recently landed on master, dave would you agree to uplift https://bugzilla.mozilla.org/show_bug.cgi?id=1063877 to 2.0 in this case?
Flags: needinfo?(bbajaj) → needinfo?(dhylands)
It landed on master, but I backed it out (minor glitch). I'll be relanding today. I'll put in approval requests to uplift to 2.1 and 2.0.
Flags: needinfo?(dhylands)
Bug 1063877 is uplifted to Aurora and b2g32. Ready for requests? :)
Flags: needinfo?(dhylands)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #33)
> Bug 1063877 is uplifted to Aurora and b2g32. Ready for requests? :)

I don't think that the patches attached to this bug need to be uplifted.

I believe that bug 1063877 addresses the root cause so that the code in the patch attached to this bug won't be needed (which is why I called it a preventative patch).
Flags: needinfo?(dhylands)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: