Closed
Bug 1062541
Opened 10 years ago
Closed 10 years ago
DeviceStorageFile::CalculateSizeAndModifiedDate crashes b2g process
Categories
(Core :: DOM: Device Interfaces, defect, P1)
Tracking
()
People
(Reporter: tkundu, Assigned: dhylands)
References
Details
(Keywords: crash, Whiteboard: [caf-crash 339][caf priority: p1][CR 717877][b2g-crash])
Attachments
(4 files)
219.28 KB,
application/x-bzip
|
Details | |
6.38 KB,
text/plain
|
Details | |
8.88 KB,
patch
|
khuey
:
review+
|
Details | Diff | Splinter Review |
9.42 KB,
patch
|
Details | Diff | Splinter Review |
STR: Run random stability test with USB tethering. B2g is crashing after few hours.
Reporter | ||
Comment 1•10 years ago
|
||
[Blocking Requested - why for this release]:
blocking-b2g: --- → 2.0?
Whiteboard: [CR 717877]
Component: Gaia::System → DOM: Device Interfaces
Product: Firefox OS → Core
Assignee | ||
Comment 2•10 years ago
|
||
With this being such a specific crash, there must have been an assert failure? or back trace or something? Otherwise how do you know that DeviceStorageFile::CalculateSizeAndModifiedDate crashed? Please provide the log or whatever so I can determine what type of crash we're dealing with here.
Flags: needinfo?(tkundu)
Reporter | ||
Updated•10 years ago
|
Flags: needinfo?(dhylands)
So mFile is null at http://mxr.mozilla.org/mozilla-b2g32_v2_0/source/dom/devicestorage/nsDeviceStorage.cpp#1202.
Assignee | ||
Comment 6•10 years ago
|
||
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #5) > So mFile is null at > http://mxr.mozilla.org/mozilla-b2g32_v2_0/source/dom/devicestorage/ > nsDeviceStorage.cpp#1202. Yeah - I'm trying to understand how it got to be null. GetRootDirectoryForType assigns mFile. So this implies that one of the root directories is null. It seems reasonable to have an if check in CalculateSizeAndModifiedDate I'm just trying to understand what else might fail.
Flags: needinfo?(dhylands)
Assignee | ||
Comment 7•10 years ago
|
||
I went through the places where DeviceStorageFile::mFile might be null and added checks.
Attachment #8483867 -
Flags: review?(khuey)
Reporter | ||
Comment 8•10 years ago
|
||
(In reply to Dave Hylands [:dhylands] from comment #7) > Created attachment 8483867 [details] [diff] [review] > Add additional checks for mFile being null > > I went through the places where DeviceStorageFile::mFile might be null and > added checks. Could you please rebase this patch for FFOS v2.0 branch?
Flags: needinfo?(dhylands)
Updated•10 years ago
|
Assignee: nobody → dhylands
Attachment #8483867 -
Flags: review?(khuey) → review+
Assignee | ||
Comment 10•10 years ago
|
||
Comment on attachment 8483953 [details] [diff] [review] Add additional checks for mFile being null. 2.0 Branch Approval Request Comment [Feature/regressing bug #]: not sure - I think its been this way since it was written [User impact if declined]: crashes stability test [Describe test coverage new/current, TBPL]: [Risks and why]: It might change a crash into a non-crash which might be harder to track down. [String/UUID change made/needed]: none This patch is really just a preventative patch. I still don't understand the root cause.
Attachment #8483953 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 11•10 years ago
|
||
Pushed to master https://hg.mozilla.org/integration/b2g-inbound/rev/6a316cc10211 The master patch should work on 2.1 There is a 2.0 specific patch.
Assignee | ||
Comment 12•10 years ago
|
||
Comment on attachment 8483953 [details] [diff] [review] Add additional checks for mFile being null. 2.0 Branch NOTE: This flag is now for security issues only. Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings. [Approval Request Comment] Bug caused by (feature/regressing bug #): I think its been this way since it was coded User impact if declined: crashes stability test Testing completed: none (just the TBPL tests) Risk to taking this patch (and alternatives if risky): It might change a crash into a non-crash which might be harder to track down. String or UUID changes made by this patch: none This patch is really just a preventative patch. I still don't understand the root cause.
Attachment #8483953 -
Flags: approval-mozilla-aurora? → approval-mozilla-b2g32?
Assignee | ||
Comment 13•10 years ago
|
||
Comment on attachment 8483867 [details] [diff] [review] Add additional checks for mFile being null Approval Request Comment [Feature/regressing bug #]: not sure - I think its been this way since it was written [User impact if declined]: crashes stability test [Describe test coverage new/current, TBPL]: [Risks and why]: It might change a crash into a non-crash which might be harder to track down. [String/UUID change made/needed]: none This patch is really just a preventative patch. I still don't understand the root cause.
Attachment #8483867 -
Flags: approval-mozilla-aurora?
Updated•10 years ago
|
Whiteboard: [CR 717877][b2g-crash] → [caf priority: p1][CR 717877][b2g-crash]
Updated•10 years ago
|
Whiteboard: [caf priority: p1][CR 717877][b2g-crash] → [caf-crash 339][caf priority: p1][CR 717877][b2g-crash]
Comment 14•10 years ago
|
||
Observed on: Device: msm8226 Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.074 Moz BuildID: 20140826000204 B2G Version: 2.0 Gecko Version: 32.0 Gaia: http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=a51633e29a7826b6bf07cb1c5ad81b3217b9820a Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=8f121f8fa5fa1a3e82967f91e2afe337465fc593
Assignee | ||
Comment 15•10 years ago
|
||
I've added some logging over in bug 1063126, and the patch applies cleanly in the 2.0 branch.
https://hg.mozilla.org/mozilla-central/rev/6a316cc10211
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Updated•10 years ago
|
blocking-b2g: 2.0? → 2.0+
Comment 17•10 years ago
|
||
(In reply to Dave Hylands [:dhylands] from comment #13) > Comment on attachment 8483867 [details] [diff] [review] > Add additional checks for mFile being null > > Approval Request Comment > [Feature/regressing bug #]: not sure - I think its been this way since it > was written > [User impact if declined]: crashes stability test > [Describe test coverage new/current, TBPL]: > [Risks and why]: It might change a crash into a non-crash which might be > harder to track down. > [String/UUID change made/needed]: none > > This patch is really just a preventative patch. I still don't understand the > root cause. Hmm, I am worried this may not be fully fixed and may come bite us again. Let's ask CAF to try the patch out for feedback before landing on the branch which we continue to investigate in 1063126
Updated•10 years ago
|
Flags: needinfo?(ikumar)
Comment 18•10 years ago
|
||
Due to recent policy changes, all patches need approval for uplift regardless of blocking status. Please request b2g32 and aurora approval on these when you get a chance. Sorry for the inconvenience :(
status-b2g-v2.0:
--- → affected
status-b2g-v2.1:
--- → affected
status-b2g-v2.2:
--- → fixed
status-firefox33:
--- → wontfix
status-firefox34:
--- → affected
status-firefox35:
--- → fixed
Flags: needinfo?(dhylands)
Assignee | ||
Comment 19•10 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #18) > Due to recent policy changes, all patches need approval for uplift > regardless of blocking status. Please request b2g32 and aurora approval on > these when you get a chance. Sorry for the inconvenience :( Now I'm confused. One of the patches is currently sitting in approval-mozilla-aurora? and the other is in approval-mozilla-b2g32? (see comments 12 & 13). Are you saying I need to do something else? Or did you just miss the fact that I already asked for approval?
Flags: needinfo?(dhylands) → needinfo?(ryanvm)
Comment 20•10 years ago
|
||
Sorry about that, one too many bugs in that mass change.
Flags: needinfo?(ryanvm)
Reporter | ||
Comment 21•10 years ago
|
||
> [Risks and why]: It might change a crash into a non-crash which might be > harder to track down. I also agree with this statement. I think that root cause of this issue is in bug 1062550. I have a local fix for that .I am verifying it now (it is a gaia change in networkmanager.js) . I will update here again if we really need this fix or not .
Comment 22•10 years ago
|
||
Dave -- based on the feedback that the issue was identified somewhere else and that your change has a risk of masking the real issue, may be we should backout the change?
Flags: needinfo?(ikumar) → needinfo?(dhylands)
Assignee | ||
Comment 23•10 years ago
|
||
Perhaps. Based on what I'm seeing in some other bugs, I think that what's going on is that Nuwa is interfering, and/or a race is happening. nsVolumeService sends a request to the parent to send out a list of volumes: http://dxr.mozilla.org/mozilla-central/source/dom/system/gonk/nsVolumeService.cpp#93-97 This triggers the parent to send out an NS_VOLUME_STATE_CHANGED to establish the initial state of the volumes used by DeviceStorage. Any further changes get picked up async NS_VOLUME_STATE_CHANGED messages. It turns out that this happens very early in the life of the child process. This was written before Nuwa existed, and I'm also perfectly happy to recode this to do it a better way. Perhaps there is a way to know when Nuwa is ready in the child, and have the child do the request later. Technically, the child is doing an async request with an async reply, so there may be a race condition happening where the child asks for the volume state before the async reply is received. So maybe the right thing is to make the child do a sync request, and rather than do it in the constructor, defer it until its actually needed (so we don't pay a penalty for apps which don't use device storage). The fact that CalculateSizeAndModifiedDate is crashing tells me that the child process hasn't got the volume information from the parent (however it happened). So it may be that under the stress test, the race condition is triggering things, or it may be some combination of a race with Nuwa (depending on when the nsVolumeService constructor is called).
Flags: needinfo?(dhylands)
Reporter | ||
Comment 24•10 years ago
|
||
(In reply to Dave Hylands [:dhylands][on PTO - back Sep 11] from comment #23) > Perhaps. Based on what I'm seeing in some other bugs, I think that what's > going on is that Nuwa is interfering, and/or a race is happening. > > nsVolumeService sends a request to the parent to send out a list of volumes: > http://dxr.mozilla.org/mozilla-central/source/dom/system/gonk/ > nsVolumeService.cpp#93-97 > This triggers the parent to send out an NS_VOLUME_STATE_CHANGED to establish > the initial state of the volumes used by DeviceStorage. Any further changes > get picked up async NS_VOLUME_STATE_CHANGED messages. > > It turns out that this happens very early in the life of the child process. My guess is that b2g crashes many times and it is happening while trying on/off usb tethering very quickly. Some RACE might have triggered this crash during that usb tethering crash. I asked our test team to reproduce it usb tethering fix.
Assignee | ||
Comment 25•10 years ago
|
||
I put together a patch over in bug 1063877 which changes the way the volume service works. With the patched version it uses a sync IPC call to get the initial set of volumes. I think that the root cause of this bug is that the child process doesn't have the list of volumes (which is one of the few reasons that mFile can be null).
Reporter | ||
Comment 26•10 years ago
|
||
(In reply to Dave Hylands [:dhylands][on PTO - back Sep 11] from comment #25) > I put together a patch over in bug 1063877 which changes the way the volume > service works. With the patched version it uses a sync IPC call to get the > initial set of volumes. > > I think that the root cause of this bug is that the child process doesn't > have the list of volumes (which is one of the few reasons that mFile can be > null). Thanks for the good work. We will also test with fix from bug 1063877 comment 2 as it seems to be fixing some different RACE which is root cause of this bug :) BTW I am seeing that you workaround attachment 8483867 [details] [diff] [review] has already landed and you already pointed us to potential risk of this patch in Comment 13 . Do you want to revert your fix (attachment 8483867 [details] [diff] [review]) now ?
Flags: needinfo?(dhylands)
Assignee | ||
Comment 27•10 years ago
|
||
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #26) > (In reply to Dave Hylands [:dhylands][on PTO - back Sep 11] from comment #25) > > I put together a patch over in bug 1063877 which changes the way the volume > > service works. With the patched version it uses a sync IPC call to get the > > initial set of volumes. > > > > I think that the root cause of this bug is that the child process doesn't > > have the list of volumes (which is one of the few reasons that mFile can be > > null). > > Thanks for the good work. We will also test with fix from bug 1063877 > comment 2 as it seems to be fixing some different RACE which is root cause > of this bug :) > > BTW I am seeing that you workaround attachment 8483867 [details] [diff] [review] > [review] has already landed and you already pointed us to potential risk of > this patch in Comment 13 . Do you want to revert your fix (attachment > 8483867 [details] [diff] [review]) now ? There's nothing wrong with the patch (and it makes the code more consistent internally), so I don't see any reason to revert it on master. It hasn't been approved yet for 2.0. You may wish to revert it in your tree. If the crash goes away with the other patches in place and this one removed, then that validates that we got the root cause. Once we've established that we've actually fixed the root cause, then I would still argue that the patch should be applied since it makes things more robust. I feel that it's better for an app to misbehave than for the phone to crash. But from a testing standpoint, the crash is probably easier to detect.
Flags: needinfo?(dhylands)
Reporter | ||
Updated•10 years ago
|
Flags: needinfo?(tkundu)
Comment 28•10 years ago
|
||
Tapas, any update on testing patch in https://bugzilla.mozilla.org/show_bug.cgi?id=1063877 ?
Comment 29•10 years ago
|
||
Comment on attachment 8483867 [details] [diff] [review] Add additional checks for mFile being null These may not need to land on 2.0/2.1 so clearing approvals here. Patch from 1063877 may be the final fix here, but we are waiting for final confirmation from CAF.
Attachment #8483867 -
Flags: approval-mozilla-aurora?
Updated•10 years ago
|
Attachment #8483953 -
Flags: approval-mozilla-b2g32?
Reporter | ||
Comment 30•10 years ago
|
||
We are not seeing this issue so far from our test team report even without fix from bug 1063877. We landed a fix for bug 1062550 and we are not seeing this crash anymore. We need to wait for more testing hours. Meanwhile, could you please uplift bug 1063877 for 2.0 branch if you are sure that this is real fix ?
Flags: needinfo?(tkundu) → needinfo?(bbajaj)
Comment 31•10 years ago
|
||
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #30) > We are not seeing this issue so far from our test team report even without > fix from bug 1063877. > > We landed a fix for bug 1062550 and we are not seeing this crash anymore. We > need to wait for more testing hours. > > Meanwhile, could you please uplift bug 1063877 for 2.0 branch if you are > sure that this is real fix ? sure, I think that's recently landed on master, dave would you agree to uplift https://bugzilla.mozilla.org/show_bug.cgi?id=1063877 to 2.0 in this case?
Flags: needinfo?(bbajaj) → needinfo?(dhylands)
Assignee | ||
Comment 32•10 years ago
|
||
It landed on master, but I backed it out (minor glitch). I'll be relanding today. I'll put in approval requests to uplift to 2.1 and 2.0.
Flags: needinfo?(dhylands)
Comment 33•10 years ago
|
||
Bug 1063877 is uplifted to Aurora and b2g32. Ready for requests? :)
Flags: needinfo?(dhylands)
Assignee | ||
Comment 34•10 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #33) > Bug 1063877 is uplifted to Aurora and b2g32. Ready for requests? :) I don't think that the patches attached to this bug need to be uplifted. I believe that bug 1063877 addresses the root cause so that the code in the patch attached to this bug won't be needed (which is why I called it a preventative patch).
Flags: needinfo?(dhylands)
Comment 35•10 years ago
|
||
wontfix per comment 34 in that case :)
You need to log in
before you can comment on or make changes to this bug.
Description
•