Closed Bug 998255 Opened 6 years ago Closed 6 years ago

Unsafe usage of asyncStorage.getItem in ftu_launcher.js

Categories

(Firefox OS Graveyard :: Gaia::System, defect, critical)

x86
Gonk (Firefox OS)
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: zcampbell, Unassigned)

References

Details

Attachments

(5 files, 2 obsolete files)

Bugs 937336 and bug 990580 and Vicamo's investigation has found an unsafe use of async_storage.getItem in the ftu_launcher.js:
https://github.com/mozilla-b2g/gaia/blob/master/apps/system/js/ftu_launcher.js

This is causing device-based UI tests to fail so it is high priority.
What's the proposed way to use async storage?
Summary: Unsafe usage of async_storage.getItem in ftu_launcher.js → Unsafe usage of asyncStorage.getItem in ftu_launcher.js
Add a new method that works in the same way before bug 936703 -- return result value in IDBRequest.onsuccess event.
Add a debug version of current |asyncStorage.getItem|.
With attachment 8408884 [details] [diff] [review] and attachment 8408883 [details] [diff] [review], you can replace any |asyncStorage.getItem| calls to one of them for further debug.
This log file is captured with |asyncStorage.getItem| call in "apps/system/js/ftu_launcher.js" replaced by |asyncStorage.getItemUnsafe|.  The test steps are given in bug 990580 comment 76.  All 11 runs passed.
Captured with |asyncStorage.getItem| call in "apps/system/js/ftu_launcher.js" replaced by |asyncStorage.getItemDebug|.  Test process hang and adb disconnects.
(In reply to Alive Kuo [:alive][NEEDINFO!][God bless Taiwan.] from comment #1)
> What's the proposed way to use async storage?

To callback/return values in IDBTransaction oncomplete handler as landed bug 936703 does.  However, 936703 causes Gaia UI test test_cold_launch.py to fail at a very high rate.  The only victim so far is "apps/system/js/ftu_launcher.js".  Actually I'm still trying to figure out what happened because from the two log files we see all IndexedDB calls ends successfully and completely.  I still don't know why the timing is so important that one succeeded and the other failed.
Blocks a blocker.
blocking-b2g: --- → 2.0?
Keywords: qablocker
Severity: normal → critical
We really need an assignee here - this is causing our UI automation suite to nearly fall over right now.

Gregor - Can you find someone to take a look at this?
Flags: needinfo?(anygregor)
Gregor mentioned that the large majority of his devs are out today, so I'm redirecting this to Tim to see any of his team members have cycles to look into this.
Flags: needinfo?(anygregor) → needinfo?(timdream)
FTU... This looks like this may be the same underlying problem that us keeping us from landing the fix in bug 980549...
Attached file Potential patch - pull request (obsolete) —
This may be a way to solve the issue. It seems like there are probably race conditions between several asyncStorage calls in ensurePing and the ftu launcher callback. By delaying this asyncStorage getter until after ensurePing has made the setItem calls, it should avoid some race conditions.

Can you guys try this patch out to see if it resolves the issue? I have just verified that things continue to work, I have not yet tried verifying any tests. Thanks!
This url removes whitespace diffs and is easier to read.
Attachment #8409736 - Attachment is obsolete: true
Stephen - Can you have someone test Kevin's patch in comment 13?
Flags: needinfo?(timdream) → needinfo?(stephen.donner)
Also quick patch link if it's easier/not known: https://github.com/mozilla-b2g/gaia/pull/18494.diff
(In reply to Jason Smith [:jsmith] from comment #15)
> Stephen - Can you have someone test Kevin's patch in comment 13?

Sure, I've asked No-Jun to help, and he's testing the patch now.
Flags: needinfo?(stephen.donner)
(In reply to Kevin Grandon :kgrandon from comment #14)
> Created attachment 8409737 [details]
> Potential patch - pull request
> 
> This url removes whitespace diffs and is easier to read.

Tested patch, and with this patch it makes time.sleep(10) workaround in gaia_test.py obsolete.  tests_killall.py is passing consistently with this patch without the workaround.  test_cleanup_gaia.py is still failing, but there is a bug for that (bug 991262, or bug 990580)   Stephen and I think this patch fixes at least some of the issues that we are seeing in automation.
Comment on attachment 8409737 [details]
Potential patch - pull request

Flagging a few people for review here. Marshall touched this last, so if you'd like to redirect/flag him we can do that as well.
Attachment #8409737 - Flags: review?(timdream)
Attachment #8409737 - Flags: review?(alive)
(In reply to npark from comment #18)
> (In reply to Kevin Grandon :kgrandon from comment #14)
> > Created attachment 8409737 [details]
> > Potential patch - pull request
> > 
> > This url removes whitespace diffs and is easier to read.
> 
> Tested patch, and with this patch it makes time.sleep(10) workaround in
> gaia_test.py obsolete.  tests_killall.py is passing consistently with this
> patch without the workaround.  test_cleanup_gaia.py is still failing, but
> there is a bug for that (bug 991262, or bug 990580)   Stephen and I think
> this patch fixes at least some of the issues that we are seeing in
> automation.

Tested, too.  test_cold_launch.py is still failing.
Comment on attachment 8409737 [details]
Potential patch - pull request

It seems this is not a valid fix according to discussion in the bug.

We really should be fixing the underlining issue instead of tweaking call sequences this way.
Attachment #8409737 - Flags: review?(timdream)
Attachment #8409737 - Flags: review?(alive)
Turns out something else ended up fixing bug 990580, so we don't need to block on this anymore.
blocking-b2g: 2.0? → ---
Keywords: qablocker
Comment on attachment 8409737 [details]
Potential patch - pull request

(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #21)
> Comment on attachment 8409737 [details]
> Potential patch - pull request
> 
> It seems this is not a valid fix according to discussion in the bug.

The only valid fix is to stop using async_storage.js, which would be a much larger effort. If you guys want to take up this work, please do so, but this is way too much to do to hold of on fixing this bug.

The current problem is that this code heavily nests async_storage.js usage, which does not guarantee callback ordering execution. If we need to guarantee callback ordering (which we do in this case), then we need to protect ourselves. Could we fix this in the asycs_storage.js library? Possibly, but this would come at a performance regression in other areas (e.g., removal of parallel get requests).

I'm opening the review again as this is the best solution I see in the nearterm. We can also open a bug to remove the usage of async_storage.js in the system app if you'd like, or we can just use it correctly.
Attachment #8409737 - Flags: review?(timdream)
Attachment #8409737 - Flags: review?(alive)
Comment on attachment 8409737 [details]
Potential patch - pull request

I don't think I understand the bug well enough to do any meaningful review.

From reading the code it looks like you are trying to avoid using mozSettings database and asyncStorage at the same time, but I don't know what it solves.
Attachment #8409737 - Flags: review?(timdream) → review?(21)
Sorry, I know this bug was a bit confusing and I probably haven't done a good job at describing the root cause. It only happens because we interweave parallel get and set IndexedDB transactions. Ideally this should be solved by using a single transaction, but we'd have to remove async_storage.js to do this.
Blocks: 980549
(In reply to Kevin Grandon :kgrandon from comment #25)
> It only happens because we interweave
> parallel get and set IndexedDB transactions. 

Can I naively ask: would this be fixed in the IndexedDB impl? IndexedDB spec does not explicitly forbid usage like this.

If we are hitting this behavior in Gaia, certainly someone on the web will hit by it again.
We can certainly try to push for a spec change here, though to be honest I doubt we will get very far. Introducing this at the indexedDB seems like it has the potential to cause performance issues if it can not do things in parallel. One solution for web developers who don't like this behavior may be to try to use the synchronous indexedDB implementation that we are working on in the future, though that has its own pitfalls.
Comment on attachment 8409737 [details]
Potential patch - pull request

I have the same opinion and not sure what does ftu_ping do.
Attachment #8409737 - Flags: review?(alive)
Comment on attachment 8409737 [details]
Potential patch - pull request

It makes sense to me that set/get does not returns in order. But before reviewing I would like to understand what in this code is dependent on the order of callbacks ? (this is not obvious to me). 

Also can't we simply move the FtuPing.ensurePing() call inside the asyncStorage.getItem call ?

Please r? with those questions answered :)
Attachment #8409737 - Flags: review?(21)
There is still a race condition in the code, but seems like QA is unblocked and I haven't seen reports of this on an actual device. I don't feel like driving this too hard as there's lots of things to do.

I'm going to close this as works for me, but I can pick it back up if it's actually an issue.
No longer blocks: 980549
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.