Closed Bug 1112577 Opened 10 years ago Closed 9 years ago

CallLogDB transactions are not atomic

Categories

(Firefox OS Graveyard :: Gaia::Dialer, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(b2g-v2.0 affected, b2g-v2.1 affected, b2g-v2.2 fixed, b2g-master fixed)

RESOLVED FIXED
2.2 S6 (20feb)
Tracking Status
b2g-v2.0 --- affected
b2g-v2.1 --- affected
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: jlorenzo, Assigned: gsvelto)

References

Details

(Whiteboard: [planned-sprint c=3])

Attachments

(2 files, 8 obsolete files)

46 bytes, text/x-github-pull-request
Details | Review
40.20 KB, patch
thills
: review+
Details | Diff | Splinter Review
Like discussed on IRC, we're having a lot of intermittent failures on the UI tests checking the CallLog.

In bug 1093608, I see more failures now that I delete the database after each test[1]. :gsvelto explained that the CallLogDB relies on the `onsuccess` event to determine if a transaction is finished. `oncomplete` is more appropriated in this case.

That could explain why we have more failures when inserting and deleting entries in the database.

[1] http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk.ui.adhoc/449/console
A little more context: the call log DB code is currently starting IndexedDB transactions from within other IndexedDB transactions which would be bad enough in and by itself but in addition to this it also doesn't properly check if transactions have succeeded or not. The user-provided callback for a database operation is usually invoked from the `onsuccess' handler of the transaction's last request rather than from the `oncomplete' handler of the transaction itself.
This is a first attempt at fixing all the races and abuses of the IndexedDB API. Unfortunately it breaks making outgoing calls to unknown numbers as I can't keep a single transaction alive across the function call used to retrieve the contact information. I need to find a way to solve that issue while keeping the operation atomic.
Status: NEW → ASSIGNED
Whiteboard: [planned-sprint c=1]
Target Milestone: --- → 2.2 S3 (9jan)
I've fixed the problems I was having with outgoing calls and now this seems to work OK. The idea there is that I look for an existing group of calls, if I don't find it I retrieve the contact info and then - before adding the new group - I check again if it's already there. This is needed because we must use two separate transactions (due to the asynchronous contact query in the middle) yet we don't want to add the same group twice if they come in rapid sequence.

Johan, could you check if this patch fixes the problems you were having with the integration tests?
Attachment #8539262 - Attachment is obsolete: true
Attachment #8540663 - Flags: review?(drs.bugzilla)
Attachment #8540663 - Flags: feedback?(jlorenzo)
Is there perhaps also some relationship to bug 1105168, where we have to use a time.sleep(1) when we'd rather not do that?
Comment on attachment 8540663 [details] [diff] [review]
[PATCH] Ensure that all call log database operations are atomic

Like discussed on IRC last week, I'm still having issues even after pushing this patch on my phone. The test in bug 1096385 still fails: A phone call is placed, then another is received but nothing happens on the call log.
Attachment #8540663 - Flags: feedback?(jlorenzo) → feedback-
Comment on attachment 8540663 [details] [diff] [review]
[PATCH] Ensure that all call log database operations are atomic

Clearing review until comment 6 is addressed.
Attachment #8540663 - Flags: review?(drs.bugzilla)
Blocks: 1117804
Target Milestone: 2.2 S3 (9jan) → 2.2 S4 (23jan)
Rebased patch as the attached one had bit-rotted.
Attachment #8540663 - Attachment is obsolete: true
Quick update, according to :jlorenzo the issue in comment 1096385 is not reproducing anymore. He's confirming that this is really the case; if it is I'll ask for review on the attached patch.
Bug 1096385 doesn't seem to be happening anymore but then again http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk.ui.adhoc/621/console is full of timeouts; if it was a race-condition it might have disappeared because the runner is slow.

Bug 1093608 is still failing though, can you try it with my patch applied and see what happens?
Blocks: 1112242
I've addressed an issue that popped up during testing. I've also tried to reproduce the call log intermittent errors :jlorenzo pointed out locally but couldn't. Since this is somewhat inconclusive (they often show up only in automation) I'll be waiting before asking for review again.
Attachment #8553730 - Attachment is obsolete: true
Target Milestone: 2.2 S4 (23jan) → 2.2 S5 (6feb)
Comment on attachment 8554479 [details] [diff] [review]
[PATCH v3] Ensure that all call log database operations are atomic

I ran test_dialer_dsds_incoming_call_log_recorded_sim.py (bug 1093608) 50 times on each SIM card. After these 100 runs, not a single failure was noticed.

I also run all the dialer test cases (DSDS and non-DSDS), all are passing except that test_call_log_groups (bug 1117804) has never passed (tested 3 times). On the call log screen I see that "Today" and "Yesterday" are displayed after 2014 and 2013. Gabriele, do you have an idea of what it could be?
Flags: needinfo?(gsvelto)
(In reply to Johan Lorenzo [:jlorenzo] (QA) from comment #13)
> I ran test_dialer_dsds_incoming_call_log_recorded_sim.py (bug 1093608) 50
> times on each SIM card. After these 100 runs, not a single failure was
> noticed.

Excellent!

> I also run all the dialer test cases (DSDS and non-DSDS), all are passing
> except that test_call_log_groups (bug 1117804) has never passed (tested 3
> times). On the call log screen I see that "Today" and "Yesterday" are
> displayed after 2014 and 2013. Gabriele, do you have an idea of what it
> could be?

No, but as noted there's numerous other problems with this code including bug 1112242. I'll ask for review on this one so we can land it and then proceed to fix the other problems.
Flags: needinfo?(gsvelto)
Attachment #8554479 - Flags: review?(drs.bugzilla)
Comment on attachment 8554479 [details] [diff] [review]
[PATCH v3] Ensure that all call log database operations are atomic

Redirecting to Tamara.
Attachment #8554479 - Flags: review?(drs.bugzilla) → review?(thills)
QA Contact: jlorenzo
Just fixed a silly typo which would trigger a null pointer error after displaying the call log. The PR was also updated.
Attachment #8554479 - Attachment is obsolete: true
Attachment #8554479 - Flags: review?(thills)
Attachment #8555842 - Flags: review?(thills)
Comment on attachment 8555842 [details] [diff] [review]
[PATCH v4] Ensure that all call log database operations are atomic

Clearing the review flag as I've just found another issue that needs to be addressed.
Attachment #8555842 - Flags: review?(thills)
Target Milestone: 2.2 S5 (6feb) → 2.2 S7 (6mar)
Target Milestone: 2.2 S7 (6mar) → 2.2 S5 (6feb)
We should be good now though I have to check the integration tests first. Johan can you give this patch one more spin on your tests?

As for the changes here they boild down to three things:

- Lots of IndexedDB operations were returning to the calling code as soon as the last request called its `onsuccess' handler. This is wrong, as the last request can succeed even though the transaction ultimately fails. All those bits have been changed to use the transaction `oncomplete' handler to confirm that the operation was successful. Some refactoring was needed for this to work.

- The contact cache was horribly racy. When the app started it assumed that the cache was valid, and only realized that it was not after partially populating the call log with stale data. I've changed this to assume that the cache is invalid until we've proven otherwise. Additionally the procedure to update the contact cache was creating new IndexedDB transactions from within another IndexedDB transaction. This was bound to fail in spectacular ways and in fact did fail from time to time. The racyness of it all however masked this most of the time.

- Other functions were also racy, in that they launched a transaction, then based on its result updated the DB using another transaction. Because of that tTwo overlapping operations would usually lead to wrong data in the database as the second one overwrote the result of the first one. I've replaced these either with a single atomic transaction, or two transactions but having the second one verifying the condition on which it needs to act before going forward; and bailing out in case that condition is not met anymore.

- Finally I had to patch the unit tests to pretend the contact cache is always valid. This is necessary because all the unit-tests relied on it even though it was never initialized explicitly in the code.
Attachment #8555842 - Attachment is obsolete: true
Flags: needinfo?(jlorenzo)
Attachment #8556509 - Flags: review?(thills)
Due to bug 1126260, I am not able to run test_dialer_dsds_incoming_call_log_recorded_sim.py properly. We need to get bug 1126260 fixed to validate this patch on master.

Clearing the NI in the meantime.
Depends on: 1126260
Flags: needinfo?(jlorenzo)
Updated patch that splits the CallLogDb.invalidateContactsCache() function and forces the CallLog._removeContact() and CallLog._updateContact() function to wait until we're sure if the contact cache is up to date or not. I've pushed a separate patch on top of the PR for ease of review.
Attachment #8556509 - Attachment is obsolete: true
Attachment #8556509 - Flags: review?(thills)
Attachment #8559214 - Flags: review?(thills)
Comment on attachment 8559214 [details] [diff] [review]
[PATCH v6] Ensure that all call log database operations are atomic

Hi Gabriele,

I think there is a problem with the _validateContactsCache function.  It gets called when _updateContact is called when a new contact is added.  I believe the promise is not being returned and hence the call log entries don't get updated with the new contact on an add Contact.  

I debugged a bit and it looks like this check here on Line133 is failing and hence the promise never gets resolved:

https://github.com/gabrielesvelto/gaia/commit/a966e11787a1c3d86ae1285c93ac30d9d04c3e8b#diff-f70ff14b804b9dcffe477bb56b6a7e97R133

Thanks,

-tamara
Attachment #8559214 - Flags: review?(thills) → review-
(In reply to Tamara Hills [:thills] from comment #21)
> I think there is a problem with the _validateContactsCache function.  It
> gets called when _updateContact is called when a new contact is added.  I
> believe the promise is not being returned and hence the call log entries
> don't get updated with the new contact on an add Contact.  

Yes, after much pondering I realized that this isn't the correct approach. The correct approach should be to wait for the cache to be valid before installing the system message / event handlers so those methods cannot be called in an inconsistent situation. I'll revise my patch to do that.
Whiteboard: [planned-sprint c=1] → [planned-sprint c=?]
Target Milestone: 2.2 S5 (6feb) → 2.2 S6 (20feb)
Whiteboard: [planned-sprint c=?] → [planned-sprint c=3]
Yet another iteration of the patch, I've pushed the incremental changes on top of the previous one on the PR but in this case it's easier to describe what happened:

In the previous patch I modified CallLog._updateContact() and CallLog._removeContact() to wait for the contact cache to be valid. I've reverted those changes in this patch and instead made CallLog.init() wait for the contact cache to be validated before trying to render the list and before installing oncontactchange handlers. This ensures that nobody is going to touch it until we're sure whether it's valid or not.
Attachment #8559214 - Attachment is obsolete: true
Attachment #8562151 - Flags: review?(thills)
Comment on attachment 8562151 [details] [diff] [review]
[PATCH v7] Ensure that all call log database operations are atomic

Hi Gabriele,

I'm seeing a bunch of odd behaviors in the call log having to do with the call log disappearing from the DOM but some remnants remaining in the DB. Here's what I observed:

1.  When I have a few entries in the call log, then I go and delete one of the contacts that is associated with one of the entries in the Call Log, the entire call log gets wiped out.  It does not come back after a reboot.  But, I think there is some remaining info in the Call Log DB after the reboot because even though the call log is empty, and I make a call to a number, it shows as the second call to that number after the reboot.

2.  I start with an empty call log and then I make an incoming call to my device.  There is a contact associated with the number.   I see the contact on the call screen on the incoming call, but when I go to the call log, I see just the number.  Then, I close the app and the entire call log is gone again.

3.  I'm also having troubles with the button on the call log that you press to delete the call logs.  It seems to not work when I press it.  I reverted back and tried on latest master and it seems to work.  This seems to happen after the call log gets wiped out in #1 and then I try making additional calls.

4.  Just in general, the call log seems to be wiped out occasionally.  I had it wipe out after an incoming call and my incoming call wasn't there.  This was with no contacts at all.

Thanks,

-tamara
Attachment #8562151 - Flags: review?(thills) → review-
After testing I noticed a silly mistake in the CallLogDB._gatherAllGroups() function which prevented it from running correctly: I hadn't declared the 'self' variable which I was using for accessing the 'this' reference. I'm fixing up the patch and I've done a bunch of tests which seem all to be working (fingers crossed).
OK, if you look at the interdiff of this patch and the previous version you'll see that there's only one line worth of changes plus a comment I've adjusted to be clearer.

In the PR however I've removed the last two commits I made because they were super-confusing since one did some stuff and the other reverted it and then changed some. Instead I've split my changes in two more logical commits: the first one deals with splitting _invalidateContactsCache() in three functions to improve its readability, the second one deals with making operations wait before touching the contact cache.

The unit tests are all passing correctly and my manual tests didn't turn out any errors but you never know with changes this big...
Attachment #8562151 - Attachment is obsolete: true
Attachment #8563537 - Flags: review?(thills)
Comment on attachment 8563537 [details] [diff] [review]
[PATCH v8] Ensure that all call log database operations are atomic

Hi Gabriele,

Looking good.  I believe that this patch may have also fixed bug 976990 as I can't repro that behavior anymore!

Nice work!
-tamara
Attachment #8563537 - Flags: review?(thills) → review+
(In reply to Tamara Hills [:thills] from comment #27)
> Looking good.  I believe that this patch may have also fixed bug 976990 as I
> can't repro that behavior anymore!

Thanks for the review and for checking that out! I believe this might have squashed a number of other call log bugs, we'll have to go through them after this lands. I've squashed my patches on the PR and pushed again to trigger the tests. Let's wait and see what happens.
Try is finally green, merged to gaia/master 6ac71434ff4c741f31babe45df9fe10c17c93021

https://github.com/mozilla-b2g/gaia/commit/6ac71434ff4c741f31babe45df9fe10c17c93021
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment on attachment 8563537 [details] [diff] [review]
[PATCH v8] Ensure that all call log database operations are atomic

[Approval Request Comment]
[Bug caused by] (feature/regressing bug #): This issue has been present in the code since its inception
[User impact] if declined: Interaction with the call log often gives unexpected results, bug 1112242 for example which is 2.2+. Also most integration tests can't run reliably because of the race conditions present in the code.
[Testing completed]: Tested on a device and on the emulator, we have some integration tests covering this part and they're all passing.
[Risk to taking this patch] (and alternatives if risky): There is a measurable amount of risk in taking this patch as it touches many parts of the call log database code. That being said the existing code is riddled with race conditions which should be fixed by this patch, so the risk of taking it should be weighted against the fact that the existing code is quite buggy on its own.
[String changes made]: None
Attachment #8563537 - Flags: approval-gaia-v2.2?(release-mgmt)
See Also: → 1015861
(In reply to Gabriele Svelto [:gsvelto] from comment #32)
> Comment on attachment 8563537 [details] [diff] [review]
> [PATCH v8] Ensure that all call log database operations are atomic
> 
> [Approval Request Comment]
> [Bug caused by] (feature/regressing bug #): This issue has been present in
> the code since its inception
> [User impact] if declined: Interaction with the call log often gives
> unexpected results, bug 1112242 for example which is 2.2+. Also most
> integration tests can't run reliably because of the race conditions present
> in the code.
> [Testing completed]: Tested on a device and on the emulator, we have some
> integration tests covering this part and they're all passing.
> [Risk to taking this patch] (and alternatives if risky): There is a
> measurable amount of risk in taking this patch as it touches many parts of
> the call log database code. That being said the existing code is riddled
> with race conditions which should be fixed by this patch, so the risk of
> taking it should be weighted against the fact that the existing code is
> quite buggy on its own.
> [String changes made]: None

Johan, we may need to watch this closely for any fallouts, hence Ni'ing you to help keep an eye. If it ends up being too risky to manage for 2.2, we'd back this this out.
Flags: needinfo?(jlorenzo)
Attachment #8563537 - Flags: approval-gaia-v2.2?(release-mgmt) → approval-gaia-v2.2+
(In reply to bhavana bajaj [:bajaj] from comment #35)
> Johan, we may need to watch this closely for any fallouts, hence Ni'ing you
> to help keep an eye. If it ends up being too risky to manage for 2.2, we'd
> back this this out.

Right. 

Also, like Gabriele said during a Dialer stand up, this area won't be touched until a certain amount of time. This will help identify the regressions.

This patch was mainly intended to fix automation intermittent failures (in bug 1093608, bug 1096385 and bug 1117804). As first results, I couldn't see either an improvement or a worsening after running the 3 tests impacted. So I would prefer to stay prudent and stick with the previous implementation in 2.2. Gabriele, what do think of that?
Flags: needinfo?(jlorenzo) → needinfo?(gsvelto)
(In reply to Johan Lorenzo [:jlorenzo] (QA) from comment #37)
> Gabriele, what do think of that?

We need this in 2.2 because it fixes bug 976990 and bug 1112242 as well. While I'm disappointed that it didn't squash our automation issues this is still needed to resolve those bugs.
Flags: needinfo?(gsvelto)
No longer blocks: 1112242
Fair enough. I verified both bug 976990 and bug 1112242. I'll execute some manual testing around this feature to spot quickly any potential regression.
Depends on: 1135023
Hit bug 1062799 while testing.
See Also: → 1062799
Depends on: 1135044
No longer depends on: 1126260
Depends on: 1141578
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: