Closed Bug 1072773 Opened 10 years ago Closed 10 years ago

[dolphin][flame]Contact which is added in the last will be lost if cut off power in 5s.

Categories

(Core Graveyard :: DOM: Contacts, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(tracking-b2g:backlog, b2g-v1.4 affected, b2g-v2.0 affected)

RESOLVED FIXED
mozilla36
tracking-b2g backlog
Tracking Status
b2g-v1.4 --- affected
b2g-v2.0 --- affected

People

(Reporter: shiwei.zhang, Assigned: reuben)

References

Details

(Whiteboard: [sprd355296] BlockPTR2 [systemsfe])

Attachments

(2 files, 5 obsolete files)

1. Add some different contacts in Contacts app.

2. Cut off power in 5s and reboot the phone. 

3. Get into Contacts app again and check what you have added.

4. The last contact we added is lost.
[Blocking Requested - why for this release]: 
BlockPTR2
blocking-b2g: --- → 1.4?
We meet this issue both on Dophin and Flame.

gaia : remote="mozillaorg" revision="efa2b8cb095407df942fee7732a5547c7034ef9b" upstream="v1.4"

gecko : remote="mozillaorg" revision="a040c4e10e140fc234e6ee0bcfd53c0967dc7ba3" upstream="v1.4"
We also found this issue on v2.0
Component: DOM: Apps → DOM: Contacts
Messages have the same problem.
Steps to reproduce:

1.Open Messages.
2.Cut off power in 5s after receiving a message.
3.Reboot the phone and check what we have received in Messages.
4.The Message which is received before power off is lost.

I seriously doubt these issues come down to the stability of indexDB which is incompatible with power fails.
Ben, can you look at this here? The same issues in 2 different API using IndexedDB, I'd say it's an IndexedDB issue caching stuff?
Flags: needinfo?(bent.mozilla)
Component: DOM: Contacts → DOM: IndexedDB
This is how databases work.  If we haven't finished writing the data to the disk then the change will be reverted.
But do we try to write asap or are we waiting for more operation?

I'm not saying we should change anything, I'm just trying to understand what happens :)
I am wondering how databases work, it was just lost the last contact, this is more like database is waiting for more operation, such as waiting for a recovery of state machine.
I suppose it's possible that we're not flushing to disk until the next transaction starts or something.
If we deliver a transaction complete event then the data has been written to disk. Otherwise all bets are off.
Flags: needinfo?(bent.mozilla)
This is more like writing a contact to disk depond on the finish operation of previous one. If save a data, the previous datas are saved completely, but this data depends.
I am trying to understand gecko/dom/indexedDB, it seems that the operation of writing to disdisk is not here, but we can add some preparations,in this way, success can be attained whenever we save data.
Check out bug 1052989 for a patch you can apply to have the very-useful IndexedDB profile markers output to logcat.  That should help shed some light on what IndexedDB is actually being told to do by the mozContacts API.  You can also set DEBUG to true in dom/contacts/fallback/ContactDB.jsm, dom/contacts/fallback/ContactService.jsm, and dom/contacts/ContactManager.js.  The logcat labels will be "Gecko" (for IndexedDB) and "GeckoDump" (for contacts).
(In reply to Andrew Sutherland [:asuth] from comment #13)
Thanks, I am afraid that researching IndexedDB log is a complicated task to me, but I'll have a try.
In addition,I found a way to avoid the missing of the last contact:

In form.js
var doSave = function doSave(contact, noTransition) {
... ...
    var request = navigator.mozContacts.save(utils.misc.toMozContact(contact));

Add "navigator.mozContacts.getAll()" in "request.onsuccess = function onsuccess() {}"
That is, if we save a contact, mozContacts.getAll() is called.

Obviously,it is not a good idea,but some special operations in mozContacts.getAll() are actually what we want.
Hi Andrew- 

Could you check Shiwei's feedback on Comment#14 and give some suggestions?

Thanks
Flags: needinfo?(bugmail)
My advice is the same as in comment 13.  Enable IndexedDB logging to see what the state of the actual IndexedDB transaction is.  It doesn't seem like the code added in comment 14 should be impacting anything unless the STR now includes waiting for mozContacts.getAll() to complete, in which case any races in persisting the data to disk durably are obviously changed so they can finish since IndexedDB and the disk subsystem have more time to write things out.
Flags: needinfo?(bugmail)
(In reply to Andrew Sutherland [:asuth] from comment #16)
Hi Andrew
  I think mozContacts.getAll() can NOT prolong the writing time too much, it can't be more than 5s, do you agree?
  I follow the methods in comment13 and had some try, but IndexedDB is so unfamiliar to me, I made some modifications in IDBObjectStore.cpp and IDBTransaction.cpp, but nothing changed.
  Could you help to check the IndexedDB transaction state or give some ideas? I can try it out by myself.
Flags: needinfo?(bugmail)
Flags: needinfo?(bugmail)
Flags: needinfo?(bugmail)
Hm, so, I just tried to look into the contacts scenario on trunk on a Flame device and it turns out the IndexedDB-PBackground patch broke the ProfilerHelpers stuff too.  So I checked out Gecko prior to the most recent landing of that patch (git checkout d081140cb01c4f850766901db9558a27f123e463^).  On there I'm seeing 2 "readwrite" IndexedDB transactions and they're both running to completion relatively quickly, where completion in this case is a durable SQLite commit which transfers responsibility to the device's flash and its/the kernel's honoring of fsync/etc./etc.  I'm defining relatively quickly as about ~120ms with only a few contacts and ~190ms with ~2000 contacts using the reference workload.  And I'm defining completion as the profiler-derived logcat messages showing "IndexedDB Transaction ###: Complete (rv = 0)"

Obviously this is trunk and not v2.0 or v1.4.  If you can reproduce on trunk that'd be interesting, but I don't believe the explicit use case of "a user who enters a contact and then uncleanly shuts down the device within 5 seconds and expects the contact to still be there" merits blocker status or investigation on old branches if it doesn't reproduce on the trunk.  It seems like it would be a blocker only if it still holds true for less-artificially small time intervals.  Also, especially if we're not talking about reproducing on a flame device, it's worth first finding out what the filesystem configuration is on the device and whether it is properly honoring fsync()/fdatasync().

If you do reproduce, please provide much more extensive STRs, specifically for contacts:
- How many contacts are in the database (roughly)?
- How quickly did you add the contact(s) after opening the app?  (In my 2000 contacts case on trunk, I notice from the IndexedDB logs that a *ton* of (read) requests were flying by, all as parts of one or more long-running transactions.  I know the Contacts UI had serious scaling problems for a while; it seems conceivable especially on older versions that this could potentially impact writes.)
- How are you cutting off the power?  And is it possible that the mechanism used to do this could also lead to improper operation of the device.
- Are you seeing filesystem fsck's or any results like that when restarting the device afterwards?
Flags: needinfo?(bugmail)
Dear Andrew,
   I am not quite sure the issue has something to do with trunk, but only 2 contacts can reproduce it.
   I cut the power by removing the battery. A video which recording the STRs of reproducing the issue is attached in our bugzilla, please check the comment6 at 

http://bugzilla.spreadtrum.com/bugzilla/show_bug.cgi?id=355296

  Thanks.
Flags: needinfo?(bugmail)
Hi Andrew- 

I upload the reproduced video on the youtube:

https://www.youtube.com/watch?v=_mjCSnIy82c&feature=youtu.be

Thanks

Vance
Can you reproduce it on a flame?
Flags: needinfo?(vchen)
According to the following comment, it can be reproduced on Flame:

https://bugzilla.mozilla.org/show_bug.cgi?id=1070820#c8
Flags: needinfo?(vchen)
That's testing for a different issue, though.  If we revert the commit mentioned in bug 1070820 comment 11, does this happen on the flame?
Flags: needinfo?(shiwei.zhang)
Hi Kyle -

Test again with 2.0 and master branch, issue can be reproduced. Since the commit mentioned in bug 1070820 comment 11 is only in 1.4 branch, it should has nothing to do with the contact lost issue here

Also update the reproduced steps here:

1. Create a new contact(e.g. AAA)
2. Create a new contact(e.g. BBB)
3. Remove the battery within 5 seconds after you create BBB
4. Power on the device again, enter Contact App, you will see BBB is lost

Cannot reproduce this issue if removing the battery after 5 seconds in the step 3
Flags: needinfo?(khuey)
I am able to reproduce this on the flame and debugged it.  This is a contacts bug.  The contacts API uses two object stores, the actual store of contacts and the "getallcache".  If you pull the battery at the right time the database gets left in a state where the actual store of contacts is correct but the cache is wrong.  Once the cache gets rebuilt things are correct.  If you try to create another BBB after rebooting the device you'll see two entries in the UI.
Component: DOM: IndexedDB → DOM: Contacts
Flags: needinfo?(khuey)
Kyle, I just want to be sure of what is needed to fix this.

I see that "invalidateCache" is called in the "onsuccess" handler in saveContact (and other locations, but let's discuss saving contacts here).

Would starting a transaction for both datastores in the start of saveContact work? So that both changes are in the same transaction?
Flags: needinfo?(khuey)
I think that making the changes to the "real" object store and the cache in the same transaction would fix this.
Flags: needinfo?(khuey)
I can't test this on my device (bug 1065010, or whatever the real problem is), but this should do the trick. Thanks for diagnosing this, Kyle. I suspected this was a result of my IDB code…

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=c71e354dafd4
(In reply to Julien Wajsberg [:julienw] (PTO until 7th October, ask :steveck or :azasypkin for SMS stuff) from comment #26)
> Would starting a transaction for both datastores in the start of saveContact
> work? So that both changes are in the same transaction?

FWIW, this was already happening*, but ContactDB creates a new transaction for clearing the cache store. 

* IndexedDBHelper passes all stores to IDBDatabase.transaction when you pass a single store to newTxn, it just doesn't return the actual IDBStore objects to the callback. I guess some callers are doing txn.objectStore("store_that_wasnt_in_the_list").
Hi Reuben,
  I test attachment 8501358 [details] [diff] [review] in dolphin, unfortunately, issue still exist.
  Can we directly call getAllStore.clear() in saveContact()?
Flags: needinfo?(shiwei.zhang) → needinfo?(reuben.bmo)
Reuben, shouldn't we do the same for removeContact/removeObjectFromCache?
(In reply to Shiwei Zhang from comment #30)
> Hi Reuben,
>   I test attachment 8501358 [details] [diff] [review] in dolphin,
> unfortunately, issue still exist.
>   Can we directly call getAllStore.clear() in saveContact()?

That's what he's doing in this patch or am I missing something?
(In reply to Julien Wajsberg [:julienw] (PTO until 7th October, ask :steveck or :azasypkin for SMS stuff) from comment #32)

> That's what he's doing in this patch or am I missing something?

I saw getAllStore.clear() is used in the patch.

I just test can the patch fix this issue, but the result is it cannot.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #25)
> I am able to reproduce this on the flame and debugged it.  This is a
> contacts bug.  The contacts API uses two object stores, the actual store of
> contacts and the "getallcache".  If you pull the battery at the right time
> the database gets left in a state where the actual store of contacts is
> correct but the cache is wrong.  Once the cache gets rebuilt things are
> correct.  If you try to create another BBB after rebooting the device you'll
> see two entries in the UI.

I test the patch in comment #28, issue still exist.
I suppose this is an indexedDB bug instead of ContactDB. Cache or "real" object store, no matter where the contact stored, if we pull the battery and reboot device, the last contact is gone, we can never find it by getAll().
Flags: needinfo?(khuey)
With a single transaction for both stores, the specific problem of the cache store being outdated when you reboot should be fixed, but there's a possibility that you're shutting down the phone before or in the middle of the transaction.

I'll update the patch to make sure all the ContactDB operations happen within a single transaction.
Flags: needinfo?(reuben.bmo)
Does the contacts app update the UI before the transaction's oncomplete event fires?

If I repeat the steps with your patch I see a single contact in the (IDB) database if I pull the battery and 2 if I don't.  The value in the cache appears to be in sync though.
Flags: needinfo?(khuey)
This appears to be a dupe of 1070820
This appears to be a dupe of bug 1070820
In bug 1070820 it was especially decided to file this bug, because there are 2 different issues.
Yes, this issue is not a dupe of 1070820, please pay attention to comment 24 and comment 23.
[Blocking Requested - why for this release]:

Suggest to nominate this one to 2.1 blocker since this is a common issue that can be reproduced on Flame. Although the RSP is not a command use case(need to remove battery immediately after creating a new contact data), user will suffer from data lost if issue happens
blocking-b2g: 1.4? → 2.1?
Attachment #8501358 - Attachment is obsolete: true
Attachment #8505159 - Flags: review?(anygregor)
(In reply to Reuben Morais [:reuben] from comment #42)
> Created attachment 8505159 [details] [diff] [review]
> Make sure ContactDB remove and save happen within a single transaction
。
Issue still exist,the last added contact can't be writed in ***csotncta.sqlite if cut off power in 5s.
The triage group decided that while this is bad behaviour (the data loss), the affected population of people who are adding contacts *right* when their battery dies isn't huge and it's not a regression.  I've moved the nom to backlog; hope that's okay.

We can always uplift if we get a patch.
blocking-b2g: 2.1? → backlog
(In reply to Andrew Overholt [:overholt] from comment #44)
> The triage group decided that while this is bad behaviour (the data loss),
> the affected population of people who are adding contacts *right* when their
> battery dies isn't huge and it's not a regression.  I've moved the nom to
> backlog; hope that's okay.
> 
> We can always uplift if we get a patch.

In fact, Messages App have the same issue,I'm afraid it is a common defect.
Since Messages App have the same issue (comment #4), we should ensure that the changes are written to disk while complete a transaction.
Similar issue is found in contacts deleting, 
Steps to reproduce:

1.Delete two contacts. 
2.Cut off power in 5s after deleting operation completed.
3.Reboot the phone and check contacts list.
4.The last deleted contact appears.
I think this issue has much to do with sqlite-journal,if I delete the ***csotncta.sqlite-journal before open Contacts App when reboot completed, everthing went normally!! the last added contact is no longer lost.
I think this is how journaling works. But maybe we don't commit properly to the storage...
Comment on attachment 8505159 [details] [diff] [review]
Make sure ContactDB remove and save happen within a single transaction

Review of attachment 8505159 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/contacts/fallback/ContactDB.jsm
@@ +980,2 @@
>        if (DEBUG) debug("Going to update" + JSON.stringify(contact));
> +      let store = stores[0]; // main store

Can't we access them by name?
Attachment #8505159 - Flags: review?(anygregor) → review+
(In reply to Julien Wajsberg [:julienw] from comment #49)
> I think this is how journaling works. But maybe we don't commit properly to
> the storage...

I think we don't commit properly to the journal instead of sqlite db file, and db is restored by journal when rebooting.
Flags: needinfo?(anygregor)
Flags: needinfo?(anygregor)
Whiteboard: [sprd355296] BlockPTR2 → [sprd355296] BlockPTR2 [systemsfe]
I think this patch can help handling the issue, but I am not sure if it is entirely feasible.
Please give a review.
Thanks.
Attachment #8508585 - Flags: review?(felash)
Attachment #8508585 - Flags: review?(anygregor)
Comment on attachment 8508585 [details] [diff] [review]
mozillabug1072773_contact_lost.patch

Review of attachment 8508585 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not suitable to review this, but I think this is unreasonnable as this will impair performance a lot.
Attachment #8508585 - Flags: review?(felash)
Comment on attachment 8508585 [details] [diff] [review]
mozillabug1072773_contact_lost.patch

I can't comment on the sqlite change, but the contacts change undoes the getAll performance optimizations, so it's definitely undoable.
Attachment #8508585 - Flags: review-
Attached patch Patch? (obsolete) — Splinter Review
Dr. Hipp, could you take a look at this patch?  We do not understand why when journal_mode=TRUNCATE sqlite makes a truncate call without following it with an fsync, but in zeroJournalHdr the truncate call is followed by an fsync.  Without this fsync call the rollback journal is still present and valid if we pull the power and when we restart the system it is used to roll back a transaction that has completed.
Attachment #8508986 - Flags: review?(drh)
Attachment #8508585 - Attachment is obsolete: true
Attachment #8508585 - Flags: review?(anygregor)
Comment on attachment 8508986 [details] [diff] [review]
Patch?

Shiwei, please run a test with this patch on your device.  It fixes the issue on my flame.
Attachment #8508986 - Flags: feedback?(shiwei.zhang)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #55)
> Created attachment 8508986 [details] [diff] [review]
> Patch?
> 
> Dr. Hipp, could you take a look at this patch?

(1) Kyle:  Please call me "Richard".  :-)

(2) Ben Turner described the problem on a private chat and I then implemented a patch that is very similar to what you have, without first looking at your patch.  So, yes, I think your patch is probably the right answer.  But we are going to what to do a lot of testing on our end before we make the change official.  I'll post a follow-up message in the unlikely event that we find your patch to be inadequate in some way.

(3) My patch does go one step further in that it only invokes the fsync() after ftruncate() if PRAGMA synchronous=FULL (see https://www.sqlite.org/pragma.html#pragma_synchronous).  Synchronous=FULL is the default setting in TRUNCATE mode.  But by disabling the extra fsync with synchronous=NORMAL, that gives applications that only care about consistency and not durability the opportunity to avoid the extra fsync().
(In reply to D. Richard Hipp from comment #57)
> (3) My patch does go one step further in that it only invokes the fsync()
> after ftruncate() if PRAGMA synchronous=FULL

This sounds fine to me.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #56)
> Comment on attachment 8508986 [details] [diff] [review]
> Patch?
> 
> Shiwei, please run a test with this patch on your device.  It fixes the
> issue on my flame.

Yes, issue is fixed on dolphin, Thanks.
Flags: needinfo?(bugmail)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #56)
Hi Kyle
  Is rollback journal invalidated in the fsync call? I could not understand why it take 5s in synchronization without this patch.
  Thank you again for your timely help.
(In reply to Shiwei Zhang from comment #61)
>   Is rollback journal invalidated in the fsync call? 

The rollback journal is invalidated by ftruncate().  However, the ftruncate() does not survive the power outage if it is not followed by an fsync().  So after the power outage, the full original text of the rollback journal reappears, which prompts SQLite to rollback the transaction.  The extra fsync() call forces the ftruncate() to persist across the power outage.
SQLite version 3.8.6.1 is now available at https://www.sqlite.org/2014/sqlite-amalgamation-3080601.zip

Version 3.8.6.1 is simply version 3.8.6 with a patch like Kyle's and with the version number increased and a record of having been officially tested.
Thank you for testing Shiwei, and thank you for the official patch Richard.
Attached patch Part 1: Update sqlite to 3.8.6.1 (obsolete) — Splinter Review
Attachment #8508986 - Attachment is obsolete: true
Attachment #8508986 - Flags: feedback?(shiwei.zhang)
Attachment #8509528 - Flags: review?(bent.mozilla)
Comment on attachment 8509528 [details] [diff] [review]
Part 1: Update sqlite to 3.8.6.1

Marco usually stamps these.
Attachment #8509528 - Flags: review?(bent.mozilla) → review?(mak77)
Attachment #8509528 - Flags: feedback+
Comment on attachment 8509528 [details] [diff] [review]
Part 1: Update sqlite to 3.8.6.1

Review of attachment 8509528 [details] [diff] [review]:
-----------------------------------------------------------------

you should bump the version in configure.in and in db/sqlite3/README.MOZILLA

I'd also prefer if this would happen in its own bug in Toolkit/Storage and block this bug please, so we can better track it.
Attachment #8509528 - Flags: review?(mak77) → review-
Comment on attachment 8509528 [details] [diff] [review]
Part 1: Update sqlite to 3.8.6.1

Spun off to bug 1088143.
Attachment #8509528 - Attachment is obsolete: true
Keywords: leave-open
Comment on attachment 8514466 [details] [diff] [review]
IDB patch to enable the SQLite fix

Review of attachment 8514466 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/indexedDB/ActorsParent.cpp
@@ +2208,5 @@
>  
> +  NS_NAMED_LITERAL_CSTRING(kNormalSynchronousPragma, "PRAGMA synchronous = NORMAL;");
> +  NS_NAMED_LITERAL_CSTRING(kFullSynchronousPragma, "PRAGMA synchronous = FULL;");
> +
> +  rv = aConnection->ExecuteSimpleSQL(IndexedDatabaseManager::FullSynchronous() ?

Since the default is NORMAL I don't think we need to set it to normal again. Just set to full if the pref is set.
Attachment #8514466 - Flags: review?(bent.mozilla) → review+
https://hg.mozilla.org/mozilla-central/rev/acd029ee3d13
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
Vance, please work with the Devices team to backport the patches in this bug and in bug 1088143 to whatever branches the vendor requires.  Note that the vendor must set the "dom.indexedDB.fullSynchronous" pref to true to enable this fix.
Flags: needinfo?(vchen)
Thanks for the reminder Kyle, already inform partners about the patch
Flags: needinfo?(vchen)
Depends on: 1114520
Depends on: 1116930
Depends on: 1124117
No longer depends on: 1124117
blocking-b2g: backlog → ---
Hi Kyle Huey,
  I found this modification did NOT fix this issue, contact will still be lost.

@@ -42202,6 +43517,14 @@ static int pager_end_transaction(Pager *pPager, int hasMaster, int bCommit){
         rc = SQLITE_OK;
       }else{
         rc = sqlite3OsTruncate(pPager->jfd, 0);
+        if( rc==SQLITE_OK && pPager->fullSync ){
+          /* Make sure the new file size is written into the inode right away.
+          ** Otherwise the journal might resurrect following a power loss and
+          ** cause the last transaction to roll back.  See
+          ** https://bugzilla.mozilla.org/show_bug.cgi?id=1072773
+          */
+          rc = sqlite3OsSync(pPager->jfd, pPager->syncFlags);
+        }
       }
       pPager->journalOff = 0;
     }else if( pPager->journalMode==PAGER_JOURNALMODE_PERSIST


in sqlite v3.8.6.1, also in v3.8.9.

but change the "if" condition

+ if( rc==SQLITE_OK && pPager->fullSync ){

to 

+ if( rc==SQLITE_OK ){

the patch will work normally.

Just let you know.

Thank you.
Flags: needinfo?(khuey)
Did you set the preference, as I mention in comment 76?
Flags: needinfo?(khuey) → needinfo?(shiwei.zhang)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #79)
> Did you set the preference, as I mention in comment 76?

No, I did not, I can't find dom.indexedDB.fullSynchronous in gecko/gaia.

could you tell me the path of this preference file? 

Thank you.
Flags: needinfo?(shiwei.zhang)
You just add an entry for that preference whereever you are putting your other OEM customizations.  Your TAM (Vance) ought to be able to point you in the right direction.
Thank you Kyle.

Dear Vance,
  Could you tell me how to set the preference to, as comment 79?
  Where is this preference file in v2.1s?

Thank you.
Flags: needinfo?(vchen)
Hi Shiwei -

Not sure if SPRD has its own customization mechanism, but normal you can just add the following preference in gaia/build/config/custom-prefs.js

pref("dom.indexedDB.fullSynchronous", true);

Please give it a try and let me know if that works

Thanks

Vance
Flags: needinfo?(vchen)
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: