2nd run failure on emulator-x86-kk

RESOLVED FIXED

Status

P1
normal
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: swu, Assigned: jdai)

Tracking

unspecified
x86_64
Gonk (Firefox OS)
Dependency tree / graph

Firefox Tracking Flags

(tracking-b2g:backlog)

Details

Attachments

(6 attachments)

(Reporter)

Description

4 years ago
Created attachment 8577944 [details]
Error log.

After building emulator-x86-kk, we are only able to run run-emulator.sh for the first time.  Attached is the error log, looks like it's related to device storage or indexedDB.
(Reporter)

Updated

4 years ago
Blocks: 1090358
Might be the same as bug 1143703.
See Also: → bug 1143703
(In reply to Shian-Yow Wu [:swu] from comment #0)
> Created attachment 8577944 [details]
> Error log.
> 
> After building emulator-x86-kk, we are only able to run run-emulator.sh for
> the first time.  Attached is the error log, looks like it's related to
> device storage or indexedDB.

I hit the same IndexedDB error. I can get only a black screen emulator.
(Assignee)

Comment 3

4 years ago
From my investigation it appears that 2nd run emulator those files become empty.
data/local/storage/permanent/chrome/.metadata
data/local/storage/default/12+f+app+++keyboard.gaiamobile.org/.metadata
data/local/storage/default/21+f+app+++verticalhome.gaiamobile.org/.metadata
data/local/storage/default/30+f+app+++system.gaiamobile.org/.metadata

I backup those files from 1st run emulator and 2nd run emulator restore it back, I am able to boot up emulator.
(Assignee)

Updated

4 years ago
See Also: → bug 1107608
(Assignee)

Comment 4

4 years ago
The emulator-x86-kk cannot run 2nd time only happen at force close emulator. e.g., ctrl+c, force close emulator, kill emulator process. I close the emulator through emulator grahic panel button, those .metadata files are exist and not empty.
I've had some success with running `adb shell sync` before halting the emulator.  Obviously not a real fix (and doesn't work if, say, the adb server wasn't running when the emulator was started), but it might be a better workaround, at least.
(Assignee)

Comment 6

4 years ago
Created attachment 8595942 [details] [review]
[platform_system_core] pull request 9 [Checkin: comment 28]
Attachment #8595942 - Flags: feedback?(jld)
(Assignee)

Comment 7

4 years ago
Created attachment 8595943 [details] [review]
[device_generic_goldfish] pull request 27 [Checkin: comment 28]
Attachment #8595943 - Flags: feedback?(jld)
(Assignee)

Updated

4 years ago
Assignee: nobody → jdai
(Assignee)

Updated

4 years ago
Attachment #8595942 - Flags: feedback?(jld) → review?(tzimmermann)
(Assignee)

Updated

4 years ago
Attachment #8595943 - Flags: feedback?(jld) → review?(tzimmermann)
(Assignee)

Comment 8

4 years ago
Hi Thomas,

Would you mind help me review this patches? If I assigned to the wrong reviewer person, please help me forward to the correct person.

Thanks.
Created attachment 8596362 [details] [diff] [review]
Gecko patch: fsync new QuotaManager metadata files.

I tried to see if there's a less dramatic way to deal with this than mounting all of /data sync.  This patch seems to avoid bug 1107608 when interrupting and restarting the emulator, although it's not a complete fix for that bug as I understand it.  It would be interesting to see if it helps other people who've run into these second-boot problems.


I also encountered another failure mode, where the Gaia system app didn't have the various privileged navigator properties defined; this basically breaks the entire system.  I couldn't reproduce it in order to find out exactly what went wrong, but I'd guess there's someplace that app permissions are stored that's affected.

Also, the profiles.ini apparently isn't fsync()ed, so it can be truncated to 0 length the same way as the QuotaManager metadata, depending on when the first boot is interrupted.  But this seems to be harmless — it results in Gecko ignoring the original "default" profile directory and creating a new one, and the emulator seems to work normally.
Attachment #8596362 - Flags: feedback?(swu)
Attachment #8596362 - Flags: feedback?(jdai)
Comment on attachment 8595942 [details] [review]
[platform_system_core] pull request 9 [Checkin: comment 28]

Hi,

Thanks for working on this problem. In general, I'm OK with this fix if it solves the reported problem without much overhead. I just have a few general comments and questions.

  - Your commit messages just contain the reported bug. It'd be better to state what the patch does.

  - Does 'sync' increase the overhead significantly?

  - Could the same effect be achieved by using ext4's 'data=ordered'?
Attachment #8595942 - Flags: review?(tzimmermann) → review+
Comment on attachment 8595943 [details] [review]
[device_generic_goldfish] pull request 27 [Checkin: comment 28]

Same as comment 10.
Attachment #8595943 - Flags: review?(tzimmermann) → review+
(Reporter)

Comment 12

4 years ago
Comment on attachment 8596362 [details] [diff] [review]
Gecko patch: fsync new QuotaManager metadata files.

With this patch, there is no more error message as in comment 0.  But the 2nd run failure problem still happens with different error messages, probably it's same as the other failure case you mentioned.  After booted to homescreen at the 1st run, I hit ctrl-c to close the emulator.
Attachment #8596362 - Flags: feedback?(swu)
(Reporter)

Comment 13

4 years ago
Created attachment 8596416 [details]
Error log for fsync metadata file patch

Here is the error log for comment 12.
Created attachment 8596867 [details] [diff] [review]
Gecko patch: don't opt in to permission DB corruption and data loss.

What's going on here is that we're not syncing writes to the permissions database (/data/local/permissions.sqlite).  On first boot or after an update, the code in dom/apps/Webapps.jsm (in loadAndUpdateApps and installSystemApps; also isFirstRun in AppsUtils.jsm) copies permissions from the apps' webapps.manifest files into the permissions DB, and then sets preferences to indicate that that's been done.

The prefs.js file, however, is fsync()ed; if the profiles.ini file is lost (as described above) then that doesn't matter because it will never be read, but it's possible for the emulator to be stopped when the profiles.ini has been flushed to storage but the permissions.sqlite hasn't.  In that case, the next boot sees the prefs that indicate the installed apps have already been scanned, but the permissions DB is empty, which breaks the system app ("navigator.mozSettings is null" and so on).

Adding fflush() and fsync() calls to nsToolkitProfileService::Flush (and applying the QuotaManager patch) makes this reliably reproducible.

I'll file a separate bug for this, because this should be fixed regardless of what happens with the B2G emulator.
I also, in one case, saw corruption of permissions.sqlite, not just data loss — that is, running the sqlite3 utility on it, all commands fail with “Error: database disk image is malformed”.  I wasn't able to reproduce this, but the sqlite documentation warns that it's a possibility if `PRAGMA synchronous = OFF` is used and the OS isn't shut down cleanly.
Probably not useful, but in case it is: for the permissions DB issue, it should be possible to “unbrick” the emulator by using edit_prefs.sh to set dom.apps.reset-permissions to false.
(Assignee)

Comment 17

4 years ago
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #10)
> Comment on attachment 8595942 [details] [review]
> [platform_system_core] pull request 9
> 
> Hi,
> 
> Thanks for working on this problem. In general, I'm OK with this fix if it
> solves the reported problem without much overhead. I just have a few general
> comments and questions.
> 
>   - Your commit messages just contain the reported bug. It'd be better to
> state what the patch does.
> 
I changed the commit message for more descriptions.

>   - Does 'sync' increase the overhead significantly?
> 
No, it didn't. But I don't know any tool which can measure emulator performance.
If you know any tool, please tell me.

>   - Could the same effect be achieved by using ext4's 'data=ordered'?
Using ext4's 'data=ordered' can't solve this issue, because 'data=ordered' is the default mode in ext4[1].
[1] http://man7.org/linux/man-pages/man5/ext4.5.html
[Tracking Requested - why for this release]:
tracking-b2g: --- → backlog
Comment on attachment 8596362 [details] [diff] [review]
Gecko patch: fsync new QuotaManager metadata files.

Bug 1157979 is going to fix this more thoroughly.
Attachment #8596362 - Flags: feedback?(jdai)

Comment 20

4 years ago
All,

About the permission DB(permissions.sqlite) issue, there are two cases as follow which have been identified. Please refer to my comment at https://bugzilla.mozilla.org/show_bug.cgi?id=1115619#c23.

Case 1: App permission records in table are not completed(bug 1115619)
  Already land a patch to guarantee all app permissions can be loaded to permissions.sqlite completely and successfully.
Case 2: corrupted permissions.sqlite(bug 1124126)
  The patch of handling corrupted permissions.sqlite is under review.
(In reply to John Dai[:johnz][:jdai] from comment #17)
> (In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #10)
> > Comment on attachment 8595942 [details] [review]
> > [platform_system_core] pull request 9
> > 
> > Hi,
> > 
> > Thanks for working on this problem. In general, I'm OK with this fix if it
> > solves the reported problem without much overhead. I just have a few general
> > comments and questions.
> > 
> >   - Your commit messages just contain the reported bug. It'd be better to
> > state what the patch does.
> > 
> I changed the commit message for more descriptions.
> 
> >   - Does 'sync' increase the overhead significantly?
> > 
> No, it didn't. But I don't know any tool which can measure emulator
> performance.
> If you know any tool, please tell me.
> 

I have no idea which is the best tool for measuring the performance overhead. I just did some experiments based on existing indexDB mochitests. I specifically ran test_deleteDatabase_interactions.html and test_unique_index_update.html . The average time it took for the 1st test before and after applying John's patch was 372s and 374s. For the 2nd test, before and after: 372s and 377s.

Any better idea of examining the possible overhead resulting from the patch, Thomas?

Updated

4 years ago
Flags: needinfo?(tzimmermann)
Thanks Hsin-Yi! It was just a question to rule out some obvious problem. But AFAIK syncs in the emulator don't propagate to the actual hard drive.

Comment 14 sounds scary to me. We can break a device by modifying files in such a trivial way. (?) Syncing the FS is probably OK for now, but we might want to revert this change when these DB bugs have been resolved.
Flags: needinfo?(tzimmermann)
Something like Bonnie [1] might be useful for performance testing.

[1] http://www.textuality.com/bonnie/
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #22)
> Thanks Hsin-Yi! It was just a question to rule out some obvious problem. But
> AFAIK syncs in the emulator don't propagate to the actual hard drive.
> 
> Comment 14 sounds scary to me. We can break a device by modifying files in
> such a trivial way. (?) Syncing the FS is probably OK for now, but we might
> want to revert this change when these DB bugs have been resolved.

Agreed!
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #22)
> Comment 14 sounds scary to me. We can break a device by modifying files in
> such a trivial way. (?) Syncing the FS is probably OK for now, but we might
> want to revert this change when these DB bugs have been resolved.

Comment #14 is now bug 1157950.  To summarize: we definitely need to use a journaling mode where the effects of unexpectedly halting the OS are limited to nondurability, and that's going to happen.  The synchronization problem with Webapps.jsm is harder; most users of the permissions manager don't need durability, and it doesn't even try to provide it, not even optionally: it's doing updates asynchronously on a background thread, on top of whatever transaction loss we could get from SQLite itself.  So that needs more work if we want to truly fix it; just syncing will make the race condition harder to hit by accident, but it will still exist.
Thanks for this summary.

Updated

4 years ago
Depends on: 1162624
No longer depends on: 1157979

Updated

4 years ago
Blocks: 1177234

Updated

4 years ago
No longer blocks: 1177234

Comment 27

3 years ago
Hi John,
Can you help to land your patch which already r+?
Thanks!
Status: NEW → ASSIGNED
Flags: needinfo?(jdai)
Priority: -- → P1
(Assignee)

Updated

3 years ago
Flags: needinfo?(jdai)
Keywords: checkin-needed
Attachment #8595942 - Attachment description: [platform_system_core] pull request 9 → [platform_system_core] pull request 9 [Checkin: comment 28]
Attachment #8595943 - Attachment description: [device_generic_goldfish] pull request 27 → [device_generic_goldfish] pull request 27 [Checkin: comment 28]
(Assignee)

Comment 29

3 years ago
This issue is verified fixed on the latest builds.

Updated

3 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.