Closed Bug 1645907 Opened 4 years ago Closed 2 months ago

Handle corrupt ExtensionPermission database

Categories

(WebExtensions :: Storage, defect, P3)

defect

Tracking

(firefox125 fixed)

RESOLVED FIXED
125 Branch
Tracking Status
firefox125 --- fixed

People

(Reporter: agi, Assigned: rpl)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Whiteboard: [addons-jira])

Attachments

(4 files, 2 obsolete files)

While working on Bug 1645181 I noticed that if the database becomes corrupt for some reason we don't really handle it all that well. I'm gonna attach a patch soon.

See Also: → 1645181

This patch makes it so whenever we encounter a corrupt database file we recrate
it and backup the corrupt file to a folder in the profile.

Severity: -- → S3
Priority: -- → P1
Blocks: 1646182
Depends on: 1646451
Pushed by asferro@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/cafdc157b491
Recreate ExtensionPermission db on corrupt file. r=mixedpuppy

Backed out for failures on test_extension_permissions_corrupt.js

backout: https://hg.mozilla.org/integration/autoland/rev/613e8419e9b5f5705e4898065c5cc6454f0e16b6

push: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedTaskRun=aCh_9chrTjGPzOs_XDFsXQ.0&revision=cafdc157b491c28f54af655e677e8a10dace3536&group_state=expanded

failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=307899823&repo=autoland&lineNumber=2028

[task 2020-06-29T18:00:42.330Z] 18:00:42 INFO - TEST-START | xpcshell.ini:toolkit/components/extensions/test/xpcshell/test_extension_permissions_corrupt.js
[task 2020-06-29T18:00:42.750Z] 18:00:42 WARNING - TEST-UNEXPECTED-FAIL | xpcshell.ini:toolkit/components/extensions/test/xpcshell/test_extension_permissions_corrupt.js | xpcshell return code: -11
[task 2020-06-29T18:00:42.750Z] 18:00:42 INFO - TEST-INFO took 416ms
[task 2020-06-29T18:00:42.750Z] 18:00:42 INFO - >>>>>>>
[task 2020-06-29T18:00:42.750Z] 18:00:42 INFO - PID 2170 | [2170, Main Thread] WARNING: Couldn't get the user appdata directory. Crash events may not be produced.: file /builds/worker/checkouts/gecko/toolkit/crashreporter/nsExceptionHandler.cpp, line 2924
[task 2020-06-29T18:00:42.750Z] 18:00:42 INFO - (xpcshell/head.js) | test MAIN run_test pending (1)
[task 2020-06-29T18:00:42.750Z] 18:00:42 INFO - (xpcshell/head.js) | test run_next_test 0 pending (2)
[task 2020-06-29T18:00:42.750Z] 18:00:42 INFO - (xpcshell/head.js) | test MAIN run_test finished (2)
[task 2020-06-29T18:00:42.750Z] 18:00:42 INFO - running event loop
[task 2020-06-29T18:00:42.750Z] 18:00:42 INFO - PID 2170 | [2170, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, kKnownEsrVersion) failed with result 0x80004002 (NS_NOINTERFACE): file /builds/worker/checkouts/gecko/toolkit/components/resistfingerprinting/nsRFPService.cpp, line 673
[task 2020-06-29T18:00:42.750Z] 18:00:42 INFO - xpcshell.ini:toolkit/components/extensions/test/xpcshell/test_extension_permissions_corrupt.js | Starting check_remote
[task 2020-06-29T18:00:42.750Z] 18:00:42 INFO - (xpcshell/head.js) | test check_remote pending (2)
[task 2020-06-29T18:00:42.750Z] 18:00:42 INFO - TEST-PASS | xpcshell.ini:toolkit/components/extensions/test/xpcshell/test_extension_permissions_corrupt.js | check_remote - [check_remote : 1] useRemoteWebExtensions matches - false == false
[task 2020-06-29T18:00:42.752Z] 18:00:42 INFO - TEST-PASS | xpcshell.ini:toolkit/components/extensions/test/xpcshell/test_extension_permissions_corrupt.js | check_remote - [check_remote : 1] testing from extension process - true == true
[task 2020-06-29T18:00:42.752Z] 18:00:42 INFO - (xpcshell/head.js) | test run_next_test 0 finished (2)

Flags: needinfo?(agi)

Looks like we need to wait for Bug 1597898

Depends on: 1597898
Flags: needinfo?(agi)

This patch makes it so whenever we encounter a corrupt database file we recrate
it and backup the corrupt file to a folder in the profile.

Attachment #9166675 - Attachment is obsolete: true

Hey Agi,
I see that Bug 1646182 (enabling the ExtensionPermissions rkv-based backend on all channels) is currently marked as P3, is P1 still an appropriate priority from your perspective?

Besides that, I was wondering:

  • is the phabricator revision originally as approved by Shane ready or does it need some more tweaks to leverage changes introduced by Bug 1597898? (I see that Bug 1646451 may be actually be another blocker, but based on a quick look that shouldn't require changes to the patch attached to this bug).
  • is the patch attached to this issue still something that you still plan to finalize and land?
Flags: needinfo?(agi)

Hey Luca! IIRC this was waiting on support from rkv. The problem was/is that rkv does not allow reloading the database in the event of a corrupted db file. Victor was working on a fix but I think he's not at mozilla anymore and I'm not sure what the state of that is :( I can double check on Monday (leaving the ni to do that).

Reset for triage again.

Assignee: agi → nobody
Severity: S3 → N/A
Priority: P1 → --
Flags: needinfo?(agi)

Hi Lina,
I did a little bit of digging to see if this issue was still relevant or if something was already done on the kvstore or rkv side and this bugzilla issue may need to be updated accordingly, and it seems that rkv SafeMode does now actually take care of it internally when it does determine that the data being read is corrupted (since when https://github.com/mozilla/rkv/pull/207 got merged and so that should have been already also happening in mozilla-central too, since when the rkv crate in mozilla-central have been updated to a version that included that change).

And so I wanted to confirm with your help:

  • is there anything else that should be done on the mozilla-central side? The patch originally attached to Bug 1646451 and then marked as obsolete (https://phabricator.services.mozilla.com/D80050) seems to be doing on the kvstore side what https://github.com/mozilla/rkv/pull/207 did for the rkv SafeMode on the rkv store side (and so maybe Bug 1646451 could as well be closed now)

  • would it be reasonable to consider renaming the corrupted file into an unique .corrupt file so that we would be able to investigate corruption issues? (similarly to what we do in case of corrupted json files in JSONFile.sys.mjs)
    if that would sound a reasonable follow up to you to, would that be something that fits better in rkv or the kvsore layer? (mainly to determine where the bug tracking that followup would be filed, so that we can consider that as a blocker to fully close this issue on our side)

Let us know what you think.

Flags: needinfo?(lina)
Whiteboard: [addons-jira]

Hi Jan,
would you be able to take over the needinfo currently assigned to Lina in comment 9?

Thanks in advance for your help!

Flags: needinfo?(jrediger)

Oops, I'm so sorry, I thought I replied to this, but it was actually bug 1807010. 😬

  1. I don't see any uses of set_discard_if_corrupted in m-c, so I think we might still need to adopt it in kvstore?
  2. That sounds reasonable to me! Since rkv already has logic to delete corrupted files on load, I wonder if we can generalize set_discard_if_corrupted to something like set_corruption_recovery_strategy, where the options are Discard (the current behavior), Rename (to rename the corrupted file into a .corrupt file, like you suggested), and Ignore (bubble the error up). We can then add another variant of nsIKeyValueService.getOrCreate, like nsIKeyValueService.getOrCreateWithOptions, to let Firefox JS callers control the behavior.

WDYT?

Flags: needinfo?(lina)

(In reply to Lina Butler [:lina] from comment #11)

Oops, I'm so sorry, I thought I replied to this, but it was actually bug 1807010. 😬

hehe, no worries at all, I was actually sorry to have been filling your queue of multiple rkv/kvstore related questions all at once and so I thought to spread the remaining ones around a little bit :-P

  1. I don't see any uses of set_discard_if_corrupted in m-c, so I think we might still need to adopt it in kvstore?
    ah, I knew that I was potentially missing to notice one piece of the puzzle, now I see that the only call in mozilla-central for set_discard_if_corrupted is from a unit test part of the rkv crate itself, me facepalming myself for having missed to notice that before.

And so this seems to confirm that Bug 1646451 is still relevant and should be wiring that up, eventually along or after we also took care of point (2) described below.

  1. That sounds reasonable to me! Since rkv already has logic to delete corrupted files on load, I wonder if we can generalize set_discard_if_corrupted to something like set_corruption_recovery_strategy, where the options are Discard (the current behavior), Rename (to rename the corrupted file into a .corrupt file, like you suggested), and Ignore (bubble the error up). We can then add another variant of nsIKeyValueService.getOrCreate, like nsIKeyValueService.getOrCreateWithOptions, to let Firefox JS callers control the behavior.

WDYT?

yep, personally I think that would work perfectly for us, we would change our calls to pass the new options and enable our preferred behavior by opt-in into the Rename strategy.

Let's also confirm with Jan what would be his perspective, so that we can then also update Bug 1646451 to reflect the final resolution about what we want to aim for the work that will belong to fixing that bug.

Flags: needinfo?(jrediger)

Hi Jan,
what do you think about the strategy described in point (2) from Lina's comment 11?

if that sounds good to you too, I'm happy to update Bug 1646451 to make sure we captured what is the current status, what we are aiming for and what is missing to get there.

Flags: needinfo?(jrediger)

(*Jan-Erik)

Sorry for the delay.
The set_corruption_recovery_strategy option sounds good to me.
I don't see any usage of set_discard_if_corrupted in glean or appservices, so we wouldn't even break them.

Flags: needinfo?(jrediger)

(In reply to Jan-Erik Rediger [:janerik] from comment #14)

(*Jan-Erik)

Sorry for the delay.
The set_corruption_recovery_strategy option sounds good to me.
I don't see any usage of set_discard_if_corrupted in glean or appservices, so we wouldn't even break them.

Thanks a lot for the confirmation!

are there already some bugzilla and github issues that we can track as blockers of this issue on the ExtensionPermissions.jsm side?
would you mind to file the bug in the relevant bug tracker if they are not already filed?

(I guess that a github issue would be tracking the changes needed on the rkv side. and bugzilla issues for tracking vendoring the new rkv version and using the new options on the kvstore side)

This bugzilla issue will still be tracking the changes needed for make ExtensionPermissions.jsm kvstore backend to opt-in to the "handling corrupted mode" option in the call to create the kvstore instance (and we will be filing another one to do the same for ExtensionScriptingStore.jsm), and so we'd like to add as blockers for this bug the bugzilla (and/or Github issue) tracking the changes that needs to happen on the rkv and kvstore sides to unblock this work.

Thanks again for your help on this.

Flags: needinfo?(jrediger)
Flags: needinfo?(jrediger)

I filed https://github.com/mozilla/rkv/issues/234.
You can file the issues for tracking in the respective components.

I see if I can get around to tackling the rkv side.

Severity: N/A → S3
Priority: -- → P3

Ping after several months (🥲): https://github.com/mozilla/rkv/pull/235 is now merged and is published as v0.19.0.

Blocks: 1872499

This initial patch only includes an additional test case to confirm which is
the default recovery strategy used by the kvstore (as its own patch to confirm
that the changes introduced by the child patch to the kvstore internals is
going to keep that behavior unchanged).

Assignee: nobody → lgreco
Status: NEW → ASSIGNED
Attachment #9157385 - Attachment is obsolete: true

I've been looking to a android test-verify failure hit by the test case updated by the last patch in the stack (test_ext_scripting_persistAcrossSessions.js) in my last push to try, but a push to try without any of the changes attached to this bug and only an inline comment added to that test case to force it to be run in test-verify has hit the exact same failure and so that looks an unrelated issue (one hit becuase the test file has been changed to add an additional test case):

And so I'll be pushing the attached stack of patches to autoland shortly.

Pushed by luca.greco@alcacoop.it:
https://hg.mozilla.org/integration/autoland/rev/171b3c244e87
Test kvstore default recovery strategy. r=janerik
https://hg.mozilla.org/integration/autoland/rev/1c271afe8f96
Added a new kvstore getOrCreateWithOptions method to allow setting a custom recovery strategy. r=janerik
https://hg.mozilla.org/integration/autoland/rev/d1562e742107
Use kvstore rename recovery strategy on corrupted db in ExtensionPermissions. r=willdurand
https://hg.mozilla.org/integration/autoland/rev/01672912dc97
Use kvstore rename recovery strategy on corrupted db in ExtensionScriptingStore. r=willdurand
Regressions: 1881876
No longer regressions: 1881876
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: