TLS randomly stops working in GeckoView
Categories
(Core :: Security: PSM, defect, P1)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox-esr60 | --- | unaffected |
| firefox66 | --- | unaffected |
| firefox67 | --- | unaffected |
| firefox68 | + | fixed |
People
(Reporter: snorp, Assigned: keeler)
References
(Depends on 1 open bug)
Details
(Keywords: regression, Whiteboard: [geckoview:fenix:m4][necko-triaged][psm-assigned])
Attachments
(3 files, 1 obsolete file)
We've had some problems recently where TLS loads return an error every time. It seems that something is going wrong in the profile, because clearing that resolves the issue.
| Reporter | ||
Comment 1•6 years ago
|
||
| Reporter | ||
Comment 2•6 years ago
|
||
:keeler, this seems like it could be certdb related, can you have a look at the logs?
| Reporter | ||
Updated•6 years ago
|
| Reporter | ||
Comment 3•6 years ago
|
||
| Reporter | ||
Comment 4•6 years ago
|
||
Near as I can tell, GetFirstEVPolicy() always returns false here[0], which results in CertVerifier::VerifyCert failing with Result::ERROR_UNKNOWN_ERROR. I am not sure what about this profile is causing this.
| Reporter | ||
Comment 5•6 years ago
|
||
Hmm, not sure what changed, but I can't seem to repro anymore with the attached profile. I thought I had it narrowed down to a stale/invalid state_security/lock.mdb, but not sure now.
Comment 6•6 years ago
|
||
This might be related to Fennec "Secure Connection Failed" bug 1541354.
| Assignee | ||
Comment 7•6 years ago
|
||
If we can't initialize cert_storage for whatever reason, this will happen. This raises some questions: why can't we initialize this component? should we fail a bit more "open" if we can't initialize it? (right now it stores some certificate revocation information, so we should try to avoid situations where we'll just skip these checks).
| Reporter | ||
Comment 8•6 years ago
|
||
(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #7)
If we can't initialize
cert_storagefor whatever reason, this will happen. This raises some questions: why can't we initialize this component? should we fail a bit more "open" if we can't initialize it? (right now it stores some certificate revocation information, so we should try to avoid situations where we'll just skip these checks).
Is this a recent-ish change? I'm having trouble reproducing the problem at the moment, but I'll see if the cert_storage stuff is the root cause when I'm able.
| Assignee | ||
Comment 9•6 years ago
|
||
It landed in bug 1429796. See also bug 1535662 and related bugs.
Comment 10•6 years ago
|
||
Bug 1541811 ("Can't sign into Google Play account") might be related to these TLS issues. Softvision's bisection identified the following pushlog as a possible source of the regression:
| Reporter | ||
Comment 11•6 years ago
|
||
Some progress: we appear to be failing to open the cert_storage lmdb, as it's failing with SecurityStateError { message: "lmdb error: MDB_INVALID: File is not an LMDB file" }. I don't know why this is occuring yet.
| Reporter | ||
Comment 12•6 years ago
|
||
The trouble appears to begin here: https://searchfox.org/mozilla-central/rev/d143f8ce30d1bcfee7a1227c27bf876a85f8cede/third_party/rust/lmdb-rkv-sys/lmdb/libraries/liblmdb/mdb.c#3712
Somehow the database appears corrupt on the device, but inspecting the files on my dev machine reveals they are not. Investigating further...
| Reporter | ||
Comment 13•6 years ago
|
||
OK, I think we've got it figured out. LMDB 0.9.x databases are not compatible between 32-bit and 64-bit architectures. It looks like in this case, the databases were created on a 64-bit machine, but then later read on a 32-bit one (and my emulator is using i686). This is problematic, because we plan to migrate to 64-bit builds for most Android users in the coming months. Myk, can we migrate to the newer LMDB to avoid this issue? (Also, thanks for helping me debug this!)
Updated•6 years ago
|
Comment 14•6 years ago
|
||
[Tracking Requested - why for this release]:
This is a regression in Fennec/GV 67 after updating from 32-bit to 64-bit Fennec. An HTTPS certificate database created in 32-bit is not forward compatible with 64-bit.
| Reporter | ||
Comment 16•6 years ago
|
||
Updated•6 years ago
|
Updated•6 years ago
|
Comment 17•6 years ago
|
||
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #13)
OK, I think we've got it figured out. LMDB 0.9.x databases are not compatible between 32-bit and 64-bit architectures. It looks like in this case, the databases were created on a 64-bit machine, but then later read on a 32-bit one (and my emulator is using i686). This is problematic, because we plan to migrate to 64-bit builds for most Android users in the coming months. Myk, can we migrate to the newer LMDB to avoid this issue?
It's possible, but it's risky. I prototyped an upgrade to the unstable master branch of LMDB in this change:
https://hg.mozilla.org/try/rev/0426bfd006fbf1f60141da0ae783542602014c41
And it passes tests while performing more or less equivalently on Talos:
But there are known perf problems on Windows in the unstable branch (http://www.openldap.org/lists/openldap-bugs/201902/msg00009.html). And there's no compatibility guarantee on that branch, so we risk having to manage an even harder compatibility change in the future.
Thus I think we shouldn't upgrade to the unstable branch of LMDB to resolve this issue. We should rather drop and recreate databases when that's feasible (as it is in the case of the cert_storage database) and migrate them when it isn't (as with the Notification Store database).
| Reporter | ||
Comment 18•6 years ago
|
||
Ok, Dana would you mind taking this over? I think it'll go quicker that way.
Updated•6 years ago
|
Comment 20•6 years ago
|
||
67 is close to flipping to release. Dana, will the fix for this be ready (and safe) for a beta uplift?
Also, have we identified the fix which introduced this regression in 67?
| Assignee | ||
Comment 21•6 years ago
|
||
My understanding was we determined cert_storage is the cause of this. However, that landed in 68, so I don't see how this could affect 67.
In any case, if we can't land a fix for this before the merge, bug 1547877 will make it so we use the old implementation (that doesn't use cert_storage) on beta/release, so we shouldn't have this bug.
| Assignee | ||
Updated•6 years ago
|
Comment 22•6 years ago
|
||
(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #21)
My understanding was we determined
cert_storageis the cause of this. However, that landed in 68, so I don't see how this could affect 67.
In that case, I will change the bug status flags to 67=unaffected.
If this bug is solely caused by 64-bit Fennec reading a 32-bit cert_storage database, note that ARM64 Fennec is enabled in 67 Beta, but it won't ride to Release until 68. x86_64 Fennec, however, will ride to Release in 67.
| Assignee | ||
Comment 23•6 years ago
|
||
It turns out that an rkv database created on a 32-bit platform cannot be used on
a 64-bit platform and vice-versa. To work around this for now, we delete and
recreate the DB backing cert_storage and set flags to let our consumers know
to re-load all known data.
Comment 24•6 years ago
|
||
Comment 25•6 years ago
|
||
| bugherder | ||
| Reporter | ||
Comment 26•6 years ago
|
||
(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #21)
My understanding was we determined
cert_storageis the cause of this. However, that landed in 68, so I don't see how this could affect 67.In any case, if we can't land a fix for this before the merge, bug 1547877 will make it so we use the old implementation (that doesn't use
cert_storage) on beta/release, so we shouldn't have this bug.
Agreed, yeah, it should only affect 68.
Updated•6 years ago
|
Updated•6 years ago
|
Description
•