Closed Bug 1546361 Opened 6 months ago Closed 6 months ago

TLS randomly stops working in GeckoView

Categories

(Core :: Security: PSM, defect, P1)

Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
mozilla68
Tracking Status
firefox-esr60 --- unaffected
firefox66 --- unaffected
firefox67 --- unaffected
firefox68 + fixed

People

(Reporter: snorp, Assigned: keeler)

References

(Depends on 1 open bug)

Details

(Keywords: regression, Whiteboard: [geckoview:fenix:m4][necko-triaged][psm-assigned])

Attachments

(3 files, 1 obsolete file)

We've had some problems recently where TLS loads return an error every time. It seems that something is going wrong in the profile, because clearing that resolves the issue.

:keeler, this seems like it could be certdb related, can you have a look at the logs?

Flags: needinfo?(dkeeler)
Attachment #9060071 - Attachment mime type: application/octet-stream → text/plain

Near as I can tell, GetFirstEVPolicy() always returns false here[0], which results in CertVerifier::VerifyCert failing with Result::ERROR_UNKNOWN_ERROR. I am not sure what about this profile is causing this.

[0] https://searchfox.org/mozilla-central/rev/ec489aa170b6486891cf3625717d6fa12bcd11c1/security/certverifier/CertVerifier.cpp#595

Hmm, not sure what changed, but I can't seem to repro anymore with the attached profile. I thought I had it narrowed down to a stale/invalid state_security/lock.mdb, but not sure now.

This might be related to Fennec "Secure Connection Failed" bug 1541354.

OS: Unspecified → Android
Priority: -- → P1
See Also: → 1541354
Whiteboard: [geckoview:fenix:m4]

If we can't initialize cert_storage for whatever reason, this will happen. This raises some questions: why can't we initialize this component? should we fail a bit more "open" if we can't initialize it? (right now it stores some certificate revocation information, so we should try to avoid situations where we'll just skip these checks).

Flags: needinfo?(dkeeler)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #7)

If we can't initialize cert_storage for whatever reason, this will happen. This raises some questions: why can't we initialize this component? should we fail a bit more "open" if we can't initialize it? (right now it stores some certificate revocation information, so we should try to avoid situations where we'll just skip these checks).

Is this a recent-ish change? I'm having trouble reproducing the problem at the moment, but I'll see if the cert_storage stuff is the root cause when I'm able.

Flags: needinfo?(dkeeler)

It landed in bug 1429796. See also bug 1535662 and related bugs.

Flags: needinfo?(dkeeler)

Bug 1541811 ("Can't sign into Google Play account") might be related to these TLS issues. Softvision's bisection identified the following pushlog as a possible source of the regression:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=06f0c5e35c3ab64490bcdfcb25f7ac6ff531ba5e&tochange=fd67a4332060ea5389d448b3453ea23064e50b5e

See Also: → 1541811
Summary: TLs randomly stops working in GeckoView → TLS randomly stops working in GeckoView

Some progress: we appear to be failing to open the cert_storage lmdb, as it's failing with SecurityStateError { message: "lmdb error: MDB_INVALID: File is not an LMDB file" }. I don't know why this is occuring yet.

The trouble appears to begin here: https://searchfox.org/mozilla-central/rev/d143f8ce30d1bcfee7a1227c27bf876a85f8cede/third_party/rust/lmdb-rkv-sys/lmdb/libraries/liblmdb/mdb.c#3712

Somehow the database appears corrupt on the device, but inspecting the files on my dev machine reveals they are not. Investigating further...

OK, I think we've got it figured out. LMDB 0.9.x databases are not compatible between 32-bit and 64-bit architectures. It looks like in this case, the databases were created on a 64-bit machine, but then later read on a 32-bit one (and my emulator is using i686). This is problematic, because we plan to migrate to 64-bit builds for most Android users in the coming months. Myk, can we migrate to the newer LMDB to avoid this issue? (Also, thanks for helping me debug this!)

Flags: needinfo?(myk)
Blocks: 1541354
See Also: 1541354

[Tracking Requested - why for this release]:

This is a regression in Fennec/GV 67 after updating from 32-bit to 64-bit Fennec. An HTTPS certificate database created in 32-bit is not forward compatible with 64-bit.

Assignee: nobody → snorp
Duplicate of this bug: 1541354

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #13)

OK, I think we've got it figured out. LMDB 0.9.x databases are not compatible between 32-bit and 64-bit architectures. It looks like in this case, the databases were created on a 64-bit machine, but then later read on a 32-bit one (and my emulator is using i686). This is problematic, because we plan to migrate to 64-bit builds for most Android users in the coming months. Myk, can we migrate to the newer LMDB to avoid this issue?

It's possible, but it's risky. I prototyped an upgrade to the unstable master branch of LMDB in this change:

https://hg.mozilla.org/try/rev/0426bfd006fbf1f60141da0ae783542602014c41

And it passes tests while performing more or less equivalently on Talos:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=7335537b61e6773c8cc8e36fcbb66865ec5785ed&newProject=try&newRevision=2a407d0491b5f5987db77820b42e50f4035a4203

But there are known perf problems on Windows in the unstable branch (http://www.openldap.org/lists/openldap-bugs/201902/msg00009.html). And there's no compatibility guarantee on that branch, so we risk having to manage an even harder compatibility change in the future.

Thus I think we shouldn't upgrade to the unstable branch of LMDB to resolve this issue. We should rather drop and recreate databases when that's feasible (as it is in the case of the cert_storage database) and migrate them when it isn't (as with the Notification Store database).

Flags: needinfo?(myk)

Ok, Dana would you mind taking this over? I think it'll go quicker that way.

Flags: needinfo?(dkeeler)

Sure - I'll see what I can do.

Flags: needinfo?(dkeeler)
Whiteboard: [geckoview:fenix:m4] → [geckoview:fenix:m4][necko-triaged]

67 is close to flipping to release. Dana, will the fix for this be ready (and safe) for a beta uplift?
Also, have we identified the fix which introduced this regression in 67?

Flags: needinfo?(dkeeler)

My understanding was we determined cert_storage is the cause of this. However, that landed in 68, so I don't see how this could affect 67.

In any case, if we can't land a fix for this before the merge, bug 1547877 will make it so we use the old implementation (that doesn't use cert_storage) on beta/release, so we shouldn't have this bug.

Flags: needinfo?(dkeeler) → needinfo?(snorp)
Assignee: snorp → dkeeler
Component: Networking: HTTP → Security: PSM
Whiteboard: [geckoview:fenix:m4][necko-triaged] → [geckoview:fenix:m4][necko-triaged][psm-assigned]

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #21)

My understanding was we determined cert_storage is the cause of this. However, that landed in 68, so I don't see how this could affect 67.

In that case, I will change the bug status flags to 67=unaffected.

If this bug is solely caused by 64-bit Fennec reading a 32-bit cert_storage database, note that ARM64 Fennec is enabled in 67 Beta, but it won't ride to Release until 68. x86_64 Fennec, however, will ride to Release in 67.

It turns out that an rkv database created on a 32-bit platform cannot be used on
a 64-bit platform and vice-versa. To work around this for now, we delete and
recreate the DB backing cert_storage and set flags to let our consumers know
to re-load all known data.

Pushed by dkeeler@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a977984e7862
recreate cert_storage data as necessary r=jcj,myk
Status: NEW → RESOLVED
Closed: 6 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #21)

My understanding was we determined cert_storage is the cause of this. However, that landed in 68, so I don't see how this could affect 67.

In any case, if we can't land a fix for this before the merge, bug 1547877 will make it so we use the old implementation (that doesn't use cert_storage) on beta/release, so we shouldn't have this bug.

Agreed, yeah, it should only affect 68.

Flags: needinfo?(snorp)

untracking for 67 as per comment #26

Attachment #9060820 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.