Closed Bug 1158855 Opened 9 years ago Closed 9 years ago

Pinned Gmail/Twitter tabs sometimes do not restore: sec_error_ocsp_unknown_cert, sec_error_unknown_issuer

Categories

(Firefox :: Session Restore, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 41
Tracking Status
firefox38 --- wontfix
firefox38.0.5 --- wontfix
firefox39 --- fixed
firefox40 --- fixed
firefox41 --- fixed
firefox-esr31 --- wontfix
firefox-esr38 --- wontfix

People

(Reporter: blassey, Assigned: ttaubert)

References

Details

(Keywords: reproducible)

Attachments

(3 files, 1 obsolete file)

I keep twitter open in an app tab and restart my Nightly daily to update. In the last 1-2 weeks my twitter app tab hasn't restored. It just comes in blank after restart. If I hit the refresh button it reloads and renders.
Summary: twitter app tab never restored → Pinned Gmail/Twitter tabs sometimes do not restore
Does this happen without any add-ons in your profile? Does this only happen with e10s? Do you have a large number of tabs/windows? Can you maybe create a copy of your profile and try to close a few tabs and remove a few add-ons to narrow it down? I can't reproduce it here.
Flags: needinfo?(cpeterson)
Flags: needinfo?(blassey.bugs)
I have disabled all of my addons and closed all but my pinned tabs and it still reproduces. Anything else I can do to help diagnose?

Here's what's in my browser console (except for some debug spew that I've got enabled for Loop): 

TypeError: aWindow.QueryInterface is not a function PrivateBrowsingUtils.jsm:49:11
Sending message that cannot be cloned. Are you trying to send an XPCOM object? SiteSpecificUserAgent.js:50:0
The character encoding of a framed document was not declared. The document may appear different if viewed without the document framing it. xti.php
Using //@ to indicate sourceURL pragmas is deprecated. Use //# instead eed3767b0b98f24f24fd45c47377d56dcalendarjs_doozercompiled__en.js:232:0
Sending message that cannot be cloned. Are you trying to send an XPCOM object? SiteSpecificUserAgent.js:50:0
OpenGL compositor Initialized Succesfully.
Version: 2.1 NVIDIA-8.26.29 310.40.55f01
Vendor: NVIDIA Corporation
Renderer: NVIDIA GeForce GT 750M OpenGL Engine
FBO Texture Target: TEXTURE_2D
Flags: needinfo?(blassey.bugs)
I have the same problem with Gmail Firefox 37.0.2 stable (in win8.1 x64).
To reproduce:
in a clean FF profile,
- open mail.google.com and login. Pin this tab.
- enable "show my windows and tabs from last time" 
- restart Firefox a few times.


The pinned gmail tab will either load ok, 
or, [case #1] it will show the error "Secure connection fails" (see screenshot 1).
If then you press "Try again", then Gmail loads ok.

If then you close and restart Firefox a few more times,
then the error "Secure connection fails" will surely occur most of the times,
or [case #2] the loading procedure will hang with an empty and idle progress bar (see screenshot 2).

Once this happens, then pressing "reload current page" (or F5) doesn't fixes this:
the tab title spinner rotates for a while, but the progress bar remains empty and idle.
And, If you close Firefox and restart it,
then Gmail'loading will surely hang every single time.
The only way to fix it is to press force reload (Ctrl+F5)
Attached image Gmail 1 - error.png
Attached image Gmail 2 - hang.png
Thanks, Kostas! Using the STR from comment 4, I was able to reproduce the stuck loading screen (~20% of the time using Nightly 40) in a relatively-clean profile with no add-ons enabled. Kostas is using Firefox 37.0.2 so this is not an e10s problem.

I was not able to reproduce the stuck loading screen with Firefox 31 (ESR), but after a couple browser restarts, I hit the "Secure Connection Failed (Error code: sec_error_ocsp_unknown_cert)" error.

With Firefox 37.0.02, after a few rapid restarts, I hit an HTTPS error: "mail.google.com uses an invalid security certificate. The certificate is not trusted because the issuer certificate is unknown. (Error code: sec_error_unknown_issuer)".
Flags: needinfo?(cpeterson)
Summary: Pinned Gmail/Twitter tabs sometimes do not restore → Pinned Gmail/Twitter tabs sometimes do not restore: sec_error_ocsp_unknown_cert, sec_error_unknown_issuer
Tim, do you need any more information about this bug? I can reproduce this problem pretty easily with Kosta's STR in comment 4. This is a very frustrating bug because it is not fixed by reloading or force-reloading the pinned tab on my machine.
Flags: needinfo?(ttaubert)
Keywords: reproducible
I can't reproduce this here...

The odd thing is that Kostas is hitting an OCSP error (thought they'd staple?) just like you in comment #7. But then there's also a problem with the root? Not sure what's happening.
Flags: needinfo?(ttaubert)
Aha, just saw ocsp_unknown_cert...
Looks like there are two failing OCSP request, sometimes one fails and then a subsequent ones succeeds and overrides the cached response. Sometimes both of them fail and we see a warning.
In other runs it's the exact same request that fails and a few milliseconds later succeeds.
Brian, David, do you maybe have an idea about how to debug this further? Is one of Google's OCSP responders faulty? Should we handle errors here differently?
More complete, here are two request/response pairs for the same cert:

# Failing
request: http://www.lapo.it/asn1js/#30493047304530433041300906052B0E03021A05000414F2E06AF9858A1D8D709B4919237AA9B51A287E6404144ADD06161BBCF668B576F581B6BB621ABA5A812F02086B8FE9B75EED2966
response: http://www.lapo.it/asn1js/#308201CB0A0100A08201C4308201C006092B0601050507300101048201B1308201AD308196A21604144ADD06161BBCF668B576F581B6BB621ABA5A812F180F32303135303532313037313930305A306B30693041300906052B0E03021A05000414F2E06AF9858A1D8D709B4919237AA9B51A287E6404144ADD06161BBCF668B576F581B6BB621ABA5A812F020822CACBB89757909F8000180F32303135303532313037313930305AA011180F32303135303532383037313930305A300D06092A864886F70D01010505000382010100705F057A675469F7A9DC9DBD4D4D368F9A5C3CF955E71C2510ACC61A819FEEEAB42AE9BED7FAB8FADC62113FE1EDED7EB6A0CC5D845DFDC86CCC1CC45C6BB0FF55ABE77D2B77B64BA5FCDF1D2427C275E15ACF78A71285AB90D3DF3F48017E3ED06879A15B6DD1C5C2EF4B9E8F168AB514084829282A43679C39591357C0CD341B42D22F0BBB6B27F14D8B4800B62AE4355B952854A0FE61084092E2A6879CF943C8A86F86FDECDF829D612F9602D613B8A8CDA61939FD483941A1D1BB10AE8A22A162D2EF5178B1DF294AF5D51AE947BE58C4E98617AC570D809E455AC801E86DFD3551143E747CB347FC20C7A2BD691016B8693207948D0F61E95B717CD281

# Succeeding
request: http://www.lapo.it/asn1js/#30493047304530433041300906052B0E03021A05000414F2E06AF9858A1D8D709B4919237AA9B51A287E6404144ADD06161BBCF668B576F581B6BB621ABA5A812F02086B8FE9B75EED2966
response: http://www.lapo.it/asn1js/#308201CB0A0100A08201C4308201C006092B0601050507300101048201B1308201AD308196A21604144ADD06161BBCF668B576F581B6BB621ABA5A812F180F32303135303532313037303133345A306B30693041300906052B0E03021A05000414F2E06AF9858A1D8D709B4919237AA9B51A287E6404144ADD06161BBCF668B576F581B6BB621ABA5A812F02086B8FE9B75EED29668000180F32303135303532313037303133345AA011180F32303135303532383037303133345A300D06092A864886F70D0101050500038201010044A9B181EE2C314F86FF2FA39A92471C645E7C17949FB1105BFF1E141E066583D7BD9C8447B2701A33D5FAC523E6E4B2B5DB8EDD3F4167B0F084F02506C1C1522FBE6C6DEFB504C1F152BFF099C8F38007A0CA28907758F14AE06FB7670E7542E0042AA67241686E9DB630C3C4B50A84EE19AF77E14221B9A264429752F4BBDC0BF3A44579486886FB652F8F3914AF6712E79C5D0F8C4EF2460C8A9C9FFC851D194CF11FF93E5D11864F6BE70F7A11C7FD184BD56FD784EFE5CD1ACBDAB0CBC7EC32C503D59C6494B141A730B461F75ABBE0AB6BF4F0A984EED28E2A986CC8C14CF2F90682A6D88C993EC4C88FA110628F46B453928F3819662CAC4B496F06D3

In the failing request/response pair the serial numbers don't match up.
We currently seem to handle a missing response for a cert the same as a response=unknown. Shouldn't we rather handle this as a server failure in that it ignored one of our requests?
Looks like bug 1060112.
Depends on: 1060112
(In reply to Tim Taubert [:ttaubert] from comment #12)
> We're sending this OCSP request:
> 
> http://www.lapo.it/asn1js/
> #30493047304530433041300906052B0E03021A05000414F2E06AF9858A1D8D709B4919237AA9
> B51A287E6404144ADD06161BBCF668B576F581B6BB621ABA5A812F020879652CE5F5CA6419
> 
> to http://clients1.google.com/ocsp and it's responding with "unknown cert"
> twice.

Just to be clear, your change in bug 1060112 will change the error reported sometimes, but it won't actually fix this problem, right?

I'm guessing that this is what is happening: We have a cached HTTP response. Between the time we wrote the response to the cache and the time we retrieved the response from the cache, the certificate expired. The OCSP responder then responses "don't know about that cert," which is reasonable. Then Firefox gives up. Instead, Firefox should either throw away the cached HTTP response, or it should update the certificate in the response with a new certificate that was obtained from a new network request.

This theory could very well be wrong. IIRC Firefox doesn't validate certificates when reading responses from the cached, so I would be surprised if it is correct. But, it fits the observed behavior.
(In reply to Brian Smith (:briansmith, :bsmith, use NEEDINFO?) from comment #20)
> Just to be clear, your change in bug 1060112 will change the error reported
> sometimes, but it won't actually fix this problem, right?

With the patch I can't reproduce it anymore. Current Firefox queries the OCSP responder and sometimes receives the erroneous response containing no entry for the requested cert. Verification will fail with SEC_ERROR_OCSP_UNKNOWN_CERT and we put that in the cache. If subsequent requests fail too we show an SSL error page, which happens only intermittently probably due to Google's load balancing.

With the new error code we wouldn't cache the OCSP response, we only do that for "Success", "ERROR_REVOKED_CERTIFICATE", and "ERROR_OCSP_UNKNOWN_CERT".
Is anyone of you three brave enough to give the patch in bug 1060112 a try? Would be great to know if that stops it from reproducing for you as well. You might have to clobber for that, NSS is sometimes weird when building.
Flags: needinfo?(rick3162)
Flags: needinfo?(cpeterson)
Flags: needinfo?(blassey.bugs)
Assignee: nobody → ttaubert
Status: NEW → ASSIGNED
(In reply to Tim Taubert [:ttaubert] from comment #21)
> (In reply to Brian Smith (:briansmith, :bsmith, use NEEDINFO?) from comment
> #20)
> > Just to be clear, your change in bug 1060112 will change the error reported
> > sometimes, but it won't actually fix this problem, right?
> 
> With the patch I can't reproduce it anymore. Current Firefox queries the
> OCSP responder and sometimes receives the erroneous response containing no
> entry for the requested cert. Verification will fail with
> SEC_ERROR_OCSP_UNKNOWN_CERT and we put that in the cache. If subsequent
> requests fail too we show an SSL error page, which happens only
> intermittently probably due to Google's load balancing.

Yes, but it seems like that patch only fixes the special case of where the OCSP responder does the wrong thing: sending a "Good" response for a different certificate instead of sending an "Unknown" response. Won't things still be broken for the case where the OCSP responder does the right thing?

> With the new error code we wouldn't cache the OCSP response, we only do that
> for "Success", "ERROR_REVOKED_CERTIFICATE", and "ERROR_OCSP_UNKNOWN_CERT".

I'm guessing that the real problem is that we're asking for an OCSP response for a certificate that is expired. Why?

Remember that recently the OCSP request code was changed to support short-lived certificates and maybe for other reasons. I suspect a regression in one of those changes.
(In reply to Brian Smith (:briansmith, :bsmith, use NEEDINFO?) from comment #24)
> Yes, but it seems like that patch only fixes the special case of where the
> OCSP responder does the wrong thing: sending a "Good" response for a
> different certificate instead of sending an "Unknown" response. Won't things
> still be broken for the case where the OCSP responder does the right thing?

When the responder responds with "cert unknown" we will block and I thought that was a deliberate decision? If it's not we should probably address that in another bug? It feels like you're right and we should probably treat it like an aborted connection and block only when the responder tells us the cert was revoked.

> I'm guessing that the real problem is that we're asking for an OCSP response
> for a certificate that is expired. Why?

It might be an expired certificate used somewhere in Gmail but I didn't check that. Maybe the OCSP responders aren't properly synced and one of them is missing some information? Not sure what's more likely.

> Remember that recently the OCSP request code was changed to support
> short-lived certificates and maybe for other reasons. I suspect a regression
> in one of those changes.

I didn't know, that's certainly possible.
(In reply to Tim Taubert [:ttaubert] from comment #23)
> Is anyone of you three brave enough to give the patch in bug 1060112 a try?
> Would be great to know if that stops it from reproducing for you as well.
> You might have to clobber for that, NSS is sometimes weird when building.

I can give it a try
Flags: needinfo?(blassey.bugs)
The patch on bug 1060112 didn't fix this for me.
Flags: needinfo?(rick3162)
Flags: needinfo?(cpeterson)
The OCSP warning was definitely one problem that is fixed now and I don't see that anymore. I can still reproduce the frozen loading bar and I figured out what it is:

This seems to be a variation (or regression?) of bug 450595, probably caused by bug 936271. Gmail uses an <iframe> to load JavaScript and turns that into a wyciwyg:// page by calling document.write() some time after the load event. After restarting the browser we restore the history for all those frames and somehow now the <iframe> isn't static anymore but session history reports it as dynamic. I couldn't find out how they do that, would otherwise have written a test for it.

Dynamic frames are very error prone and we shouldn't try collect/restore history entries. This would be in line with the docShell's behavior [1] too. I suggest we don't collect children if one or more of those were dynamically added and simply load the top entry when restoring a page.

[1] https://mxr.mozilla.org/mozilla-central/source/docshell/base/nsDocShell.cpp#1459
Attachment #8610173 - Flags: review?(dteller)
Blocks: 450595, 936271
(In reply to Tim Taubert [:ttaubert] from comment #28)
> This seems to be a variation (or regression?) of bug 450595, probably caused
> by bug 936271. Gmail uses an <iframe> to load JavaScript and turns that into
> a wyciwyg:// page by calling document.write() some time after the load
> event. After restarting the browser we restore the history for all those
> frames and somehow now the <iframe> isn't static anymore but session history
> reports it as dynamic.

I should've added that on the third load we will then override <iframe id="js_frame"> with something else, I think a mostly blank HTML page. If I manually kick off loading the correct URL on the stuck loading screen the app starts loading and works fine.
Forgot to remove some code we don't need anymore.
Attachment #8610173 - Attachment is obsolete: true
Attachment #8610173 - Flags: review?(dteller)
Attachment #8610208 - Flags: review?(dteller)
Comment on attachment 8610208 [details] [diff] [review]
0001-Bug-1158855-Don-t-collect-children-of-SHEntries-if-o.patch, v2

Review of attachment 8610208 [details] [diff] [review]:
-----------------------------------------------------------------

Sounds sane.
Attachment #8610208 - Flags: review?(dteller) → review+
https://hg.mozilla.org/mozilla-central/rev/1e8841b048f4
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 41
Comment on attachment 8610208 [details] [diff] [review]
0001-Bug-1158855-Don-t-collect-children-of-SHEntries-if-o.patch, v2

Approval Request Comment
[Feature/regressing bug #]: bug 936271
[User impact if declined]: Gmail and Twitter tabs might seem stuck after restoring a session, can be fixed force-reloading the tab.
[Describe test coverage new/current, TreeHerder]: No test coverage for the bug itself, but for the changes introduced.
[Risks and why]: Risk is rather low, we simply stopped collecting child frames for dynamic sites and will simply reload subframe URLs from cache or network.
[String/UUID change made/needed]: None.
Attachment #8610208 - Flags: approval-mozilla-beta?
Attachment #8610208 - Flags: approval-mozilla-aurora?
Comment on attachment 8610208 [details] [diff] [review]
0001-Bug-1158855-Don-t-collect-children-of-SHEntries-if-o.patch, v2

User facing issue. Taking the fix as it is early in the beta cycle.
Attachment #8610208 - Flags: approval-mozilla-beta?
Attachment #8610208 - Flags: approval-mozilla-beta+
Attachment #8610208 - Flags: approval-mozilla-aurora?
Attachment #8610208 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: