1555306 - Collision on NS find_objects()/sqlite3PagerSharedLock() when fetching OSCP causes extereme variance in page load times

Andrew Creskey [:acreskey]

Reporter

Description

•

6 years ago

•

Edited

Attached image Loadtimes of a few sites -- note the long tail on washingtonpost — Details

It was observed that there is extreme variance in the page load times for https://www.washingtonpost.com

e.g. results from ~2000ms to ~5000ms when testing against a recorded http session.

Finding the root cause may help us pinpoint and reduce noise in our runtime.

Notes:
Browsertime was used against static sites recorded with WebPageReplay.
The attached graph shows the distribution of 2 sets of 25 cold loads on the 2017 reference laptop.
Both sets of results, straight_wpr_0 and straight_wpr_1 were made with identical configurations. i.e. outside of noise you would expect the same results.

Andrew Creskey [:acreskey]

Reporter

Comment 1

•

6 years ago

Attached file geckoProfiles from a browsertime session — Details

jesup:
I've attached captured profiles through browsertime.

geckoProfile-7 and geckoProfile-25 are of the more common fast load times (~2100ms)
geckoProfile-19 and geckoProfile-22 are two of the slowest loads (~5000ms)

Added Socket,JS to the thread list and also turned off responsiveness.

Andrew Creskey [:acreskey]

Reporter

Comment 2

•

6 years ago

There's a video which autoplays on this recording, by the way.

Andrew Creskey [:acreskey]

Reporter

Comment 3

•

6 years ago

I noticed that in the slow loads the network requests visible in the content process are generally "waiting for socket thread" for 4800+ms.

Andrew Creskey [:acreskey]

Reporter

Comment 4

•

6 years ago

Added 'SSL,DNS' to the thread list.

Fast load (2164ms)
http://bit.ly/30PixDL

Slow load (4781ms)
http://bit.ly/30PYDbQ

I think it's the OCSP requests in the SSL cert threads that have to wait for main thread?

Andrew Creskey [:acreskey]

Reporter

Comment 5

•

6 years ago

Some profiles from live sites to see if the OCSP delays are a biproduct of the webpagereplay recording:

Faster live site: 3357ms
http://bit.ly/2W2XcDv

Slow live site: 6281ms (why are the no samples from the parent process from the first 5seconds?)
http://bit.ly/2VZliPd

Andrew Creskey [:acreskey]

Reporter

Comment 6

•

6 years ago

Note that I did see long, eg 862ms, DoOCSPRequest calls in the SSL Cert threads on the slow live site.

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

6 years ago

Depends on: 1555801

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 7

•

6 years ago

SSL threads are colliding with each other for an sqlite DB access critical section, and with the SocketThread.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 8

•

6 years ago

Clip from #necko:
(base profile: http://bit.ly/2KbbJdI)
[15:14] jesup Socket thread is 90% in ssl3_HandleHandshakeMessage(), split evenly between CERT_NewTempCertificate(), HandshakeCallback() and tls13_HandlePostHelloHandshakeMessage() basically
...
[ lots of blocking involving sqlite3PagerSharedLock() ]
...
[15:30] jesup both the socketthread and the SSL Cert#2 thread are calling find_objects and blocking on a critical section
[15:33] mayhemer I'm honestly lost in the profile, hard to figure out where delays are in the graphs and where to actually start
[15:36] jesup Summary: the socketthread and the SSL Cert#2 thread are calling find_objects() and blocking on a critical section. This eats the first 1/3. then SSL cert #2 dispatches to mainthread (which likely hands off to Socket Thread) the ssl connection to get the OCSP, and 1/2 of that is CERT_NewTempCertificate() (and waiting on the DB), and 1/2 is tls13_HandlePostHelloHandshakeMessage() (which...
[15:36] jesup ...blocks on find_objects() again)
[15:38] jesup let me find who owns that find_objects critical section...
[15:40] jesup aha, the lock on the sqlite is within find_objects also, but apparently not in the critical section
[15:40] jesup aha, it's fighting with SSL Cert #5
[15:41] jesup Look at this profile, with find_objects highlighted: http://bit.ly/2Ke3qOA
[15:43] jesup Note that in the two sections where SocketThread is blocked on the critical section in find_objects(), SSL cert #5 is in find_objects. And SSL Cert #6 is also fighting for it

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 9

•

6 years ago

[15:46] mayhemer in other words, we are doing IO under the lock!
[15:46] mayhemer I like the busy waits sqlite does.. 2x 100ms sleep
[15:48] acreskey jesup: good find! I think longterm that sqllite DB is going away
[15:50] mayhemer I just wonder how to fix this. this seems to me like a nss bug, since it calls a sync API to read data (always a bad thing)
[15:51] mayhemer note that this may also go to HW security tokens, not just disk

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 10

•

6 years ago

Moving to Networking; this might be more an NSS issue/design issue, but it's at least partly an issue in how networking is interfacing to NSS.

Component: Performance → Networking

Summary: Extreme variance in page load times on www.washingtonpost.com site - recorded http session, reference laptop → Collision on NS find_objects()/sqlite3PagerSharedLock() when fetching OSCP causes extereme variance in page load times

Whiteboard: [qf:investigate] → [qf:p1:pageload]

Kershaw Chang [:kershaw]

Comment 11

•

6 years ago

This looks like a nss bug to me.

Component: Networking → Security: PSM

Honza Bambas (:mayhemer)

Comment 12

•

6 years ago

The socket thread is blocked by calling SSL_ForceHandshake from nsNSSSocketInfo::DriveHandshake:

RtlEnterCriticalSection
struct nssCryptokiInstanceStr * * find_objects(struct NSSTokenStr *, struct nssSessionStr *, struct CK_ATTRIBUTE *, unsigned long, unsigned int, <unnamed-tag> *)
nssToken_FindObjectsByTemplate
nssToken_FindCertificateByIssuerAndSerialNumber
nssTrustDomain_FindCertificateByIssuerAndSerialNumber
NSSTrustDomain_FindCertificateByEncodedCertificate
CERT_NewTempCertificate
ssl3_HandleHandshakeMessage
ssl3_HandleNonApplicationData
ssl3_HandleRecord
ssl3_GatherCompleteHandshake
SSL_ForceHandshake
nsNSSSocketInfo::DriveHandshake()
bool mozilla::net::nsHttpConnection::EnsureNPNComplete(nsresult & const, unsigned int & const)
mozilla::net::nsHttpConnection::OnSocketWritable()

Cache I/O thread is deserializing security info:

find_objects
nssToken_FindObjectsByTemplate
nssToken_FindCertificateByIssuerAndSerialNumber
nssTrustDomain_FindCertificateByIssuerAndSerialNumber
nssTrustDomain_FindCertificateByEncodedCertificate
CERT_NewTempCertificate
CERT_DecodeCertFromPackage
nsNSSCertificate::InitFromDER(char*, int)
nsNSSCertificate::Read(nsIObjectInputStream*)
nsBinaryInputStream::ReadObject(bool, nsISupports**)
NS_ReadOptionalObject(nsIObjectInputStream*, bool, nsISupports**)
mozilla::psm::TransportSecurityInfo::Read(nsIObjectInputStream*)
nsBinaryInputStream::ReadObject(bool, nsISupports**)
NS_DeserializeObject(nsTSubstring<char> const&, nsISupports**)
mozilla::net::CacheEntry::GetSecurityInfo(nsISupports**)
mozilla::net::nsHttpChannel::OpenCacheInputStream(nsICacheEntry*, bool, bool)
mozilla::net::nsHttpChannel::OnCacheEntryCheck(nsICacheEntry*, nsIApplicationCache*, unsigned int*)
mozilla::net::CacheEntry::InvokeCallback(mozilla::net::CacheEntry::Callback&)
mozilla::net::CacheEntry::InvokeCallbacks(bool)
mozilla::net::CacheEntry::InvokeCallbacks()
mozilla::net::CacheEntry::OnFileReady(nsresult, bool)
non-virtual thunk to mozilla::net::CacheEntry::OnFileReady(nsresult, bool)
mozilla::net::CacheFile::OnMetadataRead(nsresult)
non-virtual thunk to mozilla::net::CacheFile::OnMetadataRead(nsresult)
mozilla::net::CacheFileMetadata::OnDataRead(mozilla::net::CacheFileHandle*, char*, nsresult)
mozilla::net::ReadEvent::Run()
mozilla::net::CacheIOThread::LoopOneLevel(unsigned int)
mozilla::net::CacheIOThread::ThreadFunc()

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 13

•

6 years ago

We're seeing this sort of problem crop up regularly in test runs; here's one from foxnews:
https://perfht.ml/2X4bwjw
2s OCSP response - 1s "waiting for Socket Thread", 1s waiting for DNS. From previous investigations, we'd likely find (like above) that we have a bunch of SSL threads and the Socket thread fighting over the sqlite DB via find_names().
Also this profile https://perfht.ml/2QIC4RE (simpleperf on an geckoview_example startup) shows Cache2 IO hitting find_names() as well, for deserialization of security info stored in cache entries.

Is there any way to reduce the delay when we collide (change the 100ms backoff in sqlite perhaps (per mayhemer)), or to make these async operations in some manner? Or even serialize all of these onto a single thread (even if it blocks callers) to avoid the 100ms backoff behavior?

I hear the sqlite DB is to be replaced with an rkv store - when would this occur? Bug#? Would that impact this pattern/delay?

Flags: needinfo?(dkeeler)

Dana Keeler (she/her) [:keeler]

Comment 14

•

6 years ago

The NSS sqlite DB is not being replaced with rkv. You're probably thinking of intermediate preloading (bug 1535662), which we're still in the process of shipping (it's only enabled on Nightly right now). Intermediate preloading helps a bit with this contention by potentially avoiding a call to find_objects (bug 1557092).

I think the issue here is the transaction created by the intermediate caching code importing certificates[0] is blocking readers. Setting PRAGMA journal_mode = WAL; made a test program I wrote that pits a reader thread against a writer thread go orders of magnitude faster, but I'm not sure if we can set and ship that in NSS (for one, the sqlite documentation page says it doesn't work with DBs on networked drives - I don't know if that means "it just won't do WAL and falls back to DELETE" or if it means "it'll fail in byzantine ways if you try"). To reduce the long sleeps we could override the default sqlite busy handler, but that would still block any threads trying to read while the transaction completes. Since that involves syncing to the underlying filesystem, that'll always be slow. Another option would be to stop caching intermediates altogether. In theory once intermediate preloading ships intermediate caching is less important (it's mostly a band-aid anyway to help with misconfigured servers).

[0] See bug 1529044 for the most recent implementation, although the feature has been around for a long time. Also, any code that imports new certificates would have this problem. Luckily, it's rare that users do this.

Flags: needinfo?(dkeeler)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 15

•

6 years ago

I see this is currently done off an Idle runnable... is that causing this collision?

I do see all the SSL threads and the SocketThread calling VerifyCert() https://perfht.ml/2RbtLhH and this profile shows SSL Thread 4 (https://perfht.ml/2R9WjYM) running PK11_ImportCerts() throughout this time period.

My concern with using 'idle' priority here is that it doesn't mean we're really idle... this is off-main-thread, so unless there's other SSL runnables queued to this SSL thread, those are empty, and the only thing delaying us running the idle runnable is GetIdlePeriodHint(), which looks at upcoming RefreshDriver and timer executions. The net result is that we'll be pretty likely to run it even if we're super-busy on MainThread.

Idle as currently designed was intended for MainThread usage, unfortunately. You could dispatch an Idle runnable to mainThread that dispatched a runnable to an SSL thread; that would ensure we run it in a real MainThread idle period (or very close to). This wouldn't stop all collisions, but I suspect would make them much less likely.

Perhaps even better would be running this off some "deep idle" callback, with a (soft) time limit to avoid failing to import forever, but waiting for a longer idle period.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 16

•

6 years ago

Attached patch Non-compiling patch — Details — Splinter Review

So unique_ptr<> and lambdas hate each other, and this doesn't compile (nor any mutation of it I've tried); one could stop trying to abuse lambdas and just create a runnable class that does "if (NS_IsMainThread()) { thread->Dispatch(this, NS_DISPATCH_NORMAL); return; }", and send that runnable to the MainThread idle queue to start. I'd advise that instead of twisting the code in a knot to make the 3-level-lambda thing work.

I was trying this to a) ensure it works, and b) see if the collisions (mostly) disappear

f?'s to put it on both of your radars. Andrew, if we can get a working patch we can see if this makes the noise go away or lessen.

You could "best case" test it by turning off intermediate cert import like you tried before

Assignee: nobody → rjesup

Attachment #9071884 - Flags: feedback?(dkeeler)

Attachment #9071884 - Flags: feedback?(acreskey)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 17

•

6 years ago

So, trying to run this on "Idle" on an SSL thread is exactly the same as just Dispatching it; non-MainThreads don't implement PrioritizedEventQueue

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 18

•

6 years ago

When an intermediate cert save is needed, we could Dispatch to Socket Thread to start (or reset if running) a Timer. When the timer fires, either Dispatch to SSL Cert threadpool to save, or Dispatch to MainThread Idle queue, and when that runs that Dispatch to SSL Cert to save. Perhaps use a 10s or 60s timer, with perhaps a deadline that we won't reset the timer to more than N seconds from the oldest dirty data waiting (example: 30s timer, with a max 2 min dirty time). It depends on how much we care about possible not writing it out. (We could also force a write as part of shutdown if dirty, perhaps)

Dana Keeler (she/her) [:keeler]

Comment 19

•

6 years ago

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #15)

I see this is currently done off an Idle runnable... is that causing this collision?

I don't think that's the issue - it's that we have many threads wanting to read the sqldb and some threads wanting to occasionally write to the sqldb (the fact that NSS decoding a certificate results in reading the sqldb is an unfortunate consequence of design decisions made in NSS).

Perhaps even better would be running this off some "deep idle" callback, with a (soft) time limit to avoid failing to import forever, but waiting for a longer idle period.

Right - we basically want to do this when the network is idle. Maybe something like that already exists?
Also, it's not really a problem if importing gets delayed indefinitely (until shutdown, basically) because as long as we hold those certificates alive in memory, we'll be able to find them during verification, which is why we're saving them in the first place. Of course, saving a bunch of intermediates at shutdown might have performance characteristics we want to avoid.

One thing to keep in mind is that when we ship intermediate preloading, we have the option of saving certificates in cert_storage (rkv), which won't have this problem. (The issue is there's some crash bugs and performance issues we have to solve before we can ship that.)

Dana Keeler (she/her) [:keeler]

Comment 20

•

6 years ago

Comment on attachment 9071884 [details] [diff] [review] Non-compiling patch Review of attachment 9071884 [details] [diff] [review]: ----------------------------------------------------------------- Like I said, we want to do this when the network is idle, if we can. I think this is the right sort of thing, though.

Attachment #9071884 - Flags: feedback?(dkeeler) → feedback+

Honza Bambas (:mayhemer)

Comment 21

•

6 years ago

Necko currently can't provide any net-idle notification (like that there are no requests happening for a while) but we have a mechanism that we could relatively easily base such a notification on.

Andrew Creskey [:acreskey]

Reporter

Comment 22

•

6 years ago

I can try out some of these variations in the next bit.
I'd like to find a scenario in CI where I can see the impact of this -- the tp6 mobile cold load tests should do it.
But definitely I will need to do some local performance measurement for sites like foxnews.com.

Andrew Creskey [:acreskey]

Reporter

Comment 24

•

6 years ago

Comment on attachment 9071884 [details] [diff] [review] Non-compiling patch Review of attachment 9071884 [details] [diff] [review]: ----------------------------------------------------------------- At one point I had a tried out dispatching the import to the main thread (where it was enventually done on idle). For a new starting point, I made a comparison of our current import cert behaviour against "don't import at all": https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8f91877992a12b1dcd514bf80c11d1a6638446b9&newProject=try&newRevision=133b78c7f0d1eab73b209f3421f5626e2121dae4&framework=10 What I find interesting is that, at least on Moto G5 android, I'm not seeing a performance improvement from skipping the import. (This does line up with measurements I had taken a few months back). We now have desktop cold loads, so I'm triggering more jobs there. we have a limited selection of sites.

Attachment #9071884 - Flags: feedback?(acreskey) → feedback+

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 25

•

6 years ago

You may need to do the test in browsertime where it was first seen, not in raptor, especially if there are network effects here (like really fast OCSP responses, or something)

Dana Keeler (she/her) [:keeler]

Updated

•

6 years ago

Priority: -- → P1

Whiteboard: [qf:p1:pageload] → [psm-assigned][qf:p1:pageload]

Andrew Creskey [:acreskey]

Reporter

Comment 26

•

6 years ago

Ah, from a slack conversation (thank you Dana) I realized why I'm not see intermediate import in our raptor tests.
We import the mitmproxy CA cert into firefox so we can connect directly to mitmproxy to return the archived http responses.
And so no intermediates are encountered during testing. (Same thing when using WebPageReplay)

Andrew Creskey [:acreskey]

Reporter

Comment 27

•

6 years ago

FYI, I just found out that we don't do many OCSP request on android because by pref we only make them for EV certificates:
https://searchfox.org/mozilla-central/rev/5e660d3dfcba897c8501e3fda1d415565a096e7e/mobile/android/app/mobile.js#421

Steven Singer

Comment 28

•

6 years ago

A lock convoy through CERT_NewTempCertificate() and find_objects() seem to be implicated in bug 1577944 (see comment 2) via a new (non OCSP) path to CERT_NewTempCertificate() that was added in bug 1551177.

Dana Keeler (she/her) [:keeler]

Comment 29

•

4 years ago

I believe bug 1694542 addressed this - Randell, can you confirm?

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 30

•

4 years ago

forwarding to andrew

Flags: needinfo?(rjesup) → needinfo?(acreskey)

Andrew Creskey [:acreskey]

Reporter

Comment 31

•

4 years ago

Let me just double check this on the reference laptop. I'll need a day for it to stabilize from OS updates.

Flags: needinfo?(acreskey)

Andrew Creskey [:acreskey]

Reporter

Comment 32

•

4 years ago

I haven't been able to find any cases where the OCSP requests are blocked.

Trying the usual sites and I can see the requests but they are completing quite quickly.

Dana Keeler (she/her) [:keeler]

Comment 33

•

4 years ago

Great - thanks!

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → DUPLICATE

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

3 years ago

Performance Impact: --- → P1

Keywords: perf:pageload

Whiteboard: [psm-assigned][qf:p1:pageload] → [psm-assigned]

Loadtimes of a few sites -- note the long tail on washingtonpost 6 years ago Andrew Creskey [:acreskey] 218.51 KB, image/png		Details
geckoProfiles from a browsertime session 6 years ago Andrew Creskey [:acreskey] 7.02 MB, application/x-zip-compressed		Details
Non-compiling patch 6 years ago Randell Jesup [:jesup] (needinfo me) 2.85 KB, patch	keeler : feedback+ acreskey : feedback+	Details \| Diff \| Splinter Review