Closed Bug 1686930 Opened 4 years ago Closed 1 years ago

Facebook spend most of its time under `get PerformanceResourceTiming` functions.

Categories

(Core :: Security, defect, P2)

defect

Tracking

()

RESOLVED INACTIVE

People

(Reporter: nbp, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(3 obsolete files)

While investigating Bug 1674295, which can be reproduced by:

  • Create an Ubuntu Server 20.10 VM
  • install lightdm, i3, and Firefox Nightly from mozilla website.
  • Start firefox.
  • Login on Facebook.
  • Scroll far down.
  • Check about:performance to see a high CPU impact, with a slow memory increase.

I was able to collect the following profile: https://share.firefox.dev/2MZ85qQ

This profile mostly report what seems to be an infinite loop calling get PerformanceRessourceTiming.responseEnd and get PerformanceRessourceTiming.requestStart which seems to be taking a few milli-seconds each (likely emphasized by some gecko-profiler instrumentation)

This investigation is in the context of a slow and steady memory leak (Bug 1674295).

Component: DOM: Core & HTML → Performance
Component: Performance → Security
Whiteboard: [qf]

(In reply to Nicolas B. Pierron [:nbp] from comment #0)

I was able to collect the following profile: https://share.firefox.dev/2MZ85qQ

This profile mostly report what seems to be an infinite loop calling get PerformanceRessourceTiming.responseEnd and get PerformanceRessourceTiming.requestStart which seems to be taking a few milli-seconds each (likely emphasized by some gecko-profiler instrumentation)

Most of the time there is really in overhead of the native allocation recording from the profiler. https://share.firefox.dev/35Hzjc2 shows what remains (705 samples out of 3355 samples that call PerformanceRessourceTiming.*) when excluding the profiler functions related to allocation tracking.

The same happens when I disabled the native allocation reporting from the profiler, where the function takes 1 ms or more each:
https://share.firefox.dev/3nOecv1

I've seen a similar report internally triggered by the same code. We've experimenting with reducing the cap of performance entries which should help. How many performance entries do you have when this problem triggers? performance.getEntries().length

The code in question does performance.getEntriesByName(url, 'resource').reduce(...) and it looks for the max entry by responseEnd that matches a url. getEntriesByName isn't spec'ed to be sorted by time so an exhaustive linear search is correct. Looks like it's used by the video and/or live streaming code.

It looks like work is done lazy on every access of these fields. Is that an appropriate implementation?

Summary: Facebook spend most of its time under `get PerformanceRessourceTiming` functions. → Facebook spend most of its time under `get PerformanceResourceTiming` functions.

The implementation isn't that unreasonable when the default buffer size is used.

One could also keep a separate lists of entries in a hashtable which is keyed on the name, and then the filtering would happen on type only,
or vice versa. But there would still be O(n) behavior.

Caching the most recent result and possibly explicitly modifying that if there are new entries matching the filters it use might work in some cases, but not in general.

The entries should definitely use atoms for name and entryType. That would speed up things already quite a bit.

But somehow profile in comment 3 enters the very slowest paths of bug 1655683.

(In reply to Benoit Girard (:BenWa) from comment #4)

I've seen a similar report internally triggered by the same code. We've experimenting with reducing the cap of performance entries which should help. How many performance entries do you have when this problem triggers? performance.getEntries().length

Looking at the profiles and searching for getEntries report all the getEntriesByName calls, and the stack chart view gives us some temporal hint of the number of calls:

I can't reproduce the responseEnd loop locally, including scrolling past/stopping on videos. It sounds like if they are looking at a bunch of timestamps of different times in sequence, then we will be thrashing the LRU cache of jitter seeds which are SHA-256 hashes. It will cause us to do a bunch of SHA-256 hashes (and memory allocations).

The comment here is the same set of steps I'd recommend to pinpoint this if one can reproduce it.

To resolve it, depending on the cause, we could remove the lock, replace SHA-256 with something faster but less secure, or look at disabling jitter entirely.

(In reply to Tom Ritter [:tjr] (ni? for response to sec-[advisories/bounties/ratings/cves]) from comment #10)

The comment here is the same set of steps I'd recommend to pinpoint this if one can reproduce it.

I will test in an hour.

Flags: needinfo?(nicolas.b.pierron)

(In reply to Tom Ritter [:tjr] (ni? for response to sec-[advisories/bounties/ratings/cves]) from comment #10)

I can't reproduce the responseEnd loop locally, including scrolling past/stopping on videos. It sounds like if they are looking at a bunch of timestamps of different times in sequence, then we will be thrashing the LRU cache of jitter seeds which are SHA-256 hashes. It will cause us to do a bunch of SHA-256 hashes (and memory allocations).

The comment here is the same set of steps I'd recommend to pinpoint this if one can reproduce it.

Testing with:

  • privacy.resistFingerprinting.reduceTimerPrecision.jitter set to false (was set to true before)
  • privacy.reduceTimerPrecision set to true

does not reproduce the excessive time under the get PerformanceResourceTiming functions, it in fact disappear from the profiles:
https://share.firefox.dev/3sIrrB7

The loop remains and leaks even faster now (Bug 1674295)

Flags: needinfo?(nicolas.b.pierron)

(In reply to Benoit Girard (:BenWa) from comment #5)

The code in question does performance.getEntriesByName(url, 'resource').reduce(...) and it looks for the max entry by responseEnd that matches a url.

Is it possible that the reduce callback function is accessing responseEnd even for entries which don't match the URL?

Flags: needinfo?(b56girard)

Looking at the network chart in the profiles, it seems that every stream chunk has a different URL. And yet, responseEnd seems to be called many times per chunk.

(In reply to Markus Stange [:mstange] from comment #13)

Is it possible that the reduce callback function is accessing responseEnd even for entries which don't match the URL?

Oh, nevermind, that would only be possible if getEntriesByName returned entries that don't match the URL.

It think we need to find out how many calls we get per network request, by adding logging or profiler markers in getEntriesByName (log URL + number of results) and in GetResponseEnd.

Flags: needinfo?(b56girard)

(In reply to Nicolas B. Pierron [:nbp] from comment #9)

Looking at the profiles and searching for getEntries report all the getEntriesByName calls, and the stack chart view gives us some temporal hint of the number of calls:

I'm not exactly sure what you're describing here, but it seems like you're trying to get the number of calls, which is impossible to deduce from sampling-based data.

(In reply to Olli Pettay [:smaug] from comment #8)

But somehow profile in comment 3 enters the very slowest paths of bug 1655683.

Would it make sense to reduce the timer precision only once per entry? At the moment we do it on every field access:
https://searchfox.org/mozilla-central/rev/d5e98892f9553f9113e69c585308008e681e19c9/dom/performance/PerformanceTiming.cpp#555
Maybe we should do it during construction, and store the DOMHighResTimeStamp instead of the TimeStamp on PerformanceTimingData.

That sounds very reasonable. And if we need the higher accuracy values for something, store also them but don't expose to the web.
But isn't some of that already happening, like https://searchfox.org/mozilla-central/rev/d5e98892f9553f9113e69c585308008e681e19c9/dom/performance/PerformanceResourceTiming.cpp#43 I guess we need it in more places. Cache, or store only DOMHighResTimeStamp

(In reply to Markus Stange [:mstange] from comment #15)

(In reply to Markus Stange [:mstange] from comment #13)

Is it possible that the reduce callback function is accessing responseEnd even for entries which don't match the URL?

Oh, nevermind, that would only be possible if getEntriesByName returned entries that don't match the URL.

It think we need to find out how many calls we get per network request, by adding logging or profiler markers in getEntriesByName (log URL + number of results) and in GetResponseEnd.

The VM on which I am testing is not capable of building Firefox, and I suspect that installing mozilla-central build dependencies would make the bug disappear (by installing some missing libraries).

I can try it if you make a try-push that I can download, or if you manage to give me a BPF file which has the uprobe instrumentation (if someone can tell me how to unstrip firefox binaries from the crash symbols…).

(In reply to Markus Stange [:mstange] from comment #16)

(In reply to Nicolas B. Pierron [:nbp] from comment #9)

Looking at the profiles and searching for getEntries report all the getEntriesByName calls, and the stack chart view gives us some temporal hint of the number of calls:

I'm not exactly sure what you're describing here, but it seems like you're trying to get the number of calls, which is impossible to deduce from sampling-based data.

Yes, this way of counting only gives us a lower bound estimate of the number of calls, knowing that we have no coroutine in Firefox, and if these are not generators/await functions (are they?).

(In reply to Nicolas B. Pierron [:nbp] from comment #19)

I can try it if you make a try-push that I can download

I've started a try build here: https://treeherder.mozilla.org/jobs?repo=try&revision=744e02d1bd61dd24b283d9a1b3ddba383dfef7aa

Yes, this way of counting only gives us a lower bound estimate of the number of calls

Oh, I see. Yes, it does let you get a lower bound.

Oh, just for reference, here's the "prettified" code from the facebook bundle:

function getLatestResourceTimingEntry(name) {
  if (!performance.getEntriesByName) return null;
  var entries = performance.getEntriesByName(name, "resource");
  return entries.reduce(function (a, b) {
    if (a == null) return b;
    return typeof b.responseEnd === "number" &&
      typeof b.responseStart === "number" &&
      typeof b.startTime === "number" &&
      typeof a.responseEnd === "number" &&
      typeof a.responseStart === "number" &&
      typeof a.requestStart === "number" &&
      typeof a.startTime === "number"
      ? b.responseEnd < a.responseEnd
        ? a
        : b
      : null;
  }, null);
}

(In reply to Markus Stange [:mstange] from comment #20)

I've started a try build here: https://treeherder.mozilla.org/jobs?repo=try&revision=744e02d1bd61dd24b283d9a1b3ddba383dfef7aa

Nicolas, can you reproduce the problem again in this try build and grab a profile?

Flags: needinfo?(nicolas.b.pierron)

I've written an experimental patch to do the timestamp resolution work eagerly, for all fields, even if none of the performance APIs are called. I'm not sure if that's a good idea but I'll put the patches up for reference.

(In reply to Markus Stange [:mstange] from comment #22)

(In reply to Markus Stange [:mstange] from comment #20)

I've started a try build here: https://treeherder.mozilla.org/jobs?repo=try&revision=744e02d1bd61dd24b283d9a1b3ddba383dfef7aa

Nicolas, can you reproduce the problem again in this try build and grab a profile?

The profile made with this try build is available at https://share.firefox.dev/3iUcQOn

Flags: needinfo?(nicolas.b.pierron)

(In reply to Nicolas B. Pierron [:nbp] from comment #28)

The profile made with this try build is available at https://share.firefox.dev/3iUcQOn

Thanks! In this profile we're spending much less time in Facebook's getLatestResourceTimingEntry function, probably because Facebook has reduced the size of the resource timing buffer in the meantime.
You can see lots of resourcetimingbufferfull events being fired: https://share.firefox.dev/2KVVWm4
Moreover, we now have between 25 and 30 entries for each URL in the buffer. That's still a lot! But likely a lot less than in comment 3.

It indeed looks like Facebook requests the same URLs over and over again. Here's the network chart: https://share.firefox.dev/2NDSfCt

On the unfiltered marker chart (https://share.firefox.dev/2NDU0zz), you can execute the following code in the web console to get a breakdown by URL:

let loadUrlList = filteredMarkers.filter(m => m.name.startsWith("Load ")).map(m => m.data.URI);
let getEntryUrlList = filteredMarkers.filter(m => m.name === "GetEntriesByName").map(m => m.data.name);
let urlSet = new Set(loadUrlList.concat(getEntryUrlList));
[...urlSet.values()].map(url => `Loaded ${loadUrlList.filter(u => u == url).length} times, getEntriesByName ${getEntryUrlList.filter(u => u == url).length} times: ${url}`).join("\n")

This gives:

Loaded 341 times, getEntriesByName 682 times: https://video-cdt1-1.xx.fbcdn.net/v/t42.1790-2/138786069_210069640784604_7754146941036574614_n.mp4?_nc_cat=110&ccb=2&_nc_sid=5aebc0&efg=eyJ2ZW5jb2RlX3RhZyI6ImRhc2hfYXVkaW9fYWFjcF82NF9mcmFnXzJfYXVkaW8ifQ%3D%3D&_nc_ohc=sUc_JHGqWScAX_oLvac&_nc_ht=video-cdt1-1.xx&oh=e184e3e87c1cec7665ad84ec0452853a&oe=60101148&bytestart=82483&byteend=83654
Loaded 282 times, getEntriesByName 846 times: https://video-cdg2-1.xx.fbcdn.net/v/t39.25447-2/140846455_245865773926867_8619276549445379178_n.mp4?_nc_cat=108&ccb=2&_nc_sid=5aebc0&efg=eyJ2ZW5jb2RlX3RhZyI6ImRhc2hfdjRfNXNlY2dvcF9ocTVfZnJhZ18yX3ZpZGVvIn0%3D&_nc_ohc=HOBUcEQlJwwAX-X_zWb&_nc_ht=video-cdg2-1.xx&oh=05ccc735d463fdef51962f97f5a3262d&oe=60347A0A&bytestart=477058&byteend=675982
Loaded 341 times, getEntriesByName 680 times: https://video-cdg2-1.xx.fbcdn.net/v/t42.1790-2/140721819_1063062194157692_801491828212844150_n.mp4?_nc_cat=1&ccb=2&_nc_sid=5aebc0&efg=eyJ2ZW5jb2RlX3RhZyI6ImRhc2hfYXVkaW9fYWFjcF82NF9mcmFnXzJfYXVkaW8ifQ%3D%3D&_nc_ohc=Yk5oKPMOKwMAX94i15y&_nc_ht=video-cdg2-1.xx&oh=32b7b424a728186b3ef412c068fbbb55&oe=60100FAA&bytestart=50108&byteend=51263
Loaded 282 times, getEntriesByName 846 times: https://video-cdt1-1.xx.fbcdn.net/v/t39.25447-2/143164686_1945404802274609_6497816041632441805_n.mp4?_nc_cat=101&ccb=2&_nc_sid=5aebc0&efg=eyJ2ZW5jb2RlX3RhZyI6ImRhc2hfdjRfNXNlY2dvcF9ocTFfZnJhZ18yX3ZpZGVvIn0%3D&_nc_ohc=WvzbpuK4vX4AX_TLJ9Y&_nc_ht=video-cdt1-1.xx&oh=d673838b57bb1826b3e59b92d71c8716&oe=6036A61A&bytestart=451118&byteend=555248
Loaded 340 times, getEntriesByName 680 times: https://video-cdg2-1.xx.fbcdn.net/v/t42.1790-2/132891872_3396079503847270_6185489711801712447_n.mp4?_nc_cat=1&ccb=2&_nc_sid=5aebc0&efg=eyJ2ZW5jb2RlX3RhZyI6ImRhc2hfYXVkaW9fYWFjcF82NF9mcmFnXzJfYXVkaW8ifQ%3D%3D&_nc_ohc=gN7uBOoJykUAX8T0Odn&_nc_ht=video-cdg2-1.xx&oh=d6ac64db567c09da5a0ba39dafc9a770&oe=6010180F&bytestart=66485&byteend=75767
Loaded 282 times, getEntriesByName 846 times: https://video-cdg2-1.xx.fbcdn.net/v/t39.25447-2/135241769_402607857487670_748316234532690871_n.mp4?_nc_cat=104&ccb=2&_nc_sid=5aebc0&efg=eyJ2ZW5jb2RlX3RhZyI6ImRhc2hfdjRfNXNlY2dvcF9ocTFfZnJhZ18yX3ZpZGVvIn0%3D&_nc_ohc=4m83SaqjXCMAX_-k8e1&_nc_ht=video-cdg2-1.xx&oh=321cc688df865419acf0957846280296&oe=6033D33F&bytestart=251851&byteend=320829
Loaded 340 times, getEntriesByName 680 times: https://video-cdg2-1.xx.fbcdn.net/v/t42.1790-2/137594403_736319993929240_874913058885026076_n.mp4?_nc_cat=1&ccb=2&_nc_sid=5aebc0&efg=eyJ2ZW5jb2RlX3RhZyI6ImRhc2hfYXVkaW9fYWFjcF82NF9mcmFnXzJfYXVkaW8ifQ%3D%3D&_nc_ohc=C-iTSoxbhqMAX8BYnq-&_nc_ht=video-cdg2-1.xx&oh=16331f01ee37393b9af46db746131a93&oe=601018E1&bytestart=115821&byteend=124431
Loaded 281 times, getEntriesByName 843 times: https://video-cdg2-1.xx.fbcdn.net/v/t39.25447-2/138627024_151850170041264_2957618579105685603_n.mp4?_nc_cat=102&ccb=2&_nc_sid=5aebc0&efg=eyJ2ZW5jb2RlX3RhZyI6ImRhc2hfdjRfNXNlY2dvcF9ocTVfZnJhZ18yX3ZpZGVvIn0%3D&_nc_ohc=fvzSwSea1aUAX_DxUhc&_nc_ht=video-cdg2-1.xx&oh=1180695c22721bb0a617b2249d6e5fc7&oe=603664A3&bytestart=2150726&byteend=2983886
Loaded 340 times, getEntriesByName 682 times: https://video-cdg2-1.xx.fbcdn.net/v/t42.1790-2/138432457_772652486998580_7882847971516699134_n.mp4?_nc_cat=1&ccb=2&_nc_sid=5aebc0&efg=eyJ2ZW5jb2RlX3RhZyI6ImRhc2hfYXVkaW9fYWFjcF82NF9mcmFnXzJfYXVkaW8ifQ%3D%3D&_nc_ohc=ArBoDLSyBiIAX8gfCAu&_nc_ht=video-cdg2-1.xx&oh=c9216545b42c4f17a508117185ea699e&oe=60100E16&bytestart=82483&byteend=83654
Loaded 281 times, getEntriesByName 843 times: https://video-cdt1-1.xx.fbcdn.net/v/t39.25447-2/140610631_226298955711491_3545714115869898952_n.mp4?_nc_cat=106&ccb=2&_nc_sid=5aebc0&efg=eyJ2ZW5jb2RlX3RhZyI6ImRhc2hfdjRfNXNlY2dvcF9ocTVfZnJhZ18yX3ZpZGVvIn0%3D&_nc_ohc=cbu8TMrePBUAX96lGlD&_nc_ht=video-cdt1-1.xx&oh=a1fc0206d88ded792f8fd941f1451519&oe=6034244F&bytestart=153246&byteend=324140
Loaded 281 times, getEntriesByName 845 times: https://video-cdt1-1.xx.fbcdn.net/v/t39.25447-2/140251749_146227323977327_631208250021167542_n.mp4?_nc_cat=101&ccb=2&_nc_sid=5aebc0&efg=eyJ2ZW5jb2RlX3RhZyI6ImRhc2hfdjRfNXNlY2dvcF9ocTVfZnJhZ18yX3ZpZGVvIn0%3D&_nc_ohc=KKgfgYZYVaIAX8LTAg2&_nc_ht=video-cdt1-1.xx&oh=b2bbad97ff449e09b66ef1577be3c580&oe=6034BFFC&bytestart=164301&byteend=337532
Loaded 1 times, getEntriesByName 0 times: https://www.facebook.com/ajax/bnzai?__a=1&__beoa=0&__ccg=EXCELLENT&__comet_req=1&__hsi=6922016835067559057-0&__pc=EXP1%3Acomet_pkg&__req=4h&__rev=1003221632&__s=tb5u9c%3As9srfk%3A7uum4d&__spin_b=trunk&__spin_r=1003221632&__spin_t=1611657635&__user=723363057&dpr=1&fb_dtsg=AQG0HfUkvT5o%3AAQFk28bL_ICd&jazoest=21993&ph=C3
Loaded 2 times, getEntriesByName 0 times: https://www.facebook.com/ajax/webstorage/process_keys/?state=0

Benoit, are the repeated loads expected?

Flags: needinfo?(b56girard)
Depends on: 1688943

Comment on attachment 9199178 [details]
Bug 1686930 - Enforce and document that mTimingData is always non-null.

Revision D103007 was moved to bug 1688941. Setting attachment 9199178 [details] to obsolete.

Attachment #9199178 - Attachment is obsolete: true

Comment on attachment 9199179 [details]
Bug 1686930 - Convert PerformanceResourceTiming timestamps eagerly.

Revision D103008 was moved to bug 1688943. Setting attachment 9199179 [details] to obsolete.

Attachment #9199179 - Attachment is obsolete: true

Canceling Benoit's needinfo again - I just saw that bug 1674295 comment 40 has basically the same question on the same topic, and Benoit is already needinfo'd over there.

Flags: needinfo?(b56girard)

Removing [qf] since this is now more like a meta bug, and other bugs which this depends on are being fixed.

Severity: -- → S2
Priority: -- → P2
Whiteboard: [qf]

This doesn't look like an S2 bug, moving to S3. Feel free to revert if you disagree.

Florian, is this interesting from a power usage perspective?

Severity: S2 → S3
Flags: needinfo?(florian)

(In reply to Marco Castelluccio [:marco] from comment #34)

This doesn't look like an S2 bug, moving to S3. Feel free to revert if you disagree.

Florian, is this interesting from a power usage perspective?

It has power use implications (as any bug causing excessive CPU use does), but I think it makes more sense to look at it in terms responsiveness as the user experience seems poor. Given this bug was potentially a duplicate of bug 1655683 which has been closed, maybe we should close this one too. Nicolas, feel free to reopen if this can still be reproduced.

Status: NEW → RESOLVED
Closed: 1 years ago
Flags: needinfo?(florian)
Resolution: --- → INACTIVE
Attachment #9199177 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: