1303296 - Implement Expected Queueing Time (EQT) for Telemetry/Profiler

Reporter

Description

•

9 years ago

In the context of Progressive Web Metrics, Estimated Input Latency (EIL) quantifies the risk to input responsiveness for any time window after the page first becomes interactive. EIL is based on the Expected Queueing Time (EQT), which estimates how long the main thread would take to respond to user input, and the probability that new input will require a responsive main thread. Chromium specs can be found here: https://docs.google.com/document/d/1b9slyaB9yho91YTOkAQfpCdULFkZM9LqsipcX3t7He8/edit#

Andrew Overholt [:overholt]

Comment 1

•

9 years ago

Can you start a discussion on dev-platform, Dominik?

Flags: needinfo?(dstrohmeier)

Priority: -- → P3

:Dominik Strohmeier [:dstrohmeier]

Reporter

Comment 2

•

9 years ago

Andrew, yes, we are late with this. Email will go out to dev-platform at latest on Monday hopefully starting a discussion on the topic of Progressive Web Metrics, in general, and the individual components what we see critical to be added to Gecko.

Flags: needinfo?(dstrohmeier)

Jean-Yves Perrier [:teoli]

Updated

•

9 years ago

Keywords: dev-doc-needed

:Harald Kirschner :digitarald

Comment 3

•

9 years ago

:billm mentioned that the DOM team started to land probes. This would fit right into this task set and we need to start baselining.

Flags: needinfo?(overholt)

Andrew Overholt [:overholt]

Updated

•

9 years ago

Assignee: nobody → afarre

Flags: needinfo?(overholt)

Andrew Overholt [:overholt]

Comment 4

•

9 years ago

I was thinking we could use some other measurements for this but now I'm not so sure. Let me work on this some more.

Assignee: afarre → nobody

:Harald Kirschner :digitarald

Comment 5

•

9 years ago

Clarified the minimal scope for this bug, which is only telemetry and profiler. Developer facing numbers depend on validation and spec stabilizing.

Keywords: dev-doc-needed

Summary: Implement Estimated Input Latency (EIL) → Implement Estimated Input Latency (EIL) for Telemetry/Profiler

Andrew Overholt [:overholt]

Comment 6

•

9 years ago

William's going to take a look here.

Assignee: nobody → wchen

:Harald Kirschner :digitarald

Updated

•

9 years ago

Blocks: 1299118

:Harald Kirschner :digitarald

Comment 7

•

9 years ago

Latest status: wchen and I had a short kick off meeting last week. He didn't see any glaring issues with the spec apart from minor chrome specific parts, so we might have a test implementation soon that we can then verify in real life scenarios.

:Harald Kirschner :digitarald

Comment 8

•

9 years ago

We can take a first shot at this by actually doing the analysis on JSON profiles. This way we can statically infer the numbers for any profile and then see if the in-engine analysis came to the same conclusion.

Andreas Farre [:farre]

Comment 9

•

9 years ago

This will probably come in handy for Bug 1305153, especially measuring latency occurring due to content JS.

Chris Peterson [:cpeterson]

Comment 10

•

9 years ago

Harald says the new Estimated Queuing Latency (EQL) metric has replaced Estimated Input Latency (EIL).

Summary: Implement Estimated Input Latency (EIL) for Telemetry/Profiler → Implement Estimated Queuing Latency (EQL) for Telemetry/Profiler

Chris Peterson [:cpeterson]

Comment 11

•

9 years ago

William, are you still working on this EQL probe for Hasal performance testing?

status-firefox53: --- → affected

Flags: needinfo?(wchen)

:Harald Kirschner :digitarald

Updated

•

9 years ago

Summary: Implement Estimated Queuing Latency (EQL) for Telemetry/Profiler → Implement Expected Queueing Time (EQT) for Telemetry/Profiler

:Harald Kirschner :digitarald

Comment 12

•

9 years ago

:wchen, is this on your backlog to be on the trains for Q1. Chrome is also planning this for Q1, so we can compare notes with them.

Martin Best (:mbest)

Comment 13

•

9 years ago

This is a core metric for the Quantum work. What's the current status and is there someone owning this?

William Chen [:wchen]

Comment 14

•

9 years ago

I'm working on this and I'll have patches for this up soon.

Flags: needinfo?(wchen)

:Harald Kirschner :digitarald

Comment 15

•

8 years ago

Can I help by arranging a meeting to talk about any details that are unclear and might be blocking you?

Flags: needinfo?(wchen)

:Harald Kirschner :digitarald

Comment 16

•

8 years ago

Can we have a strawman implementation by end of Q1 that allows us to assess how useful this metric is for tracking performance?

Flags: needinfo?(overholt)

Naveed Ihsanullah [:naveed]

Updated

•

8 years ago

Whiteboard: [qf:p3]

William Chen [:wchen]

Comment 17

•

8 years ago

Attached patch Implement expected queuing time metric. — Details — Splinter Review

billm: I see that you recently reviewed some code related to queuing events. Can you review this patch? This patch implements expected queuing as described in the EQT document [1] and roughly implemented as described in second EQT document[2]. The way we calculate the EQT in Gecko isn't quite the same as described in the document because we queue input events in a different way and can't make the same assumptions as Chrome. Here are the things that happen in this patch: - Create a circular buffer for the main thread that records event-related actions in the event queue that handles input events. Actions are pushing and popping of nested event queues, an event being queue, an event that has started running and when an event has stopped running. - When we want to compute the EQT, we use the actions in the buffer to create time slices that are used to determine when a hypothetical input event would run if it were to arrive during the time slice. - For now I have only enabled this for non-release/beta since it touches the event queue and I don't want to risk performance issues unless we actually need more data collected from release/beta. [1] https://docs.google.com/document/d/1b9slyaB9yho91YTOkAQfpCdULFkZM9LqsipcX3t7He8 [2] https://docs.google.com/document/d/13ck0H1It8YChVAsgsAsZnZRtc3JQkZ-QRBSl944X6nA

Flags: needinfo?(wchen)

Attachment #8848367 - Flags: review?(wmccloskey)

William Chen [:wchen]

Comment 18

•

8 years ago

EQT is an instantaneous measurement, at what time do you want to sample it for telemetry?

Flags: needinfo?(hkirschner)

:Harald Kirschner :digitarald

Comment 19

•

8 years ago

We need it for bug 1299118, TTI would be the main consumer for now. With the profiler running we could mark times when EQT is bad, but those times could be also deferred from the information in the profile.

Flags: needinfo?(hkirschner)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 20

•

8 years ago

Comment on attachment 8848367 [details] [diff] [review] Implement expected queuing time metric. Review of attachment 8848367 [details] [diff] [review]: ----------------------------------------------------------------- I can look this over, but changes to nsThread that are this significant should be reviewed by an XPCOM peer.

Attachment #8848367 - Flags: review?(wmccloskey) → review?(nfroyd)

Nathan Froyd [:froydnj]

Comment 21

•

8 years ago

Comment on attachment 8848367 [details] [diff] [review] Implement expected queuing time metric. Review of attachment 8848367 [details] [diff] [review]: ----------------------------------------------------------------- Um. Lots of comments here. The actual nsThread integration seems reasonable enough, but I'd like some context on when exactly we're calling into all this machinery. ::: xpcom/threads/nsThread.cpp @@ +665,5 @@ > } > #endif > } > > +#define EVENT_ACTION_BUFFER_LENGTH 1000 Any rationale for why this is 1000 rather than 60 or 500 or 3000? Please document why this number and not some other one. Probably better to make this: static const kEventActionBufferLength = ...; because variables are better than macros where possible. Since we're using this for a circular buffer that gets touched on every event, it should be a power-of-two so `%` is as cheap as we can make it. @@ +749,5 @@ > } > +#ifndef RELEASE_OR_BETA > + void* eventPtr = event.get(); > +#endif > + bool putInNormalQueue = false; This variable reads better as `placedInNormalQueue`. See comments on the header file part of the patch. @@ +1627,5 @@ > + TimeStamp mStart; > + TimeStamp mEnd; > + // Pointer is only used to identify a event and should never > + // be dereferenced. > + void* mRunsAfter; Maybe we should use mozilla::Opaque<uintptr_t> here, just to make sure nobody does bad things with the pointer? We can initialize it to the obvious value in Init(). I'm not sure if doing so makes things clearer or not; asking for your feedback here. @@ +1637,5 @@ > +}; > +} // anonymous namespace > + > +double > +nsThread::GetExpectedQueueTime() How often is this function getting called? @@ +1645,5 @@ > + TimeStamp now = TimeStamp::Now(); > + > + // Create a local copy of the event action buffer so > + // that we don't have to keep holding the lock. > + nsTArray<TimedEventAction> actions(EVENT_ACTION_BUFFER_LENGTH); Presumably this should be an nsAutoTArray, so we don't have to heap allocate in the common case? @@ +1659,5 @@ > + uint32_t startIndex = actionIndex; > + uint32_t endIndex = actionIndex; > + while (actions[startIndex].mTimeStamp.IsNull() || > + (now - actions[startIndex].mTimeStamp).ToMilliseconds() > 1000) { > + startIndex = (startIndex + 1) % EVENT_ACTION_BUFFER_LENGTH; Is this really correct in the case that `actions` is shorter than EVENT_ACTION_BUFFER_LENGTH? Also, actions[] accesses are not going to be valid if `actions` is shorter than EVENT_ACTION_BUFFER_LENGTH on the first iteration of this loop and the ones below. @@ +1666,5 @@ > + break; > + } > + } > + > + nsTArray<QueuingDuration> queuingDurations; nsAutoTArray, same rationale as previously? @@ +1675,5 @@ > + // A nullptr means that there is currently no queued task, > + // thus any arriving input events do not have to wait in > + // the queue. > + nsTArray<void*> lastQueuedTask; > + lastQueuedTask.AppendElement(nullptr); nsAutoTArray, same rationale as previously? @@ +1680,5 @@ > + > + // Create queuing durations for the recorded actions. > + for (uint32_t currentIndex = startIndex; > + currentIndex != endIndex; > + currentIndex = (currentIndex + 1) % EVENT_ACTION_BUFFER_LENGTH) { Same question here as when we were iterating to determine `startIndex`. @@ +1681,5 @@ > + // Create queuing durations for the recorded actions. > + for (uint32_t currentIndex = startIndex; > + currentIndex != endIndex; > + currentIndex = (currentIndex + 1) % EVENT_ACTION_BUFFER_LENGTH) { > + TimedEventAction currentAction = actions[currentIndex]; const TimedEventAction&, so we're not needlessly copying here? @@ +1692,5 @@ > + > + // End the previous duration if it has not been marked as ended. > + if (!queuingDurations.IsEmpty() && > + queuingDurations.LastElement().mEnd.IsNull()) { > + queuingDurations.LastElement().mEnd = currentAction.mTimeStamp; We do this check-and-set three or four times in this function; I think a small helper is warranted. Additionally, consider helper methods for calculating this loop and the subsequent one. The comments do help make this a lot clearer than it would be otherwise (thank you!), but splitting things up a bit would also be helpful. @@ +1782,5 @@ > + uint32_t currentIndex = startIndex; > + // For each queuing duration, compute the queue time of a hypothetical > + // input event arriving during the duration. > + for (uint32_t i = 0; i < queuingDurations.Length(); i++) { > + QueuingDuration& duration = queuingDurations[i]; Use const? @@ +1788,5 @@ > + TimeStamp runTime = duration.mEnd; > + if (duration.mRunsAfter) { > + // Find the time at which |duration.mRunsAfter| is finished running. > + while (currentIndex != endIndex) { > + TimedEventAction& action = actions[currentIndex]; Use const? @@ +1789,5 @@ > + if (duration.mRunsAfter) { > + // Find the time at which |duration.mRunsAfter| is finished running. > + while (currentIndex != endIndex) { > + TimedEventAction& action = actions[currentIndex]; > + currentIndex = (currentIndex + 1) % EVENT_ACTION_BUFFER_LENGTH; Same concern as above when we're iterating `actions`. ::: xpcom/threads/nsThread.h @@ +169,4 @@ > mozilla::MutexAutoLock& aProofOfLock); > > + void PutEvent(nsIRunnable* aEvent, > + bool* aPutInNormalQueue, I think we should call this `aPlacedInNormalQueue` to make its only-outparamness a little clearer. With the current name, it sounds like it could be an in-out param. Why are we not trying to account for EQT of the high priority queue, even as a separate metric? Surely that is important, too? @@ +177,4 @@ > } > > void PutEvent(already_AddRefed<nsIRunnable> aEvent, > + bool* aPutInNormalQueue, Same comment on this function parameter as previously. @@ +306,5 @@ > + // Pointer is only used to identify a event and should > + // never be dereferenced. > + void* mEvent; > + EventAction mAction; > + mozilla::TimeStamp mTimeStamp; Please put mAction after mTimeStamp so this might pack a little better on some platforms.

Attachment #8848367 - Flags: review?(nfroyd) → feedback+

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 22

•

8 years ago

I have a couple higher-level comments: 1. I think that, as much as possible, this code should be split into a separate data structure that just records the events and computes the EQT. That should go in a separate file from nsThread. And we'll need unit tests for that data structure. This code is complex enough that it should be tested in isolation (with GTests or something). 2. I think you may be misinterpreting what PushEventQueue/PopEventQueue are for. They're only called on worker threads for a specific purpose (that I've never really understood). There are nested event loops on the main thread and these Push/Pop functions are never called for them. If you want to detect nested event loops, you'll need to look at the recursion counter in nsThread. 3. I had a hard time following what the code was doing. There will need to be a high-level comment describing what's actually being computed as well as more comments in the code that does the EQT computation. I can kind of imagine how you would adapt Chromium's EQT measurement for our event loop (where input events are not prioritized). But you need to actually write out what you're measuring in detail.

Andrew Overholt [:overholt]

Updated

•

8 years ago

Flags: needinfo?(overholt)

Andrew Overholt [:overholt]

Comment 23

•

8 years ago

Do we care about this anymore?

Assignee: wchen → nobody

Flags: needinfo?(dstrohmeier)

Chris Peterson [:cpeterson]

Comment 24

•

8 years ago

(In reply to Andrew Overholt [:overholt] from comment #23) > Do we care about this anymore? Harald? ^ (Dominik is out on paternity leave.)

Flags: needinfo?(dstrohmeier) → needinfo?(hkirschner)

:Harald Kirschner :digitarald

Comment 25

•

8 years ago

Not for 57. We might revisit Time to Interactive later. It came up in the past as possible metric for event queue and idle times by the DOM team; is that still relevant?

Flags: needinfo?(hkirschner)

Firefox Bug Husbandry Bot

Updated

•

8 years ago

Keywords: perf

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

3 years ago

Performance Impact: --- → P3

Whiteboard: [qf:p3]

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Hsin-Yi Tsai (she/her) | away [:hsinyi]

Updated

•

4 months ago

Status: NEW → RESOLVED

Closed: 4 months ago

Resolution: --- → INCOMPLETE