Closed Bug 1384300 Opened 7 years ago Closed 7 years ago

StartupRefreshDriverTimer::ScheduleNextTick runs a lot

Categories

(Core :: Layout, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox56 --- fixed

People

(Reporter: billm, Assigned: mchang)

Details

Attachments

(1 file, 1 obsolete file)

StartupRefreshDriverTimer::ScheduleNextTick is consistently showing up in telemetry as a very frequent runnable. Currently it is the #2 most frequent unlabeled runnable. My understanding is that this runnable should only be used before the vsync IPC channel is created. So it should be very rare.

Mason, could you please audit this code and see if we're running this timer when we shouldn't be? This seems like it could be an easy perf win.
Flags: needinfo?(mchang)
Assignee: nobody → mchang
Flags: needinfo?(mchang)
I looked at this and I can't reproduce this at all. The background IPC channel is already open and I don't even get to the startup refresh driver timer at all. If we can't create the vsync child actor, we crash and we should be seeing those. The only way I see this happening is that it takes a super long time to startup and this runs for a long time. How are you getting the telemtry data saying it's running a lot?
Flags: needinfo?(wmccloskey)
The keyed histogram MAIN_THREAD_RUNNABLE_MS keeps track of how many times and for how long each runnable ran for. If you look for the key "StartupRefreshDriverTimer::ScheduleNextTick", you can see how often it was run and compare it to other runnables. Unfortunately, this histogram seems to overwhelm telemetry.m.o. I use analysis.telemetry.m.o to compute this stuff. I've only been focused on unlabeled runnables lately, so I don't have any recent data on the absolute frequency. I'll start a job for this tomorrow, unless you could do it yourself.
Here is a list of the top 50 most commonly occurring runnables in the content process on Nightly. The StartupRefreshDriver one is at #33. That puts it ahead of all sorts of runnables that I would expect to be pretty common.

[(u'PCompositorBridge::Msg_DidComposite(labeled)', 143375719772),
 (u'TimeoutExecutor Runnable(labeled)', 59410770755),
 (u'IdleRunnable', 34257922172),
 (u'PVsync::Msg_Notify', 33948097183),
 (u'CollectorRunner(labeled)', 31343122253),
 (u'AbstractThread::Runner for AutoTaskDispatcher::TaskGroupRunnable(labeled)',
  17418447566),
 (u'StorageNotifierService::Broadcast(labeled)', 9019051456),
 (u'AsyncEventDispatcher(labeled)', 8462330949),
 (u'LayerActivityTracker(labeled)', 7589551799),
 (u'nsDocument::NotifyIntersectionObservers(labeled)', 6811539546),
 (u'IdleRunnableWrapper for AsyncFreeSnowWhite', 6528715258),
 (u'CompleteResumeRunnable(labeled)', 5730208830),
 (u'PBrowser::Msg_RealMouseMoveEvent(labeled)', 5192647800),
 (u'Anonymous_interface_timer', 4214070339),
 (u'HTMLMediaElement::nsAsyncEventRunner(labeled)', 4198639643),
 (u'AbstractThread::Runner for MediaStreamGraphStableStateRunnable(labeled)',
  2856709011),
 (u'MediaResource::Destroy(labeled)', 2759945798),
 (u'dom::PostMessageEvent(labeled)', 2694500197),
 (u'MediaCache::UpdateEvent(labeled)', 2285536644),
 (u'AbstractThread::Runner for ChannelMediaDecoder::ResourceCallback::NotifyBytesConsumed(labeled)',
  2253276060),
 (u'TelemetryIPCAccumulator::IPCTimerFired(labeled)', 2240688692),
 (u'LogMessageRunnable(labeled)', 2169344152),
 (u'nsIDocument::SelectorCache(labeled)', 1678813352),
 (u'ProgressTracker::AsyncNotifyRunnable(labeled)', 1387951363),
 (u'layers::DestroyTextureData(labeled)', 1297439042),
 (u'PTexture::Msg___delete__(labeled)', 1165871372),
 (u'IdleRequestExecutor', 1117665004),
 (u'net::HttpChannelChild::OnTransportAndData(labeled)', 1046300164),
 (u'PHttpChannel::Msg_DeleteSelf(labeled)', 967719132),
 (u'PHttpChannel::Msg_OnStartRequest(labeled)', 959539211),
 (u'AsyncWaitRunnable(labeled)', 957573724),
 (u'PAPZ::Msg_RequestContentRepaint(labeled)', 936242521),
 (u'StartupRefreshDriverTimer::ScheduleNextTick', 882093868),
 (u'DummyEvent', 806975553),
 (u'PContent::Msg_NotifyVisited', 788808267),
 (u'PContent::Msg_DispatchLocalStorageChange', 705935444),
 (u'HTMLMediaElement::nsResolveOrRejectPendingPlayPromisesRunner(labeled)',
  669375536),
 (u'PredictorLearnRunnable(labeled)', 653087008),
 (u'PPluginInstance::Msg_ShowDirectDXGISurface(labeled)', 607093586),
 (u'ProxyReleaseEvent for NS_ReleaseOnMainThreadSystemGroup(labeled)',
  584824970),
 (u'IDecodingTask::NotifyDecodeComplete(labeled)', 569424075),
 (u'XMLHttpRequest(labeled)', 555698410),
 (u'AbstractThread::Runner for MozPromise::ThenValueBase::ResolveOrRejectRunnable(labeled)',
  541376591),
 (u'StorageNotifierRunnable(labeled)', 505226138),
 (u'PVRManager::Msg_UpdateDisplayInfo', 487101114),
 (u'PBrowser::Msg_AsyncMessage(labeled)', 460034963),
 (u'PContent::Msg_DataStoragePut', 381876790),
 (u'PresShell::UpdateApproximateFrameVisibility(labeled)', 378435183),
 (u'RefreshDriverVsyncObserver::NormalPriorityNotify', 375128720),
 (u'non-nsINamed ThrottledEventQueue runnable(labeled)', 358344235)]

The notebook I used is here: https://gist.github.com/bill-mccloskey/897f18b79adf252a820d875558d76d9e
Flags: needinfo?(wmccloskey)
I was just looking at the code myself. Suppose we somehow manage to call StartupRefreshDriverTimer::ScheduleNextTick. Then we create a timer that will fire pretty soon. When that happens, TimerTick is called. TimerTick is passed the StartupRefreshDriverTimer as the closure parameter. We call Tick on that parmeter, which calls Tick with different arguments. That function immediately calls ScheduleNextTick.

So, as far as I can tell, if we ever create the StartupRefreshDriverTimer, then it will tick forever. I'm not really sure what path we would need to take in order for the vsync actor not to have been created. But I think it's probably just a matter of thread switching being slow.

Can you fix this Mason? Could we override SwapRefreshDrivers on the StartupRefreshDriverTimer to cancel the timer?
Flags: needinfo?(mchang)
Thanks! I think this was happening because the SimplerRefreshDriverTimer wasn't virtual and deleting a derived class cased as a base class is undefined. I still added the explicit StopTimer() and an assertion when we tick one to make sure its ok. Thanks for finding this!
Flags: needinfo?(mchang)
Attachment #8891055 - Flags: review?(wmccloskey)
Attachment #8891058 - Flags: review?(wmccloskey) → review+
Attachment #8891055 - Attachment is obsolete: true
Attachment #8891055 - Flags: review?(wmccloskey)
Couldn't land the patch with the assertion that all content and root refresh drivers are empty when we tick. This is because during shutdown, we could still have a vsync event on the main thread loop that can run AFTER we already removed all the refresh drivers and stopped the timer. 

New try - https://treeherder.mozilla.org/#/jobs?repo=try&revision=4eb07203d01aa79065cece82937d1f8f510fa104
Pushed by mchang@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ee5586d41e4e
Stop refresh driver timers after swapping refresh drivers. r=billm
https://hg.mozilla.org/mozilla-central/rev/ee5586d41e4e
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Note: This runnable becomes top #2 unlabeled runnable in latest telemetry analysis:
https://gist.github.com/bevis-tseng/d8a918cb53112a82357c872ce15ee790
Flags: needinfo?(wmccloskey)
Flags: needinfo?(bill.mccloskey)
You need to log in before you can comment on or make changes to this bug.