Closed Bug 1614861 Opened 9 months ago Closed 8 months ago

Slow startup notification bar can somehow get negative samples

Categories

(Firefox :: General, defect, P3)

defect

Tracking

()

RESOLVED FIXED
Firefox 75
Tracking Status
firefox75 --- fixed

People

(Reporter: mconley, Assigned: mconley)

References

(Blocks 1 open bug)

Details

(Whiteboard: [fxperf])

Attachments

(1 file)

The slow startup notification bar samples the time between process start and shortly after browser-delayed-startup-finished fires, and after 5 such samples, checks to see if the average time was over 20s. If so, a slow startup notification shows.

In https://www.reddit.com/r/firefox/comments/eul2fg/mozilla_firefox_seems_slow_to_start_umm_no_it/, a user reports that they're seeing this notification bar, despite the fact that Firefox is starting quite quickly. I reached out to this user, and this is some of the values that were stored for startup samples:

browser.slowStartup.averageTime: -1607100339,
browser.slowStartup.samples: 2,
browser.slowStartup.maxSamples: 5,
browser.slowStartup.timeThreshold: 20000

That averageTime is really suspect. How are we getting this very negative value?

Based on:

https://searchfox.org/mozilla-central/rev/1db5ef59eba65d32d6a29a494e87b6078453e559/browser/components/BrowserGlue.jsm#1258-1270,1288

    let currentTime = Date.now() - Services.startup.getStartupInfo().process;
    let averageTime = 0;
    let samples = 0;
    try {
      averageTime = Services.prefs.getIntPref(
        "browser.slowStartup.averageTime"
      );
      samples = Services.prefs.getIntPref("browser.slowStartup.samples");
    } catch (e) {}

    let totalTime = averageTime * samples + currentTime;
    samples++;
    averageTime = totalTime / samples;

averageTime could be negative if either samples or totalTime is negative. But we know samples isn't. So for averageTime to be negative, totalTime must be negative. Ignoring samples again, either currentTime must be (very) negative, or averageTime is already negative.
So logically, currentTime must be negative at some point. When samples is 0 (ie first startup), if Date.now() yields a lower number than the startup info, that could happen. It looks like we compute the startup number in complicated ways... so perhaps that's plausible? :gsvelto, looks like you know about this based on hg annotate output around https://searchfox.org/mozilla-central/source/mozglue/misc/TimeStamp.cpp#69 ? It does look to me like that's basically computing a "normal" date object from a high-res timestamp which could be error-prone in all kinds of interesting ways, AIUI.

Note also that:

    Services.prefs.setIntPref("browser.slowStartup.averageTime", averageTime);

can overflow to be negative, cf. Services.prefs.setIntPref("test.foo", 1000000000000). This could happen e.g. if the process number is 0/undefined (again, unsure if that can happen), because then we end up saving Date.now() here, which definitely overflows - whether it's negative depends on the actual input number...

Flags: needinfo?(gsvelto)

Blast from the past, this goes back to bug 793735. So, the very, very complicated way the process startup time is calculated in getStartupInfo() is due to the fact that there's no simple way to obtain a number that's consistent with the other entries in that structure. I had to deploy that complex magic dance to make sure that all the entries returned by getStartupInfo() were always in the right order. Before fixing that bug we could have scenarios where the process startup time would appear to have happened later than other events which caused all sort of issues.

So what happened after bug 793735 is that TimeStamp::Now() - TimeStamp()::ProcessCreation() would always return a non-negative value and most of the time the actual process uptime. But. Once we get into the startup info we have Date objects, not timestamps. The TimeStamp class is a relative, monotonic timer unaffected by timezone changes, NTP adjustments and the like. Date is not, it represents an absolute, timezone-dependent time so Date.now() is never guaranteed to be monotonic. So getStartupInfo() is cheating in order to generate a Date.now()-compatible version of the process start time. It's taking a timestamp with PR_Now() - which back in the day was how we implemented Date.now() - and substracting the difference computed via TimeStamp::Now() - TimeStamp()::ProcessCreation().

Long story short, we're comparing two different timebases here, with different properties which is known to be problematic. Unfortunately we don't really have an option because of getStartupInfo() design. getStartupInfo() provides absolute time values for events which cannot be guaranteed to be monotonic WRT Date.now() because the latter is not monotonic.

So the real fix here would involve two things: change getStartupInfo() to return intervals, not absolute times. And then go through the code and find all instances of Date.now() - <Date instance> and switch them to Performance.now() - <DOMHighResTimeStamp instance>. A quick grep reveals there's at least 282 of those.

Flags: needinfo?(gsvelto)

(In reply to Gabriele Svelto [:gsvelto] from comment #2)

So the real fix here would involve two things: change getStartupInfo() to return intervals, not absolute times. And then go through the code and find all instances of Date.now() - <Date instance> and switch them to Performance.now() - <DOMHighResTimeStamp instance>. A quick grep reveals there's at least 282 of those.

Yikes, that sounds like a bunch of work. Mike, based on your comments in bug 1601993, do you want to keep the negative numbers internally? As a stopgap, we could just enforce that the values we store and/or use to determine whether to show the bar are positive, so at least people don't get the notification if their startup was really quick?

Intuitively, I don't think the bug here could happen unless startup was relatively quick and the machine clock got adjusted (e.g. by the OS having done an NTP check inbetween the start of startup and the check here being hit), so I think we should probably at least not show the notification bar in this case... That'd be fairly straightforward to fix, I think?

Though I suppose there's no reason that the adjustment has to be negative, so equally the adjustment could kick us into "oh look, we took 45 seconds to start up" territory just because the NTP clock adjustment moved the clock forward by 43 seconds... and there's no "quick" fix for that.

I guess we could iteratively improve things here by just also storing the highres timestamps on the startup info object, in addition to the Date instance ones, so then we can iteratively switch consumers over to the new stuff? It does look like [the number of calls to getStartupInfo()](https://searchfox.org/mozilla-central/search?q=getStartupInfo() is actually relatively minimal, so maybe that on its own is not super hard to fix...

Flags: needinfo?(mconley)
Priority: -- → P3
See Also: → 1601993
Whiteboard: [fxperf]

I had a look at all the getStartupInfo() consumers and they all seem to care about intervals, not absolute values. So one might change the interface to return the appropriate intervals (e.g. uptime(), timeToFirstPaint(), etc...) and use those. Since the implementation of getStartupInfo() is already using the TimeStamp class internally this should be easy. That would be the "nice" fix that does the right thing w/o rewriting half of the codebase.

But we could also paper over it by calling Math.max(Date.now() - Services.startup.getStartupInfo().process, 0) :-)

Note that we also have bug 1007647 on file which also touches on this problem.

More data from the Reddit user - I asked them to tell me the result of running Services.startup.getStartupInfo().process in the Browser Console - they responded:

The value it returned with was the local time and date in my country. It was correct about everything except for the time- it was 12 minutes early, compared to the actual real time.

I suppose this might account for the negative value, if an NTP request goes out each time the OS starts, and the new time isn't somehow written to the underlying BIOS.

Fascinating stuff. Telling time is hard. And it seems we have lots of ways of doing it.

We also have Cu.now(), which was added in bug 969490, and apparently it's a high-resolution monotonic timer with a time origin of process start. Maybe we should be using that for the slow startup notification bar instead. I suggest we consider Date.now() - processStart an anti-pattern across the codebase, and promote using Cu.now() instead.

Are there any gotchas with Cu.now() we should know about, gsvelto? Or does your bug 969490 comment 14 still stand?

Flags: needinfo?(mconley) → needinfo?(gsvelto)

The slow startup notification bar sampler code was using
Date.now() - Services.startup.getStartupInfo().process.

Services.startup.getStartupInfo().process is a high-resolution timestamp from a
monotonic clock that has been converted to a Date object. Date.now() is a timestamp
from a clock that is not guaranteed to be monotonic, so even though they're technically
the same type, Date.now() ultimately has looser guarantees on its stability.

Date.now() can, for example, change if the OS communicates with a clock on a network,
and updates itself - this can move the clock forward or backward. Timezone changes can
also impact Date.now(). These are edge-cases, but users do hit them nonetheless.

This patch switches the sampler to use Cu.now(), which is a high-resolution interval
from process start to the current time using a monotonic clock.

Assignee: nobody → mconley
Status: NEW → ASSIGNED
See Also: → 1615314

Cu.now() seems to be doing things the right way. It's basically the process uptime in milliseconds so it should be monotonic, relatively high-precision and stable. Naturally it won't be possible to mix the timings with timings coming from other processes but I guess that's fine. It's also going to be a lot faster than calling getPerformanceInfo().

Flags: needinfo?(gsvelto)
Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/49925756d93c
Make slow startup notification bar sampler use monotonic clock. r=gsvelto
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 75
You need to log in before you can comment on or make changes to this bug.