Consider always using the first taken timestamp as the process creation time
Categories
(Core :: mozglue, task)
Tracking
()
People
(Reporter: gsvelto, Assigned: gsvelto)
References
Details
Attachments
(1 obsolete file)
In order to produce an instance of the TimeStamp class that represents the process creation time we do the following:
- We take a regular timestamp
- We use platform-dependent code to compute the uptime of the current process in microseconds
- We substract the process uptime from the current timestamp
- We compare this newly created timestamp against the very first one we sampled. If it appears to fall sometimes after it, we ignore it and use the first sampled timestamp instead, and record the inconsistency.
This process suffers from several issues:
- We require platform-dependent code that is either complex, fiddly or both.
- The process creation timestamp "wobbles" around over time, no two measures are the same.
- Because the process uptime is computed from a different timebase we often have to fall back on the first taken timestamp. In my testing on an unloaded Linux machine the computed timestamp appears to have happened ~10µs after the first sampled timestamp, so for half the processes we create, we discard the computed timestamp and use the first sampled timestamp instead.
- The code using
TimeStamp::ProcessCreation()should check and report this inconsistency but in practice only 3 out of ~30 call sites seem to care. - When a machine is loaded the process creation timestamp appears 10 or several 10s of milliseconds before the first timestamp. This is because the process was waiting to be scheduled for one or more scheduler time slices. Measuring this time isn't particularly interesting because we have no control over it.
- Last but not least when we restart a process we disregard this time entirely, always relying on the first sampled timestamp instead.
Because of the above I propose to remove the platform-dependent TimeStamp::ComputeProcessUptime() entirely and always rely on the first taken timestamp instead. This provides a consistent & stable measure, a simpler interface, simpler code and less platform-specific dependencies with no particular downsides.
I'll upload a patch and then ask for feedback both here and on dev-platform next year given this is a change that would impact all architectures. I don't want to make this change without sufficient consensus.
| Assignee | ||
Comment 1•3 years ago
|
||
| Assignee | ||
Updated•3 years ago
|
Comment 2•3 years ago
|
||
This sounds good. Uptime.h works like this, and it's been working well, and is indeed a lot simpler. This is some room for dedup in Uptime.h after this lands.
| Assignee | ||
Comment 3•3 years ago
|
||
(In reply to Paul Adenot (:padenot) from comment #2)
This sounds good. Uptime.h works like this, and it's been working well, and is indeed a lot simpler. This is some room for dedup in Uptime.h after this lands.
Thank you for your feedback. Do you think it would be possible to phase process creation time out eventually and only use uptime instead? I haven't looked at all the code that uses the process creation time but my guess is that it's only to compute uptime.
(In reply to Gabriele Svelto [:gsvelto] from comment #3)
I haven't looked at all the code that uses the process creation time but my guess is that it's only to compute uptime.
Talking selfishly about the profiler:
The process creation time is used as basis for all exported timestamps (so effectively they are uptimes at the time of each recorded event), and then sub-profiles from different child processes are synchronized by computing the wall clock day&time of that creation time.
Anyway, we don't need that "process creation time" to be precisely the actual process creation time, any early-enough never-changing timestamp would do. 👍
Comment 5•3 years ago
|
||
It's worth looking back why we cared about the actual process creation time. Back then, firefox was directly linked against libxul. That meant that there was a large portion of the startup runtime (loading libraries, running static initializers) that weren't covered by taking a timestamp in main(). That's not the case anymore, but we still record the "start" telemetry probe for the startup timeline to be the difference between the process creation time and the time we reach main(). The "main" probe is for XRE_main, so after loading libxul.
So if you run mkdir /tmp/new-profile; gdb --args firefox -profile /tmp/new-profile, then on the gdb command prompt do start, then wait a few seconds on the prompt, and then do continue, go to about:telemetry#simple-measurements-tab, you'll see that start is not 0 (it will match how long you waited between start and continue). Whether start is still valuable as a value would be good to check in telemetry. As in, do we have a lot of values larger than 0, and why? And I would expect antivirus on windows would make us report non-0 values. So removing process creation entirely could make us oblivious to those kinds of issues (although they probably also show up to some extent in main).
| Assignee | ||
Comment 6•3 years ago
|
||
This is interesting, thanks Mike, I'll have a look at our telemetry. Note that in this case the first timestamp would be taken when the static initializers are run, so somewhere in between when the executable is loaded and when main() is called.
| Assignee | ||
Comment 7•3 years ago
|
||
Alright, so I've looked at the telemetry and start still looks like it's a Rayleigh distribution, especially on Windows. On Linux it looks a lot more like what I described in comment 0, with the vast majority of the samples being 0 and most of the rest being 10 ms (i.e. a single scheduling slice). macOS seems very well behaved with most samples being around 0 but even then some samples are remarkably far off (100s of milliseconds). So computing the real process creation time is still valuable. I guess this means I'll have to close this as invalid and find another way around the issue of computing the uptime while creating unrooted JS objects.
| Assignee | ||
Comment 8•3 years ago
|
||
Alright, closing this as INVALID since we still want these measurements. I'll find another way around the issue in bug 1678152.
Updated•3 years ago
|
Description
•