Open Bug 1689204 Opened 4 years ago Updated 4 years ago

Android sampler may rapidly "catch up" if sampling takes took long, seems undesirable

Tracking

()

Status:

NEW

People

(Reporter: mcomella, Unassigned)

Details

Attachments

(1 file)

validate-scheduleAtFixedRate-fenix.patch 4 years ago Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla] 1.61 KB, patch		Details \| Diff \| Splinter Review

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Reporter

Description

•

4 years ago

Attached patch validate-scheduleAtFixedRate-fenix.patch — Details — Splinter Review

The Android sampler's interval is determined by the ScheduledExecutorService.scheduleAtFixedRate method (source). The Runnable is documented to never run concurrently (docs). This leads to a question: what happens if a Runnable takes longer than specified interval?

In local testing, it appears that 'two or more executions will occur in rapid succession to "catch up."' (this is the same behavior as Timer.scheduleAtFixedRate (docs), which is where I took this quote from). This seems undesirable for the Profiler: if the sampling rate is 1ms and one sample takes 5ms, after that sample I don't think we would want to take 4 samples as rapidly as possible to "catch up" and only then return to the sampling interval. I'd expect this to return not very useful data and potentially cause the profiler to slow down further, especially considering Thread.getStackTrace() is known to be a slow method. On the contrary, I think we'd want to take one sample immediately afterward and return to the interval as specified. I think this is similar to the problems game developers face in creating a 60 FPS game loop, where I think scheduleAtFixedRate is discouraged.

I wonder if the current implementation exacerbates issues I've seen like negative sample timings (bug 1688856) or excessively long durations between samples (#2962).

To visualize, we're doing:

| _   _   _   _               _ _ _ _ _   _   _   _ |

But I think we should do:

| _   _   _   _               _   _   _   _ |

To validate the behavior of scheduleAtFixedRate, I used the attached fenix patch. First we see an intentional five second delay:

01-27 13:06:16.304 E/lol     (14851): running 1611781576304
01-27 13:06:21.305 E/lol     (14851): running 1611781581304

Then we catch up ~25 times:

01-27 13:06:21.305 E/lol     (14851): running 1611781581305
01-27 13:06:21.305 E/lol     (14851): running 1611781581305
01-27 13:06:21.306 E/lol     (14851): running 1611781581306
01-27 13:06:21.306 E/lol     (14851): running 1611781581306
01-27 13:06:21.307 E/lol     (14851): running 1611781581307
01-27 13:06:21.308 E/lol     (14851): running 1611781581307
<snip>

And then we return to the desired 100ms interval:

01-27 13:06:21.319 E/lol     (14851): running 1611781581318
01-27 13:06:21.403 E/lol     (14851): running 1611781581403
01-27 13:06:21.503 E/lol     (14851): running 1611781581503
01-27 13:06:21.603 E/lol     (14851): running 1611781581603
01-27 13:06:21.703 E/lol     (14851): running 1611781581703

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Reporter

Updated

•

4 years ago

Summary: Android sampler will "catch up" if sampling take took long, seems undesirable → Android sampler may rapidly "catch up" if sampling takes took long, seems undesirable

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 1

•

4 years ago

Coincidentally, bug 1685937 recently changed the native (C++) sampler so that it doesn't try to catch up when too much time has passed between two samplings.

NI:Nazım, please have a look when possible.

Severity: -- → S3

Flags: needinfo?(canaltinova)

Priority: -- → P2

Julien Wajsberg [:julienw]

Comment 2

•

4 years ago

•

Edited

In that case, should we use scheduleAtFixedDelay (docs) instead? I'd say that it doesn't show the same behavior.

Nazım Can Altınova [:canova][:canaltinova on phabricator]

Comment 3

•

4 years ago

Thanks for filing this, looks like something we shoould definitely fix.
I'm also wary of using scheduleWithFixedDelay because this will definitely make our intervals more than the given interval amount. Like, if users sets 1ms and the sampling takes 0.5ms, then the real sampling interval will be 1.5ms. Which is not something user wanted.

Before using the ScheduledExecutorService.scheduleAtFixedRate, we were just using Thread.sleep(mInterval) to handle the sampling interval (old code, Bug 1612251).
And now I'm thinking whether we should revert to something like it. If we can revert back to it, we can actually use the same heuristics with the C++ code Gerald mentioned.
The main reason to switch to the ScheduledExecutorService.scheduleAtFixedRate was the performance impact, but I'm not sure if that still applies with modern devices. It would be good if we can test different options and see we can catch up with sampling.

Flags: needinfo?(canaltinova)

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Reporter

Comment 4

•

4 years ago

I'm also wary of using scheduleWithFixedDelay

I agree.

The way I've implemented these loops in the past is by sending a message to an Android Handler to trigger an opportunity to execute code and, before exiting the method, dispatching a message to be handled at the next calculated interval, e.g.:

Handler sampleHandler = new Handler(); // I don't remember how to completely init these
SampleCallback callback = new SampleCallback();
callback.setHandler(sampleHandler);

class SampleCallback extends Handler.Callback {
    private Handler handler;

    SampleCallback() {}

    void setHandler(Handler handler) {
        this.handler = handler;
    }

    @Override
    public void handleMessage(Message msg) {
        // do sampling work
        nextSampleMillis = ... // using current time, calculate time of next sampling interval
        handler.sendEmptyMessageAtTime(0, nextSampleMillis); // this will call this method again
    }
}

They have similar methods to post Runnables in Handler and you can probably do something similar with Executors and schedule. There's also Thread.sleep, as you mentioned. I'm not sure which techniques have the least overhead and highest accuracy. I know there are some cases where the Android APIs are more aware of the Android platform's constraints than the Java APIs are (see this comparison of time APIs where the Java uptimeMillis API does not continue to tick during deep sleep, unlike Android's elapsedRealtime). In this case, they may have implemented Android Handler rather than using Executors for their event loop to improve performance or it might just have been to have more direct control over the API. 🤷

Note: I haven't looked at what C++ is doing to understand how that might fit in here.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Android sampler may rapidly "catch up" if sampling takes took long, seems undesirable

Categories

(Core :: Gecko Profiler, defect, P2)

Tracking

()

People

(Reporter: mcomella, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Attachment

General

Description

File Name

Content Type