Closed Bug 1638236 Opened 4 years ago Closed 2 years ago

Profiler should handle timeout when gathering child process profiles

Categories

(Core :: Gecko Profiler, defect, P1)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: mozbugz, Assigned: mozbugz)

References

(Blocks 1 open bug)

Details

Some tests expect for a full async profile, e.g.: https://searchfox.org/mozilla-central/rev/0688ffdef223dac527c2fcdb25560118c4e4df51/browser/components/tests/startupRecorder.js#212,217

Sometimes the test times out.
:florian was able to force-capture a shutdown profile of the main process: https://perfht.ml/3cAAZp8

Notice the 3s gap in samples, starting around the 10s mark, very visible in the Compositor track, this is (currently) normal while the main process is processing the profile buffer in the main thread.

Now, looking at the Marker Table, filtered with "1476":

  • At 10.334s, there is a "PProfiler::Msg_GatherProfile — sent to 1476", this is the main process requesting a profile from sub-process 1476.
  • At 6.853s, this is the last message received from 1476.
    So the profiler waits for a profile that will never arrives!

Filtering the marker table with "GatherProfile" shows that 3 other processes provided their profile around 13.4s.

Some possible actions I can think of:

A. The test itself could have a timeout, in which case it could do what :florian did manually, i.e., set the MOZ_PROFILER_SHUTDOWN=... env-var and force a Firefox shutdown to capture some profiling for later analysis.
Bug 1528859 should help capturing other processes than just the parent; though if one process is really frozen it probably won't respond anyway -- but it would be really nice if there was a way to force a shutdown-profile dump, maybe with a signal?

B. The profile-gathering logic inside Firefox could have its own timeout, so it would provide a partial profile and the test would end normally. However this may hide freezing issues! So it would be good to somehow record that some processes did not respond.

(Other suggestions welcome.)

(In reply to Gerald Squelart [:gerald] (he/him) from comment #0)

B. The profile-gathering logic inside Firefox could have its own timeout, so it would provide a partial profile

I think this would be really nice for cases where Firefox is misbehaving and a content process is not responding, outside of tests.

and the test would end normally. However this may hide freezing issues! So it would be good to somehow record that some processes did not respond.

We would probably still want tests to fail, so it would be nice to at least have a warning or an error reported somehow. But one test in the folder reads the startup profile to look at main thread I/O done by content processes, so if there's no content process at all in the profile, we'll already have a test failure.

This seems to happen more frequently these days, so I'm increasing the priority.

Severity: -- → S3
Type: task → defect
Priority: P3 → P1
Assignee: nobody → gsquelart

(In reply to Gerald Squelart [:gerald] (he/him) from comment #0)

B. The profile-gathering logic inside Firefox could have its own timeout, so it would provide a partial profile and the test would end normally. However this may hide freezing issues! So it would be good to somehow record that some processes did not respond.

Bug 1673513 just landed, it added a better timeout mechanism (waiting for as long as child processes keep progressing in their profile-gathering work).
For now, frozen processes are just silently dropped. Bug 1752930 will add information about these.

Depends on: 1752930

Bug 1673513 added good handling of timeouts when gathering child profiles.
Bug 1673513 added logging about child profiles that are discarded.
(And bug 1752930 now focuses on exit profiles, which are not relevant to this bug's initial goals.)

So I'm calling this bug effectively fixed.

Status: NEW → RESOLVED
Closed: 2 years ago
No longer depends on: 1752930
Resolution: --- → FIXED
See Also: → 1752930
You need to log in before you can comment on or make changes to this bug.