Open Bug 2024449 Opened 7 days ago Updated 5 hours ago

Stream resource monitor profile data incrementally to support crash/timeout recovery in CI

Categories

(Testing :: Mozbase, enhancement)

enhancement

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: florian, Assigned: florian)

References

(Regressed 1 open bug)

Details

Attachments

(2 files)

Currently, SystemResourceMonitor only writes the profile file at the end when as_profile() is called. If a CI worker gets killed due to a timeout, no profile data is saved.

This adds incremental streaming of profile data to a JSON Lines file while the monitor is running. The first line contains the meta object, and subsequent lines contain one marker each (test markers, resource measurements, events). On normal shutdown, the file is replaced with the full serialized JSON profile as before.

Changes:

  • SystemResourceMonitor gains a start_streaming(path) method that begins writing profile data incrementally
  • Measurement data from the child process is now sent through the pipe immediately and drained periodically by a timer, instead of being buffered until stop()
  • Each streamed JSON line has a type field ("meta" or "marker") for future extensibility
  • mozharness calls start_streaming so CI test runs produce partial profile data even on timeout

Pure refactor: extract _build_meta(), _build_thread(), _measurement_markers(),
_format_percent(), _drain_pipe(), and class-level category constants from
as_profile() and stop() into reusable methods. No behavior change.

Currently, SystemResourceMonitor only writes the profile file at the end when
as_profile() is called. If a CI worker gets killed due to a timeout, no profile
data is saved.

This adds incremental streaming of profile data to a JSON Lines file while the
monitor is running. The first line contains the meta object, then a thread
object, then one line per marker (test markers, resource measurements, events).
On normal shutdown, the file is replaced with the full serialized JSON profile
as before.

The profiler front-end needs to support the new format, there's a deploy preview at https://deploy-preview-5901--perf-html.netlify.app/
There's a try push at https://treeherder.mozilla.org/jobs?repo=try&revision=d015b5e0c9f74a6536a28e2ef3b7246adeb7cd63&selectedTaskRun=OpdbI0z2THOqvSq6unfEAA.0 with jobs that failed with a timeout and would usually not have uploaded resource usage profiles.

Pushed by fqueze@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/2225d9a5f0b8 https://hg.mozilla.org/integration/autoland/rev/17c1d5d6a0ec Extract helper methods from SystemResourceMonitor.as_profile(). r=jcristau https://github.com/mozilla-firefox/firefox/commit/9c37112d20d6 https://hg.mozilla.org/integration/autoland/rev/1e3e3b079f34 Stream resource monitor profile data incrementally to support crash/timeout recovery in CI. r=jcristau
Regressions: 2026407
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: