Closed
Bug 1241265
Opened 8 years ago
Closed 8 years ago
TELEMETRY_ARCHIVE_EVICTING_DIRS_MS regression with Nightly buildid >=20160112
Categories
(Toolkit :: Telemetry, defect, P1)
Toolkit
Telemetry
Tracking
()
RESOLVED
WORKSFORME
Tracking | Status | |
---|---|---|
firefox46 | --- | affected |
People
(Reporter: gfritzsche, Assigned: Dexter)
Details
(Whiteboard: [measurement:client])
Attachments
(1 file)
82.02 KB,
image/png
|
Details |
Alert details: http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/1319/alerts/?from=2016-01-12&to=2016-01-12 Changeset for 20160112030227: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=6020a4cb41a77a09484c24a5875bb221714c0e6a&tochange=e790bba372f14241addda469a4bdb7ab00786ab3
Reporter | ||
Comment 1•8 years ago
|
||
There are some FHR removal patches in there, but i'm not sure how the FHR removal patches could make us drop that timing. Maybe if now-removed code had a performance impact on the directory removal? TELEMETRY_ARCHIVE_EVICTING_DIRS_MS histogram is the time it took to remove old (>180 days) ping archive directories. It significantly dropped (across all platforms) from buildid 20160112 on: https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median&cumulative=0&end_date=2016-01-17&keys=!__none__!__none__&max_channel_version=nightly%252F46&measure=TELEMETRY_ARCHIVE_EVICTING_DIRS_MS&min_channel_version=nightly%252F46&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2015-12-14&trim=1&use_submission_date=0 ... but i don't see any change in how many old directories we evict: https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median&cumulative=0&end_date=2016-01-17&keys=!__none__!__none__&max_channel_version=nightly%252F46&measure=TELEMETRY_ARCHIVE_EVICTED_OLD_DIRS&min_channel_version=nightly%252F46&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2015-12-14&trim=1&use_submission_date=1 ... or in directory ages: https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median&cumulative=0&end_date=2016-01-17&keys=!__none__!__none__&max_channel_version=nightly%252F46&measure=TELEMETRY_ARCHIVE_OLDEST_DIRECTORY_AGE&min_channel_version=nightly%252F46&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2015-12-14&trim=1&use_submission_date=0 Looking at the code: https://dxr.mozilla.org/mozilla-central/rev/b67316254602a63bf4e568198a5c7d3288a9db27/toolkit/components/telemetry/TelemetryStorage.jsm#846 https://dxr.mozilla.org/mozilla-central/rev/b67316254602a63bf4e568198a5c7d3288a9db27/toolkit/components/telemetry/TelemetryStorage.jsm#880 ... this suggests that we don't get errors thrown from OS.File.removeDir or similar. Looking through the other archive measures, i see an unexpectedly sharp rise in archive size: https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median&cumulative=0&end_date=2016-01-17&keys=!__none__!__none__&max_channel_version=nightly%252F46&measure=TELEMETRY_ARCHIVE_SIZE_MB&min_channel_version=nightly%252F44&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2015-12-14&trim=1&use_submission_date=0 I don't see any obvious changes in the file histories right now. E.g. the last change here has pushdate 2016-01-14: https://hg.mozilla.org/mozilla-central/filelog/tip/toolkit/components/telemetry/TelemetryStorage.jsm
Reporter | ||
Updated•8 years ago
|
Assignee: nobody → gfritzsche
Reporter | ||
Comment 2•8 years ago
|
||
Looking closer, the sharp rise of TELEMETRY_ARCHIVE_SIZE_MB overlaps in dates with the sharp decline of submissions (viewing buy buildid): https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median&cumulative=0&end_date=2016-01-17&keys=!__none__!__none__&max_channel_version=nightly%252F46&measure=TELEMETRY_ARCHIVE_SIZE_MB&min_channel_version=nightly%252F44&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2015-12-14&trim=1&use_submission_date=0 Similar with TELEMETRY_ARCHIVE_SCAN_PING_COUNT: https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median&cumulative=0&end_date=2016-01-17&keys=!__none__!__none__&max_channel_version=nightly%252F46&measure=TELEMETRY_ARCHIVE_SCAN_PING_COUNT&min_channel_version=nightly%252F44&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2015-12-14&trim=1&use_submission_date=0 Roberto, this looks more like the median spiking due to bias to few heavy-weighted submissions? Any idea how to confirm that (besides watching it for a while)?
Flags: needinfo?(rvitillo)
Comment 3•8 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #2) > Roberto, this looks more like the median spiking due to bias to few > heavy-weighted submissions? > Any idea how to confirm that (besides watching it for a while)? You could run a server side analysis to confirm that.
Flags: needinfo?(rvitillo)
Comment 4•8 years ago
|
||
Note that the median should be pretty robust to outliers though.
Reporter | ||
Comment 5•8 years ago
|
||
Comparing yesterdays graphs to today, the spike is moving: http://i.imgur.com/zDwAycP.png http://i.imgur.com/JMskr1O.png
Reporter | ||
Comment 6•8 years ago
|
||
The spike seems to be an artifact. We'll keep monitoring, but it doesn't look like its a real issue. For the scan time regression/improvement, we can probably explain this with FHR not doing competing file I/O anymore. Let's confirm this with simple scenarios like: * create 6 month archive directories * fill them up with 100+ pings * run startup 10-30 times (until we get stable numbers) both before & after the FHR removal * compare the eviction timings
Assignee: gfritzsche → alessio.placitelli
Reporter | ||
Updated•8 years ago
|
Points: --- → 2
Assignee | ||
Comment 7•8 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #6) > The spike seems to be an artifact. We'll keep monitoring, but it doesn't > look like its a real issue. > > For the scan time regression/improvement, we can probably explain this with > FHR not doing competing file I/O anymore. > Let's confirm this with simple scenarios like: > * create 6 month archive directories > * fill them up with 100+ pings > * run startup 10-30 times (until we get stable numbers) both before & after > the FHR removal > * compare the eviction timings As discussed over IRC, I morphed this test: * create 12 month archive directories * fill them with 1000 pings With 6 directories and 100 pings, there was no real difference on my machine for TELEMETRY_ARCHIVE_EVICTING_DIRS_MS. I guess this is because my machine is quite fast. With the bigger archive, after 15 runs, I got the following averages: 2279ms for builds without FHR against 2374ms for builds with FHR. This is roughly a 4% improvement. I honestly expected a bigger improvement, but still this seems to confirm the trend we're seeing. Georg, can you think of any other test here? Or it's ok to keep watching the data for a while and eventually close this bug?
Flags: needinfo?(gfritzsche)
Reporter | ||
Comment 8•8 years ago
|
||
(In reply to Alessio Placitelli [:Dexter] from comment #7) > With 6 directories and 100 pings, there was no real difference on my machine > for TELEMETRY_ARCHIVE_EVICTING_DIRS_MS. I guess this is because my machine > is quite fast. If this holds true, then we should probably see a change in the DIRS_MS distribution on t.m.o (shifting to smaller DIRS_MS times).
Flags: needinfo?(gfritzsche)
Assignee | ||
Comment 9•8 years ago
|
||
Assignee | ||
Comment 10•8 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #8) > (In reply to Alessio Placitelli [:Dexter] from comment #7) > > With 6 directories and 100 pings, there was no real difference on my machine > > for TELEMETRY_ARCHIVE_EVICTING_DIRS_MS. I guess this is because my machine > > is quite fast. > > If this holds true, then we should probably see a change in the DIRS_MS > distribution on t.m.o (shifting to smaller DIRS_MS times). That seems to be the case: in the weeks following the removal of FHR, the distribution slowly shifted. The most recent data (see the attached picture for a comparison and the percentiles) seem to back this trend as well.
Reporter | ||
Comment 11•8 years ago
|
||
Thanks, this looks reasonable then. Furthermore, the spikes from comment 2 keep shifting with the latest submission dates, so that looks like an artifact. Viewing those by date they look rather normal.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•