Closed Bug 1241265 Opened 8 years ago Closed 8 years ago

TELEMETRY_ARCHIVE_EVICTING_DIRS_MS regression with Nightly buildid >=20160112

Categories

(Toolkit :: Telemetry, defect, P1)

defect
Points:
2

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox46 --- affected

People

(Reporter: gfritzsche, Assigned: Dexter)

Details

(Whiteboard: [measurement:client])

Attachments

(1 file)

There are some FHR removal patches in there, but i'm not sure how the FHR removal patches could make us drop that timing.
Maybe if now-removed code had a performance impact on the directory removal?

TELEMETRY_ARCHIVE_EVICTING_DIRS_MS histogram is the time it took to remove old (>180 days) ping archive directories.
It significantly dropped (across all platforms) from buildid 20160112 on:
https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median&cumulative=0&end_date=2016-01-17&keys=!__none__!__none__&max_channel_version=nightly%252F46&measure=TELEMETRY_ARCHIVE_EVICTING_DIRS_MS&min_channel_version=nightly%252F46&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2015-12-14&trim=1&use_submission_date=0

... but i don't see any change in how many old directories we evict:
https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median&cumulative=0&end_date=2016-01-17&keys=!__none__!__none__&max_channel_version=nightly%252F46&measure=TELEMETRY_ARCHIVE_EVICTED_OLD_DIRS&min_channel_version=nightly%252F46&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2015-12-14&trim=1&use_submission_date=1

... or in directory ages:
https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median&cumulative=0&end_date=2016-01-17&keys=!__none__!__none__&max_channel_version=nightly%252F46&measure=TELEMETRY_ARCHIVE_OLDEST_DIRECTORY_AGE&min_channel_version=nightly%252F46&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2015-12-14&trim=1&use_submission_date=0

Looking at the code:
https://dxr.mozilla.org/mozilla-central/rev/b67316254602a63bf4e568198a5c7d3288a9db27/toolkit/components/telemetry/TelemetryStorage.jsm#846
https://dxr.mozilla.org/mozilla-central/rev/b67316254602a63bf4e568198a5c7d3288a9db27/toolkit/components/telemetry/TelemetryStorage.jsm#880
... this suggests that we don't get errors thrown from OS.File.removeDir or similar.

Looking through the other archive measures, i see an unexpectedly sharp rise in archive size:
https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median&cumulative=0&end_date=2016-01-17&keys=!__none__!__none__&max_channel_version=nightly%252F46&measure=TELEMETRY_ARCHIVE_SIZE_MB&min_channel_version=nightly%252F44&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2015-12-14&trim=1&use_submission_date=0

I don't see any obvious changes in the file histories right now. E.g. the last change here has pushdate 2016-01-14:
https://hg.mozilla.org/mozilla-central/filelog/tip/toolkit/components/telemetry/TelemetryStorage.jsm
Assignee: nobody → gfritzsche
(In reply to Georg Fritzsche [:gfritzsche] from comment #2)

> Roberto, this looks more like the median spiking due to bias to few
> heavy-weighted submissions?
> Any idea how to confirm that (besides watching it for a while)?

You could run a server side analysis to confirm that.
Flags: needinfo?(rvitillo)
Note that the median should be pretty robust to outliers though.
Comparing yesterdays graphs to today, the spike is moving:
http://i.imgur.com/zDwAycP.png
http://i.imgur.com/JMskr1O.png
The spike seems to be an artifact. We'll keep monitoring, but it doesn't look like its a real issue.

For the scan time regression/improvement, we can probably explain this with FHR not doing competing file I/O anymore.
Let's confirm this with simple scenarios like:
* create 6 month archive directories
* fill them up with 100+ pings
* run startup 10-30 times (until we get stable numbers) both before & after the FHR removal
* compare the eviction timings
Assignee: gfritzsche → alessio.placitelli
Points: --- → 2
(In reply to Georg Fritzsche [:gfritzsche] from comment #6)
> The spike seems to be an artifact. We'll keep monitoring, but it doesn't
> look like its a real issue.
> 
> For the scan time regression/improvement, we can probably explain this with
> FHR not doing competing file I/O anymore.
> Let's confirm this with simple scenarios like:
> * create 6 month archive directories
> * fill them up with 100+ pings
> * run startup 10-30 times (until we get stable numbers) both before & after
> the FHR removal
> * compare the eviction timings

As discussed over IRC, I morphed this test:

* create 12 month archive directories
* fill them with 1000 pings

With 6 directories and 100 pings, there was no real difference on my machine for TELEMETRY_ARCHIVE_EVICTING_DIRS_MS. I guess this is because my machine is quite fast.

With the bigger archive, after 15 runs, I got the following averages: 2279ms for builds without FHR against 2374ms for builds with FHR. This is roughly a 4% improvement.

I honestly expected a bigger improvement, but still this seems to confirm the trend we're seeing.

Georg, can you think of any other test here? Or it's ok to keep watching the data for a while and eventually close this bug?
Flags: needinfo?(gfritzsche)
(In reply to Alessio Placitelli [:Dexter] from comment #7)
> With 6 directories and 100 pings, there was no real difference on my machine
> for TELEMETRY_ARCHIVE_EVICTING_DIRS_MS. I guess this is because my machine
> is quite fast.

If this holds true, then we should probably see a change in the DIRS_MS distribution on t.m.o (shifting to smaller DIRS_MS times).
Flags: needinfo?(gfritzsche)
Attached image DIR_MS_Compare.png
(In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> (In reply to Alessio Placitelli [:Dexter] from comment #7)
> > With 6 directories and 100 pings, there was no real difference on my machine
> > for TELEMETRY_ARCHIVE_EVICTING_DIRS_MS. I guess this is because my machine
> > is quite fast.
> 
> If this holds true, then we should probably see a change in the DIRS_MS
> distribution on t.m.o (shifting to smaller DIRS_MS times).

That seems to be the case: in the weeks following the removal of FHR, the distribution slowly shifted. The most recent data (see the attached picture for a comparison and the percentiles) seem to back this trend as well.
Thanks, this looks reasonable then.
Furthermore, the spikes from comment 2 keep shifting with the latest submission dates, so that looks like an artifact.
Viewing those by date they look rather normal.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: