Open Bug 899398 Opened 11 years ago Updated 2 years ago

review telemetry from thumbnail service

Categories

(Firefox :: General, defect)

defect

Tracking

()

People

(Reporter: markh, Unassigned)

References

Details

Bug 870104 added telemetry for thumbnailing, and the results are starting to come in!  We should look at those results and see if anything stands out.
I took a look, summarized below.  Numbers are approximations.

Significant Telemetry Dashboard options were:

* Operation System: All
* Channel: nightly
* Submission Reason: save-session
* Date is "Year to date" unless otherwise implied

The number of samples has steadily dropped over the past few months.  Just eyeballing, was 40k per day in late July, down to 4k in late September.

20% of captures timed out (FX_THUMBNAILS_BG_CAPTURE_DONE_REASON).  The Evolution dashboard doesn't helpfully display discrete values like the completion reason, but looking at the histogram over various periods, 20% seems to be stable since the bg service was turned on.  Don't know why.  Could be long page loads, could be content process crashes.  The page-load time data only includes successful captures, and we unfortunately don't record content process crashes as a completion reason (but we should).

Drawing the window to the canvas (FX_THUMBNAILS_BG_CAPTURE_CANVAS_DRAW_TIME_MS) takes twice as long as it does in the fg service (FX_THUMBNAILS_CAPTURE_TIME_MS).  But this result has steadily declined from 73ms in early August to 50ms in late September.

The average page load time for successful captures (FX_THUMBNAILS_BG_CAPTURE_PAGE_LOAD_TIME_MS) jumped from 3s to 4s around the start of August.  The longest times -- the 95th percentile -- jumped from 10s to 18s.

Captures that successfully complete (FX_THUMBNAILS_BG_CAPTURE_SERVICE_TIME_MS) are surprisingly quick: 3.6s from dequeue and start to completion.  ("Surprisingly" because I always use a debug build, where captures take forever, especially with popular, real-world sites.)  But std dev is 4.3s, and this result showed a jump corresponding to the page-load jump.

Queue size at the time of a new capture request (FX_THUMBNAILS_BG_QUEUE_SIZE_ON_CAPTURE) is kind of a nice long-tail distribution, but the average size has increased from 6 captures in late July to 15 in late September.

Time that captures spent in-queue (FX_THUMBNAILS_BG_CAPTURE_QUEUE_TIME_MS) shows a strange, non-normal distribution: 15s on average, huge std dev of 38s.  The average time has jumped up twice: first from 10s to 17s around the start of August (like the page-load jump), and then to 22s the second week of September.  This seems to roughly correspond to the increase in queue size.
(In reply to Drew Willcoxon :adw from comment #1)
> and we unfortunately don't record content process crashes as a completion
> reason (but we should).

bug 924651
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.