Closed Bug 1665966 Opened 4 years ago Closed 4 years ago

Several performance timing distribution metrics contain unexpected histogram keys

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: esmyth, Assigned: mdroettboom)

Details

Attachments

(3 files)

I found several cases of clients with timing distribution recording times for histogram keys larger than the 600000000000 ns max. In a few cases, the value is larger than bigquery's INT64 type can handle and it broke a query.

SELECT
  client_info.client_id,
  CAST(key AS NUMERIC) AS histogram_key,
  SUM(value) AS histogram_value,
FROM `moz-fx-data-shared-prod.org_mozilla_firefox.metrics`
  CROSS JOIN UNNEST(metrics.timing_distribution.gfx_content_paint_time.values)
WHERE DATE(submission_timestamp) >= '2020-08-18'
  AND DATE(submission_timestamp) < '2020-09-18'
GROUP BY 1, 2
HAVING histogram_value > 0
  AND histogram_key > 600000000000
ORDER BY histogram_key DESC

The following probes show similar issues:

  • performance_page_non_blank_paint
  • performance_time_response_start
  • performance_time_dom_interactive
  • performance_time_dom_content_loaded_start
  • performance_time_dom_content_loaded_end
  • performance_interaction_keypress_present_latency
  • geckoview_page_load_time
  • geckoview_page_reload_time
  • javascript_gc_slice_time
  • javascript_gc_mark_time

The performance_time_load_event_end distribution somehow includes the key 30370004h

SELECT
  client_info.client_id,
  metrics.timing_distribution.performance_time_load_event_end.time_unit,
  key AS histogram_key,
  SUM(value) AS histogram_value,
FROM `moz-fx-data-shared-prod.org_mozilla_firefox.metrics`
  CROSS JOIN UNNEST(metrics.timing_distribution.performance_time_load_event_end.values)
WHERE DATE(submission_timestamp) >= '2020-08-18'
  AND DATE(submission_timestamp) < '2020-09-18'
  AND NOT REGEXP_CONTAINS(key, r'^\d+$')
GROUP BY 1, 2, 3
ORDER BY histogram_key DESC
Assignee: nobody → mdroettboom

First, the maximum expected value is 6e17 (not 6e11) for metrics where the input is defined in ms, which is the case for many Geckoview metrics. With that, the query only returns a single invalid value, which is MAXINT64, all from a single client. The fact that it's a single client makes me think there is just something broken or about that specific client.

Same for the second issue of outright garbage characters in the key -- that only happens for a single client.

I plan to (a) drill down on what might be special about these specific clients and (b) consider adding code at the edge schema to reject pings with these errors.

Attached file invalid_key.json

This is the single ping found with non-numeric data in the timing_distribution keys, as extracted from the payload_bytes_decoded table. There doesn't seem to be anything else wrong with the file. Just some random event, perhaps?

Attached file GitHub Pull Request

Fix for the second half of the bug where there is outright invalid characters...

Priority: P3 → P1
Whiteboard: [telemetry:glean-rs:m?]

Closing this, as with the correct maximum used, it boils down to a single erroneous client.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: