Closed Bug 1638877 Opened 4 years ago Closed 4 years ago

org.everit.json.schema.ValidationException: #/client_info: required key [app_build]

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: mdroettboom)

References

Details

Attachments

(1 file)

org-mozilla-fennec-aurora org.everit.json.schema.ValidationException: #/client_info: required key [app_build] not found baseline 927

We're seeing a bunch of failures like the above across all the app ids. We should figure out why that's happening.

Priority: -- → P3
Whiteboard: [telemetry:glean-rs:m?]
Priority: P3 → P2
Whiteboard: [telemetry:glean-rs:m?]

Maybe this has something to do with the way we are getting this info being deprecated? It says deprecated since API level 28, so I wonder if this is mostly newer devices?

(In reply to Travis Long [:travis_] from comment #1)

Maybe this has something to do with the way we are getting this info being deprecated? It says deprecated since API level 28, so I wonder if this is mostly newer devices?

Maybe, but I'd be very surprised. Deprecated doesn't mean removed :)

Assignee: nobody → mdroettboom
Priority: P2 → P1

There are more missing elements than just app_build -- it just seems that that is the first one to be checked by the schema validator so it comes up first. It actually looks like there is anywhere between 1-9 elements missing from client_info.

Looking at the first 500 of these errors: https://sql.telemetry.mozilla.org/queries/72154

The missing elements always match the order in which they are set in initializeCoreMetrics -- that is, the ones that are missing are always contiguous and at the end of the order they are set in that function. This suggests to me that there is a race condition between initializeCoreMetrics and ping collection.

The Python script to confirm this observation.

import csv
import base64
import gzip
import json

rows = csv.reader(open("Glean_Validation_Errors_2020_06_22.csv"))

entries = []
row = next(rows)

idx = row.index("payload")

sets = set()

for row in rows:
    x = base64.b64decode(row[idx])
    x = gzip.decompress(x)
    x = json.loads(x)
    entries.append(x)

    sets.add(tuple(sorted(list(x["client_info"].keys()))))

order = [
    "android_sdk_version",
    "os_version",
    "device_manufacturer",
    "device_model",
    "architecture",
    "app_channel",
    "app_build",
    "app_display_version",
]

for set in sets:
    indices = sorted([order.index(x) for x in set if x in order])
    if indices != list(range(len(indices))):
        print(set)

Now to find a way to prevent the race condition.

Blocks: 1645133

More info: This behavior described in the previous comment is only present in the dirty_startup ping. The foreground and background pings either have all core metrics or none.

I cast my vote for not saving or sending a dirty_startup ping if it doesn't have all its core metrics (recording an error metric in that case so we can monitor the number of pings we're not sending as a result)

This seems to suggest what is happening is that Fenix is stopping while initializeCoreMetrics is running. Upon the next startup, the dirty_startup ping is sent with incomplete data (it's only after this that it's filled in again).

We have a few imperfect options to resolve this:

  1. Don't send a dirty startup ping if some of these values are missing in the database -- if the previous session was really so short, maybe the data's not interesting anyway. This, from a data analysis perspective, is the same situation we have now since the pings are being dropped at the endpoint.

  2. Fill in the missing values with "unknown" values so they pass the schema validation and get entered into the database.

  3. Fill in the core metrics before sending the dirty_startup ping. This wouldn't strictly be correct, because we could have updated the app, causing a number of the values to change since we last ran. But for baseline pings, does that edge case matter?

(In reply to Michael Droettboom [:mdroettboom] from comment #6)

More info: This behavior described in the previous comment is only present in the dirty_startup ping. The foreground and background pings either have all core metrics or none.

This is probably indicating a second problem: maybe Fenix had a startup crash?

(In reply to Alessio Placitelli [:Dexter] from comment #9)

(In reply to Michael Droettboom [:mdroettboom] from comment #6)

More info: This behavior described in the previous comment is only present in the dirty_startup ping. The foreground and background pings either have all core metrics or none.

This is probably indicating a second problem: maybe Fenix had a startup crash?

Hmm... I wonder if there's any way to correlate with the Sentry data (probably not). I don't think we have any way of distinguishing a crash from a "the OS just suddenly shut us down because it can" event.

Another option (4): we don't set the dirty bit until Glean has successfully completed initialization. That way, if we "crash" or "stop" during Glean initialization, we won't get a dirty_startup ping the next time.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: