Closed Bug 1751739 Opened 3 years ago Closed 1 year ago

Experiments showing up as missing columns for org_mozilla_firefox

Categories

(Data Platform and Tools :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ascholtz, Assigned: frank)

References

Details

(Whiteboard: [dataquality])

Attachments

(2 files)

We are seeing experiment slugs in ping_info.experiments for org_mozilla_firefox. Slugs should only appear in a key-value store

So I did some investigation it appears that Android clients to send and enrollmentId object that does not conform to the schemas:

{"ping_info":{"experiments":{"<some slug>":{"extra":{"enrollmentId":"<some enrollment ID>"}}}}}

The schemas expect extra to have a "type" property which is missing here.

@tlong: I'm not sure if this is a problem in the Glean SDK or the way it's implemented in Android?

Flags: needinfo?(tlong)

What you shared in the comment above is the "old" schema before we enforced types on the event_extras, the current schema should have the "type" field but I don't know what we did to preserve backwards compatibility here. Pinging a Jan-Erik for visibility on this too.

Flags: needinfo?(tlong) → needinfo?(jrediger)

ping_info.experiment are not events.
They should not be affected by schema changes we made to event_extras..

We haven't touched the experiments data format in a long time.
{"<some slug>":{"extra":{"enrollmentId":"<some enrollment ID>"}} is as documented.

The definition is here: https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/f3b9a483a93d318e7f83956b1c2f910b4d2d7637/templates/include/glean/glean.1.schema.json#L41-L55
Looking at an early commit here it looks like it was always listing a type property? But maybe that was ok because it doesn't have a "additionalProperties": false in there? Though I'm not 100% sure there.

Anna, can you be specific on what data we see where? From the bug as is I'm not even sure what exactly the problem is.

Flags: needinfo?(jrediger) → needinfo?(ascholtz)

Of course, for some more context: in our weekly data health check meeting we have some checks that look at additional_properties. If some field is present in a valid received ping, but is not present in the ping's schema, it doesn't have its own column to be placed into during ingestion those fields are placed in the additional_properties column of the ping's table or view. During our last check we noticed enrollmentId showing up in additional_properties for the org_mozilla_firefox namespace for baseline, metrics and events pings.

Looking into additional_properties:

SELECT * FROM org_mozilla_firefox.events   -- same for org_mozilla_firefox.baseline, org_mozilla_firefox.metrics
WHERE additional_properties LIKE '%enrollmentId%' AND DATE(submission_timestamp) = '2022-01-30' 
LIMIT 10

revealed a lot of entries shaped like:

{"ping_info":{"experiments":{"<some slug>":{"extra":{"enrollmentId":"<some enrollment ID>"}}}}}

When looking at the most recent schema version, it looks like the structure the pings are currently sending is not conforming to the schema. The current schema would expect a format like:

{"ping_info":{"experiments":{"<some slug>":{"extra":{"type":"<something something>"}}}}}

Looking at the Glean SDK build versions, most of these pings are coming from 42.1.0 (which looks like a pretty recent one):

SELECT client_info.telemetry_sdk_build, count(*) FROM org_mozilla_firefox.baseline
WHERE additional_properties LIKE '%enrollmentId%' AND DATE(submission_timestamp) = '2022-01-30' 
group by 1
order by 1 desc
Flags: needinfo?(ascholtz)

Thanks for the context!

:travis, can you take a look at it this week?

Flags: needinfo?(tlong)

Yep, I can fit it in this week.

Assignee: nobody → tlong
Flags: needinfo?(tlong)

Okay, interestingly enough I see this in Firefox iOS and Focus Android also (but not Focus iOS because they haven't ran any Nimbus experiments yet). So whenever Nimbus started adding the enrollmentId to the experiment extras, we probably started seeing this problem. I agree with :janerik that the format I expect doesn't mention the type

Jan-Erik, I assume we want to update the schema to match what is documented? If so, I can probably do that either tomorrow or first part of next week.

Flags: needinfo?(jrediger)

Yes, as we didn't change the client the schema is wrong, so updating it is necessary.

Do I understand it right that for a long time we simply didn't have any extra fields in experiments and thus it fit the schema (because an empy extra object is allowed)?

Flags: needinfo?(jrediger)

(In reply to Jan-Erik Rediger [:janerik] from comment #8)

Yes, as we didn't change the client the schema is wrong, so updating it is necessary.

Do I understand it right that for a long time we simply didn't have any extra fields in experiments and thus it fit the schema (because an empy extra object is allowed)?

That's correct, I know mako didn't use the extras, and I can't think of any other consumer of the experiment API that would have until Nimbus came along and started putting the enrollmentId there.

Fixing the schema it is then.
Will we need to backfil data?

I have a PR up to correct the schema, I've pinged Anna for review. As to whether or not we need to backfill, I'll leave that to whoever was looking at the data. Nothing I have worked on would require it but the Nimbus analysis stuff might require it.

I don't think we need to backfill (the Nimbus analysis stuff doesn't require it so far) and no one else has complained about the data missing so far.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED

this was reverted in Bug 1754502 and mozilla-pipeline-schemas#725 because mozilla-pipeline-schemas#724 introduced an invalid schema evolution.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

To summarize, the ping_info.experiments[].value.extra has BigQuery type STRUCT<type STRING> whereas what we want is a repeated STRUCT<key STRING, value STRING>. We do not currently have any way to change the type of a field once it has been deployed.

The only potential remedy that comes to mind here is to introduce a coordinated change in schema generation and in the pipeline to rename this field to extra2 or something along those lines, so that we can deploy the new field with the correct structure. This is similar to the remedy in https://bugzilla.mozilla.org/show_bug.cgi?id=1737656 where we had to rename a few metric types.

See Also: → 1737656

I have to untake this in order to focus on other things

Assignee: tlong → nobody
Component: Datasets: General → General

I'm marking this as a P4 since according to https://bugzilla.mozilla.org/show_bug.cgi?id=1774272#c0 Nimbus does not rely on this field, and is the only customer of the Experiment API.

We can re-evaluate if there's an end user who needs to use this field. Workaround right now is to query it in additional_properties.

Severity: -- → S3
Priority: -- → P4

Instead of dealing with the general problem here - that experiment extra fields have the wrong type - we can enable this field directly in the schema, as a compatible change.

Assignee: nobody → fbertsch
Priority: P4 → P1
Whiteboard: [data-quality] → [dataquality]
Status: REOPENED → RESOLVED
Closed: 3 years ago1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: