Add campaign field and build metadata to core ping

RESOLVED FIXED

Status

Data Platform and Tools
Datasets: Mobile
P1
normal
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: frank, Assigned: frank)

Tracking

(Blocks: 1 bug)

Details

Attachments

(1 attachment, 2 obsolete attachments)

Comment hidden (empty)
Created attachment 8855852 [details]
new_core_config
I've tested this and the output looks good. Here are the changes:

1. Partition by normalizedChannel
2. Added new "campaign" field (binary UTF8)
3. Added buildId and appName to metadata
Flags: needinfo?(whd)
In the last version from https://bug1347609.bmoattachments.org/attachment.cgi?id=8850024 the searches group had an int32 value. This seems like it would be a possible incompatible change, is that expected?
Created attachment 8855896 [details]
updated_core_config

A few changes:

1. appName is no longer a metadata field, but a partition. We need to take all the existing files and put them in "app_name=Fennec", since they are all Fennec pings.

2. "submission" partition name changed to "submission_date". Is it feasible to change this for the historical files? If not feel free to keep it as "submission".
Attachment #8855852 - Attachment is obsolete: true
Created attachment 8855899 [details]
changed_search_field_core_config

Updated search field type.
Attachment #8855896 - Attachment is obsolete: true
https://github.com/mozilla-services/puppet-config/pull/2554

(In reply to Frank Bertsch [:frank] from comment #4)

> 2. "submission" partition name changed to "submission_date". Is it feasible
> to change this for the historical files? If not feel free to keep it as
> "submission".

I've changed this to "submission_date_s3" to be more similar to our batch jobs (which append _s3 to avoid having the problem where a parquet file contains a field that also exists in an s3 partition).

This required a copy of all existing data, which is complete. However, it appears there are already different app_names sending data so we may need to do an actual backfill to properly categorize historical data.

After the next p2h run the data should be available using the new partitioning scheme from re:dash.
Flags: needinfo?(whd)
Great, thanks whd. We got a few pings today with Focus data, but I'm not overly concerned and there's no need to run backfill as long as the future data is properly partitioned by appName.
:whd, unfortunately my change wasn't backwards compatible. Bug 1352521 will fix this for our presto instance, but not for Athena, so we're going to have to version bump the new data.
Flags: needinfo?(whd)
I've bumped the version to 2. I'm running backfill for 2017 as I assume that's desired.
Flags: needinfo?(whd)
2017 is now fully backfilled for v2.
Thanks whd. Closing this out.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
:whd, the name should have been `campaignId` not `campaign` :/

What's the cost of a backfill? I need to decide whether or not it's worth it to backfill again.
Status: RESOLVED → REOPENED
Flags: needinfo?(whd)
Resolution: FIXED → ---
In terms of compute, it's $0.42/hr for about 5 hours on a c3.2xlarge per backfilled day, which for 110 days comes out to about $250.0. There are other factors such as network and s3 api costs but the cost will be dominated by compute. I estimate the aws cost of backfill to be < $500.

In terms of my time, an hour or so, as this requires a production deploy, backfill setup, and some context switching.
Flags: needinfo?(whd)
Let's not backfill for now. Can you change the name to campaignId rather than campaign, and we'll have this data moving forward? I'll take care of the schema in mozilla-pipeline-schemas.
Flags: needinfo?(whd)

Updated

a year ago
Component: Metrics: Pipeline → Datasets: Mobile
Product: Cloud Services → Data Platform and Tools
Changes available in STMO.
Status: REOPENED → RESOLVED
Last Resolved: a year agoa year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.