Drop in stub installer submissions starting March 11th to March 19th
Categories
(Data Platform and Tools Graveyard :: Operations, defect)
Tracking
(Not tracked)
People
(Reporter: whd, Assigned: robotblake)
References
Details
(Whiteboard: [dataquality])
:rtestard reported via Slack that he observed a very significant drop on stub installs being reported starting March 11th on (https://sql.telemetry.mozilla.org/queries/70574#177660).
I checked payload_bytes_raw.stub_installer
and determined that the raw volume dropped off at that time, ruling out recent stub installer pipeline family deduping changes. I then checked AWS since we have a separate teeing mechanism for stub installs that :robotblake set up during the final stages of the GCP migration. I observed that on the 11th there was an instance node failure and a new instance was brought up:
Successful Launching a new EC2 instance: i-0baee62d15afd692a At 2021-03-11T16:44:47Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 1 to 2. 2021 March 11, 08:44:49 AM -08:00 2021 March 11, 08:45:51 AM -08:00
Successful Terminating EC2 instance: i-02d90c246711c7e4a At 2021-03-11T16:44:26Z an instance was taken out of service in response to an EC2 instance status checks failure. 2021 March 11, 08:44:26 AM -08:00 2021 March 11, 08:56:41 AM -08:00
I'm fairly certain that the issue is that the new instance isn't proxying its traffic to GCP (but is still responding to clients with 200s so isn't considered unhealthy from an ELB perspective). I'm guessing this means that we have the logs on disk and can backfill from them, but for now I've manually removed the new instance from the ELB. This technically leaves us vulnerable to an additional instance failure but that would cause only a small gap in data (followed by no data until we get the proxying to work again).
We should determine why the new instance isn't proxying traffic to GCP. I suspect it's because :robotblake set the running instances up with a different configuration than the ASG launch template that spawned the new instance. We still have a dsmo-backfill node running that we can possibly use to perform the backfill, but I'll need :robotblake to weigh in here.
Separately we should make a bug for migrating this proxy out of AWS. Technically we could point dsmo at the ingestion endpoint directly (since it doesn't use https) but that would be a change in behavior (current DSMO ELB doesn't listen on HTTPS and somebody hitting https://dsmo for some reason would see a cert error).
Updated•4 years ago
|
Reporter | ||
Comment 1•4 years ago
|
||
I didn't disable log rotation (which was set to 10 days) on this node until today, which means we likely lost some data for the 11th. We can probably recover this from ELB logs, however it may be less effort to simply make a note of this issue on https://docs.telemetry.mozilla.org/concepts/analysis_gotchas.html#notable-historic-events.
Comment 2•4 years ago
|
||
It looks like this has been restored recently. Do you know what the inclusive date window is for affected data?
Reporter | ||
Comment 3•4 years ago
|
||
This data is expected to be backfilled soon but I've updated the bug description with the affected date window.
Reporter | ||
Updated•4 years ago
|
Assignee | ||
Comment 4•4 years ago
|
||
The requests to the affected node have been back-filled and are currently showing up in the live tables, I'll be checking back in tomorrow to make sure everything finishes populating correctly.
Comment 5•4 years ago
|
||
Once all the records flow to live tables, we'll need DE to run the copy_deduplicate task for the affected days in order to get these backfill records propagated from the live table to the stable table.
Comment 6•4 years ago
•
|
||
Ran the following (from a local clone of bigquery-etl):
for x in $(seq 11 21); do
./script/copy_deduplicate \
--project_id=moz-fx-data-shared-prod \
--billing_project moz-fx-data-shared-prod \
--only "firefox_installer_live.install_v1" \
--parallelism=1 \
--date=2021-03-${x}
done
The counts in the stable table are now higher for 2021-03-11 through 2021-03-19. There was no change for 2021-03-20. See https://sql.telemetry.mozilla.org/queries/78998/source for comparison of pre-backfill and post-backfill counts.
Reporter | ||
Comment 7•4 years ago
|
||
I messed up my UTC conversion and the 19th should indeed be the last affected date. Closing this.
Updated•2 years ago
|
Updated•2 years ago
|
Description
•