Closed Bug 1699645 Opened 4 years ago Closed 4 years ago

Drop in stub installer submissions starting March 11th to March 19th

Categories

(Data Platform and Tools Graveyard :: Operations, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whd, Assigned: robotblake)

References

Details

(Whiteboard: [dataquality])

:rtestard reported via Slack that he observed a very significant drop on stub installs being reported starting March 11th on (https://sql.telemetry.mozilla.org/queries/70574#177660).

I checked payload_bytes_raw.stub_installer and determined that the raw volume dropped off at that time, ruling out recent stub installer pipeline family deduping changes. I then checked AWS since we have a separate teeing mechanism for stub installs that :robotblake set up during the final stages of the GCP migration. I observed that on the 11th there was an instance node failure and a new instance was brought up:

Successful	Launching a new EC2 instance: i-0baee62d15afd692a	At 2021-03-11T16:44:47Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 1 to 2.	2021 March 11, 08:44:49 AM -08:00	2021 March 11, 08:45:51 AM -08:00
Successful	Terminating EC2 instance: i-02d90c246711c7e4a	At 2021-03-11T16:44:26Z an instance was taken out of service in response to an EC2 instance status checks failure.	2021 March 11, 08:44:26 AM -08:00	2021 March 11, 08:56:41 AM -08:00

I'm fairly certain that the issue is that the new instance isn't proxying its traffic to GCP (but is still responding to clients with 200s so isn't considered unhealthy from an ELB perspective). I'm guessing this means that we have the logs on disk and can backfill from them, but for now I've manually removed the new instance from the ELB. This technically leaves us vulnerable to an additional instance failure but that would cause only a small gap in data (followed by no data until we get the proxying to work again).

We should determine why the new instance isn't proxying traffic to GCP. I suspect it's because :robotblake set the running instances up with a different configuration than the ASG launch template that spawned the new instance. We still have a dsmo-backfill node running that we can possibly use to perform the backfill, but I'll need :robotblake to weigh in here.

Separately we should make a bug for migrating this proxy out of AWS. Technically we could point dsmo at the ingestion endpoint directly (since it doesn't use https) but that would be a change in behavior (current DSMO ELB doesn't listen on HTTPS and somebody hitting https://dsmo for some reason would see a cert error).

See Also: → 1699822
Whiteboard: [data-quality]

I didn't disable log rotation (which was set to 10 days) on this node until today, which means we likely lost some data for the 11th. We can probably recover this from ELB logs, however it may be less effort to simply make a note of this issue on https://docs.telemetry.mozilla.org/concepts/analysis_gotchas.html#notable-historic-events.

It looks like this has been restored recently. Do you know what the inclusive date window is for affected data?

Flags: needinfo?(whd)

This data is expected to be backfilled soon but I've updated the bug description with the affected date window.

Flags: needinfo?(whd)
Summary: Drop in stub installer submissions starting March 11th → Drop in stub installer submissions starting March 11th to March 20th
Assignee: nobody → bimsland

The requests to the affected node have been back-filled and are currently showing up in the live tables, I'll be checking back in tomorrow to make sure everything finishes populating correctly.

Once all the records flow to live tables, we'll need DE to run the copy_deduplicate task for the affected days in order to get these backfill records propagated from the live table to the stable table.

Ran the following (from a local clone of bigquery-etl):

for x in $(seq 11 21); do 
  ./script/copy_deduplicate \
    --project_id=moz-fx-data-shared-prod \
    --billing_project moz-fx-data-shared-prod \
    --only "firefox_installer_live.install_v1" \
    --parallelism=1 \
    --date=2021-03-${x}
done

The counts in the stable table are now higher for 2021-03-11 through 2021-03-19. There was no change for 2021-03-20. See https://sql.telemetry.mozilla.org/queries/78998/source for comparison of pre-backfill and post-backfill counts.

I messed up my UTC conversion and the 19th should indeed be the last affected date. Closing this.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Summary: Drop in stub installer submissions starting March 11th to March 20th → Drop in stub installer submissions starting March 11th to March 19th
Product: Data Platform and Tools → Data Platform and Tools Graveyard
Whiteboard: [data-quality] → [dataquality]
You need to log in before you can comment on or make changes to this bug.