Closed Bug 1729069 Opened 3 years ago Closed 3 years ago

Country attribution is skewed towards US starting 2021-08-31

Categories

(Data Platform and Tools :: General, task, P1)

task
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Assigned: klukas)

References

Details

(Whiteboard: [dataquality])

Attachments

(1 file)

There's a slight bump on 2021-08-31, but the effect really ramps up for 09-01 and 09-02. This seems to be affecting all doctypes.

This graph compares country attribution counts for 2021-09-02 vs. one week previous for two different tables (activity_stream.sessions and telemetry.events): https://sql.telemetry.mozilla.org/queries/81777/source

Attribution to US is about 3x while every other country sees attribution at about 0.6x compared to the previous week. So almost certainly we are resolving lots of clients to US that would have been from other parts of the world.

We should look into when the geo database was last updated.

I was initially alerted to this due to some reporting for new tab metrics that showed a huge increase in number of clients in the US. See:

https://sql.telemetry.mozilla.org/queries/80904/source#200710
https://sql.telemetry.mozilla.org/queries/81775/source#202676

Those graphs show the magnitude of the spike increasing the past 3 days.

:whd almost immediately realized this is likely related to https://bugzilla.mozilla.org/show_bug.cgi?id=1666498 where we have been moving how telemetry traffic flows.

This is not due to an update of the GeoIP database.

:whd has changed DNS entries to redirect all traffic back to the known working stack. That should propagate in a few hours, so data going forward should have correct geo.

:whd is also increasing retention on all payload_bytes_raw tables from 14 days to 30 days to minimize risk of data rolling off before we have a chance to backfill.

:wlach identified that the Numbers That Matter dashboard shows issues back to August 17th. That coincides with when :whd moved the stub installer pipeline family to the new network configuration. That's past the 14 day retention window that was on the table, so we initially thought the data was lost.

But it appears that BQ time travel (which allows looking at snapshots up to 7 days old) allows us to see an extra 7 days, so we can recover this. First, we need an appropriate snapshot timestamp:

SELECT UNIX_MILLIS(CURRENT_TIMESTAMP() - INTERVAL 6 day - INTERVAL 22 hour)
-- 1630100775333

Then we can craft a bq cp invocation to save off that snapshot that will look something like:

bq cp moz-fx-data-shared-prod:payload_bytes_raw.stub_installer@1630100775000  moz-fx-data-shared-prod:payload_bytes_raw.stub_installer_snapshot_bug1729069 

I'm going to run the following on prod for all standard pipeline families while we sort this out:

 bq update --time_partitioning_expiration 2592000 moz-fx-data-shared-prod:payload_bytes_raw.stub_installer

I've copied the snapshot for stub installer pings into moz-fx-data-shared-prod.payload_bytes_raw.stub_installer_snapshot_bug1729069.

We should now be in a stable state where new data will have correct geo. We also now have at least 14 days to perform a backfill before affected raw data rolls off.

Immediate follow-ups:

  • Send comms to fx-data-dev about this incident

Basic list of follow-ups for early next week:

  • Code change for the Decoder to handle this new case
  • Prep backfill process
  • Run a small sample backfill to validate country distribution looks correct
  • Kick off Dataflow jobs for large scale backfill on the 5 affected days
  • Validate backfill and copy into place
  • Clear DAGs in Airflow to rerun ETL for affected days

Further follow-ups:

  • Schedule a retrospective

The primary affected window is August 30th to September 3rd UTC, with increasing numbers of submissions affected over the course of the week as traffic was shifted.

Per https://bugzilla.mozilla.org/show_bug.cgi?id=1666498#c24 the affected window for telemetry and structured data includes a limited window (< 1/251) from August 23-29. From discussion with :klukas we think it's best to avoid doing a full backfill for those days as the cost and time would be substantial, and instead make a note of it on https://github.com/mozilla/data-docs/blob/main/src/concepts/analysis_gotchas.md#notable-historic-events. NI :mreid for approval.

The main issue here is that a new static value (the GCLB IP) is now added via nginx to the XFF proxy chain, and the logic we use in gcp-ingestion didn't account for it. This was caused by a logging change at some point from the cloudops-infra skeleton that I did not properly test for in later iterations of the edge deployment. The resolution we're planning to make on Tuesday is to perform the geoip lookup using the same rule as if x_pipeline_proxy is set if a static value 35.227.207.240 is the second to last entry of XFF. Essentially, the presence of x_pipeline_proxy means we should skip the an entry in XFF (that's the AWS tee), and the presence of 35.227.207.240 means we should also skip an entry (this is the GCLB and wasn't logged on the old GCP edge). EDIT: this value should probably be configurable to account for multiple deployments such as stage or another migration, but in practice we will never want to backfill stage.

stub installer will need a special case to backfill since the namespace it writes to firefox-installer is shared with the full installer ping from the structured pipeline family. We need to reprocess from August 17th to September 3rd for this table, but only from August 30th from payload_bytes_raw.structured. We should be able to select pings from the decoded table before that that have installer_type = full and combine them with data from stub_installer_snapshot_bug1729069.

I will separately re-verify ctxsvc iprepd logic later today.

Flags: needinfo?(mreid)

I've also extended the retention on pioneer's payload_bytes_raw since it was affected as well. The backfill for that will need to happen within the service perimeter and via the beam project that has access to KMS, so I will likely need to be involved in setting that up.

I'm able to easily replicate the issue via the following:

# old endpoint
curl -H "X-Debug-ID: whd" -vvv -k -X POST "https://stage.ingestion.nonprod.dataops.mozgcp.net/submit/telemetry/$(uuidgen)/sync/Firefox/77.0a1/default/20200415104457?v=4" -d '{"payload": {}}'
# new endpoint, has an extra entry in XFF
curl -H "X-Debug-ID: whd" -vvv -k -X POST "https://stage.ingestion-edge.nonprod.dataops.mozgcp.net/submit/telemetry/$(uuidgen)/sync/Firefox/77.0a1/default/20200415104457?v=4" -d '{"payload": {}}'
select geo_city from `moz-fx-data-shar-nonprod-efed.payload_bytes_error.telemetry` where date(submission_timestamp) = '2021-09-03' and x_debug_id = 'whd'
1 <Actual city I live in>
2 Kansas City

So the decoder change described in comment #8 should have the desired effect.

I also determined that even though there is inconsistent handing of xff for ctxsvc iprepd vs standard paths, iprepd is being queried correctly via openresty and the fraud pipeline is using the correct IP value for the new edge stack (since it is using GCLB logs instead of nginx or app logs). I will amend the nginx configuration next week to make nginx logging consistent between ctxsvc and non-ctxsvc and make sure the xff index continues to be set correctly for iprepd lookups.

(In reply to Wesley Dawson [:whd] from comment #8)

Per https://bugzilla.mozilla.org/show_bug.cgi?id=1666498#c24 the affected window for telemetry and structured data includes a limited window (< 1/251) from August 23-29. From discussion with :klukas we think it's best to avoid doing a full backfill for those days as the cost and time would be substantial, and instead make a note of it on https://github.com/mozilla/data-docs/blob/main/src/concepts/analysis_gotchas.md#notable-historic-events. NI :mreid for approval.

I think it is acceptable to skip a backfill, but want to clarify my understanding, which is that 1 out of 251 incoming requests were being sent through the affected stack from Aug 23-29. Of those, some (all?) records were mis-resolved to a static location in Kansas City.

If that's right, I approve skipping the backfill and making a note on the "gotchas" page. One other action I recommend is verifying that this would not affect the results of any in-flight experiments.

Flags: needinfo?(mreid) → needinfo?(whd)

which is that 1 out of 251 incoming requests were being sent through the affected stack from Aug 23-29. Of those, some (all?) records were mis-resolved to a static location in Kansas City.

That's correct, "some" being everything except submissions that also went to the fraud pipeline (ctxsvc).

Flags: needinfo?(whd)
Whiteboard: [data-quality]

I've been investigating how to do this backfill for pioneer, since it's a more complicated situation due to VPC-SC and KMS. I'm going to create a separate beam-like backfill project for pioneer that exists within a service perimeter and grant it temporary access to BQ+KMS for the backfill. Since GCR is protected within the perimeter I will copy the flex template image from the standard location into the project, and potentially other GCS assets if VPC-SC restrictions cause issues.

EDIT: https://github.com/mozilla-services/cloudops-infra/pull/3343

Telemetry backfill completed yesterday and after some parameter tuning structured backfill is expected to complete later tonight. :klukas did some validation work on telemetry and verified that apart from some some pings from CN being unexpectedly unaffected, geoip information in the backfilled tables look correct.

Assuming structured completes tonight and looks good, we're planning the following for tomorrow:

  1. Run the last query to combine firefox-installer since this namespace is handled by both the stub_installer and structured pipeline families EDIT: done
  2. run the _live partition replacements for telemetry, structured and firefox-installer (special case using the _combined backfill table) into shared-prod
  3. re-run copy-dedup etc. from airflow for the 5 mainly affected days
  4. manually run stub installer copy deduplicate for the 08-17 to 08-29 window (unknown if there are downstream jobs that will need to be rerun)
  5. merge https://github.com/mozilla/gcp-ingestion/pull/1815 (note: CI failing) and https://github.com/mozilla-services/cloudops-infra/pull/3340/files, and re-enable schemas and beam deploys, restoring production to fully operational state

At a later point when I'm back from PTO:

  1. Finalize the tee decommission. We will want to investigate CN specifically since there appears to be a significant DNS propagation delay or similar for that endpoint

I also completed the pioneer backfill today, notes in https://github.com/mozilla-services/cloudops-infra/pull/3343.

As mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1729069#c10 I've verified that the contextual_services_live.*_click tables were not affected by this geo issue. We route these clicks through a different code path at the telemetry edge in order to get a fraud score from iprepd which happened to also shield them from the geo regression. The impressions data is, however, affected.

I've completed sanity checking for a selection of telemetry and structured doctypes.

To check overall document counts, I used queries like:


WITH
  prod AS (
  SELECT
    DATE(submission_timestamp) AS submission_date,
    COUNT(distinct document_id) AS n_prod
  FROM
    `moz-fx-data-shared-prod.org_mozilla_firefox_live.baseline_v1`
  WHERE
    DATE(submission_timestamp) BETWEEN '2021-08-30' AND '2021-09-03'
    group by 1
  ),
  bkfill AS (
  SELECT
    DATE(submission_timestamp) AS submission_date,
    COUNT(distinct document_id) AS n_bkfill
  FROM
    `moz-fx-data-backfill-6.org_mozilla_firefox_live.baseline_v1`
  WHERE
    DATE(submission_timestamp) BETWEEN '2021-08-30' AND '2021-09-03'
  GROUP BY
    1 )
SELECT
  *,
  n_prod/n_bkfill AS skew
FROM
  prod
JOIN
  bkfill
USING
  (submission_date)
  order by 1 desc

The skew values are around 0.999992, which likely reflects just a difference in how many duplicates were introduced in the loading to BQ phase.

I checked country distributions via queries like:

declare dt date default '2021-09-03';

WITH
  prod AS (
  SELECT
    metadata.geo.country,
    COUNT(distinct document_id) AS n_prod
  FROM
    `moz-fx-data-shared-prod.org_mozilla_firefox_live.baseline_v1`
  WHERE
    DATE(submission_timestamp) = dt
  GROUP BY
    (metadata.geo.country)),
  bkfill AS (
  SELECT
    metadata.geo.country,
    COUNT(distinct document_id) AS n_bkfill
  FROM
    `moz-fx-data-backfill-6.org_mozilla_firefox_live.baseline_v1`
  WHERE
    DATE(submission_timestamp) = dt
  GROUP BY
    (metadata.geo.country) )
SELECT
  *,
  n_prod/n_bkfill AS skew
FROM
  prod
JOIN
  bkfill
USING
  (country)
  order by 4 desc

These generally show a skew above 1 for US, very close to 1 for CN, and below 1 for other countries. This aligns with our expectations for how backfill and existing prod data should differ.

So, we should be good to move this data into place.

Data was moved into place today and most ETL is complete, with some remaining jobs to be run tomorrow morning. I expect to merge gcp-ingestion/cloudops-infra PRs tomorrow and restore production to standard state, including an out-of-band schemas deploy. Additional cleanup of backfill resources (e.g. removal of pbr.stub_installer_snapshot_bug1729069) will likely take place next week.

ETL is now fully complete save for a few bqetl_public_data_json tasks retrying right now.

The relevant code and infra changes were deployed today and and out-of-band schemas deploy happened today. Production should be in standard state and some further cleanup of backfill resources will happen early next week before we close this out.

I've cleaned up all BQ and GCS resources in the backfill project. We still need to delete the stub installer raw snapshot table.

The documentation for this backfill is now merged: https://github.com/mozilla/bigquery-backfill/pull/15

?ni :whd for a final review next week of whether the cleanup steps all look complete

Note that there was a very similar geo incident (internal) that happened in a separate piece of infrastructure yesterday. It's not clear to me whether there's any correlation between these two incidents happening close together in time.

Flags: needinfo?(whd)
See Also: → 1731609

?ni :whd for a final review next week of whether the cleanup steps all look complete

It looks like the raw snapshot table was deleted, so I'm going to close this out.

Note that there was a very similar geo incident (internal) that happened in a separate piece of infrastructure yesterday. It's not clear to me whether there's any correlation between these two incidents happening close together in time.

I don't think there is, but I'm going to schedule something with :jbuck since cloudops-infra history doesn't seem to include the config changes associated with their deploy on the 16th and there seems to be a gap in how we're managing IP addresses and GeoIP across SRE.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(whd)
Resolution: --- → FIXED
Component: Pipeline Ingestion → General
Whiteboard: [data-quality] → [dataquality]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: