Closed Bug 1579056 Opened 5 years ago Closed 5 years ago

Estimate cost of backfilling BigQuery from AWS Data Lake

Categories

(Data Platform and Tools :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Assigned: whd)

References

Details

This should include data transfer costs as well as processing.

:whd put together this document with cost estimates and a plan for the data transfer.

It's been noted that we should consider using slots for the BQ loading side of things.

:whd and I discussed today a bit the cost of deduplicating once we load into BigQuery. It may be possible to do this efficiently in the Dataflow job, investigating in https://github.com/mozilla/gcp-ingestion/issues/821

We have our first successful tests of running a Dataflow job to load Heka data. If I'm reading billing correctly, it looks like vCPU was the dominant cost, at something around $200 per job (processing a full day of all doctypes). The charges associated with Dataflow Shuffle Service were less than $1 per job and led to the job completing more quickly and probably using less vCPU, so I'm inclined to continue using the Shuffle service in further tests (which requires we run in us-central1). So, as a very early estimate, it looks like this will cost ~$300 per day processed to run the Dataflow jobs.

Processing cost estimates have been added to the document linked in comment 1. We're good to go.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Assignee: nobody → whd
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.