Estimate cost of backfilling BigQuery from AWS Data Lake
Categories
(Data Platform and Tools :: General, task)
Tracking
(Not tracked)
People
(Reporter: mreid, Assigned: whd)
References
Details
This should include data transfer costs as well as processing.
Reporter | ||
Comment 1•5 years ago
•
|
||
:whd put together this document with cost estimates and a plan for the data transfer.
Assignee | ||
Comment 2•5 years ago
|
||
It's been noted that we should consider using slots for the BQ loading side of things.
Comment 3•5 years ago
|
||
:whd and I discussed today a bit the cost of deduplicating once we load into BigQuery. It may be possible to do this efficiently in the Dataflow job, investigating in https://github.com/mozilla/gcp-ingestion/issues/821
Comment 4•5 years ago
|
||
We have our first successful tests of running a Dataflow job to load Heka data. If I'm reading billing correctly, it looks like vCPU was the dominant cost, at something around $200 per job (processing a full day of all doctypes). The charges associated with Dataflow Shuffle Service were less than $1 per job and led to the job completing more quickly and probably using less vCPU, so I'm inclined to continue using the Shuffle service in further tests (which requires we run in us-central1). So, as a very early estimate, it looks like this will cost ~$300 per day processed to run the Dataflow jobs.
Reporter | ||
Comment 5•5 years ago
|
||
Processing cost estimates have been added to the document linked in comment 1. We're good to go.
Reporter | ||
Updated•5 years ago
|
Updated•3 years ago
|
Description
•