Closed Bug 1715271 Opened 5 years ago Closed 10 months ago

estimate carbon emissions for project

Categories

(Data Platform and Tools :: General, task, P4)

task

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jessed, Unassigned)

Details

Hi, in the published report for the Mozilla Regrets Reporter research project, we hope to provide an estimate of carbon emitted in the storage and processing of data for the project.

I have found this methodology: https://www.cloudcarbonfootprint.org/docs/gcp

In order to apply it I would need access to detailed billing data from GCP. Alternatively, it may be possible for someone with adequate privileges to create an appropriate extract table for me.

I believe all our queries happen in moz-fx-data-bq-regrets-report, but the data is ingested and stored in shared-prod in datasets matching regrets_reporter*

Please let me know what you think the best approach is and I'm happy to collaborate on figuring things out.

From the method I link to:

Ensure you have a GCP Service Account with the permission to start BigQuery jobs and read Bigquery job results. Learn more about GCP Service Accounts here.

Create and download a JSON private file for this Service Account to your local filesystem, and make sure to set the GOOGLE_APPLICATION_CREDENTIALS environment variable. Learn more about this authentication method here.

Note: make sure you use the full path for this environment variable, eg /Users/<user>/path/to/credential

Set up Google Cloud billing data to export to BigQuery. You can find the instructions for this here https://cloud.google.com/billing/docs/how-to/export-data-bigquery.

Just want to add that this is for a report that needs to be ready to go by the 16th of this month. If it's not possible to get this resolved by then, we just won't include the calculation, but if it is at all possible, we'd appreciate the support.

From the data infra cost dashboard, I see only $0.50 worth of cost associated with the regrets-reporter project itself, so that's the cost of queries there. I don't know if that's comprehensive. The total data size in the regrets_reporter_update_v1 stable table is 3 GB; the scale of cost for ingestion, etc. would scale with that data volume.

It's super tiny. In the context of the telemetry pipeline which exists anyway, it's a rounding error. So I think there's an argument that this research project is essentially free since it's piggy-backing on existing infrastructure and is unlikely to have increased the amount of shared compute used for ingestion, etc. But that's probably not very helpful for making a statement about the carbon impact.

?ni :jason - Does anything jump to your mind that would shed more light on this question? I assume we already have appropriate billing data being extracted, but not sure how much effort it would be to turn that into something meaningful for regrets-reporter specifically.

Flags: needinfo?(jthomas)

Thanks Jeff. I agree it's going to be quite minimal. I think that turning the billing data into something useful may be fairly easy though. I have found a tool that I believe can make a computation automatically, but it needs access to to the billing extract tables (and I think one would have to be created appropriately for just the regrets reporter project). It also needs a service account to run.

Documentation is here: https://www.cloudcarbonfootprint.org/docs/gcp/

Is this something that someone would be able to assist with?

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #2)

?ni :jason - Does anything jump to your mind that would shed more light on this question? I assume we already have appropriate billing data being extracted, but not sure how much effort it would be to turn that into something meaningful for regrets-reporter specifically.

We do have billing data exports setup for GCP but getting the estimates for regrets-reporter specifically around the ingestion pipeline would require more investigation since it is just a very small portion of ingested data. Jeff already pointed out the dashboard associated with BigQuery query costs and the actual table size.

(In reply to Jesse McCrosky from comment #3)

Thanks Jeff. I agree it's going to be quite minimal. I think that turning the billing data into something useful may be fairly easy though. I have found a tool that I believe can make a computation automatically, but it needs access to to the billing extract tables (and I think one would have to be created appropriately for just the regrets reporter project). It also needs a service account to run.

Documentation is here: https://www.cloudcarbonfootprint.org/docs/gcp/

Is this something that someone would be able to assist with?

Is this something we need to run once or needs to be run on a continuous basis? Either way, it would require investigation and setup and I don't think we can turn this around by the 16th due to our other commitments. Specifically, I would like to see how the emission is calculated and a quick look at the code shows that it might be using usage amount and cost information [1]. I have some concerns that if it is actually using costs it could potentially expose any discount we have with GCP.

I will bring this up later in our Data SRE staff meeting to see if anyone has additional cycles to work on this.

[1] https://github.com/cloud-carbon-footprint/cloud-carbon-footprint/blob/trunk/packages/gcp/src/lib/BillingExportTable.ts#L347

Flags: needinfo?(jthomas)

Thanks Jason,

I checked in with Brandi and the carbon calculations don't need to go through the full review process that the rest of the report does, so this could be added as late as June 25th.

We just need a one-time calculation of the carbon emissions attributable to the data processing for the regrets reporter project.

Let me know if I can do anything else to help.

Priority: -- → P4

As of October 2021 Google Cloud apparently has a built-in "Carbon Footprint" reporting feature (currently in preview), which can also export to BigQuery: https://cloud.google.com/carbon-footprint

Hello,

The Mozilla Data Engineering organization is currently going through our extensive backlog, consisting of hundreds of issues stretching back for nearly 10 years. We've done a pass through all of the open bugzilla bugs and have identified and tagged the ones that we think are relevant enough to still need attention. The rest, including the bug with which this comment is associated, we are closing as "WONTFIX" in a single bulk operation.

If you feel we have closed this (or any) issue in error, please feel free to take the following actions:

  • Reopen the bug.
  • Edit the bug to add the string [dataplatform] (including the brackets) to the Whiteboard field. (Note that you must edit the Whiteboard, not the similarly named QA Whiteboard.)

Doing this will ensure that we see the bug in our weekly triage process, where we will decide how to proceed.

Thank you.

Status: NEW → RESOLVED
Closed: 10 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.