Closed Bug 1914369 Opened 8 months ago Closed 7 months ago

Airflow task graphics_telemetry.graphics_dashboard failed for exec_date 2024-08-21

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: akomar, Assigned: kik, NeedInfo)

Details

(Whiteboard: [airflow-triage])

Airflow task graphics_telemetry.graphics_dashboard failed for exec_date 2024-08-21

Task link:
https://workflow.telemetry.mozilla.org/dags/graphics_telemetry.graphics_dashboard/grid?tab=logs&dag_run_id=scheduled__2024-08-21T03%3A00%3A00%2B00%3A00&task_id=run_dataproc_pyspark

Failed Spark job logs are in https://console.cloud.google.com/dataproc/jobs/graphics-dashboard_b60d87bc/monitoring?project=airflow-dataproc&region=us-west1
Log extract:

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/conda/default/lib/python3.7/site-packages/moztelemetry/histogram.py", line 131, in __init__
    name, histograms_definition[proper_name])
KeyError: 'GRAPHICS_SANITY_TEST_REASON'

This seems to be an error in ancient, archived Python library: https://github.com/mozilla/python_moztelemetry. IIRC last time we were updating it it required some special care.

I don't see any non-DE folks listed in https://github.com/mozilla/telemetry-airflow/blob/main/dags/graphics_telemetry.py
:kik - is there any chance this DAG might be ready for deprecation? If not, have we looked how hard would it be to migrate it out of Spark to BQ?
Also should it be Tag.ImpactTier.tier_1?

Flags: needinfo?(kignasiak)

The issue is likely somewhere in the job code in https://github.com/mozilla/python_mozetl/blob/a49c3a045221997bf444c1b37fc251998fc4b036/mozetl/graphics/graphics_telemetry_dashboard.py or https://github.com/FirefoxGraphics/telemetry/blob/master/analyses/bigquery_shim/bigquery_shim/dashboard.py so I don't think changes to moztelemetry are needed.

I was looking for the original bug for moving this job off of databricks and found bug 1755307 which has some discussion around moving this to bigquery. I remember looking into that four years ago and it wasn't worth the effort back then

So this is the line that throws the error:
https://github.com/mozilla/python_moztelemetry/blob/f579e95fd985a5a070421e6c8daf793f518db51a/moztelemetry/histogram.py#L131

I'm not sure if we can specifically say the logic in python_moztelemetry is right or wrong as nothing has changed. So what seems have happened is that now we the data received includes a new histogram key without updating the histogram definition at a specific URL. If I'm undestanding this correctly, it's fetched from here:
https://github.com/mozilla/python_moztelemetry/blob/f579e95fd985a5a070421e6c8daf793f518db51a/moztelemetry/histogram.py#L26

So I'd argue process failed here as whoever made this addition forgot to update the json file or made a change there that broke this job without knowing about it.

Flags: needinfo?(kignasiak)

If I am correct in saying, the source of the problem we started needs to be addressed here:
https://github.com/mozilla/gecko-dev/blob/master/toolkit/components/telemetry/Histograms.json

Looks like this specific commit resulted in the failures:
https://github.com/mozilla/gecko-dev/commit/7ed00c97b6d112a7699172f862e75dcf13cee67a

Now the important question, is where is the key specified and coming from to begin with. It appears that the change in the commit linked above was intentional and there's nothing wrong with it in itself. My assumption here would be that this should have not broken anything, so could there be some configuration still laying around that needs to be deleted?

Flags: needinfo?(florian)

Indeed the key (GRAPHICS_SANITY_TEST_REASON) does not exist inside:
https://hg.mozilla.org/mozilla-central/raw-file/tip/toolkit/components/telemetry/Histograms.json

which is where is appears we try to find the key.

Could this be the source of the error:
https://github.com/mozilla/python_mozetl/blob/main/mozetl/graphics/graphics_telemetry_dashboard.py#L100

:akomar :benwu do you think removing it from that PropertyList would do the trick? The thing I'm a bit unsure is that it says "# List of keys for properties on session pings that we care about.". I have zero insight into this, but could also no longer be true since this code base has not really been developed for a long time now...

Flags: needinfo?(florian)
Flags: needinfo?(bewu)
Flags: needinfo?(akomarzewski)

From what I can tell, sanity test reason isn't used anywhere and the GRAPHICS_SANITY_TEST_REASON value isn't actually exported to json so I'm guessing it's safe to remove and worth trying since the dashboard is already broken.

This code for the graph is commented out https://github.com/FirefoxGraphics/telemetry/blob/bb0cc53a61a30acb22b3973f32f1721c572d02bc/www/chart-graphs.js#L531

And these values in the job are unused https://github.com/mozilla/python_mozetl/blob/main/mozetl/graphics/graphics_telemetry_dashboard.py#L637-L640

:jrmuizel, is https://firefoxgraphics.github.io/telemetry/ still used? And do you know if GRAPHICS_SANITY_TEST_REASON is used anywhere in it?

Flags: needinfo?(jmuizelaar)
Flags: needinfo?(bewu)
Flags: needinfo?(akomarzewski)

Looks like we have another histogram missing:

name, histograms_definition[proper_name])
KeyError: 'PLUGIN_DRAWING_MODEL'

This is the error message for the first run after merging the PR to remove GRAPHICS_SANITY_TEST_REASON from the list of histograms to process.

So this is the current list of histograms that we try to process:

# List of keys for properties on session pings that we care about.
GfxKey = "environment/system/gfx"
MonitorsKey = "environment/system/gfx/monitors"
ArchKey = "environment/build/architecture"
FeaturesKey = "environment/system/gfx/features"
UserPrefsKey = "environment/settings/userPrefs"
DeviceResetReasonKey = "payload/histograms/DEVICE_RESET_REASON"
SANITY_TEST = "payload/histograms/GRAPHICS_SANITY_TEST"
STARTUP_TEST_KEY = "payload/histograms/GRAPHICS_DRIVER_STARTUP_TEST"
WebGLSuccessKey = "payload/histograms/CANVAS_WEBGL_SUCCESS"
WebGL2SuccessKey = "payload/histograms/CANVAS_WEBGL2_SUCCESS"
PluginModelKey = "payload/histograms/PLUGIN_DRAWING_MODEL"
MediaDecoderKey = "payload/histograms/MEDIA_DECODER_BACKEND_USED"
LayersD3D11FailureKey = "payload/keyedHistograms/D3D11_COMPOSITING_FAILURE_ID"
LayersOGLFailureKey = "payload/keyedHistograms/OPENGL_COMPOSITING_FAILURE_ID"
WebGLAcclFailureKey = "payload/keyedHistograms/CANVAS_WEBGL_ACCL_FAILURE_ID"
WebGLFailureKey = "payload/keyedHistograms/CANVAS_WEBGL_FAILURE_ID"

This list has been extracted from here: https://github.com/mozilla/python_mozetl/blob/main/mozetl/graphics/graphics_telemetry_dashboard.py#L72-L88


Comparing the above list (under the payload path only) with the histograms definition in gecko-dev we see the following still missing:

  • PLUGIN_DRAWING_MODEL
  • D3D11_COMPOSITING_FAILURE_ID -
  • OPENGL_COMPOSITING_FAILURE_ID

All three of the above appear to have been removed via: removed via: https://github.com/mozilla/gecko-dev/commit/7f25ee929ca325e057bdfc256bb978954b5fed0d#diff-6c55b27a94376254370c09aa7b397a80d4b2f7c3cb28b690f33557e92f9f98b8

All of them appear to have expires_in_version set to never at the time they got removed. For this reason, it is unclear to me if this was needed or not. I tried searching for all three via https://probes.telemetry.mozilla.org/?search= but I got no results.

(In reply to Ben Wu [:benwu] from comment #8)

From what I can tell, sanity test reason isn't used anywhere and the GRAPHICS_SANITY_TEST_REASON value isn't actually exported to json so I'm guessing it's safe to remove and worth trying since the dashboard is already broken.

This code for the graph is commented out https://github.com/FirefoxGraphics/telemetry/blob/bb0cc53a61a30acb22b3973f32f1721c572d02bc/www/chart-graphs.js#L531

And these values in the job are unused https://github.com/mozilla/python_mozetl/blob/main/mozetl/graphics/graphics_telemetry_dashboard.py#L637-L640

:jrmuizel, is https://firefoxgraphics.github.io/telemetry/ still used? And do you know if GRAPHICS_SANITY_TEST_REASON is used anywhere in it?

I still use it. I think we should probably just unexpire GRAPHICS_SANITY_TEST_REASON.

We don't need PLUGIN_DRAWING_MODEL, D3D11_COMPOSITING_FAILURE_ID and OPENGL_COMPOSITING_FAILURE_ID

Flags: needinfo?(jmuizelaar)

Florian, instead of removing the telemetry for GRAPHICS_SANITY_TEST_REASON in https://github.com/mozilla/gecko-dev/commit/7ed00c97b6d112a7699172f862e75dcf13cee67a can you set its expiry to never?

Flags: needinfo?(florian)

Opened a PR to revert the change on our side to remove GRAPHICS_SANITY_TEST_REASON from python_mozetl:
https://github.com/mozilla/python_mozetl/pull/404

(In reply to Jeff Muizelaar [:jrmuizel] from comment #11)

Florian, instead of removing the telemetry for GRAPHICS_SANITY_TEST_REASON in https://github.com/mozilla/gecko-dev/commit/7ed00c97b6d112a7699172f862e75dcf13cee67a can you set its expiry to never?

Given it expired in Firefox 70, at this point if you want this probe again I think you should open a new bug, attach a reverting patch and request a new data review.

Flags: needinfo?(florian)
Flags: needinfo?(jmuizelaar)

In the meantime whilst we wait for the GRAPHICS_SANITY_TEST_REASON to be added back to the histograms specs, here's a PR to remove PLUGIN_DRAWING_MODEL, D3D11_COMPOSITING_FAILURE_ID and OPENGL_COMPOSITING_FAILURE_ID from the job code:
https://github.com/mozilla/python_mozetl/pull/405

:benwu could you please take a look?

Flags: needinfo?(bewu)
Flags: needinfo?(bewu)

The core issue reported in this bug has been reported and the job now completes successfully.

[:jrmuizel] If you need GRAPHICS_SANITY_TEST_REASON please follow Florian's advice:

Given it expired in Firefox 70, at this point if you want this probe again I think you should open a new bug, attach a reverting patch and request a new data review.

Once that is done and it has been added back to the configuration please let us know and we can add it back to the job.

Thanks!

Status: NEW → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.