Closed Bug 1776767 Opened 3 years ago Closed 2 years ago

Permafailing GeckoMediaPlugins.CDMStorageBasic | Value of: false | gtest | timed out after 300 seconds without output

Categories

(Core :: Audio/Video: GMP, defect)

defect

Tracking

()

RESOLVED FIXED
120 Branch
Tracking Status
firefox-esr115 --- fixed
firefox118 --- wontfix
firefox119 --- wontfix
firefox120 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: aryx)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disable-recommended])

Attachments

(3 files)

Filed by: imoraru [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=382664966&repo=mozilla-beta
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Cw88jNALRXabk_Lww1nVDg/runs/0/artifacts/public/logs/live_backing.log


[task 2022-06-27T20:05:41.044Z] 20:05:41     INFO -  TEST-START | GeckoMediaPlugins.CDMStorageBasic
[task 2022-06-27T20:06:11.315Z] 20:06:11  WARNING -  TEST-UNEXPECTED-FAIL | GeckoMediaPlugins.CDMStorageBasic | Value of: false
[task 2022-06-27T20:06:11.320Z] 20:06:11     INFO -    Actual: false
[task 2022-06-27T20:06:11.321Z] 20:06:11     INFO -  Expected: true @ /builds/worker/checkouts/gecko/dom/media/gtest/TestCDMStorage.cpp:418
[task 2022-06-27T20:11:12.389Z] 20:11:12     INFO -  gtest INFO | gtest | process wait complete, returncode=572
[task 2022-06-27T20:11:12.392Z] 20:11:12  WARNING -  gtest TEST-UNEXPECTED-FAIL | gtest | timed out after 300 seconds without output
[task 2022-06-27T20:11:12.392Z] 20:11:12     INFO -  gtest INFO | rungtests.py exits with code 1
[task 2022-06-27T20:11:12.544Z] 20:11:12    ERROR - Return code: 1
[task 2022-06-27T20:11:12.548Z] 20:11:12    ERROR - No tests run or test summary not found
[task 2022-06-27T20:11:12.548Z] 20:11:12     INFO - TinderboxPrint: gtest-gtest<br/><em class="testfail">T-FAIL</em>
[task 2022-06-27T20:11:12.548Z] 20:11:12  WARNING - # TBPL FAILURE #
[task 2022-06-27T20:11:12.548Z] 20:11:12  WARNING - setting return code to 2
[task 2022-06-27T20:11:12.548Z] 20:11:12  WARNING - The gtest suite: gtest ran with return status: FAILURE
[task 2022-06-27T20:11:12.550Z] 20:11:12     INFO - Running post-action listener: _package_coverage_data
[task 2022-06-27T20:11:12.550Z] 20:11:12     INFO - Running post-action listener: _resource_record_post_action
[task 2022-06-27T20:11:12.551Z] 20:11:12     INFO - Running post-action listener: process_java_coverage_data
[task 2022-06-27T20:11:12.551Z] 20:11:12     INFO - [mozharness: 2022-06-27 20:11:12.551917Z] Finished run-tests step (success)
[task 2022-06-27T20:11:12.551Z] 20:11:12     INFO - Running post-run listener: _resource_record_post_run
[task 2022-06-27T20:11:14.081Z] 20:11:14     INFO - instance_metadata.json not found; unable to determine instance type
[task 2022-06-27T20:11:14.086Z] 20:11:14     INFO - Validating Perfherder data against Z:\task_165635762731293\mozharness\external_tools\performance-artifact-schema.json
[task 2022-06-27T20:11:14.098Z] 20:11:14     INFO - PERFHERDER_DATA: {"framework": {"name": "job_resource_usage"}, "suites": [{"name": "gtest.gtest.overall", "extraOptions": ["buildbot-unknown"], "subtests": [{"name": "cpu_percent", "value": 7.433168058455118}, {"name": "io_write_bytes", "value": 2922785792}, {"name": "io.read_bytes", "value": 579670016}, {"name": "io_write_time", "value": 202}, {"name": "io_read_time", "value": 134}]}, {"name": "gtest.gtest.start-pulseaudio", "subtests": [{"name": "time", "value": 0.002039670944213867}]}, {"name": "gtest.gtest.install", "subtests": [{"name": "time", "value": 10.099332094192505}, {"name": "cpu_percent", "value": 12.95138888888889}]}, {"name": "gtest.gtest.stage-files", "subtests": [{"name": "time", "value": 0.8117291927337646}]}, {"name": "gtest.gtest.run-tests", "subtests": [{"name": "time", "value": 949.8705515861511}, {"name": "cpu_percent", "value": 7.378471488912357}]}]}
[task 2022-06-27T20:11:14.098Z] 20:11:14     INFO - Total resource usage - Wall time: 960s; CPU: 7%; Read bytes: 579670016; Write bytes: 2922785792; Read time: 134; Write time: 202
[task 2022-06-27T20:11:14.098Z] 20:11:14     INFO - TinderboxPrint: CPU usage<br/>7.4%
[task 2022-06-27T20:11:14.098Z] 20:11:14     INFO - TinderboxPrint: I/O read bytes / time<br/>579,670,016 / 134
[task 2022-06-27T20:11:14.098Z] 20:11:14     INFO - TinderboxPrint: I/O write bytes / time<br/>2,922,785,792 / 202
[task 2022-06-27T20:11:14.098Z] 20:11:14     INFO - TinderboxPrint: CPU idle<br/>7,112.7 (92.4%)
[task 2022-06-27T20:11:14.098Z] 20:11:14     INFO - TinderboxPrint: CPU system<br/>227.6 (3.0%)
[task 2022-06-27T20:11:14.098Z] 20:11:14     INFO - TinderboxPrint: CPU user<br/>340.6 (4.4%)
[task 2022-06-27T20:11:14.099Z] 20:11:14     INFO - start-pulseaudio - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
[task 2022-06-27T20:11:14.099Z] 20:11:14     INFO - install - Wall time: 10s; CPU: 13%; Read bytes: 188416; Write bytes: 269877248; Read time: 0; Write time: 21
[task 2022-06-27T20:11:14.099Z] 20:11:14     INFO - stage-files - Wall time: 1s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
[task 2022-06-27T20:11:14.106Z] 20:11:14     INFO - run-tests - Wall time: 950s; CPU: 7%; Read bytes: 578692096; Write bytes: 2571033600; Read time: 134; Write time: 176
[task 2022-06-27T20:11:14.307Z] 20:11:14  WARNING - returning nonzero exit status 2
[taskcluster 2022-06-27T20:11:14.399Z]    Exit Code: 2
[taskcluster 2022-06-27T20:11:14.399Z]    User Time: 15.625ms
[taskcluster 2022-06-27T20:11:14.399Z]  Kernel Time: 31.25ms
[taskcluster 2022-06-27T20:11:14.399Z]    Wall Time: 19m47.4203963s
[taskcluster 2022-06-27T20:11:14.399Z]       Result: FAILED
[taskcluster 2022-06-27T20:11:14.399Z] === Task Finished ===
[taskcluster 2022-06-27T20:11:14.399Z] Task Duration: 19m47.4243962s
[taskcluster 2022-06-27T20:11:14.501Z] Uploading artifact public/logs/localconfig.json from file logs\localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2023-06-27T18:59:21.419Z
[taskcluster 2022-06-27T20:11:14.929Z] Uploading artifact public/test_info/resource-usage.json from file build\blobber_upload_dir\resource-usage.json with content encoding "gzip", mime type "application/json" and expiry 2023-06-27T18:59:21.419Z
[taskcluster 2022-06-27T20:11:15.131Z] Uploading artifact public/test_info/system-info.log from file build\blobber_upload_dir\system-info.log with content encoding "gzip", mime type "text/plain" and expiry 2023-06-27T18:59:21.419Z
[taskcluster 2022-06-27T20:11:15.305Z] Uploading link artifact public/logs/live.log to artifact public/logs/live_backing.log with expiry 2023-06-27T18:59:21.419Z
[taskcluster:error] exit status 2
Component: GTest → Audio/Video: GMP
Product: Testing → Core
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
See Also: → 1832020

Do you think we can close this ?

Flags: needinfo?(aryx.bugmail)

It's still a frequent failure. Example. Bug 1832020 should be reopened.

Flags: needinfo?(aryx.bugmail) → needinfo?(lissyx+mozillians)

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #34)

It's still a frequent failure. Example. Bug 1832020 should be reopened.

It makes no sense, I pushed retries in sets of 200-300 and it was constantly green.

Flags: needinfo?(lissyx+mozillians)

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #34)

It's still a frequent failure. Example. Bug 1832020 should be reopened.

This one is on ASAN, maybe it's too slow? There's no indication of IPC timeout there though.

Should the test be excluded on ASan or will the failure be handled by a code change.

I have no idea, I dont know this part.

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #37)

Should the test be excluded on ASan or will the failure be handled by a code change.

Jim, please discuss this in the next triage meeting.

Severity: S4 → --
Flags: needinfo?(jmathies)
Priority: P5 → --

Update

There have been 44 failures within the last 7 days:

  • 1 failure on OS X 10.15 WebRender opt
  • 37 failures on Windows 11 x64 22H2 asan WebRender opt
  • 6 failures on Windows 11 x64 22H2 CCov WebRender opt

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=421173454&repo=mozilla-central&lineNumber=12758

[task 2023-06-30T05:38:16.161Z] 05:38:16     INFO -  TEST-START | GeckoMediaPlugins.CDMStorageBasic
[task 2023-06-30T05:38:46.156Z] 05:38:46  WARNING -  TEST-UNEXPECTED-FAIL | GeckoMediaPlugins.CDMStorageBasic | Value of: false
[task 2023-06-30T05:38:46.159Z] 05:38:46     INFO -    Actual: false
[task 2023-06-30T05:38:46.159Z] 05:38:46     INFO -  Expected: true @ /builds/worker/checkouts/gecko/dom/media/gtest/TestCDMStorage.cpp:420
[task 2023-06-30T05:43:46.163Z] 05:43:46  WARNING -  gtest TEST-UNEXPECTED-FAIL | gtest | timed out after 300 seconds without output
[task 2023-06-30T05:43:46.174Z] 05:43:46     INFO -  mozcrash INFO | Writing a dump to Z:\task_168810209882326\build\tests\gtest\d8060beb-f2c1-4480-b2b9-606c6afe2e20.dmp for [3284]
[task 2023-06-30T05:43:46.257Z] 05:43:46     INFO -  gtest INFO | gtest | process wait complete, returncode=1
[task 2023-06-30T05:43:46.257Z] 05:43:46     INFO -  mozcrash checking Z:\task_168810209882326\build\tests\gtest for minidumps...
[task 2023-06-30T05:43:46.276Z] 05:43:46     INFO -  mozcrash INFO | Downloading symbols from: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AeeRTk1UTqyycfHrRjUHxQ/artifacts/public/build/target.crashreporter-symbols.zip
[task 2023-06-30T05:43:51.571Z] 05:43:51     INFO -  mozcrash INFO | Downloading symbols from: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AeeRTk1UTqyycfHrRjUHxQ/artifacts/public/build/target.crashreporter-symbols.zip
[task 2023-06-30T05:44:01.839Z] 05:44:01     INFO -  mozcrash INFO | Downloading symbols from: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AeeRTk1UTqyycfHrRjUHxQ/artifacts/public/build/target.crashreporter-symbols.zip
[task 2023-06-30T05:44:25.786Z] 05:44:25     INFO -  mozcrash INFO | Downloading symbols from: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AeeRTk1UTqyycfHrRjUHxQ/artifacts/public/build/target.crashreporter-symbols.zip
[task 2023-06-30T05:45:06.928Z] 05:45:06     INFO -  mozcrash INFO | Downloading symbols from: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AeeRTk1UTqyycfHrRjUHxQ/artifacts/public/build/target.crashreporter-symbols.zip
[task 2023-06-30T05:45:07.273Z] 05:45:07     INFO -  gtest ERROR | HTTP Error 404: Not Found
[task 2023-06-30T05:45:07.279Z] 05:45:07     INFO -  gtest INFO  | rungtests.py exits with code 1
[task 2023-06-30T05:45:07.350Z] 05:45:07     INFO - Return code: 1
[task 2023-06-30T05:45:07.363Z] 05:45:07    ERROR - No tests run or test summary not found
[task 2023-06-30T05:45:07.363Z] 05:45:07     INFO - TinderboxPrint: gtest-gtest<br/><em class="testfail">T-FAIL</em>
Whiteboard: [stockwell needswork:owner]

I made a change to how TimeStamp comparisons are implemented (making them much simpler and faster) and I'm seeing a noticeable uptick in the rate at which GTest-1proc is failing for the reason captured in this bug. (It was failing quite a bit before my change too but there's a definite increase with it.)

Before:
https://treeherder.mozilla.org/jobs?repo=try&revision=bd1f6ae63071271510c592b9600fe53498d39cf3&selectedTaskRun=HG2pnEPkQ9yLbt0PiDkmKQ.0

After:
https://treeherder.mozilla.org/jobs?repo=try&revision=127269ef4c0c78f37f4461e24e754baa0a471563&selectedTaskRun=HX0OwcwWRbGVAITBnam45g.0

Is this test really sensitive to timing? Or does it do some logic based on comparisons of TimeStamp values?

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

Update

There have been 43 failures within the last 7 days:

  • 1 failure on OS X 10.15 WebRender opt
  • 33 failures on Windows 11 x64 22H2 asan WebRender opt
  • 9 failures on Windows 11 x64 22H2 CCov WebRender opt

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=422465494&repo=mozilla-central&lineNumber=34064

Blocks: media-triage
Flags: needinfo?(jmathies)

Update

There have been 35 total failures within the last 7 days:

  • 3 failures on Windows 11 x64 22H2 CCov WebRender opt
  • 32 failures on Windows 11 x64 22H2 asan WebRender opt

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=424026900&repo=autoland&lineNumber=12742

Update

There have been 54 total failures within the last 7 days:

  • 9 failures on windows11-64-2009-ccov-qr
  • 44 failures on Windows 11 x64 22H2 asan WebRender opt
  • 1 failure on OS X 10.15 WebRender opt

Recent log: https://treeherder.mozilla.org/logviewer?job_id=425065620&repo=mozilla-central&lineNumber=12749

[task 2023-08-05T22:21:51.223Z] 22:21:51     INFO -  TEST-START | GeckoMediaPlugins.CDMStorageBasic
[task 2023-08-05T22:22:21.232Z] 22:22:21  WARNING -  TEST-UNEXPECTED-FAIL | GeckoMediaPlugins.CDMStorageBasic | Value of: false
[task 2023-08-05T22:22:21.235Z] 22:22:21     INFO -    Actual: false
[task 2023-08-05T22:22:21.235Z] 22:22:21     INFO -  Expected: true @ /builds/worker/checkouts/gecko/dom/media/gtest/TestCDMStorage.cpp:420
[task 2023-08-05T22:27:21.247Z] 22:27:21  WARNING -  gtest TEST-UNEXPECTED-FAIL | gtest | timed out after 300 seconds without output
[task 2023-08-05T22:27:21.260Z] 22:27:21     INFO -  mozcrash INFO | Writing a dump to Z:\task_169127273656953\build\tests\gtest\1199ef74-852d-41df-8d61-170b93f65b26.dmp for [7976]
[task 2023-08-05T22:27:21.372Z] 22:27:21     INFO -  gtest INFO | gtest | process wait complete, returncode=None

This has reached our disable-recommended queue.
Justin, are you still looking into this?
Thank you.

Flags: needinfo?(jlink)

No concerns either way from me at this point. I'm not sure why my change consistently increased the failure rate before but, after rebasing my change onto newer code, the increase in failures went away and I landed my change.

Flags: needinfo?(jlink)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
No longer blocks: media-triage
Severity: -- → S3

Update

There have been 46 failures within the last 7 days:

  • 42 failures on Windows 11 x64 22H2 asan WebRender opt
  • 4 failures on Windows 11 x64 22H2 CCov WebRender opt

Recent log: https://treeherder.mozilla.org/logviewer?job_id=427131282&repo=autoland&lineNumber=12746

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

The test fails permanently in the Windows ASan configuration since bug 1841352
landed. With no fix in sight, this should be disabled until fixed.

Assignee: nobody → aryx.bugmail
Pushed by rvandermeulen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/8068f7fac7d4 disabled GeckoMediaPlugins.CDMStorageBasic on Windows asan for permanent failures r=media-playback-reviewers,alwu
Status: REOPENED → RESOLVED
Closed: 3 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 120 Branch
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 120 Branch → ---
Flags: needinfo?(aryx.bugmail)
Summary: Intermittent GeckoMediaPlugins.CDMStorageBasic | Value of: false | gtest | timed out after 300 seconds without output → Permafailing GeckoMediaPlugins.CDMStorageBasic | Value of: false | gtest | timed out after 300 seconds without output

The patch should be checking for MOZ_ASAN rather than ASAN. Fixing that makes CDMStorageForgetThisSite start failing, though.

Flags: needinfo?(aryx.bugmail)

So far, every test I disable just pushes the failures to the next test. I'm inclined to just skip all CDMStorageTests on Windows ASAN.

...which in turn is just pushing the failures to GeckoMediaPlugins.GMPTestCodec

(In reply to Ryan VanderMeulen [:RyanVM] from comment #102)

So far, every test I disable just pushes the failures to the next test. I'm inclined to just skip all CDMStorageTests on Windows ASAN.

TBH, the ASAN behavior would just be coherent with the rest of what I investigated back then where there was a huge latency on doing things early in the startup process and that we could not figure out, leading to the problem. Given how ASAN can slow things, it would likely trigger the same behavior. Disabling tests and failing on the next one just confirms this, since we just postpone the time we need to start things and they get slowish.

(In reply to :gerard-majax from comment #105)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #102)

So far, every test I disable just pushes the failures to the next test. I'm inclined to just skip all CDMStorageTests on Windows ASAN.

TBH, the ASAN behavior would just be coherent with the rest of what I investigated back then where there was a huge latency on doing things early in the startup process and that we could not figure out, leading to the problem. Given how ASAN can slow things, it would likely trigger the same behavior. Disabling tests and failing on the next one just confirms this, since we just postpone the time we need to start things and they get slowish.

Yep, makes sense. Guess I'll just keep hammering at this on Try until I skip my way to victory.

Until we can resolve the underlying bug which is causing GMP launch
slowdowns on Windows ASAN builds, just skip the impacted tests.

Pushed by rvandermeulen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ff0c4e0dc579 Skip failing GMP tests on Windows ASAN. r=alwu
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 120 Branch
See Also: → 1860717

as per bug 1860717, the failures are on win/ccov. I am seeing this as a perma fail while migrating win/ccov tests from 22H2 -> 24H2.

I find that skipping each test until I skipped all the win/asan tests on ccov as well is what was needed to get green.

:alwu, given that do you have concerns with skipping those tests on ccov?

Flags: needinfo?(alwu)

I'm good to skip them.

Flags: needinfo?(alwu)
Attachment #9444988 - Attachment description: Bug 1776767 - Skip some cppunittest on win11/24h2 ccov. r=alwu → Bug 1776767 - Skip some gtests on win11/24h2 ccov. r=alwu
Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ab9ac5b46a26 Skip some gtests on win11/24h2 ccov. r=alwu

A patch has been attached on this bug, which was already closed. Filing a separate bug will ensure better tracking. If this was not by mistake and further action is needed, please alert the appropriate party. (Or: if the patch doesn't change behavior -- e.g. landing a test case, or fixing a typo -- then feel free to disregard this message)

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: