Closed Bug 1721745 Opened 3 years ago Closed 3 years ago

Permanent awsy/test_memory_usage.py TestMemoryUsage.test_open_tabs | AssertionError: heap-unclassified was negative: -107216

Categories

(Testing :: AWSY, defect, P1)

All
Windows 10
defect

Tracking

(firefox-esr78 unaffected, firefox-esr91 unaffected, firefox93 unaffected, firefox94 fixed, firefox95 wontfix)

RESOLVED FIXED
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- unaffected
firefox93 --- unaffected
firefox94 --- fixed
firefox95 --- wontfix

People

(Reporter: intermittent-bug-filer, Assigned: mccr8)

References

(Depends on 1 open bug, Regression)

Details

(Keywords: assertion, intermittent-failure, regression, Whiteboard: [stockwell unknown])

Attachments

(1 file)

Filed by: apavel [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=345953449&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/B8BW9-D9Smu50SqufAchcg/runs/0/artifacts/public/logs/live_backing.log


[task 2021-07-21T23:09:51.706Z] 23:09:51     INFO - starting checkpoint TabsClosedExtraProcesses...
[task 2021-07-21T23:09:57.714Z] 23:09:57     INFO - checkpoint created, stored in Z:\\task_1626907236\\build\\tests\\results\\memory-report-TabsClosedExtraProcesses-0.json.gz
[task 2021-07-21T23:09:57.715Z] 23:09:57     INFO - closing preloaded browser
[task 2021-07-21T23:09:57.729Z] 23:09:57     INFO - starting checkpoint TabsClosed...
[task 2021-07-21T23:10:03.085Z] 23:10:03     INFO - checkpoint created, stored in Z:\\task_1626907236\\build\\tests\\results\\memory-report-TabsClosed-0.json.gz
[task 2021-07-21T23:10:33.090Z] 23:10:33     INFO - starting checkpoint TabsClosedSettled...
[task 2021-07-21T23:10:33.909Z] 23:10:33     INFO - checkpoint created, stored in Z:\\task_1626907236\\build\\tests\\results\\memory-report-TabsClosedSettled-0.json.gz
[task 2021-07-21T23:10:33.909Z] 23:10:33     INFO - starting checkpoint TabsClosedForceGC...
[task 2021-07-21T23:10:35.131Z] 23:10:35     INFO - checkpoint created, stored in Z:\\task_1626907236\\build\\tests\\results\\memory-report-TabsClosedForceGC-0.json.gz
[task 2021-07-21T23:10:35.132Z] 23:10:35     INFO - setting results
[task 2021-07-21T23:10:35.132Z] 23:10:35     INFO - tearing down!
[task 2021-07-21T23:10:35.133Z] 23:10:35     INFO - tearing down webservers!
[task 2021-07-21T23:10:35.133Z] 23:10:35     INFO - mozproxy MitmproxyDesktop stop!!
[task 2021-07-21T23:10:35.134Z] 23:10:35     INFO - mozproxy Mitmproxy stop!!
[task 2021-07-21T23:10:35.135Z] 23:10:35     INFO - mozproxy Stopping mitmproxy playback, killing process 3904
[task 2021-07-21T23:10:35.135Z] 23:10:35     INFO - mozproxy Sending CTRL_BREAK_EVENT to mitmproxy
[task 2021-07-21T23:10:37.133Z] 23:10:37     INFO - mozproxy Successfully killed the mitmproxy playback process
[task 2021-07-21T23:10:37.133Z] 23:10:37     INFO - mozproxy Turning off the browser proxy
[task 2021-07-21T23:10:37.134Z] 23:10:37     INFO - mozproxy writing: Z:\task_1626907236\build\application\firefox\distribution\policies.json
[task 2021-07-21T23:10:37.134Z] 23:10:37     INFO - processing data in Z:\task_1626907236\build\tests\results!
[task 2021-07-21T23:10:38.920Z] 23:10:38     INFO - TEST-UNEXPECTED-ERROR | awsy/test_memory_usage.py TestMemoryUsage.test_open_tabs | AssertionError: heap-unclassified was negative: -107216
[task 2021-07-21T23:10:38.920Z] 23:10:38     INFO - Traceback (most recent call last):
[task 2021-07-21T23:10:38.920Z] 23:10:38     INFO -   File "z:\task_1626907236\build\venv\lib\site-packages\marionette_harness\marionette_test\testcases.py", line 235, in run
[task 2021-07-21T23:10:38.921Z] 23:10:38     INFO -     self.tearDown()
[task 2021-07-21T23:10:38.921Z] 23:10:38     INFO -   File "Z:\task_1626907236\build\tests\awsy\awsy\test_memory_usage.py", line 159, in tearDown
[task 2021-07-21T23:10:38.921Z] 23:10:38     INFO -     AwsyTestCase.tearDown(self)
[task 2021-07-21T23:10:38.921Z] 23:10:38     INFO -   File "z:\task_1626907236\build\venv\lib\site-packages\awsy\awsy_test_case.py", line 111, in tearDown
[task 2021-07-21T23:10:38.921Z] 23:10:38     INFO -     self.perf_extra_opts(),
[task 2021-07-21T23:10:38.921Z] 23:10:38     INFO -   File "z:\task_1626907236\build\venv\lib\site-packages\awsy\process_perf_data.py", line 207, in create_perf_data
[task 2021-07-21T23:10:38.921Z] 23:10:38     INFO -     extra_opts,
[task 2021-07-21T23:10:38.921Z] 23:10:38     INFO -   File "z:\task_1626907236\build\venv\lib\site-packages\awsy\process_perf_data.py", line 150, in create_suite
[task 2021-07-21T23:10:38.922Z] 23:10:38     INFO -     memory_report_path, node, name_filter
[task 2021-07-21T23:10:38.922Z] 23:10:38     INFO -   File "z:\task_1626907236\build\venv\lib\site-packages\awsy\parse_about_memory.py", line 119, in calculate_memory_report_values
[task 2021-07-21T23:10:38.922Z] 23:10:38     INFO -     totals = path_total(data, data_point_path)
[task 2021-07-21T23:10:38.922Z] 23:10:38     INFO -   File "z:\task_1626907236\build\venv\lib\site-packages\awsy\parse_about_memory.py", line 89, in path_total
[task 2021-07-21T23:10:38.922Z] 23:10:38     INFO -     path_totals[k] += heap_unclassified(k)
[task 2021-07-21T23:10:38.922Z] 23:10:38     INFO -   File "z:\task_1626907236\build\venv\lib\site-packages\awsy\parse_about_memory.py", line 70, in heap_unclassified
[task 2021-07-21T23:10:38.922Z] 23:10:38     INFO -     assert unclassified >= 0, "heap-unclassified was negative: %d" % unclassified
[task 2021-07-21T23:10:38.922Z] 23:10:38     INFO - TEST-INFO took 754517ms
[task 2021-07-21T23:10:38.928Z] 23:10:38     INFO - 
[task 2021-07-21T23:10:38.928Z] 23:10:38     INFO - SUMMARY
[task 2021-07-21T23:10:38.928Z] 23:10:38     INFO - -------
[task 2021-07-21T23:10:38.928Z] 23:10:38     INFO - passed: 0
[task 2021-07-21T23:10:38.928Z] 23:10:38     INFO - failed: 1
[task 2021-07-21T23:10:38.929Z] 23:10:38     INFO - todo: 0
[task 2021-07-21T23:10:38.929Z] 23:10:38     INFO - 
[task 2021-07-21T23:10:38.929Z] 23:10:38     INFO - FAILED TESTS
[task 2021-07-21T23:10:38.929Z] 23:10:38     INFO - -------
[task 2021-07-21T23:10:38.930Z] 23:10:38     INFO - test_memory_usage.py test_memory_usage.TestMemoryUsage.test_open_tabs
[task 2021-07-21T23:10:38.930Z] 23:10:38     INFO - SUITE-END | took 754s
[task 2021-07-21T23:10:43.048Z] 23:10:43    ERROR - Return code: 10
[task 2021-07-21T23:10:43.049Z] 23:10:43    ERROR - Got 1 unexpected statuses
[task 2021-07-21T23:10:43.050Z] 23:10:43     INFO - AWSY exited with return code 10: WARNING
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INCOMPLETE

This has started permafailing on beta since mozilla-central was merged into beta, and was first spotted in the beta simulation from Comment 5.

Dave, could you please redirect this to someone who can take a look?

Flags: needinfo?(dave.hunt)

I'll take a look tomorrow.

Flags: needinfo?(dave.hunt) → needinfo?(continuation)

TabsOpen, TabsOpenForceGC and TabsOpenSettled have negative heap-unclassified in google.com processes. In the set of logs I looked at, the negative h-u happend for Google presentation in all 3 logs, and in Google docs for the last two, but the heap-unclassified was very low for Google docs in all three.

m-c:

242.98 MB (100.0%) -- explicit
├──161.04 MB (66.28%) ++ window-objects
├───26.25 MB (10.80%) ++ heap-overhead
├───25.24 MB (10.39%) ++ gfx
├───13.41 MB (05.52%) ++ js-non-window
├────7.01 MB (02.89%) ++ layout
├────6.36 MB (02.62%) -- images
│ ├──6.36 MB (02.62%) -- content
│ │ ├──6.34 MB (02.61%) -- vector/used/progress=18f
│ │ │ ├──3.00 MB (01.23%) ── image(74x4641, https://ssl.gstatic.com/docs/common/material_common_sprite185_grey_medium.svg)/source
│ │ │ ├──3.00 MB (01.23%) ── image(74x4641, https://ssl.gstatic.com/docs/common/material_common_sprite185.svg)/source
│ │ │ └──0.35 MB (00.14%) ── image(29x919, https://ssl.gstatic.com/docs/documents/share/images/sprite-22.svg)/source
│ │ └──0.02 MB (00.01%) ++ raster/used/<non-notable images>
│ └──0.00 MB (00.00%) ── cache/overhead
└────3.67 MB (01.51%) ++ (20 tiny)

Beta:
240.41 MB (100.0%) -- explicit
├──161.03 MB (66.98%) ++ window-objects
├───26.31 MB (10.95%) ++ heap-overhead
├───25.25 MB (10.50%) ++ gfx
├───13.39 MB (05.57%) ++ js-non-window
├────7.01 MB (02.92%) ++ layout
├────5.29 MB (02.20%) -- images
│ ├──5.29 MB (02.20%) -- content
│ │ ├──5.27 MB (02.19%) -- vector/used/progress=18f
│ │ │ ├──2.48 MB (01.03%) ── image(74x4641, https://ssl.gstatic.com/docs/common/material_common_sprite185.svg)/source
│ │ │ ├──2.48 MB (01.03%) ── image(74x4641, https://ssl.gstatic.com/docs/common/material_common_sprite185_grey_medium.svg)/source
│ │ │ └──0.30 MB (00.13%) ── image(29x919, https://ssl.gstatic.com/docs/documents/share/images/sprite-22.svg)/source
│ │ └──0.02 MB (00.01%) ++ raster/used/<non-notable images>
│ └──0.00 MB (00.00%) ── cache/overhead
└────2.13 MB (00.89%) ++ (19 tiny)

The 1.3MB drop is almost entirely in heap-unclassified, and the amount of image data went up by 1MB, so I assume there's some difference in the vector image stuff in beta. That said, the heap-unclassified is also extremely low even in m-c. It is around 0.22% in the Google Docs process, as compared to around 6.5% in the CNN process (which doesn't have any vector graphics).

It seems suspicious that in the category that seems like the source of overreporting that there are two identical images. Maybe they are sharing memory and that isn't being accounted for properly.

Timothy, do you know if we might be failing to account for some kind of sharing of vector images in the memory reporter?

I have a DMD report from try for Windows WebRender Fission running. I'm not sure if that will show us anything, if this is due to an error in a counter-based reporter, but it is worth checking.

There was some recent-ish work in memory reporting for images for WebRender, such as bug 1711948.

Flags: needinfo?(continuation) → needinfo?(tnikkel)

In a different log, the numbers for those two images were slightly different, so maybe it isn't just a simple double counting of the image.

Unfortunately the DMD report does not seem to be very useful. It says there are no twice-reported blocks, which is consistent with my theory that the overreporting is due to an error in the image reporter, which is counter based. Though we do have other counter based reporters.

The DMD report for the Google docs process has a total of 126MB of allocations, whereas the memory report has 275MB of explicit. There's 51MB of DMD stuff in the about:memory report, which might account for some of it, but that still seems like an awful large gap. (The stacks in the DMD report also appear to be bogus, FWIW.)

I'm going to move this to ImageLib, as that seems like the most likely cause of the overreporting, based on my findings in comment 11. As I said, the misreporting is likely also happening on mozilla-central, but it somehow increases when things are built as beta.

Component: AWSY → ImageLib
OS: Unspecified → Windows 10
Product: Testing → Core
Hardware: Unspecified → All

aosmond has been more involved with the imagelib memory reporters recently adding ni on him.

Flags: needinfo?(aosmond)

This fails permanently on Windows 10 x64 2004 WebRender Shippable in beta configurations (= on beta since last week) since bug 1727158 increased the process count for isolated web processes (Try push with the backout and job succeeding, beta config not shown because it already got pushed earlier without the backout).

Keywords: regression
Regressed by: 1727158
Summary: Intermittent awsy/test_memory_usage.py TestMemoryUsage.test_open_tabs | AssertionError: heap-unclassified was negative: -107216 → Permanent awsy/test_memory_usage.py TestMemoryUsage.test_open_tabs | AssertionError: heap-unclassified was negative: -107216
Depends on: 1735556

For beta, we can just disable the assertion. I filed bug 1735556 to fix the underlying apparent issue with the image memory reporting.

Flags: needinfo?(tnikkel)
Flags: needinfo?(aosmond)
Assignee: nobody → continuation
Component: ImageLib → AWSY
Product: Core → Testing

I think we've just been on thin ice for awhile with the memory reporter, and the change to process allocation just happened to tweak things enough that it started to fail. As I said above, even not as beta the heap-unclassified in the Google Docs process is weirdly low, indicating an error. Honestly, we might want to consider a tighter assertion for heap-unclassified.

Comment on attachment 9245735 [details]
Bug 1721745 - Disable negative heap-unclassified assertion in Beta for now.

Beta/Release Uplift Approval Request

  • User impact if declined: None. This disables an assertion on Beta that is permafailing, to give us more time to fix it on Nightly. I don't think anybody pays attention to AWSY on Beta anyways so it shouldn't be a big deal. I only want to land it on Beta for now, because that's the only branch it is permafailing.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): test only
  • String changes made/needed: none
Attachment #9245735 - Flags: approval-mozilla-beta?

Comment on attachment 9245735 [details]
Bug 1721745 - Disable negative heap-unclassified assertion in Beta for now.

This is test-only, approved.

Attachment #9245735 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

:mmcr8, can you land the same patch in mozilla-central? The test is failing too often atm. Alternatively, do you have an estimate of when this will be fixed in mozilla-central?

Flags: needinfo?(continuation)

I'll see if I can get an image person to fix bug 1735556 soon.

Flags: needinfo?(continuation)
Keywords: leave-open
Priority: -- → P1
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
See Also: → 1767048
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: