Closed Bug 1296003 Opened 9 years ago Closed 9 years ago

[asan] Intermittent test_screenshot.py Content.test_viewport_after_scroll | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360s) due to OOM

Tracking

(firefox51 fixed)

Status:

RESOLVED FIXED

Milestone:

mozilla51

Tracking Flags:

Tracking

Status

firefox51

---

fixed

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, regression)

Attachments

(1 file)

ERROR: AddressSanitizer failed to allocate 0x12c8000 (19693568) bytes of LargeMmapAllocator (error code: 12) 9 years ago Henrik Skupin [:whimboo][⌚️UTC+1] 64.36 KB, text/plain		Details

Treeherder Bug Filer

Reporter

Description

•

9 years ago

treeherder

Filed by: hskupin [at] gmail.com https://treeherder.mozilla.org/logviewer.html#?job_id=2081124&repo=autoland https://queue.taskcluster.net/v1/task/doJe9d_sQuSN_SWw459Ofg/runs/3/artifacts/public%2Flogs%2Flive_backing.log

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 1

•

9 years ago

Attached file ERROR: AddressSanitizer failed to allocate 0x12c8000 (19693568) bytes of LargeMmapAllocator (error code: 12) — Details

This seems to be an ASAN specific bug. In the gecko.log file I can see the following memory allocation failure. Not sure if Jesse is still working on/with those builds but maybe he could help us to find a person to analyze that. This test is close to permafail.

Flags: needinfo?(jruderman)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 2

•

9 years ago

It looks like the failure started with the following changeset: https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=ccd0fcabfd0dba3f9d3838ce12aa6c9143a52e0f Maybe this is related to bug bug 1294469 (Shrink the nursery if we run out of memory)?

Flags: needinfo?(terrence)

Flags: needinfo?(jcoppeard)

Keywords: regression, regressionwindow-wanted

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 3

•

9 years ago

I was wrong with the above changeset. Treeherder also shows this failure for earlier pushes. The first one I really see here is: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=f0067001c059ff57d6927c6da5a1605f1d29a449 Which includes: Bug 1264642 - Reduce the contiguous address space needed for StructuredClone serialization

Flags: needinfo?(terrence)

Flags: needinfo?(kchen)

Flags: needinfo?(jruderman)

Flags: needinfo?(jcoppeard)

Flags: needinfo?(continuation)

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

9 years ago

Comment 4

•

9 years ago

The "ERROR: AddressSanitizer failed to allocate" might actually be a side-effect. It looks like that we hang due to: > ###!!! [Parent][MessageChannel] Error: (msgtype=0x2E007D,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv It means our socket connection between Marionette client and server is dead. Btw. this is all e10s only.

tracking-e10s: --- → ?

Kan-Ru Chen [:kanru] (UTC+9)

Comment 5

•

9 years ago

error code 12 is ENOMEM. According to https://llvm.org/bugs/show_bug.cgi?id=22026 this looks like a OOM crash. The IPC error is the side-effect, I think.

Flags: needinfo?(kchen)

Kan-Ru Chen [:kanru] (UTC+9)

Comment 6

•

9 years ago

I think it's because the marionette screenshot is sent over IPC messages in a json object by structured clone. We somehow break ASAN because its LargeMmapAllocator cannot allocate more memory. Not sure what's the size of the screenshot but after the patch we will split it into chunks of size 4KB. If the entire base64 encoded image is around 450KB then that will generate about 112 chunks.. double that is 224. I don't think that is large enough to break ASAN. Maybe the accumulated allocations can do that?

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 7

•

9 years ago

I ended up backing out bug 1264642.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Andrew McCreight [:mccr8]

Comment 8

•

9 years ago

I'm not familiar with the code that landed in bug 1264642.

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Keywords: regressionwindow-wanted

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 9

•

9 years ago

(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #6) > I think it's because the marionette screenshot is sent over IPC messages in > a json object by structured clone. We somehow break ASAN because its > LargeMmapAllocator cannot allocate more memory. Not sure what's the size of > the screenshot but after the patch we will split it into chunks of size 4KB. > If the entire base64 encoded image is around 450KB then that will generate > about 112 chunks.. double that is 224. I don't think that is large enough to > break ASAN. Maybe the accumulated allocations can do that? I haven't touched any of that code yet, so I cannot really give feedback. I would ni? Andreas here, given that he is working a lot on Marionette server. Is there a fix needed in Marionette to handle multiple chunks or does this work all transparently? I ask because we see the IPC error a lot for various tests and random times. I found a way to reproduce it via bug 1294456, but with an opt or debug build.

Flags: needinfo?(ato)

Kan-Ru Chen [:kanru] (UTC+9)

Comment 10

•

9 years ago

(In reply to Henrik Skupin (:whimboo) from comment #9) > (In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #6) > > I think it's because the marionette screenshot is sent over IPC messages in > > a json object by structured clone. We somehow break ASAN because its > > LargeMmapAllocator cannot allocate more memory. Not sure what's the size of > > the screenshot but after the patch we will split it into chunks of size 4KB. > > If the entire base64 encoded image is around 450KB then that will generate > > about 112 chunks.. double that is 224. I don't think that is large enough to > > break ASAN. Maybe the accumulated allocations can do that? > > I haven't touched any of that code yet, so I cannot really give feedback. I > would ni? Andreas here, given that he is working a lot on Marionette server. > > Is there a fix needed in Marionette to handle multiple chunks or does this > work all transparently? I ask because we see the IPC error a lot for various > tests and random times. I found a way to reproduce it via bug 1294456, but > with an opt or debug build. It should be transparent to Marionette. I'll try to debug this.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 11

•

9 years ago

I checked the gecko.log of this job and found the following: ###!!! [Parent][MessageChannel] Error: (msgtype=0x2E007D,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv Given my investigation for other intermittent failures I feel that it might also be related to bug 1294540. But since the backout it didn't happen again, so I would keep this bug closed for now, but only add the dependency.

status-firefox51: --- → fixed

Depends on: 1294540

Target Milestone: --- → mozilla51

Kan-Ru Chen [:kanru] (UTC+9)

Comment 12

•

9 years ago

From resource-usage.json the peak memory usage is 98.3% so we were really running out of memory. The test VM has only 3G memory. Not sure if my patch make the peak memory usage worse.

Kan-Ru Chen [:kanru] (UTC+9)

Comment 13

•

9 years ago

Looks like the test machine has been given more virtual memory. I can't no longer reproduce this. https://treeherder.mozilla.org/#/jobs?repo=try&revision=b6bcace337cf&selectedJob=26031144 The resource-usage.json now shows vmem_total: 7843348480. I'm not sure if we are automatically adjusting or rotating the test machines. I'll find out.

Kan-Ru Chen [:kanru] (UTC+9)

Comment 14

•

9 years ago

After some digging I found the ec2 configuration here: https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/watch_pending.cfg Hi Chris, do you know why the ASAN tests seem to run on c3.xlarge or m3.xlarge which have more ram available? See also comment 12 and comment 13.

Flags: needinfo?(catlee)

Comment hidden (Intermittent Failures Robot)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 16

•

9 years ago

Kan-ru, by end of last week we switched desktop-test workers to large instances by bug 1281241. So I think that is what you are also seeing here now with the ASAN tests for Marionette. I don't think that there is a way to force old instances unless you make use of a worker type which is still using that. Joel might be able to assist here.

Flags: needinfo?(catlee) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 17

•

9 years ago

Kan-ru, we have previously run on m1.medium (legacy instance type- single core), our goal is to run everything on a supported instance type which means multi core and typically more memory/cpu. I don't see the mention of a new bug here, if you want to look at an old bug, change the instance size in the taskcluster configs. An example would be: https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/desktop-test/tests.yml#273 for asan, you would need to do this by-test-platform as in the above example.

Flags: needinfo?(jmaher)

Kan-Ru Chen [:kanru] (UTC+9)

Comment 18

•

9 years ago

Thanks. Since we are moving to a larger instance and I can no longer reproduce the intermittent failure I guess I can safely reland bug 1264642

Flags: needinfo?(ato)

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

9 years ago

tracking-e10s: ? → ---

Summary: Intermittent test_screenshot.py Content.test_viewport_after_scroll | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360s) → [asan] Intermittent test_screenshot.py Content.test_viewport_after_scroll | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360s) due to OOM

BMO Automation

Updated

•

3 years ago

Product: Testing → Remote Protocol

BMO Automation

Comment 19

•

3 years ago

Moving bug to Testing::Marionette Client and Harness component per bug 1815831.

Component: Marionette → Marionette Client and Harness

Product: Remote Protocol → Testing

You need to log in before you can comment on or make changes to this bug.