Closed Bug 1296003 Opened 5 years ago Closed 5 years ago

[asan] Intermittent test_screenshot.py Content.test_viewport_after_scroll | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360s) due to OOM

Categories

(Testing :: Marionette, defect)

Version 3
defect
Not set
normal

Tracking

(firefox51 fixed)

RESOLVED FIXED
mozilla51
Tracking Status
firefox51 --- fixed

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, regression)

Attachments

(1 file)

This seems to be an ASAN specific bug. In the gecko.log file I can see the following memory allocation failure.

Not sure if Jesse is still working on/with those builds but maybe he could help us to find a person to analyze that. This test is close to permafail.
Flags: needinfo?(jruderman)
It looks like the failure started with the following changeset:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=ccd0fcabfd0dba3f9d3838ce12aa6c9143a52e0f

Maybe this is related to bug bug 1294469 (Shrink the nursery if we run out of memory)?
Flags: needinfo?(terrence)
Flags: needinfo?(jcoppeard)
I was wrong with the above changeset. Treeherder also shows this failure for earlier pushes. The first one I really see here is:

https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=f0067001c059ff57d6927c6da5a1605f1d29a449

Which includes:

Bug 1264642 - Reduce the contiguous address space needed for StructuredClone serialization
Flags: needinfo?(terrence)
Flags: needinfo?(kchen)
Flags: needinfo?(jruderman)
Flags: needinfo?(jcoppeard)
Flags: needinfo?(continuation)
The "ERROR: AddressSanitizer failed to allocate" might actually be a side-effect. It looks like that we hang due to:

> ###!!! [Parent][MessageChannel] Error: (msgtype=0x2E007D,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv

It means our socket connection between Marionette client and server is dead.

Btw. this is all e10s only.
tracking-e10s: --- → ?
error code 12 is ENOMEM. According to https://llvm.org/bugs/show_bug.cgi?id=22026 this looks like a OOM crash. The IPC error is the side-effect, I think.
Flags: needinfo?(kchen)
I think it's because the marionette screenshot is sent over IPC messages in a json object by structured clone. We somehow break ASAN because its LargeMmapAllocator cannot allocate more memory. Not sure what's the size of the screenshot but after the patch we will split it into chunks of size 4KB. If the entire base64 encoded image is around 450KB then that will generate about 112 chunks.. double that is 224. I don't think that is large enough to break ASAN. Maybe the accumulated allocations can do that?
I ended up backing out bug 1264642.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
I'm not familiar with the code that landed in bug 1264642.
Flags: needinfo?(continuation)
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #6)
> I think it's because the marionette screenshot is sent over IPC messages in
> a json object by structured clone. We somehow break ASAN because its
> LargeMmapAllocator cannot allocate more memory. Not sure what's the size of
> the screenshot but after the patch we will split it into chunks of size 4KB.
> If the entire base64 encoded image is around 450KB then that will generate
> about 112 chunks.. double that is 224. I don't think that is large enough to
> break ASAN. Maybe the accumulated allocations can do that?

I haven't touched any of that code yet, so I cannot really give feedback. I would ni? Andreas here, given that he is working a lot on Marionette server.

Is there a fix needed in Marionette to handle multiple chunks or does this work all transparently? I ask because we see the IPC error a lot for various tests and random times. I found a way to reproduce it via bug 1294456, but with an opt or debug build.
Flags: needinfo?(ato)
(In reply to Henrik Skupin (:whimboo) from comment #9)
> (In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #6)
> > I think it's because the marionette screenshot is sent over IPC messages in
> > a json object by structured clone. We somehow break ASAN because its
> > LargeMmapAllocator cannot allocate more memory. Not sure what's the size of
> > the screenshot but after the patch we will split it into chunks of size 4KB.
> > If the entire base64 encoded image is around 450KB then that will generate
> > about 112 chunks.. double that is 224. I don't think that is large enough to
> > break ASAN. Maybe the accumulated allocations can do that?
> 
> I haven't touched any of that code yet, so I cannot really give feedback. I
> would ni? Andreas here, given that he is working a lot on Marionette server.
> 
> Is there a fix needed in Marionette to handle multiple chunks or does this
> work all transparently? I ask because we see the IPC error a lot for various
> tests and random times. I found a way to reproduce it via bug 1294456, but
> with an opt or debug build.

It should be transparent to Marionette. I'll try to debug this.
I checked the gecko.log of this job and found the following:

###!!! [Parent][MessageChannel] Error: (msgtype=0x2E007D,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv

Given my investigation for other intermittent failures I feel that it might also be related to bug 1294540. But since the backout it didn't happen again, so I would keep this bug closed for now, but only add the dependency.
Depends on: 1294540
Target Milestone: --- → mozilla51
From resource-usage.json the peak memory usage is 98.3% so we were really running out of memory. The test VM has only 3G memory. Not sure if my patch make the peak memory usage worse.
Looks like the test machine has been given more virtual memory. I can't no longer reproduce this. https://treeherder.mozilla.org/#/jobs?repo=try&revision=b6bcace337cf&selectedJob=26031144

The resource-usage.json now shows vmem_total: 7843348480. I'm not sure if we are automatically adjusting or rotating the test machines. I'll find out.
After some digging I found the ec2 configuration here: https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/watch_pending.cfg

Hi Chris, do you know why the ASAN tests seem to run on c3.xlarge or m3.xlarge which have more ram available? See also comment 12 and comment 13.
Flags: needinfo?(catlee)
Kan-ru, by end of last week we switched desktop-test workers to large instances by bug 1281241. So I think that is what you are also seeing here now with the ASAN tests for Marionette. I don't think that there is a way to force old instances unless you make use of a worker type which is still using that. Joel might be able to assist here.
Flags: needinfo?(catlee) → needinfo?(jmaher)
Kan-ru, we have previously run on m1.medium (legacy instance type- single core), our goal is to run everything on a supported instance type which means multi core and typically more memory/cpu.  I don't see the mention of a new bug here, if you want to look at an old bug, change the instance size in the taskcluster configs.  An example would be:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/desktop-test/tests.yml#273

for asan, you would need to do this by-test-platform as in the above example.
Flags: needinfo?(jmaher)
Thanks. Since we are moving to a larger instance and I can no longer reproduce the intermittent failure I guess I can safely reland bug 1264642
Flags: needinfo?(ato)
tracking-e10s: ? → ---
Summary: Intermittent test_screenshot.py Content.test_viewport_after_scroll | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360s) → [asan] Intermittent test_screenshot.py Content.test_viewport_after_scroll | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360s) due to OOM
You need to log in before you can comment on or make changes to this bug.