Closed
Bug 1296003
Opened 9 years ago
Closed 9 years ago
[asan] Intermittent test_screenshot.py Content.test_viewport_after_scroll | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360s) due to OOM
Categories
(Testing :: Marionette Client and Harness, defect)
Tracking
(firefox51 fixed)
RESOLVED
FIXED
mozilla51
| Tracking | Status | |
|---|---|---|
| firefox51 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Unassigned)
References
Details
(Keywords: intermittent-failure, regression)
Attachments
(1 file)
Comment 1•9 years ago
|
||
This seems to be an ASAN specific bug. In the gecko.log file I can see the following memory allocation failure.
Not sure if Jesse is still working on/with those builds but maybe he could help us to find a person to analyze that. This test is close to permafail.
Flags: needinfo?(jruderman)
Comment 2•9 years ago
|
||
It looks like the failure started with the following changeset:
https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=ccd0fcabfd0dba3f9d3838ce12aa6c9143a52e0f
Maybe this is related to bug bug 1294469 (Shrink the nursery if we run out of memory)?
Comment 3•9 years ago
|
||
I was wrong with the above changeset. Treeherder also shows this failure for earlier pushes. The first one I really see here is:
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=f0067001c059ff57d6927c6da5a1605f1d29a449
Which includes:
Bug 1264642 - Reduce the contiguous address space needed for StructuredClone serialization
Flags: needinfo?(terrence)
Flags: needinfo?(kchen)
Flags: needinfo?(jruderman)
Flags: needinfo?(jcoppeard)
Flags: needinfo?(continuation)
Comment 4•9 years ago
|
||
The "ERROR: AddressSanitizer failed to allocate" might actually be a side-effect. It looks like that we hang due to:
> ###!!! [Parent][MessageChannel] Error: (msgtype=0x2E007D,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv
It means our socket connection between Marionette client and server is dead.
Btw. this is all e10s only.
tracking-e10s:
--- → ?
Comment 5•9 years ago
|
||
error code 12 is ENOMEM. According to https://llvm.org/bugs/show_bug.cgi?id=22026 this looks like a OOM crash. The IPC error is the side-effect, I think.
Flags: needinfo?(kchen)
Comment 6•9 years ago
|
||
I think it's because the marionette screenshot is sent over IPC messages in a json object by structured clone. We somehow break ASAN because its LargeMmapAllocator cannot allocate more memory. Not sure what's the size of the screenshot but after the patch we will split it into chunks of size 4KB. If the entire base64 encoded image is around 450KB then that will generate about 112 chunks.. double that is 224. I don't think that is large enough to break ASAN. Maybe the accumulated allocations can do that?
I ended up backing out bug 1264642.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment 8•9 years ago
|
||
I'm not familiar with the code that landed in bug 1264642.
Flags: needinfo?(continuation)
Updated•9 years ago
|
Keywords: regressionwindow-wanted
Comment 9•9 years ago
|
||
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #6)
> I think it's because the marionette screenshot is sent over IPC messages in
> a json object by structured clone. We somehow break ASAN because its
> LargeMmapAllocator cannot allocate more memory. Not sure what's the size of
> the screenshot but after the patch we will split it into chunks of size 4KB.
> If the entire base64 encoded image is around 450KB then that will generate
> about 112 chunks.. double that is 224. I don't think that is large enough to
> break ASAN. Maybe the accumulated allocations can do that?
I haven't touched any of that code yet, so I cannot really give feedback. I would ni? Andreas here, given that he is working a lot on Marionette server.
Is there a fix needed in Marionette to handle multiple chunks or does this work all transparently? I ask because we see the IPC error a lot for various tests and random times. I found a way to reproduce it via bug 1294456, but with an opt or debug build.
Flags: needinfo?(ato)
Comment 10•9 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #9)
> (In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #6)
> > I think it's because the marionette screenshot is sent over IPC messages in
> > a json object by structured clone. We somehow break ASAN because its
> > LargeMmapAllocator cannot allocate more memory. Not sure what's the size of
> > the screenshot but after the patch we will split it into chunks of size 4KB.
> > If the entire base64 encoded image is around 450KB then that will generate
> > about 112 chunks.. double that is 224. I don't think that is large enough to
> > break ASAN. Maybe the accumulated allocations can do that?
>
> I haven't touched any of that code yet, so I cannot really give feedback. I
> would ni? Andreas here, given that he is working a lot on Marionette server.
>
> Is there a fix needed in Marionette to handle multiple chunks or does this
> work all transparently? I ask because we see the IPC error a lot for various
> tests and random times. I found a way to reproduce it via bug 1294456, but
> with an opt or debug build.
It should be transparent to Marionette. I'll try to debug this.
Comment 11•9 years ago
|
||
I checked the gecko.log of this job and found the following:
###!!! [Parent][MessageChannel] Error: (msgtype=0x2E007D,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv
Given my investigation for other intermittent failures I feel that it might also be related to bug 1294540. But since the backout it didn't happen again, so I would keep this bug closed for now, but only add the dependency.
Comment 12•9 years ago
|
||
From resource-usage.json the peak memory usage is 98.3% so we were really running out of memory. The test VM has only 3G memory. Not sure if my patch make the peak memory usage worse.
Comment 13•9 years ago
|
||
Looks like the test machine has been given more virtual memory. I can't no longer reproduce this. https://treeherder.mozilla.org/#/jobs?repo=try&revision=b6bcace337cf&selectedJob=26031144
The resource-usage.json now shows vmem_total: 7843348480. I'm not sure if we are automatically adjusting or rotating the test machines. I'll find out.
Comment 14•9 years ago
|
||
After some digging I found the ec2 configuration here: https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/watch_pending.cfg
Hi Chris, do you know why the ASAN tests seem to run on c3.xlarge or m3.xlarge which have more ram available? See also comment 12 and comment 13.
Flags: needinfo?(catlee)
| Comment hidden (Intermittent Failures Robot) |
Comment 16•9 years ago
|
||
Kan-ru, by end of last week we switched desktop-test workers to large instances by bug 1281241. So I think that is what you are also seeing here now with the ASAN tests for Marionette. I don't think that there is a way to force old instances unless you make use of a worker type which is still using that. Joel might be able to assist here.
Flags: needinfo?(catlee) → needinfo?(jmaher)
Comment 17•9 years ago
|
||
Kan-ru, we have previously run on m1.medium (legacy instance type- single core), our goal is to run everything on a supported instance type which means multi core and typically more memory/cpu. I don't see the mention of a new bug here, if you want to look at an old bug, change the instance size in the taskcluster configs. An example would be:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/desktop-test/tests.yml#273
for asan, you would need to do this by-test-platform as in the above example.
Flags: needinfo?(jmaher)
Comment 18•9 years ago
|
||
Thanks. Since we are moving to a larger instance and I can no longer reproduce the intermittent failure I guess I can safely reland bug 1264642
Flags: needinfo?(ato)
Updated•9 years ago
|
tracking-e10s:
? → ---
Summary: Intermittent test_screenshot.py Content.test_viewport_after_scroll | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360s) → [asan] Intermittent test_screenshot.py Content.test_viewport_after_scroll | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360s) due to OOM
Updated•3 years ago
|
Product: Testing → Remote Protocol
Comment 19•3 years ago
|
||
Moving bug to Testing::Marionette Client and Harness component per bug 1815831.
Component: Marionette → Marionette Client and Harness
Product: Remote Protocol → Testing
You need to log in
before you can comment on or make changes to this bug.
Description
•