Closed Bug 931198 Opened 11 years ago Closed 10 years ago

about_memory fails to retrieve all files from b2g emulator

Categories

(Firefox OS Graveyard :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(blocking-b2g:1.3+, b2g-v1.3 fixed, b2g-v1.3T fixed, b2g-v1.4 fixed)

RESOLVED FIXED
1.3 C3/1.4 S3(31jan)
blocking-b2g 1.3+
Tracking Status
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed

People

(Reporter: rwood, Assigned: johns)

References

Details

Attachments

(3 files)

Attached file aboutmemorylogcat.txt
B2G/tools/get_about_memory.py fails to pull all the files off of the b2g emulator if there are a large number of files (ie. over 30). Error message is "Unable to pull some files". More details found in the attached output from terminal and a logcat. Not pulling DMD or gcc.

Easiest way to reproduce:

- Start the b2g emulator with default settings
- After startup, swipe right
- Click on the first application icon (camera), wait for conplete app startup, then click the home screen to minimize
- Repeat this for EACH application on the emulator - including those when swipe right again to the second full screen of icons; and don't forget the main apps at the very bottom too (phone etc).
- Then run get_about_memory.py

Around 33 files will be found, but always after pulling 28 it gets stuck and fails with the above noted message.

If the emulator RAM is increaed from 512MB to 1024MB then this problem doesn't happen; all of the memory files are pulled fine. However this is not a viable solution as if the RAM is changed then the emulator will not be a valid representation of a Firefox OS device.

Note that this is blocking AWSY Firefox OS automation.
Kyle, can you take a look at this, or suggest someone who can?
Flags: needinfo?(khuey)
Is this something you have time to look into John?
Flags: needinfo?(khuey) → needinfo?(jschoenick)
The logcat suggests that some content processes are crashing, probably OOMing given that higher memory limits fix this:
> I/Gecko   (   45): ###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv

Collecting memory reports in many processes simultaneously is likely to spike memory usage, I'm not sure what the best mitigation is for that. When I get a chance I can figure out how to build the b2g emulator to verify
Flags: needinfo?(jschoenick)
See https://developer.mozilla.org/en-US/docs/Mozilla/Firefox_OS?menu for instructions on how to build for B2G; the target you want here is 'emulator'.
> Collecting memory reports in many processes simultaneously is likely to
> spike memory usage, I'm not sure what the best mitigation is for that.

It shouldn't be that bad.  Collecting the measurements takes little if any memory, since it's mostly a matter of traversing lots of data structures.  Writing to file takes a moderate amount because the gzipping needs a buffer, which I think is 32KB?
Did you have time to look into this :johns?  It's completely blocking progress on B2G memory tests.  If you don't have time, let me know so I can try to find another owner.
Flags: needinfo?(jschoenick)
I'm working on a few higher priority things right now, and I need to set aside time to get a B2G emulator/builds up and running to debug this -- I'm happy to take this, but I won't be able to get to it for a week or two.
Flags: needinfo?(jschoenick)
Blocks: 941809
(In reply to John Schoenick [:johns] from comment #8)
> I'm working on a few higher priority things right now, and I need to set
> aside time to get a B2G emulator/builds up and running to debug this -- I'm
> happy to take this, but I won't be able to get to it for a week or two.

Thanks; I'm going to assign this to you so it doesn't fall through the cracks; please unassign if you feel you won't be able to get to it in a week or two.
Assignee: nobody → jschoenick
Blocks: 899194
No longer blocks: 941809
This is a huge memory failure and would like a fix at the earliest.
blocking-b2g: --- → 1.3+
Flags: needinfo?(jschoenick)
I setup a B2G build environment and am looking into this now.
Flags: needinfo?(jschoenick)
So this is indeed due to a OOM - after opening all processes we often have only a few megs free memory, and asking every process to dump memory triggers a spike that results in a some processes being killed:
> <4>select 524 (Twitter share), adj 14, size 3986, to kill
> <4>select 579 (Music2), adj 14, size 4529, to kill
> <4>send sigkill to 579 (Music2), adj 14, size 4529
> <4>select 143 (Calendar), adj 12, size 4761, to kill
> <4>select 524 (Twitter share), adj 14, size 3941, to kill
> <4>select 579 (Music2), adj 14, size 4471, to kill
> <4>send sigkill to 579 (Music2), adj 14, size 4471
> <4>select 143 (Calendar), adj 12, size 4890, to kill
> <4>select 524 (Twitter share), adj 14, size 3951, to kill
> <4>send sigkill to 524 (Twitter share), adj 14, size 3951

In this case it attempts to kill music2 (twice?) and twitter share - and only 30/32 expected reports are pulled. I will see if I can find any low hanging fruit for minimizing the amount of memory needed to dump all reports, but it is probably unrealistic to expect to dump all memory reports without any allocations.
> I will see if I can find any low
> hanging fruit for minimizing the amount of memory needed to dump all
> reports, but it is probably unrealistic to expect to dump all memory reports
> without any allocations.

You should read bug 943744. AFAIK the memory spike during dumping is almost entirely due to the notable string detection. I improved things on that front in that bug, though there is still a spike.
(In reply to Nicholas Nethercote [:njn] from comment #13)
> You should read bug 943744. AFAIK the memory spike during dumping is almost
> entirely due to the notable string detection. I improved things on that
> front in that bug, though there is still a spike.

Can this be reasonably backported to the b2g branches?
Flags: needinfo?(n.nethercote)
So dumping memory reports necessarily uses some memory, even with bug 943744 there will be the edge case where sending the dump signal OOMs a process. Is adjusting the harness to exclude these processes from dumps the host expects to receive, and possibly noting that some processes died during memory reporting in the resultant files, enough to solve this for AWSY-B2Gs purposes?
Flags: needinfo?(rwood)
(In reply to John Schoenick [:johns] from comment #15)
> So dumping memory reports necessarily uses some memory, even with bug 943744
> there will be the edge case where sending the dump signal OOMs a process. Is
> adjusting the harness to exclude these processes from dumps the host expects
> to receive, and possibly noting that some processes died during memory
> reporting in the resultant files, enough to solve this for AWSY-B2Gs
> purposes?

I think that's fine to do that, and we already sometimes encounter this situation also on low memory devices like the tarako.
> Can this be reasonably backported to the b2g branches?

I think so. Try it, and if the patches don't apply I'll take a look to see if the conflicts are bad.
Flags: needinfo?(n.nethercote)
Thanks guys. So if I'm understanding correctly, using the get_about_memory script may actually kill some running processes on the emulator? Which processes can be killed? The plan is to run the test for 6 hours straight each time, and grab about_memory logs at various intervals, without restarting the emulator.

I'm wondering if some of the emulator/gaia processes are killed during dumping the memory reports, if that will have an adverse affect on the emulator/gaia itself? Wouldn't this potentially make the resulting data inconsistent each time the test is performed? Is it ok to leave the emulator running after dumping the data? Or will I need to just grab the about_memory once at the end of the test duration? Thanks - just want to make sure I am understanding correctly. :)
Flags: needinfo?(rwood) → needinfo?(jschoenick)
PM triaged this bug and believes that it should be a blocker.
(In reply to Robert Wood [:rwood] from comment #18)
> Thanks guys. So if I'm understanding correctly, using the get_about_memory
> script may actually kill some running processes on the emulator? Which
> processes can be killed? The plan is to run the test for 6 hours straight
> each time, and grab about_memory logs at various intervals, without
> restarting the emulator.

When the device runs out of memory, the OOM killer picks a old background process to terminate. I'm not very familiar with this code, but it's my understanding that this happens often in normal use -- e.g. if the foreground app tried to do something memory intensive at this moment, some background tasks would be killed to free up space. In this case, attempting to dump all memory reports is that trigger, rather than some user action.

> I'm wondering if some of the emulator/gaia processes are killed during
> dumping the memory reports, if that will have an adverse affect on the
> emulator/gaia itself?

Since old background tasks are culled in normal operation, this should be no more harmful than if a memory intensive app were opened at that point.

> Wouldn't this potentially make the resulting data
> inconsistent each time the test is performed?

I'm not sure how deterministic the OOM killer is, but desktop AWSY, for instance, is not very deterministic. So, it may be the case that more/fewer processes are still around when we get to the checkpoint, even before considering potential OOM deaths.

> Is it ok to leave the emulator
> running after dumping the data? Or will I need to just grab the about_memory
> once at the end of the test duration? Thanks - just want to make sure I am
> understanding correctly. :)

Since OOM killing is normal, there should be no detriment to this happening. The test simply needs to be aware that the act of measuring memory may OOM-kill some old background apps -- e.g. using the dumps to graph remaining free memory would be misleading, since taking the dump actually freed up a few megs to process itself. (You could still get that datapoint, however, by e.g. querying meminfo before taking a full dump)
Flags: needinfo?(jschoenick)
:johns Excellent, thank you for the info!
@rwood, So bug 943744, which reduces the overhead of these memory spikes (but doesn't 100% solve the issue as noted) landed on 28 -- for AWSY-B2G's purposes do we care about uplifting that to the B2G26 branches (1.2, 1.2f?)
Flags: needinfo?(rwood)
No, we don't intend to run the AWSY test on 1.2x.
Flags: needinfo?(rwood)
This inlines the _wait_for_remote_files only used in this spot, and checks for child PIDs that vanished inside the loop. The implications of children vanishing discussed in comment 20
Attachment #8366150 - Flags: review?(khuey)
Attachment #8366150 - Flags: review?(khuey) → review+
Keywords: checkin-needed
https://github.com/mozilla-b2g/B2G/commit/64ff91b94cfbddf7d1af813a4aa64f4817d63dde
Status: NEW → RESOLVED
Closed: 10 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
John, can you please assist with this uplift? :)
Flags: needinfo?(jhford)
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #26)
> John, can you please assist with this uplift? :)

There are no uplifts for B2G, sadly.  The contents of this repository are shared between all branches.
Flags: needinfo?(jhford)
:johns

Hi, I notice lately while using get_about_memory with the emulator, gaia on the emulator reports that the homescreen has crashed (crash reporter) and the terminal notes "Warning: Child 155 exited during memory reporting". I'm assuming this is as expected as noted in comment 20? It does seem to recover ok. Just want to verify this is expected and not a new regression, thanks
Flags: needinfo?(jschoenick)
(In reply to Robert Wood [:rwood] from comment #28)
> :johns
> 
> Hi, I notice lately while using get_about_memory with the emulator, gaia on
> the emulator reports that the homescreen has crashed (crash reporter) and
> the terminal notes "Warning: Child 155 exited during memory reporting". I'm
> assuming this is as expected as noted in comment 20? It does seem to recover
> ok. Just want to verify this is expected and not a new regression, thanks

This sounds like a b2g bug that should be filed. It's expected that the phone kill old background tasks when it needs more ram, but I doubt we intend to nuke the home screen with an error dialog :-/
Flags: needinfo?(jschoenick)
(In reply to John Schoenick [:johns] from comment #29)
> This sounds like a b2g bug that should be filed. It's expected that the
> phone kill old background tasks when it needs more ram, but I doubt we
> intend to nuke the home screen with an error dialog :-/

The homescreen can get killed by the LMK when we're low on memory but this won't trigger the crash screen. It sounds to me that we hit an actual bug here, maybe a defective memory reporter that segfaults the homescreen when computing memory usage. To figure out what's going on we could hook up gdb to the homescreen process, let it run, then run get_about_memory.py and see where things go boom.
Thanks guys. I created Bug 979879 for this new issue.
Target Milestone: --- → 1.3 C3/1.4 S3(31jan)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: