Closed Bug 1593785 Opened 6 years ago Closed 6 years ago

Android PGO run tasks failing when build is optimized for size, "-Os"

Categories

(Firefox Build System :: Android Studio and Gradle Integration, defect)

ARM
Android
defect
Not set
normal

Tracking

(firefox73 fixed)

RESOLVED FIXED
mozilla73
Tracking Status
firefox73 --- fixed

People

(Reporter: acreskey, Assigned: aerickson)

References

Details

Attachments

(1 file)

In Bug 1591725 we're looking at different build optimization flags for Android.

One of these options, -Os, is leading to PGO run failures such as this one:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=85b33251696b529470c49b85788ffa3105c29f78&selectedJob=273808395

From the logs:

[task 2019-10-31T00:14:15.434Z] 00:14:15     INFO - Running main action method: install
[task 2019-10-31T00:15:55.071Z] 00:15:55     INFO - Failed to install /builds/worker/fetches/geckoview-androidTest.apk on None: ADBError install failed for /builds/worker/fetches/geckoview-androidTest.apk. Got: Performing Push Install
[task 2019-10-31T00:15:55.071Z] 00:15:55     INFO - /builds/worker/fetches/geckoview-androidTest.apk: 1 file pushed. 2.3 MB/s (155123944 bytes in 65.077s)
[task 2019-10-31T00:15:55.071Z] 00:15:55     INFO - 	pkg: /data/local/tmp/geckoview-androidTest.apk
[task 2019-10-31T00:15:55.071Z] 00:15:55     INFO - Failure [INSTALL_FAILED_CONTAINER_ERROR]

Locally I've been able to install the artifact: geckoview-androidTest.apk

According to stack overflow this may be mysteriously solvable by adding android:installLocation="internalOnly" to manifest, or else increasing the device's virtual memory.

Blocks: 1591725

(In reply to Andrew Creskey from comment #1)

... or else increasing the device's virtual memory.

The arm emulator's avd has

hw.ramSize=1024
vm.heapSize=128

I seem to recall that the ramSize is at its maximum for that emulator version + android 4.3 image...but I'm not sure.

I imagine this would not be a problem on the x86_64 emulator -- bug 1548962 -- which, in our standard setup, has much more memory and storage space available.

Summary: PGO run tasks failing when build is optimized for size, "-Os" → Android PGO run tasks failing when build is optimized for size, "-Os"

I did try a push with android:installLocation="internalOnly" in the manifest but it gives the same result INSTALL_FAILED_CONTAINER_ERROR.

(In reply to Geoff Brown [:gbrown] from comment #2)

The arm emulator's avd has

hw.ramSize=1024
vm.heapSize=128

I seem to recall that the ramSize is at its maximum for that emulator version + android 4.3 image...but I'm not sure.

This is another post where the avd device memory sizes were increased to avoid INSTALL_FAILED_CONTAINER_ERROR
https://github.com/flutter/flutter/issues/8824

I tried to increase the memory size in https://treeherder.mozilla.org/#/jobs?repo=try&revision=8a4e44694f269155b7354d880ff8fdb4075ccd9d, but it did not work.

https://taskcluster-artifacts.net/D3t-3_wYSgC9mypNA8ACcw/0/public/build/blobber_upload_dir/emulator-2BZy8P.log

  hw.ramSize = 2048
...
emulator:    3: KEY='hw.ramSize' VALUE='2048'

so the request was recognized, but...

Truncating RAM at 00000000-7fffffff to -33ffffff (vmalloc region overlap).
...
Memory: 832MB = 832MB total
Memory: 840192KB available (2900K code, 707K data, 124K init)

it was truncated.

Interesting -- so truncation is happening even with the default requested size, hw.ramSize=1024.

I did try to run the PGO steps locally but it looks like it requires a library that's only in the linux toolchain (I'm on OSX).

42:42.65 /Users/acreskey/ndk/android-ndk-r20/toolchains/arm-linux-androideabi-4.9/prebuilt/darwin-x86_64/lib/gcc/arm-linux-androideabi/4.9.x/../../../../arm-linux-androideabi/bin/ld: error: cannot open /Users/acreskey/.mozbuild/clang/lib/clang/9.0.0/lib/linux/libclang_rt.profile-arm-android.a: No such file or directory

Michael -- am I right that I need linux to build the android PGO-instrumented app?

Flags: needinfo?(mshal)

You don't need Linux, but you do need a clang toolchain that has the necessary runtime libraries for Android, which our bootstrapped toolchains don't.

You should be able to mach artifact toolchain --from-build linux64-clang9-android-cross to get a clang/ directory with the appropriate runtime libraries, and then copy them into the appropriate place under $HOME/.mozbuild/clang.

(In reply to Nathan Froyd [:froydnj] from comment #8)

You don't need Linux, but you do need a clang toolchain that has the necessary runtime libraries for Android, which our bootstrapped toolchains don't.

You should be able to mach artifact toolchain --from-build linux64-clang9-android-cross to get a clang/ directory with the appropriate runtime libraries, and then copy them into the appropriate place under $HOME/.mozbuild/clang.

That would be great.
What I'm seeing is this error:

./mach artifact toolchain --from-build linux64-clang9-android-cross
... Could not find a toolchain build named `linux64-clang9-android-cross`

Sigh, try --from-build linux64-clang-android-cross.

Progress... I'm getting this (at the tip of mozilla-central), so I can try the fresh checkout.

./mach artifact toolchain --from-build linux64-clang-android-cross
 0:05.21 Could not find artifacts for a toolchain build named `linux64-clang-android-cross`. Local commits and other changes in your checkout may cause this error. Try updating to a fresh checkout of mozilla-central to use artifact builds.

Thank you Nathan, ./mach artifact toolchain --from-build linux64-clang-android-cross worked great in a fresh repo.

Next issue is that the android MOZ_PGO build appears to be looking for fennec instead of the geckoview-androidTest.apk (surprising given Bug 1582221), but perhaps that's because my local builds are unsigned and have different binary names.
If I modify build/pgo/profileserver.py to use the binary that I just built, geckoview-withGeckoBinaries-debug-androidTest.apk, it proceeds further, until I hit Permission denied starting the Firefox Runner.

 3:24.63 ['/Users/acreskey/dev/firefox/src/build/obj-release-android/gradle/build/mobile/android/geckoview/outputs/apk/androidTest/withGeckoBinaries/debug/geckoview-withGeckoBinaries-debug-androidTest.apk', 'data:text/html,<script>Quitter.quit()</script>', '-foreground', '-profile', '/tmp/tmpoDXDZH']
 3:24.63 Traceback (most recent call last):
 3:24.63   File "/Users/acreskey/dev/firefox/src/mozilla-central/build/pgo/profileserver.py", line 108, in <module>
 3:24.63     runner.start()
 3:24.63   File "/Users/acreskey/dev/firefox/src/mozilla-central/testing/mozbase/mozrunner/mozrunner/base/browser.py", line 85, in start
 3:24.63     BaseRunner.start(self, *args, **kwargs)
 3:24.63   File "/Users/acreskey/dev/firefox/src/mozilla-central/testing/mozbase/mozrunner/mozrunner/base/runner.py", line 136, in start
 3:24.63     reraise(RunnerNotStartedError, "Failed to start the process: %s" % value, tb)
 3:24.63   File "/Users/acreskey/dev/firefox/src/mozilla-central/testing/mozbase/mozrunner/mozrunner/base/runner.py", line 131, in start
 3:24.63     process.run(self.timeout, self.output_timeout)
 3:24.63   File "/Users/acreskey/dev/firefox/src/mozilla-central/testing/mozbase/mozprocess/mozprocess/processhandler.py", line 811, in run
 3:24.63     self.proc = self.Process([self.cmd] + self.args, **args)
 3:24.63   File "/Users/acreskey/dev/firefox/src/mozilla-central/testing/mozbase/mozprocess/mozprocess/processhandler.py", line 123, in __init__
 3:24.63     universal_newlines, startupinfo, creationflags)
 3:24.63   File "/Users/acreskey/.pyenv/versions/2.7.11/lib/python2.7/subprocess.py", line 710, in __init__
 3:24.63     errread, errwrite)
 3:24.63   File "/Users/acreskey/.pyenv/versions/2.7.11/lib/python2.7/subprocess.py", line 1335, in _execute_child
 3:24.63     raise child_exception
 3:24.63 mozrunner.errors.RunnerNotStartedError: Failed to start the process: [Errno 13] Permission denied
 3:25.11 make[1]: *** [profiledbuild] Error 1

But I might be going into a rabbit hole in trying to build my own local PGO -Os build.

The problem is that even if get this running I won't be able to run high job count raptor performance tests with it, as I can on try.

I wonder if it would be possible to do a test using this android-7.0 emulator?
https://searchfox.org/mozilla-central/source/testing/config/tooltool-manifests/androidarm_7_0/mach-emulator.manifest
If I change the definition of the 4.3 device here ...

I wouldn't recommend that, based on the experience in bug 1519489 (but I'm not entirely sure).

I've had no end of frustration with Android arm emulators in general: They tend to be very slow and sometimes unreliable. Why not make the switch to x86_64, bug 1548962?

Ah, that's good to know.
I did do a hacked test, attempting to use the android 7 device but it didn't start. Not an area that I know very much about, so this could be incorrectly setup.

So maybe using the x86_64 build to generate the PGO data is the best next step.

Personally I have enough experience with the PGO setup and automation to make this change. But maybe we want to prioritize this.

(In reply to Andrew Creskey [:acreskey] [he/him] from comment #15)

Ah, that's good to know.
I did do a hacked test, attempting to use the android 7 device but it didn't start. Not an area that I know very much about, so this could be incorrectly setup.

So maybe using the x86_64 build to generate the PGO data is the best next step.

Personally I have enough experience with the PGO setup and automation to make this change. But maybe we want to prioritize this.

Switching to the x86_64 is not particularly hard (I have some patches ready to go), but is blocked on bug 1545497. :pmoore, has that bug stalled? Any idea what's left to get that finalized so we can stop using the android 4.3 emulator for PGO?

Flags: needinfo?(mshal) → needinfo?(pmoore)

We do now have generic-worker multiuser engine on linux, so we can run tasks securely on a linux host machine, outside of docker.

These tasks would most likely run as non-privileged users on the host - is that also ok?

I'm assuming no containers need to be created in the tasks themselves, but if they do, we should probably consider using something like podman that supports running containers as non-privileged users on the host.

If this meets your requirements, the next steps would be setting up a dedicated generic-worker linux worker pool for these tasks, either in GCP or AWS.

Flags: needinfo?(pmoore)

(In reply to Pete Moore [:pmoore][:pete] from comment #17)

We do now have generic-worker multiuser engine on linux, so we can run tasks securely on a linux host machine, outside of docker.

These tasks would most likely run as non-privileged users on the host - is that also ok?

I think a non-privileged user is fine, as long as they have access to /dev/kvm. Any idea if that's possible? In docker, /dev/kvm access is only provided if run with --privileged. Outside of docker, I think it should work as long as the user that the task is running under is in whatever group has file permissions for /dev/kvm. (Eg: On Ubuntu, /dev/kvm is crw-rw---- 1 root kvm, so adding the user to the 'kvm' group lets them use kvm).

I'm assuming no containers need to be created in the tasks themselves, but if they do, we should probably consider using something like podman that supports running containers as non-privileged users on the host.

I don't believe we need to use a container specifically for Android PGO.

If this meets your requirements, the next steps would be setting up a dedicated generic-worker linux worker pool for these tasks, either in GCP or AWS.

Should I file a separate bug blocking bug 1545497? Or does that bug cover it?

Flags: needinfo?(pmoore)

I was looking at the attempted -Os PGO builds and I noticed something strange:
The geckoview-androidTest.apk binary is exceptionally large -- 152MB.

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=277666123&revision=04c559e5d973da71fb7bd93963a412006374a013

The current -Oz builds, I'm seeing a geckoview-androidTest.apk size of 117MB.
For the -O2 build, the geckoview-androidTest.apk is 92MB.

So I think that could explain the INSTALL_FAILED_CONTAINER_ERROR install error.

The difference in apk come from libxul.so -- I'm scratching my head as to why the instrumented -Os library is so much larger than the others.

(In reply to Michael Shal [:mshal] from comment #18)

If this meets your requirements, the next steps would be setting up a dedicated generic-worker linux worker pool for these tasks, either in GCP or AWS.

Should I file a separate bug blocking bug 1545497? Or does that bug cover it?

A separate bug is probably best. Regarding the set up of the linux host, most of this can be reasonably self-serve. I'd recommend you cargo cult the gwci-linux machine image definition, remove lines 40-49 (since docker not required) and add any steps there you need for toolchains on the host etc, and place it in a new directory (e.g. /worker_types/android-pgo) together with the other files copied from the /worker_types/gwci-linux directory, adapted as needed, and submit a generic-worker PR once you believe you have everything on the host you need. Unfortunately gnome3 desktop is currently required, so you'll need to leave that in for now. At some point, we'll drop the requirement for gnome3 to support headless tasks, but currently all generic-worker linux multiuser tasks run under a gnome 3 graphical desktop environment.

Anyone with permissions to create AMIs in the production EC2 or GCP Compute Engine account(s) can run the worker_type.sh script to generate the machine images, and then, assuming this will run in the firefox-ci cluster, a ci-configuration patch will be required to update worker-images.yml and worker-pools.yml with the machine image ids etc.

Flags: needinfo?(pmoore)

In terms of using the existing emulator, I wonder if it would be possible to make the profile-generate build smaller so that it could fit on the device? Something that can be stripped out of the library just when doing the profiling run?
(assuming binary size is the problem).

Michael was able to reproduce the INSTALL_FAILED_CONTAINER_ERROR locally with the default AVD.

With the disk.dataPartition.size increased from 600M to 1200M in avd/test-1.avd/config.ini he was able to install and run the profiling build on the emulator.

Geoff, how feasible is to increase the dataPartition on this android 4.3 emulator?
(Even if for just a one off test).

Outside of changing the avd definition, I could only find an emulator option to increase the system partition.

Flags: needinfo?(gbrown)

There is an emulator command line option, -partition-size:

https://developer.android.com/studio/run/emulator-commandline

I think you could simply add '-partition-size', '1200' to the emulator arguments at

https://searchfox.org/mozilla-central/rev/8bc24752246aeac8a9aed566cf1caccf88d97d11/testing/mozharness/configs/android/androidarm_4_3.py#22

If that's troublesome, we could update config.ini in an updated avd in tooltool.

Flags: needinfo?(gbrown)

I did try a larger -partition-size yesterday (2 gigs),

https://treeherder.mozilla.org/#/jobs?repo=try&revision=651f3a9789c5a2b70022e7c61aca34f3384b9790&selectedJob=279448542

It looks like it's sticking from the emulator logs:
https://firefoxci.taskcluster-artifacts.net/SdzQ1HmzS2iPY3Kb5Gkfmw/5/public/build/blobber_upload_dir/emulator-FWzJXJ.log

disk.systemPartition.size = 2g

(Otherwise I see disk.systemPartition.size = 221m)

However I believe that this is just for the system partition but not user data where the apk would be installed.
That one I still see at 600m:
disk.dataPartition.size = 600m

Having an updated avd config.ini would be great from my perspective, it looks like that would unblock us.
I wouldn't know how to do that though.

aerickson - Can you help? The relevant avd is

https://searchfox.org/mozilla-central/source/testing/config/tooltool-manifests/androidarm_4_3/releng.manifest

We need to update only the config.ini in that avd with disk.dataPartition.size = 1200.

Flags: needinfo?(aerickson)

Yeah, absolutely.

Assignee: nobody → aerickson
Flags: needinfo?(aerickson)

I've packaged the new AVD but I'm not able to upload until I get a new taskcluster token for tooltool (tooltool tokens are now disabled).

Depends on: 1601790

I'm confirming that the attached patch allows us to build android at -Os in automation.
https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=280129731&revision=eb5d285e277a8575a7900ac92865631163c4c358
Thank you Andrew and everyone who's been helping moving this along.

Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/92126172bfbd update 4.3 emulator to have larger disk r=gbrown
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla73
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: