Closed
Bug 865311
Opened 12 years ago
Closed 10 years ago
Talos process checking is over-ambitious and wrong, Part Deux
Categories
(Testing :: Talos, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rnewman, Unassigned)
References
Details
(Whiteboard: [mozbase][mozharness+talos])
+++ This bug was initially created as a clone of Bug #700722 +++
jhammel: "I would advocate that the longer term fix is moving to mozrunner ( https://bugzilla.mozilla.org/show_bug.cgi?id=698898 ) and making the check for processes optional and off by default."
This process check is biting again, now that Fennec is accruing background services and content providers.
It's inaccurate to check for process lifetime on Android. The OS manages processes. Any check for processes should instead be a check for an activity or service.
Can we get some traction on the above change?
Comment 1•12 years ago
|
||
is this a fennec specific issue, or is this related to all of talos? According to the initial comment this is related to Fennec, but according to the bugs this is blocking this is related to desktop only. Please clarify and specify what we should be doing here? My understanding is there is no conflict in the process checking on mobile.
Reporter | ||
Comment 2•12 years ago
|
||
Sorry, blocked bugs came across in the bug clone. (I don't know why they're there on the clone source.)
This is an issue on mobile; see your discussion with gbrown today in #mobile.
No test harness should be checking for process existence on Android, unless you just uninstalled the application and want to make sure it's gone.
Comment 3•12 years ago
|
||
how could we check for an activity? Checking for a process has worked for years of android automation. I am interested in looking into this. While talking about the bug that gbrown found, we see that the scenario he observed/theorized is only seen on reftest and mochitest, not on talos. This is odd because talos is much different than reftest/mochitest. Here are some stats:
http://brasstacks.mozilla.com/orangefactor/index.html?display=Bug&bugid=663657&startday=2013-04-01&endday=2013-04-24&tree=trunk
Reporter | ||
Comment 4•12 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #3)
> how could we check for an activity? Checking for a process has worked for
> years of android automation.
I guess my question is "why do you need to check?".
As far as I can tell, you can use
adb shell dumpsys | fgrep FocusedWindow
to see what the frontmost activity is. But if you kill it with `adb am`, it -- and all of its background processes -- should be dead.
> I am interested in looking into this. While
> talking about the bug that gbrown found, we see that the scenario he
> observed/theorized is only seen on reftest and mochitest, not on talos.
Feel free to morph this bug accordingly :D
> This is odd because talos is much different than reftest/mochitest. Here
> are some stats:
> http://brasstacks.mozilla.com/orangefactor/index.
> html?display=Bug&bugid=663657&startday=2013-04-01&endday=2013-04-
> 24&tree=trunk
Note that test timeouts are not necessarily the result of this bug, as I'm sure you're aware!
Comment 5•12 years ago
|
||
We don't have direct adb access from the machine, but our sutagent can execute a dumpsys. Solving the bug that gbrown was talking about earlier today will require actually killing the browser, because we would still have the same foreground window.
In all our automation (desktop,fennec,b2g) we launch a process and keep track of it. I believe we could switch to log file polling.
Comment 6•12 years ago
|
||
Dumpsys outputs A LOT of information, I'm not sure if calling that is a good idea. Potentially we could add support inside the agent for getting this information though:
http://stackoverflow.com/questions/5446565/android-how-do-i-check-if-activity-is-running
Comment 7•12 years ago
|
||
adb shell dumpsys window input | wc -l
68
about the same amount of lines we have from 'adb shell ps'
Comment 8•12 years ago
|
||
in looking at using another method for this other than process checking, we need to know when the test is completed. Right now we determine that by ensuring:
1) the process is dead
2) no output in the log file for <timeout> duration <- this generates a failure
To detect if the process is dead, we poll the process and wait for it to not show up in the process list. On desktop, we have a handle to the process and if the handle is invalid we know the process has terminated.
Switching to 'FocusedWindow' method will not solve any bugs that would show up by polling the process list.
Reporter | ||
Comment 9•12 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #8)
> Switching to 'FocusedWindow' method will not solve any bugs that would show
> up by polling the process list.
Using FocusedWindow and looking for a Fennec activity will determine when the activity has finished, even if background services are still running in the same process.
It will also prevent your checks from being incorrect when you kill a process and Android restarts it (which is the mechanism by which gbrown was affected). Kill the process and then check whether the foreground activity is Fennec's, rather than kill the process and see if it's alive.
That you don't get a false positive in situations where Android has decided that the process should be running does indeed solve bugs that show up by polling the process list.
My fundamental point is that equating "process running" with "tests still running" is an invalid equation on Android, because Android is in charge of when a process with a given name should be alive.
> in looking at using another method for this other than process checking, we
> need to know when the test is completed.
It sounds like log file watching is the way to go. You already need to do 'smart' log watching to determine whether tests have output anything in a time window, so you might as well watch for "AND I'M DONE!".
Comment 10•12 years ago
|
||
To clarify my observations, see https://tbpl.mozilla.org/?tree=Try&rev=f24787dde0a1. With reference to the first Panda rc1 run, see https://tbpl.mozilla.org/php/getParsedLog.php?id=22172318&tree=Try&full=1, which has multiple logcats, due to the enhanced logging -- see the logcat that starts at 04-23 20:03:13.906.
remoteautomation.py's RProcess.wait loops for up to 3600 seconds (1 hour) dumping stdout until dm.processExist("org.mozilla.fennec") returns false. The logcat shows that the test ends successfully and apparently normally after a few minutes:
04-23 20:03:22.289 D/GeckoProfile( 3151): Found profile dir: ...
...
04-23 20:03:38.031 I/TestRunner( 3151): finished: testAboutPage(org.mozilla.fennec.tests.testAboutPage)
04-23 20:03:38.031 I/TestRunner( 3151): passed: testAboutPage(org.mozilla.fennec.tests.testAboutPage)
04-23 20:03:38.039 D/AndroidRuntime( 3140): Shutting down VM
04-23 20:03:38.039 I/ActivityManager( 1402): Force stopping package org.mozilla.fennec uid=10046
04-23 20:03:38.039 I/ActivityManager( 1402): Killing proc 3197:org.mozilla.f3nn3c.UpdateService/10046: force stop
04-23 20:03:38.039 I/ActivityManager( 1402): Killing proc 3151:org.mozilla.fennec/10046: force stop
04-23 20:03:38.062 I/AndroidRuntime( 3140): NOTE: attach of thread 'Binder Thread #3' failed
but then org.mozilla.fennec is restarted:
04-23 20:03:59.976 I/ActivityManager( 1402): Start proc org.mozilla.fennec for broadcast org.mozilla.fennec/org.mozilla.gecko.background.announcements.AnnouncementsStartReceiver: pid=3231 uid=10046 gids={3003, 1015, 1006}
04-23 20:04:00.093 I/ActivityThread( 3231): Pub org.mozilla.fennec.db.browser: org.mozilla.gecko.db.BrowserProvider
04-23 20:04:00.109 I/ActivityThread( 3231): Pub org.mozilla.fennec.db.tabs: org.mozilla.gecko.db.TabsProvider
04-23 20:04:00.109 I/ActivityThread( 3231): Pub org.mozilla.fennec.db.formhistory: org.mozilla.gecko.db.FormHistoryProvider
and sutAgent messages in the logcat confirm that RProcess.wait is continuing to loop waiting for org.mozilla.fennec to end.
Device info dumped by my modification to RProcess.wait confirms that the device manager is finding org.mozilla.fennec with pid=3231 after 10 minutes of waiting:
{'uptime': ... 'process': [['10046', '3231', 'org.mozilla.fennec'],
['10044', '1882', 'com.mozilla.SUTAgentAndroid']...
For some reason (maybe just timing?), this problem only affects some runs of testAboutPage.
Note that these observations apply to:
- mochitest-robotium (rc1, rc2)
- Panda, Android 4.0
- remoteautomation / RProcess.wait
I suspect that there are similar issues in other test suites and remote platforms, but I don't know, and don't have any evidence one way or the other.
> My fundamental point is that equating "process running" with "tests still
> running" is an invalid equation on Android, because Android is in charge of when
> a process with a given name should be alive.
This resonates with me. At the same time, when I look at devicemanager / remoteautomation code, it seems "right" for that code to launch a process and then use a ps process list to determine if the process is still running.
Another option to consider: sutAgent currently keeps a process object for launched processes and waits for some period of time for it to complete. We don't rely on that currently, but perhaps we could with some sut + devicemanagerSUT changes.
Comment 11•12 years ago
|
||
Another option: remoteautomation could note the pid of the launched process, and wait for the pid to end, rather than the process name.
Comment 12•12 years ago
|
||
(In reply to Richard Newman [:rnewman] from comment #9)
> (In reply to Joel Maher (:jmaher) from comment #8)
> It sounds like log file watching is the way to go. You already need to do
> 'smart' log watching to determine whether tests have output anything in a
> time window, so you might as well watch for "AND I'M DONE!".
I wonder if it wouldn't be even easier just to have an API endpoint in the webserver which is called by client code upon completion. This would eliminate the need to continually poll the log file.
Comment 13•12 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #11)
> Another option: remoteautomation could note the pid of the launched process,
> and wait for the pid to end, rather than the process name.
That doesn't work well, because sutAgent's exec may wait until the launched process completes before returning; there isn't currently an opportunity for devicemanager or remoteautomation to look up the pid until that exec command completes, and that may be too late. It looks to me like any pid-based solution would require sutAgent changes.
Comment 14•12 years ago
|
||
An api endpoint would require us to modify the mochitest harness (fairly easy, but might raise a few eyebrows) and the same with the reftest harness. I am not sure about xpcshell. For talos we would have to modify the harness as well.
My goal is to keep harnesses as similar as possible. A SUTAgent/DM change might be the best choice. Polling the log file is hacky. Whatever we do, I would like to treat this as we do the desktop tests- if we make a major difference here, lets do it on desktop.
Comment 15•12 years ago
|
||
I'm still in favor of polling activities. IMO on Android, an activity is closer in concept to a desktop process. A "process" on Android is more of an implementation detail that we shouldn't rely on.
Can we use, for example, ActivityManager.getRunningTasks in SUTAgent? http://developer.android.com/reference/android/app/ActivityManager.html#getRunningTasks%28int%29
Comment 16•12 years ago
|
||
Try run showing Panda rc stability with announcements disabled: https://tbpl.mozilla.org/?tree=Try&rev=a8711d551a68 (evidence that our panda rc woes are caused by service restarts / process management).
Comment 17•12 years ago
|
||
I can think of two semi-hacky ways of working around the immediate issue that gbrown is seeing:
1) The first would be to parse the log files for the INFO SimpleTests FINISHED string.
2) The second would be to store the pid of the process on startup and poll the list of processes during the waitForFinish call to see if that pid still exists.
Both of these could be done with changes to remoteautomation.py.
Currently we are polling based upon the process name (e.g. org.mozilla.fennec) and so miss restarts. Polling based upon activities would not fix the issue gbrown found because the activity ends up being restarted, and we would still end up waiting until we timeout.
If Android kills the fennec process (due to OOM, I guess) and restarts it, will that test run still be valid? If it is still valid, we should go with the log file. Otherwise, we should look at the pids.
Reporter | ||
Comment 18•12 years ago
|
||
(In reply to Dan Minor [:dminor] from comment #17)
> Currently we are polling based upon the process name (e.g.
> org.mozilla.fennec) and so miss restarts. Polling based upon activities
> would not fix the issue gbrown found because the activity ends up being
> restarted, and we would still end up waiting until we timeout.
The *activity* isn't restarted. The *process* is restarted to continue to host background services and content providers.
Comment 19•12 years ago
|
||
(In reply to Dan Minor [:dminor] from comment #17)
> 2) The second would be to store the pid of the process on startup and poll
> the list of processes during the waitForFinish call to see if that pid still
> exists.
I don't think it is currently possible for remoteautomation to get the pid of the process on startup -- see my argument with myself in comment 13.
Comment 20•12 years ago
|
||
> The *activity* isn't restarted. The *process* is restarted to continue to
> host background services and content providers.
Ok, to make sure I follow you, there will be no fennec activity even though a fennec process is running?
Comment 21•12 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #19)
> (In reply to Dan Minor [:dminor] from comment #17)
> > 2) The second would be to store the pid of the process on startup and poll
> > the list of processes during the waitForFinish call to see if that pid still
> > exists.
>
> I don't think it is currently possible for remoteautomation to get the pid
> of the process on startup -- see my argument with myself in comment 13.
I had thought to get the PID shortly after startup through python calls, but maybe that wouldn't be reliable enough.
Comment 22•12 years ago
|
||
we can get a pid from the current ps we do to poll the process:
Trying 192.168.1.86...
Connected to 192.168.1.86.
Escape character is '^]'.
$>ps
10000 1520 android.process.media
10047 3992 org.mozilla.fennec
10044 1921 com.mozilla.SUTAgentAndroid
10007 1632 com.android.smspush
1001 1575 com.android.phone
10045 1533 net.rocboronat.android.wallpaper.npe
10014 1546 com.android.inputmethod.latin
1000 1426 system
10036 2029 com.android.quicksearchbox
10028 2012 com.svox.pico
10017 1997 com.android.gallery3d
10008 1982 com.android.musicfx
10027 1967 com.android.defcontainer
10001 1615 android.process.acore
10038 1819 com.android.providers.calendar
10021 1892 com.android.calendar
10030 1716 com.android.deskclock
10024 1598 com.android.launcher
1000 1873 com.android.settings
10031 1732 com.android.email
10033 1765 com.android.exchange
10043 1836 com.mozilla.watcher
1000 1504 com.android.systemui
$>
why don't we just ensure the pid:procname match up, that can be done in remoteautomation.py (and we should ensure this in talos as well). I would not like to poll the log file as other harnesses don't do that.
Comment 23•12 years ago
|
||
Possibly relevant: ahal's been working on refactoring the logic for launching applications for b2g in bug 865349. If he succeeds (and I think he will), we'll probably want to do a similar refactoring for Android. CC'ing him on this bug.
Reporter | ||
Comment 24•12 years ago
|
||
(In reply to Dan Minor [:dminor] from comment #20)
> > The *activity* isn't restarted. The *process* is restarted to continue to
> > host background services and content providers.
>
> Ok, to make sure I follow you, there will be no fennec activity even though
> a fennec process is running?
Correct. Android uses the process to run all of the code that ships in the Fennec APK -- content providers, background services, alarm handlers, etc.
The Fennec browser activity -- named "@ANDROID_PACKAGE_NAME@.App" -- is just one of those things.
The issue here is that the tools are using the process as a proxy for that activity. In trying to quit the activity they force-kill the process. Android says "tut tut" and fixes the mistake by restarting the process to handle the announcements service, shortly prior to the tools checking for success by seeing if the process is dead.
Ideally you would gracefully close the activity (e.g., by using a custom broadcast intent, calling Activity.finish() when it's done, etc.) rather than nuking the process.
Comment 25•12 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #22)
> why don't we just ensure the pid:procname match up, that can be done in
> remoteautomation.py (and we should ensure this in talos as well).
I think that won't work reliably, at least for robocop-via-sut, because:
- the process is launched via dmSUT.fireProcess: https://hg.mozilla.org/mozilla-central/file/681bbf7717c1/testing/mozbase/mozdevice/mozdevice/devicemanagerSUT.py#l466
- sutAgent's exec will wait for up to 300 seconds for the process to end before returning: https://hg.mozilla.org/mozilla-central/file/681bbf7717c1/build/mobile/sutagent/android/DoCommand.java#l3734
- by the time automation/remoteautomation gets back control: https://hg.mozilla.org/mozilla-central/file/681bbf7717c1/build/automation.py.in#l1225
it is too late -- we don't know if the process we launched is still running or completed and was restarted.
Comment 26•11 years ago
|
||
now that we use topactivity for finding the process, can we resolve this bug?
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•