Closed Bug 865311 Opened 12 years ago Closed 10 years ago

Talos process checking is over-ambitious and wrong, Part Deux

Categories

(Testing :: Talos, defect)

All
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rnewman, Unassigned)

References

Details

(Whiteboard: [mozbase][mozharness+talos])

+++ This bug was initially created as a clone of Bug #700722 +++ jhammel: "I would advocate that the longer term fix is moving to mozrunner ( https://bugzilla.mozilla.org/show_bug.cgi?id=698898 ) and making the check for processes optional and off by default." This process check is biting again, now that Fennec is accruing background services and content providers. It's inaccurate to check for process lifetime on Android. The OS manages processes. Any check for processes should instead be a check for an activity or service. Can we get some traction on the above change?
is this a fennec specific issue, or is this related to all of talos? According to the initial comment this is related to Fennec, but according to the bugs this is blocking this is related to desktop only. Please clarify and specify what we should be doing here? My understanding is there is no conflict in the process checking on mobile.
Sorry, blocked bugs came across in the bug clone. (I don't know why they're there on the clone source.) This is an issue on mobile; see your discussion with gbrown today in #mobile. No test harness should be checking for process existence on Android, unless you just uninstalled the application and want to make sure it's gone.
No longer blocks: 713055, 650887
how could we check for an activity? Checking for a process has worked for years of android automation. I am interested in looking into this. While talking about the bug that gbrown found, we see that the scenario he observed/theorized is only seen on reftest and mochitest, not on talos. This is odd because talos is much different than reftest/mochitest. Here are some stats: http://brasstacks.mozilla.com/orangefactor/index.html?display=Bug&bugid=663657&startday=2013-04-01&endday=2013-04-24&tree=trunk
(In reply to Joel Maher (:jmaher) from comment #3) > how could we check for an activity? Checking for a process has worked for > years of android automation. I guess my question is "why do you need to check?". As far as I can tell, you can use adb shell dumpsys | fgrep FocusedWindow to see what the frontmost activity is. But if you kill it with `adb am`, it -- and all of its background processes -- should be dead. > I am interested in looking into this. While > talking about the bug that gbrown found, we see that the scenario he > observed/theorized is only seen on reftest and mochitest, not on talos. Feel free to morph this bug accordingly :D > This is odd because talos is much different than reftest/mochitest. Here > are some stats: > http://brasstacks.mozilla.com/orangefactor/index. > html?display=Bug&bugid=663657&startday=2013-04-01&endday=2013-04- > 24&tree=trunk Note that test timeouts are not necessarily the result of this bug, as I'm sure you're aware!
We don't have direct adb access from the machine, but our sutagent can execute a dumpsys. Solving the bug that gbrown was talking about earlier today will require actually killing the browser, because we would still have the same foreground window. In all our automation (desktop,fennec,b2g) we launch a process and keep track of it. I believe we could switch to log file polling.
Dumpsys outputs A LOT of information, I'm not sure if calling that is a good idea. Potentially we could add support inside the agent for getting this information though: http://stackoverflow.com/questions/5446565/android-how-do-i-check-if-activity-is-running
adb shell dumpsys window input | wc -l 68 about the same amount of lines we have from 'adb shell ps'
in looking at using another method for this other than process checking, we need to know when the test is completed. Right now we determine that by ensuring: 1) the process is dead 2) no output in the log file for <timeout> duration <- this generates a failure To detect if the process is dead, we poll the process and wait for it to not show up in the process list. On desktop, we have a handle to the process and if the handle is invalid we know the process has terminated. Switching to 'FocusedWindow' method will not solve any bugs that would show up by polling the process list.
(In reply to Joel Maher (:jmaher) from comment #8) > Switching to 'FocusedWindow' method will not solve any bugs that would show > up by polling the process list. Using FocusedWindow and looking for a Fennec activity will determine when the activity has finished, even if background services are still running in the same process. It will also prevent your checks from being incorrect when you kill a process and Android restarts it (which is the mechanism by which gbrown was affected). Kill the process and then check whether the foreground activity is Fennec's, rather than kill the process and see if it's alive. That you don't get a false positive in situations where Android has decided that the process should be running does indeed solve bugs that show up by polling the process list. My fundamental point is that equating "process running" with "tests still running" is an invalid equation on Android, because Android is in charge of when a process with a given name should be alive. > in looking at using another method for this other than process checking, we > need to know when the test is completed. It sounds like log file watching is the way to go. You already need to do 'smart' log watching to determine whether tests have output anything in a time window, so you might as well watch for "AND I'M DONE!".
To clarify my observations, see https://tbpl.mozilla.org/?tree=Try&rev=f24787dde0a1. With reference to the first Panda rc1 run, see https://tbpl.mozilla.org/php/getParsedLog.php?id=22172318&tree=Try&full=1, which has multiple logcats, due to the enhanced logging -- see the logcat that starts at 04-23 20:03:13.906. remoteautomation.py's RProcess.wait loops for up to 3600 seconds (1 hour) dumping stdout until dm.processExist("org.mozilla.fennec") returns false. The logcat shows that the test ends successfully and apparently normally after a few minutes: 04-23 20:03:22.289 D/GeckoProfile( 3151): Found profile dir: ... ... 04-23 20:03:38.031 I/TestRunner( 3151): finished: testAboutPage(org.mozilla.fennec.tests.testAboutPage) 04-23 20:03:38.031 I/TestRunner( 3151): passed: testAboutPage(org.mozilla.fennec.tests.testAboutPage) 04-23 20:03:38.039 D/AndroidRuntime( 3140): Shutting down VM 04-23 20:03:38.039 I/ActivityManager( 1402): Force stopping package org.mozilla.fennec uid=10046 04-23 20:03:38.039 I/ActivityManager( 1402): Killing proc 3197:org.mozilla.f3nn3c.UpdateService/10046: force stop 04-23 20:03:38.039 I/ActivityManager( 1402): Killing proc 3151:org.mozilla.fennec/10046: force stop 04-23 20:03:38.062 I/AndroidRuntime( 3140): NOTE: attach of thread 'Binder Thread #3' failed but then org.mozilla.fennec is restarted: 04-23 20:03:59.976 I/ActivityManager( 1402): Start proc org.mozilla.fennec for broadcast org.mozilla.fennec/org.mozilla.gecko.background.announcements.AnnouncementsStartReceiver: pid=3231 uid=10046 gids={3003, 1015, 1006} 04-23 20:04:00.093 I/ActivityThread( 3231): Pub org.mozilla.fennec.db.browser: org.mozilla.gecko.db.BrowserProvider 04-23 20:04:00.109 I/ActivityThread( 3231): Pub org.mozilla.fennec.db.tabs: org.mozilla.gecko.db.TabsProvider 04-23 20:04:00.109 I/ActivityThread( 3231): Pub org.mozilla.fennec.db.formhistory: org.mozilla.gecko.db.FormHistoryProvider and sutAgent messages in the logcat confirm that RProcess.wait is continuing to loop waiting for org.mozilla.fennec to end. Device info dumped by my modification to RProcess.wait confirms that the device manager is finding org.mozilla.fennec with pid=3231 after 10 minutes of waiting: {'uptime': ... 'process': [['10046', '3231', 'org.mozilla.fennec'], ['10044', '1882', 'com.mozilla.SUTAgentAndroid']... For some reason (maybe just timing?), this problem only affects some runs of testAboutPage. Note that these observations apply to: - mochitest-robotium (rc1, rc2) - Panda, Android 4.0 - remoteautomation / RProcess.wait I suspect that there are similar issues in other test suites and remote platforms, but I don't know, and don't have any evidence one way or the other. > My fundamental point is that equating "process running" with "tests still > running" is an invalid equation on Android, because Android is in charge of when > a process with a given name should be alive. This resonates with me. At the same time, when I look at devicemanager / remoteautomation code, it seems "right" for that code to launch a process and then use a ps process list to determine if the process is still running. Another option to consider: sutAgent currently keeps a process object for launched processes and waits for some period of time for it to complete. We don't rely on that currently, but perhaps we could with some sut + devicemanagerSUT changes.
Another option: remoteautomation could note the pid of the launched process, and wait for the pid to end, rather than the process name.
(In reply to Richard Newman [:rnewman] from comment #9) > (In reply to Joel Maher (:jmaher) from comment #8) > It sounds like log file watching is the way to go. You already need to do > 'smart' log watching to determine whether tests have output anything in a > time window, so you might as well watch for "AND I'M DONE!". I wonder if it wouldn't be even easier just to have an API endpoint in the webserver which is called by client code upon completion. This would eliminate the need to continually poll the log file.
(In reply to Geoff Brown [:gbrown] from comment #11) > Another option: remoteautomation could note the pid of the launched process, > and wait for the pid to end, rather than the process name. That doesn't work well, because sutAgent's exec may wait until the launched process completes before returning; there isn't currently an opportunity for devicemanager or remoteautomation to look up the pid until that exec command completes, and that may be too late. It looks to me like any pid-based solution would require sutAgent changes.
An api endpoint would require us to modify the mochitest harness (fairly easy, but might raise a few eyebrows) and the same with the reftest harness. I am not sure about xpcshell. For talos we would have to modify the harness as well. My goal is to keep harnesses as similar as possible. A SUTAgent/DM change might be the best choice. Polling the log file is hacky. Whatever we do, I would like to treat this as we do the desktop tests- if we make a major difference here, lets do it on desktop.
I'm still in favor of polling activities. IMO on Android, an activity is closer in concept to a desktop process. A "process" on Android is more of an implementation detail that we shouldn't rely on. Can we use, for example, ActivityManager.getRunningTasks in SUTAgent? http://developer.android.com/reference/android/app/ActivityManager.html#getRunningTasks%28int%29
Blocks: 663657
Try run showing Panda rc stability with announcements disabled: https://tbpl.mozilla.org/?tree=Try&rev=a8711d551a68 (evidence that our panda rc woes are caused by service restarts / process management).
I can think of two semi-hacky ways of working around the immediate issue that gbrown is seeing: 1) The first would be to parse the log files for the INFO SimpleTests FINISHED string. 2) The second would be to store the pid of the process on startup and poll the list of processes during the waitForFinish call to see if that pid still exists. Both of these could be done with changes to remoteautomation.py. Currently we are polling based upon the process name (e.g. org.mozilla.fennec) and so miss restarts. Polling based upon activities would not fix the issue gbrown found because the activity ends up being restarted, and we would still end up waiting until we timeout. If Android kills the fennec process (due to OOM, I guess) and restarts it, will that test run still be valid? If it is still valid, we should go with the log file. Otherwise, we should look at the pids.
(In reply to Dan Minor [:dminor] from comment #17) > Currently we are polling based upon the process name (e.g. > org.mozilla.fennec) and so miss restarts. Polling based upon activities > would not fix the issue gbrown found because the activity ends up being > restarted, and we would still end up waiting until we timeout. The *activity* isn't restarted. The *process* is restarted to continue to host background services and content providers.
(In reply to Dan Minor [:dminor] from comment #17) > 2) The second would be to store the pid of the process on startup and poll > the list of processes during the waitForFinish call to see if that pid still > exists. I don't think it is currently possible for remoteautomation to get the pid of the process on startup -- see my argument with myself in comment 13.
> The *activity* isn't restarted. The *process* is restarted to continue to > host background services and content providers. Ok, to make sure I follow you, there will be no fennec activity even though a fennec process is running?
(In reply to Geoff Brown [:gbrown] from comment #19) > (In reply to Dan Minor [:dminor] from comment #17) > > 2) The second would be to store the pid of the process on startup and poll > > the list of processes during the waitForFinish call to see if that pid still > > exists. > > I don't think it is currently possible for remoteautomation to get the pid > of the process on startup -- see my argument with myself in comment 13. I had thought to get the PID shortly after startup through python calls, but maybe that wouldn't be reliable enough.
we can get a pid from the current ps we do to poll the process: Trying 192.168.1.86... Connected to 192.168.1.86. Escape character is '^]'. $>ps 10000 1520 android.process.media 10047 3992 org.mozilla.fennec 10044 1921 com.mozilla.SUTAgentAndroid 10007 1632 com.android.smspush 1001 1575 com.android.phone 10045 1533 net.rocboronat.android.wallpaper.npe 10014 1546 com.android.inputmethod.latin 1000 1426 system 10036 2029 com.android.quicksearchbox 10028 2012 com.svox.pico 10017 1997 com.android.gallery3d 10008 1982 com.android.musicfx 10027 1967 com.android.defcontainer 10001 1615 android.process.acore 10038 1819 com.android.providers.calendar 10021 1892 com.android.calendar 10030 1716 com.android.deskclock 10024 1598 com.android.launcher 1000 1873 com.android.settings 10031 1732 com.android.email 10033 1765 com.android.exchange 10043 1836 com.mozilla.watcher 1000 1504 com.android.systemui $> why don't we just ensure the pid:procname match up, that can be done in remoteautomation.py (and we should ensure this in talos as well). I would not like to poll the log file as other harnesses don't do that.
Possibly relevant: ahal's been working on refactoring the logic for launching applications for b2g in bug 865349. If he succeeds (and I think he will), we'll probably want to do a similar refactoring for Android. CC'ing him on this bug.
(In reply to Dan Minor [:dminor] from comment #20) > > The *activity* isn't restarted. The *process* is restarted to continue to > > host background services and content providers. > > Ok, to make sure I follow you, there will be no fennec activity even though > a fennec process is running? Correct. Android uses the process to run all of the code that ships in the Fennec APK -- content providers, background services, alarm handlers, etc. The Fennec browser activity -- named "@ANDROID_PACKAGE_NAME@.App" -- is just one of those things. The issue here is that the tools are using the process as a proxy for that activity. In trying to quit the activity they force-kill the process. Android says "tut tut" and fixes the mistake by restarting the process to handle the announcements service, shortly prior to the tools checking for success by seeing if the process is dead. Ideally you would gracefully close the activity (e.g., by using a custom broadcast intent, calling Activity.finish() when it's done, etc.) rather than nuking the process.
(In reply to Joel Maher (:jmaher) from comment #22) > why don't we just ensure the pid:procname match up, that can be done in > remoteautomation.py (and we should ensure this in talos as well). I think that won't work reliably, at least for robocop-via-sut, because: - the process is launched via dmSUT.fireProcess: https://hg.mozilla.org/mozilla-central/file/681bbf7717c1/testing/mozbase/mozdevice/mozdevice/devicemanagerSUT.py#l466 - sutAgent's exec will wait for up to 300 seconds for the process to end before returning: https://hg.mozilla.org/mozilla-central/file/681bbf7717c1/build/mobile/sutagent/android/DoCommand.java#l3734 - by the time automation/remoteautomation gets back control: https://hg.mozilla.org/mozilla-central/file/681bbf7717c1/build/automation.py.in#l1225 it is too late -- we don't know if the process we launched is still running or completed and was restarted.
Depends on: 865944
Blocks: 873892
Blocks: 872477
now that we use topactivity for finding the process, can we resolve this bug?
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.