Closed
Bug 627794
Opened 15 years ago
Closed 14 years ago
mozmill killed my firefox!
Categories
(Testing :: Mozbase, defect)
Testing
Mozbase
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: k0scist, Assigned: k0scist)
References
Details
(Whiteboard: [mozmill-1.5.2-][mozmill-2.0+])
Attachments
(1 file, 1 obsolete file)
The way we kill e.g. firefox is fairly retarded and may result in
non-mozmill related firefox processes getting killed:
https://github.com/mozautomation/mozmill/blob/master/mozrunner/mozrunner/runner.py#L191
https://github.com/mozautomation/mozmill/blob/master/mozprocess/mozprocess/pid.py#L46
Outside of the bad touch, what is going on here? (Unix case, windows
is probably okay). We kill *any* process with e.g. firefox anywhere
in the ps ax line if the PID is greater than the PID of the initial
process. PIDs are recycled. Linux, by default, only has 32767 of
them. This means if I have started Fx as PID=32000 and I run mozmill,
it will likely get a lower PID and kill my desktop Firefox!
Not to mention things like this:
(autobot)│ps ax | grep 'firefox'
2857 ? Ssl 2:42 /home/jhammel/firefox/firefox-bin
2919 ? Sl 0:57 /home/jhammel/firefox/plugin-container
/usr/lib/adobe-flashplugin/libflashplayer.so -omnijar
/home/jhammel/firefox/omni.jar 2857 true plugin
4586 pts/0 S 0:00
/home/jhammel/mozilla/src/bzconsole/bin/python /home/jhammel/bin/bz
new Mozmill Utilities mozmill killed my firefox!
4711 pts/2 S+ 0:00 grep --colour=auto firefox
So also we need to parse the actual friggin process name vs relying on
looking at the whole line. Should mozmill kill 4586 just because I
gave it an argument that happened to include the word Firefox?
This may not be an easy problem. But it is a critical problem.
Assignee | ||
Updated•15 years ago
|
Assignee: nobody → jhammel
Comment 1•15 years ago
|
||
This has been driving me mad, and it's caused me to lose data. Please mark this P1 critical.
Comment 2•15 years ago
|
||
Kris, can you please give more information, i.e on which platform you are on and how you execute Mozmill. Thanks.
Comment 3•15 years ago
|
||
Linux jg 2.6.36-ARCH #1 SMP PREEMPT Fri Dec 10 20:32:37 CET 2010 x86_64 AMD Athlon(tm) 64 X2 Dual-Core Processor TK-55 AuthenticAMD GNU/Linux
I generally run between two and four separate Firefox sessions (often different versions of Firefox). Running 'mozmill' from the command line, no matter the arguments, is enough to kill at least one of these sessions about half of the time. The version killed has no particular to the version run by mozmill.
But I don't think any of this information is necessary. Killing processes unrelated to those explicitly opened by Mozmill is patently unacceptable, and trying to guess which processes are related to those opened by Mozmill by looking at process names and PID counts is not even remotely reasonable. In my experience, sending SIGINT to the first spawned Firefox process is enough to take down the whole browser almost without fail. Sending SIGTERM and SIGKILL does so entirely without fail.
Comment 4•15 years ago
|
||
Thanks Kris. We also promote Mozmill tests via the Mozmill Crowd extension, which could also be affected by this issue. Clint and Jeff, what possibilities do we have for 1.5.2? Or should we follow-up with a short cycle 1.5.3? As more users we want to have as quicker we have to solve this bug.
Whiteboard: [mozmill-1.5.2?]
Assignee | ||
Comment 5•15 years ago
|
||
This is going to be a major change and will probably involve several different large API changes:
- MozMill + MozMillRestart should be merged. We handle stopping in completely different ways for both of the tests; we need to consolidate them to have solid stopping conditions
- we need to figure out a better way of killing processes. quite frankly, I don't know *why* we kill all processes named e.g. Firefox . I doubt anyone else does either. Were I to guess, it was something added to cover a case when Firefox doesn't get shut down, not documented as to why, and then forgotten about. we need to make this both precise and robust.
While I agree that this is a high priority -- in fact, this is my #1 priority for mozmill as of now -- I don't think this is a good idea for 1.5.2. I would recommend concentrate on getting 2.0 out (with as few new features as possible, mostly just making solid what we have) or branching off of master for an additional release branch once this refactor is done.
Whiteboard: [mozmill-1.5.2?] → [mozmill-1.5.2?][mozmill-2.0?]
Comment 6•15 years ago
|
||
I think that putting this off is an extremely bad idea. It doesn't affect me, because I've already removed this code entirely locally, but it has a huge potential to cause data loss for other users. Frankly, the way I see it, accidentally leaving around a useless Firefox instance is a *much* better outcome than accidentally killing one which has vital user data and still significantly better than killing one that a user does not expect to have killed. I don't think that either of these are acceptable scenarios to even allow to pass for a minor release.
As for the possibility of a Firefox instance not being killed, while I suppose it's possible that this code was implemented because a developer saw that in the past, I launch Firefox from the command line and kill it by sending SIGINT to the spawned process dozens of times a day, and have done for easily two years now, and I've never seen it fail to completely kill said instance.
Assignee | ||
Comment 7•15 years ago
|
||
Its not being put off, and, like i said, this and related bugs are my top priority. There's just no way to take for 1.5.x without breaking backwards compatability or changing the API. Its a big change and will require much testing. Mozmill 2. should be released by end of march.
Assignee | ||
Comment 8•15 years ago
|
||
Instead of iterating through the names, if Firefox is started with NO_EM_RESTART, http://mxr.mozilla.org/mozilla-central/search?string=NO_EM_RESTART , then killing the singular Firefox should be sufficient. All of this might change with electrolysis. This should at least be tried, as a simple solution. If this does *not* work, we should consider some e.g. plugin based solution, since within Fx you *should* be able to get an accurate list of interesting PIDs.
Robustly killing a multi-threaded application is not trivial. After the above has been investigated in elementary form, it is advisable to review existing methodologies for bridged subprocess collection. It is not an uncommon problem, although perhaps robust solutions are less common
Depends on: 584464
Assignee | ||
Comment 9•14 years ago
|
||
automation.py kills things in a much more precise way. In addition to using NO_EM_RESTART, a process log is used and PIDs read and killed from that.
I think the first stab is just using NO_EM_RESTART. If this proves less than sufficient, then we should immediately go with the process log approach. If not, it should be subsequently ticketed and probably taken for 2.0.
Its worth noting that automation.py handles processes a bit better across the board. While mozprocess has some advantages, automation.py is more mature and should probably mostly be used as a blueprint. Ultimately, they should be combined, but that is a longer-term goal
Assignee | ||
Comment 10•14 years ago
|
||
Preliminary testing looks good. However, since stopping is all kinds of edge cases, there could indeed be cases not caught here. Should ticket getting the process log and killing those in a second patch if we don't do it here, a la automation.py : http://mxr.mozilla.org/mozilla-central/source/build/automation.py.in#867
Attachment #512266 -
Flags: review?(ahalberstadt)
Comment 11•14 years ago
|
||
Comment on attachment 512266 [details] [diff] [review]
use NO_EM_RESTART and don't kill based on process name
Seems to work great in Firefox 4 but causes a hang in Firefox 3.6
Attachment #512266 -
Flags: review?(ahalberstadt) → review-
Assignee | ||
Comment 12•14 years ago
|
||
So its very strange....I would expect NO_EM_RESTART, if it was badly supported in 3.6, to cause the process not to be killed. Instead, the tests don't run at all, as :ahal describes :(
Assignee | ||
Comment 13•14 years ago
|
||
automation.py on 3.6 does use NO_EM_RESTART:
http://hg.mozilla.org/releases/mozilla-1.9.2/file/915a35e15cde/build/automation.py.in#l702
I don't get it
Assignee | ||
Comment 14•14 years ago
|
||
I can also confirm that the mozmill and jsbridge extensions are listed in the enabled addons
Assignee | ||
Comment 15•14 years ago
|
||
This may be a clue:http://mozilla-xp.com/mozilla.dev.apps.firefox/Rules-for-when-firefox-bin-restarts-it-s-process ; still confused, though :/
Assignee | ||
Comment 16•14 years ago
|
||
Attachment #512266 -
Attachment is obsolete: true
Attachment #512329 -
Flags: review?(ahalberstadt)
Assignee | ||
Comment 17•14 years ago
|
||
tested on 3.5, 3.6, and 4.0 on linux
Comment 18•14 years ago
|
||
Comment on attachment 512329 [details] [diff] [review]
use NO_EM_RESTART and don't kill based on process name and also a separate first run to register extensions
WFM in both 3.6 and 4.0. Looks good and seems to be the proper way to do things as per the discussion thread linked above. A little weird why it worked before in 4.0, but maybe they handle this automatically.
r+
Attachment #512329 -
Flags: review?(ahalberstadt) → review+
Comment 19•14 years ago
|
||
Actually one nit:
There's a whitespace error at the return
def wait(self, timeout=None):
- """Wait for the browser to exit."""
+ """Wait for the app to exit."""
+ if self.process_handler is None:
+ return
Assignee | ||
Comment 20•14 years ago
|
||
pushed to master: https://github.com/mozautomation/mozmill/commit/d9efd4b0ca6640148bf7f175cdfcd47c28dd7282
cleaned up the whitespace too, i hope; and yes, abict, Fx 4.0 just handles this better (though, FWIW, runreftest still does the -silent treatment)
I'll file a follow-up bug about the process list
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•14 years ago
|
Whiteboard: [mozmill-1.5.2?][mozmill-2.0?] → [mozmill-1.5.2?][mozmill-2.0+]
Comment 21•14 years ago
|
||
Jeff, what are the risks of backporting to 1.5.x? Would it be possible or should we better wait for 2.0?
Assignee | ||
Comment 22•14 years ago
|
||
This could be backported to 1.5 but I recommend against it for a few reasons:
- to paraphrase Rumsfeld, there are "unknown unknowns" with this patch. i've tested it pretty well, as did :ahal, but this does fundamentally change stopping conditions. since the issue will only show itself on edge cases, it is fundamentally hard to test well though
- this has been broken since forever. since the fault is that it is overaggressive at killing, IMHO this is better than not killing adequately
- mozmill 1.5.2 is already at a release candidate. i would hate to take anything that fundamentally alters functionality
I would be more comfortable taking if https://bugzilla.mozilla.org/show_bug.cgi?id=634141 was finished and backported, but personally I'd prefer to avoid time and possible headaches at fixing a bug that was broken for > 1 year for a dot release
Updated•14 years ago
|
Whiteboard: [mozmill-1.5.2?][mozmill-2.0+] → [mozmill-1.5.2-][mozmill-2.0+]
You need to log in
before you can comment on or make changes to this bug.
Description
•