Closed Bug 596132 Opened 14 years ago Closed 14 years ago

update talos remote testing to log to file instead of redirect to file

Categories

(Testing :: Talos, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: jmaher)

References

Details

Attachments

(2 files, 6 obsolete files)

As we have found out Android devices don't have the ability to do a 'fennec > output.log'.  Currently all of talos depends on redirecting to a log and monitoring that log file for status changes and keywords to get the results we are looking for.

This patch will add MozillaFileLogging into the picture (works in IPC mode as well) for the mobile_profile which is used in fennec talos tests.

Essentially all places where there is a dump() have been changed to dumpLog() which outputs both a dump() and a file.write() to ensure full compatibility.  The log file name is set as a preference 'talos.logfile' and can be read by all code that needs to get the filename (similar to reftest)
Attachment #474977 - Flags: review?(anodelman)
Attachment #474977 - Flags: feedback?(jhammel)
Blocks: 579566
Comment on attachment 474977 [details] [diff] [review]
log output to a log file, not stdout

>Index: bcontroller.py
>===================================================================
>RCS file: /cvsroot/mozilla/testing/performance/talos/bcontroller.py,v
>retrieving revision 1.15
>diff -u -8 -p -r1.15 bcontroller.py
>--- bcontroller.py	15 Apr 2010 18:03:58 -0000	1.15
>+++ bcontroller.py	14 Sep 2010 04:01:06 -0000

<snip/>

>+  def __init__(self, command, log, mod, remoteLog = '', deviceManager = None):
Why not remoteLog=None?


>      self.command = command
>      self.log = log
>      self.mod = mod
>      self.endTime = -1
>      self.returncode = -1
>      self.deviceManager = deviceManager
>+     if (remoteLog == None or remoteLog == ''):
>+       self.remoteLog = self.log
>+     else:
>+       self.remoteLog = remoteLog
self.remoteLog = remoteLog or self.log


>   def run(self):
>     if self.mod:
>-      if (self.deviceManager.__class__.__name__ == "WinmoProcess"):
>+      if (self.deviceManager.__class__.__name__ == "RemoteProcess"):
>         if (self.mod == "str(int(time.time()*1000))"):
>           self.command += self.deviceManager.getCurrentTime()
>       else:
>         self.command = self.command + eval(self.mod)
> 
>-    if (self.deviceManager.__class__.__name__ == "WinmoProcess"):
>-      retVal = self.deviceManager.launchProcess(self.command, timeout=600)
>+    if (self.deviceManager.__class__.__name__ == "RemoteProcess"):

I'd save the class name in a variable so that this didn't have to be typed out and then when it changes it only has to be changed in one place

>+      if (len(self.remoteLog.split('/')) <= 1):
if (self.remoteLog.count('/') <=1)
>+        self.remoteLog = self.deviceManager.getDeviceRoot() + '/' + self.remoteLog
>+
>+      retVal = self.deviceManager.launchProcess(self.command, outputFile=self.remoteLog, timeout=600)

> class BrowserController:
> 
>   def __init__(self, command, mod, name, child_process, 
>-               timeout, log, host='', port=20701, root=''):
>+               timeout, log, host='', port=20701, root='', remoteLog=''):
As above, why not None?

>@@ -172,21 +180,23 @@ def main(argv=None):
<snip/>
>-   opts, args = getopt.getopt(argv[1:], "c:t:n:p:l:m:h:r:o", ["command=", "timeout=", "name=", "child_process=", "log=", "mod=", "host=", "deviceRoot=", "port="])
>+   opts, args = getopt.getopt(argv[1:], "c:t:n:p:l:m:h:r:o:d", ["command=", "timeout=", "name=", "child_process=", 
>+                                                              "log=", "mod=", "host=", "deviceRoot=", "port=", "remoteLog="])
Ah, the horror of getopt :(  I really hope someone cleans this up one day

>Index: devicemanager.py
>===================================================================
>RCS file: /cvsroot/mozilla/testing/performance/talos/devicemanager.py,v
>retrieving revision 1.3
>diff -u -8 -p -r1.3 devicemanager.py
>--- devicemanager.py	21 Jul 2010 18:35:55 -0000	1.3
>+++ devicemanager.py	14 Sep 2010 04:01:06 -0000
>@@ -53,17 +53,17 @@ class FileError(Exception):
>     self.msg = msg
> 
>   def __str__(self):
>     return self.msg
> 
> class DeviceManager:
>   host = ''
>   port = 0
>-  debug = 3
>+  debug = 2
So why is the debug level changed?  Should that go in this patch or a separate one? I'm probably missing a piece of the puzzle.


>@@ -383,24 +383,23 @@ class DeviceManager:
> 
>     #NOTE: we sleep for 30 seconds to allow the application to startup
>     time.sleep(30)
> 
>     self.process = self.processExist(appname)
>     if (self.debug >= 4): print "got pid: " + str(self.process) + " for process: " + str(appname)
> 
>   def launchProcess(self, cmd, outputFile = "process.txt", cwd = ''):
>-    if (outputFile == "process.txt"):
>-      outputFile = self.getDeviceRoot() + '/' + "process.txt"
>-
>     cmdline = subprocess.list2cmdline(cmd)
>-    self.fireProcess(cmdline + " > " + outputFile)
>-    handle = outputFile
>+    if (outputFile == "process.txt" or outputFile == None):
>+      outputFile = self.getDeviceRoot() + '/' + "process.txt"
>+      cmdline += " > " + outputFile
I'm not sure what expected behaviour is here.  Its hard to glean intent from these lines of code

 

>@@ -711,8 +710,34 @@ class DeviceManager:
<snip/>
>+  """
>+    return the current time on the device
>+  """
>+  def getCurrentTime(self):
>+    data = self.sendCMD(['clok'])
>+    if (data == None):
>+      return None
>+    return self.stripPrompt(data).strip('\n')
I have no idea what this does but I'll take your work for it


>+  """
>+    edit the user.js in the profile (on the host machine) and 
>+    add the xpconnect priviledges for the remote server
>+  """
The docstring should go inside the function

>+  def addRemoteServerPref(self, profile_dir, server):
>+    user_js_filename = os.path.join(profile_dir, 'user.js')
>+    user_js_file = open(user_js_filename, 'a+')
>+   
>+    #TODO: p2 is hardcoded, how do we determine what prefs.js has hardcoded?
>+    remoteCode = """
>+user_pref("capability.principal.codebase.p1.granted", "UniversalPreferencesWrite UniversalXPConnect UniversalPreferencesRead");
>+user_pref("capability.principal.codebase.p1.id", "http://%(server)s");
>+user_pref("capability.principal.codebase.p1.subjectName", "");
>+""" % { "server": server }
This is fine for three lines;  if this code needs to be touched again, I'd recommend doing it a bit more generally.

<snip/>

>Index: run_tests.py
>===================================================================
>RCS file: /cvsroot/mozilla/testing/performance/talos/run_tests.py,v
>retrieving revision 1.66
>diff -u -8 -p -r1.66 run_tests.py
>--- run_tests.py	4 Aug 2010 13:22:14 -0000	1.66
>+++ run_tests.py	14 Sep 2010 04:01:07 -0000
>@@ -422,16 +422,21 @@ def test_file(filename, to_screen):
>   if 'deviceroot' in yaml_config:
>       browser_config['deviceroot'] = yaml_config['deviceroot']
>   else:
>       browser_config['deviceroot'] = ''
>   if 'remote' in yaml_config:
>       browser_config['remote'] = yaml_config['remote']
>   else:
>       browser_config['remote'] = False
>+  if 'remote_log' in yaml_config:
>+      browser_config['remote_log'] = yaml_config['remote_log']
>+  else:
>+      browser_config['remote_log'] = ''
>+
Again, why not None?

<snip/>

>Index: mobile_profile/extensions/electrolysis@mozilla.org/chrome/content/MozillaFileLogger.js
>===================================================================
<snip/>
>+  try {
>+    netscape.security.PrivilegeManager.enablePrivilege("UniversalXPConnect");
>+  } catch (ex) {} //running in ipcMode-chrome
>+
>+  MozillaFileLogger._file = Cc[LF_CID].createInstance(Ci.nsILocalFile);
>+  MozillaFileLogger._file.initWithPath(path);
>+  MozillaFileLogger._foStream = Cc[FOSTREAM_CID].createInstance(Ci.nsIFileOutputStream);
>+//  MozillaFileLogger._foStream.init(this._file, PR_WRITE_ONLY | PR_CREATE_FILE | PR_APPEND,

Any reason you kept this comment in?

Overall, looks fine.
Attachment #474977 - Flags: feedback?(jhammel) → feedback+
cleaned up some feedback from ctalbert about devicemanager.py and a few bits and pieces from jhammel the feedback master.  This is the patch that is running on talos staging ;)

if this passes on staging and anode gives an r+, we can put this in the queue for landing on the next talos downtime (eta Sept 23rd)
Assignee: nobody → jmaher
Attachment #474977 - Attachment is obsolete: true
Attachment #475627 - Flags: review?(anodelman)
Attachment #474977 - Flags: review?(anodelman)
this passed green on talos staging.
Comment on attachment 475627 [details] [diff] [review]
log output to a log file, not stdout (1.1)

Any reason for the name change between WinmoProcess and RemoteProcess (considering that it still lives in a file called ffprocess_winmo.py

The main issue here is that you want to change totally over to dumpLog over dump.  To do that complete we'd have to update the pageloader bundle as well (http://hg.mozilla.org/build/pageloader/) which is installed per-run and allows us to cycle through pages.
Attachment #475627 - Flags: review?(anodelman) → review-
this patch was originally written for cvs and I don't know how to remove a file in cvs, so I never changed the name from ffprocess_winmo.py to ffprocess_remote.py.  Now that we are moving to HG, this can be done.

Currently we don't support remote testing for pageloader tests.  If you want me to modify page laoder to use a similar dumpLog method, I can do that.  This patch is for the talos repository.  I have another patch which we landed for the fennecmark toolset to use the filename instead stdout.

Maybe I don't understand why this patch cannot land without support for pageloader since pageloader is a different tool?
updated patch to be 'hg' patch.  Also I removed ffprocess_winmo.py and replaced it with ffprocess_remote.py!

Once I hear back regarding the direction on pageloader, I can r? this patch.
Attachment #475627 - Attachment is obsolete: true
second patch that fixes PerfConfigurator.py to be smarter, work with remote_log options and support android in a cleaner method.  This patch removes winmo.config and creates remote.config.  

Here is an example commandline that I run to generate a valid config:
python PerfConfigurator.py -v -e org.mozilla.fennec -t `hostname` -b mobile-browser --activeTests tzoom --sampleConfig remote.config --browserWait 60 --noChrome --output local.config --remoteDevice 192.168.1.115 --webServer 192.168.1.102/talos
Attachment #477391 - Flags: review?(anodelman)
Comment on attachment 477391 [details] [diff] [review]
fix perfconfigurator.py to work with android and remote logging (1.0)

Little weird to have one log called remove_log and the other called talos.log - might want to choose a format there and stick with it.  With that said, still r+.
Attachment #477391 - Flags: review?(anodelman) → review+
Oh, and you have the port set to a hardcoded default (20701) in both patches.  Could this be something that lives in the remote.config?
Re: comment #6, direction for pageloader.  I think that I'm still uncomfortable with how the code is set up as I don't believe that the dumpLog should be part of quit.js.  quit.js is for code that makes the browser stop, overloading how dump works shouldn't happen there.  I believe that this will generate confusion when people are reading code to figure out what is going on here.  I'd prefer that it would be packaged into a secondary script that could then be included by those tests that need access to it.
for the port being hardcoded to 20701, it is much more difficult and requires a lot more code to read it from the config file.  We need to collect information from the device (such as build id and deviceRoot) which requires the ip and port.  The solution to read it from the config file is to parse the config file and store it locally in an hash.  Then setup remote and do any variable validation cleanup.  Finally we would then write the new config out.

Currently we parse the variables (validate/cleanup) then in the write config step we read the sample.config and line by line edit on the fly as needed to create the new one.  

If we want to do this we can.  I would also change the variable input to be OptionParser or something like that so we can subclass.
updated patch to use scripts/MozillaFileLogger.js instead of hacking up quit.js.  The one problem here is that we need to include MozillaFileLogger.js before quit.js to have to event generation code (or we could duplicate or create a 3rd file)

Tested this in a remote scenario as well as a hacked up scenario where we do and don't redirect firefox output to a log file.
Attachment #477390 - Attachment is obsolete: true
Attachment #478066 - Flags: review?(anodelman)
addressed some of the concerns in previous comments.  Also pulled as much code as I could out of PerfConfigurator.py and into the existing remotePerfConfigurator.py.  Asking for another review since logic has changed.

this should help keep things separated as much as possible.

:bear, please run python remotePerfConfigurator.py now.
Attachment #477391 - Attachment is obsolete: true
Attachment #478069 - Flags: review?(anodelman)
forgot to add scripts/MozillaFileLogger.js to the patch...good catch anode!
Attachment #478066 - Attachment is obsolete: true
Attachment #478305 - Flags: review?(anodelman)
Attachment #478066 - Flags: review?(anodelman)
Attachment #478305 - Flags: review?(anodelman) → review+
Attachment #478069 - Flags: review?(anodelman) → review+
Let's get another green staging cycle out of this and then we can consider it ready to go.
small adjustment to a text string needed for pageloader manifest mods.
Attachment #478069 - Attachment is obsolete: true
Attachment #478389 - Flags: review+
What needs to happen to check in these patches?
these patches are r+ and pass in talos staging, bug 579566 is the tracking bug for the next talos downtime.  It is my understanding that is schedule for sometime this week (usually thursday) and these two patches would land then.
(In reply to comment #18)
> these patches are r+ and pass in talos staging, bug 579566 is the tracking bug
> for the next talos downtime.

That does not look like a tracking bug for downtime to me. When will this patch land?
Looks like a wrong bug number there - the downtime bug is bug 593081.  Downtimes are hard to come by these days as releng is currently blocked on completing code freeze for beta 7 - as you can see things in that bug are pretty backed up.

You'll have to talk to releng to get a better idea what their scheduling is looking like.
(In reply to comment #20)
> 
> You'll have to talk to releng to get a better idea what their scheduling is
> looking like.

No, I will not. This bug is not assigned to me. What will happen is that releng and/or ateam will answer the very simple question I just asked. We are not going to wait week to land this code.
(In reply to comment #19)
> (In reply to comment #18)
> > these patches are r+ and pass in talos staging, bug 579566 is the tracking bug
> > for the next talos downtime.
> 
> That does not look like a tracking bug for downtime to me. When will this patch
> land?
jmaher pointed you to the wrong bug. Bug#593081 is tracking scheduling of next downtime.




(In reply to comment #21)
> (In reply to comment #20)
> > You'll have to talk to releng to get a better idea what their scheduling is
> > looking like.
> No, I will not. This bug is not assigned to me. What will happen is that releng
> and/or ateam will answer the very simple question I just asked. We are not
> going to wait week to land this code.

To allow developers land beta7 fixes, RelEng is currently blocked from scheduling non-emergency treeclosure downtime. One possible downtime date being discussed is immediately after the "go to build" for beta7. I'm all-too-well-aware this yet-to-be-scheduled downtime is blocking 7 bugfixes from landing in production, including the one you care about. If you have additional info on why these need to be landed sooner, with a downtime interruption on other beta7 bugs, please let rel-drivers know.
(In reply to comment #22)
>
> If you have
> additional info on why these need to be landed sooner, with a downtime
> interruption on other beta7 bugs, please let rel-drivers know.

There are about 20 b7 blockers left. We can tolerate a downtime. I am disturbed that I got the runaround three bug comments in a row here. That doesn't mean any one commenter was wrong to point somewhere else, but it does indicate a broken organization that passes the buck too easily. Since I am a release driver, I guess I will mail myself.
(In reply to comment #23)
> (In reply to comment #22)
> >
> > If you have
> > additional info on why these need to be landed sooner, with a downtime
> > interruption on other beta7 bugs, please let rel-drivers know.
> 
> There are about 20 b7 blockers left. We can tolerate a downtime. I am disturbed
> that I got the runaround three bug comments in a row here. That doesn't mean
> any one commenter was wrong to point somewhere else, but it does indicate a
> broken organization that passes the buck too easily. Since I am a release
> driver, I guess I will mail myself.

There's no run around.  The only visibility we have into the IT/RelEng downtime schedule is "every thursday morning, permitting something doesn't come up".  I can speak for the Tools team that it's not so much "passing the buck" as not knowing.  We only know when our stuff is landing the day before the down time happens.  

I think we have a critical lack of visibility into what needs downtime and why across the organization.  I'll make sure the next time we're waiting on something it's called out in that tree management section of the Tuesday meeting so it doesn't come up as a surprise.
just as a note, these patches are HG patches, I was under the impression we would be on HG last week, but that was pushed out to this week (although it isn't scheduled)

If we do a downtime tomorrow and it is on talos cvs, I could redo these patches to ensure they apply cleanly to a cvs checkout.
(In reply to comment #24)
> 
> There's no run around.  The only visibility we have into the IT/RelEng downtime
> schedule is "every thursday morning, permitting something doesn't come up". 

It should have been today, then. But it didn't happen. Perhaps for good reason. I have mailed release-drivers to schedule time to get this patch and others in. This work is important, and we can't afford to let it sit for weeks.
Depends on: 556530
pushed to hg repo:
http://hg.mozilla.org/build/talos/pushloghtml?changeset=f97b65124838

we will ensure that the talos.zip which releng requires is ready to go along with additional testing for a smoother process.
Removing dependency on downtime - this is dependent on talos in hg landing which will happen in the downtime.  Doing this to clear up confusion in the releng-downtime mega-bug.
No longer blocks: releng-downtime
did this land during the downtime on monday?
yes, this did, I am not sure of the best way to indicate that in this bug. 

marking as resolved->fixed.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
(In reply to comment #23)
>
> There are about 20 b7 blockers left. We can tolerate a downtime. I am disturbed
> that I got the runaround three bug comments in a row here. That doesn't mean
> any one commenter was wrong to point somewhere else, but it does indicate a
> broken organization that passes the buck too easily. Since I am a release
> driver, I guess I will mail myself.

I gather that this message offended people. I am sorry. So, what happened? Near as I can tell, it went like this:

1) Engineering wanted Android automation
2) the team working on Android automation did the work (I don't know what the schedule looked like)
3) when the work was complete, Releng needed downtime to make the changes to accommodate Android automation
4) Engineering didn't accept any downtime, for unrelated reasons.

So, Engineering is making unreasonable and contradictory demands. However, it was not obvious that these demands were contradictory, and it took some rather impolitic prodding to figure out why everything was stopped up.

IMHO, that means there are two takeaways. The first problem is that engineering is incompetently micromanaging releng/automation logistics. The second problem is that releng/automation isn't making sure these very important projects go all the way to deployment.

These projects are critical to Mozilla's success, so we need to change. The worst result is mutual haplessness, where each group blames the other. The best outcome we can hope for is that each group has good stuff, and we need to coordinate the deployment of it all. Write good code and don't let it sit there.

The confusion in this bug never would have happened if everyone was making sure that their code was actually getting deployed properly. It's not cool to fix a bug in theory and then let some other group block you. In this case, the finger pointing actually formed a circle. That is unacceptable and unproductive.
This comment is intended to be constructive.

sayrer: I think RelEng's actions were entirely reasonable if Engineering did not tell them that this work was of particular urgency. Surely if Engineering has given an instruction that there be no non-emergency downtime, and this fix was not labelled as being of sufficient importance to require emergency downtime, then adding it to the queue of fixes for the next release was the obvious thing to do.

The problem would then be not Engineering micromanaging RelEng, but that (as you hint) Engineering not coordinating within itself. If RelEng received what seemed to be a consistent set of instructions from Engineering (i.e. _not_ "contradictory demands"), you can't blame them for following them. It's not their job to say "hey, we wrote this code for you, and you guys aren't letting it be deployed. What gives?" They made the not unreasonable assumption that Engineering knew what they were doing.

OTOH, if RelEng were told that this issue was of utter, vital, drop-everything importance, then raising the question of whether emergency downtime was required would have been a step they should, in hindsight, have taken.

But aside from an unfortunate typo with a bug number, and joduinn not knowing you were yourself a release-driver, I don't see them giving you the run-around in this bug.

Gerv
Hey Rob,

Thanks for the summary here.

(In reply to comment #31)
> (In reply to comment #23)
 
> IMHO, that means there are two takeaways. The first problem is that engineering
> is incompetently micromanaging releng/automation logistics. The second problem
> is that releng/automation isn't making sure these very important projects go
> all the way to deployment.
> 
> These projects are critical to Mozilla's success, so we need to change. The
> worst result is mutual haplessness, where each group blames the other. The best
> outcome we can hope for is that each group has good stuff, and we need to
> coordinate the deployment of it all. Write good code and don't let it sit
> there.
> 
> The confusion in this bug never would have happened if everyone was making sure
> that their code was actually getting deployed properly. It's not cool to fix a
> bug in theory and then let some other group block you. In this case, the finger
> pointing actually formed a circle. That is unacceptable and unproductive.

I agree with you here.  I think that this is where we fell down - getting the stuff fixed and then waiting to see if the code would land.  I'll start encouraging my automation peeps to start thinking that the bug is not fixed until the code is landed and running, and hopefully we can work out these kinks in the system more quickly next time.
You need to log in before you can comment on or make changes to this bug.