Closed Bug 892107 Opened 11 years ago Closed 11 years ago

Please loan a 10.8 mac tester to :smichaud

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwine, Assigned: smichaud)

References

Details

(Whiteboard: [buildduty])

See bug 884471 comment 49 for context
I'm going to loan a 10.8 instead of a 10.7 since the 10.8 are still a managed class, and :smichard doesn't care one way or another
Summary: Please loan a 10.7 mac tester to :smichaud → Please loan a 10.8 mac tester to :smichaud
Depends on: 892146
Host is loaned, please see your e-mail for password and VPN information.

Please reassign this bug to "nobody@" when you are done with it, and releng will reclaim it.
Assignee: nobody → smichaud
given a fresh box, you will want to download the firefox binary and tests package.

I highly recommend looking in a log file from tbpl to get the exact steps.  Here is a full log (m-c 10.8 debug, m1):
https://tbpl.mozilla.org/php/getParsedLog.php?id=25165117&tree=Mozilla-Central&full=1

Inside the log, you can find the steps that are run by searching for 'Copy/paste'.  You can find what is downloaded by searching for 'Downloading'.

Please let me know if you are able to use the log to find the steps and get tests running.  It is easy to put that in a loop, but we might want to do something smarter.
I hope and assume that this *isn't* a fresh box.  I understood it's a production machine, on which bug 884471 has already happened, that's been taken offline.

But I do need to understand how the test machines work, and any help in that direction is appreciated.  I currently have zero knowledge of this, so it will take me several days (at least) to learn it on my own.
To be clear, this is not a "fresh box" it was a production machine running production jobs before I pulled it for you.

However I did not verify that this *specific* box hit the issue, I did however verify that this *class* of boxes hit the issue.

All machines in this "class" of machines are imaged/setup identically. And have identical hardware.
OK, then.  I think the first step should be to reproduce bug 884471 on my loaner, before I start trying to load my interpose library.

I'd like to be able to run tests continuously until bug 884471 happens, then have them stop.

The tests should, I suppose, be some subset of the mochitests in which the bug is known to happen -- e.g. mochitest-1, mochitest-other and so forth.
By the way, I'm having trouble getting VPN to work properly from Mountain Lion (on my side), using Tunnelblick.  Apparently you need to use a beta version of Tunnelblick to avoid it reconnecting every few minutes, and the beta uses a different settings format, and doesn't know how to convert from the old format (even though it claims to be able to).

So I'll need to fix *that* problem before I can start work on the loaner :-(
I've now fixed my VPN problems (at least to the extent of being able to ssh in to my loaner).  Thanks, Callek, for telling me to use Viscosity, that the "Mozilla VPN" requires different settings, and where to download these settings from.  (I still can't screen-share, but I hope that won't be necessary.)

Now I need to learn the basics about how our build system works, starting from no knowledge whatsoever.

Are there any docs on this (globally available, or only available via VPN)?

What software does the build system use (apart from what's included in the Mozilla tree)?  For each of these packages, where's the source, plus any docs that may exist?

I looked at https://tbpl.mozilla.org/php/getParsedLog.php?id=25165117&tree=Mozilla-Central&full=1, and see from it that /builds/slave/talos-slave/test/ seems to be its working directory.  I also see that my loaner already has this directory.

What program produced this log?  Is it feasible that I just run this program (whatever it is)?  Please give me whatever information you can about it, including where to find its source and whatever docs may exist.
I need answers to my questions in comment #8.  Without them I can't begin work on my loaner, or on bug 884471.

I don't even know who to ask, so I've picked three names more or less out of a hat.

Answer what questions you can, and needinfo whoever you think might be able to help with the rest.
Flags: needinfo?(jmaher)
Flags: needinfo?(gps)
Flags: needinfo?(bhearsum)
Flags: needinfo?(bhearsum)
buildbot has a slave which runs on the client and executes the commands you see in the log file.  That slave is the process which generates the log file which you end up seeing.  99.31% of the time you can reproduce failures by running those commands by hand!
Flags: needinfo?(jmaher)
(In reply to comment #10)

I need more information than that!

Where is "buildbot"?  Where is "slave"?  How do I run run or both of them?  Where's the source for them?  Are there any docs, and if so where?
So it looks like I'll need to find the information I need on my own.

I Googled "buildbot site:mozilla.org" and found this:
https://wiki.mozilla.org/Buildbot

I'll keep digging.  In a day or two I should be ready.

Note that I'm on vacation next week (7-22 through 7-26), and won't be working then.
I don't think you will be able to launch command from buildbot on there.  Earlier on in the other bug you were going to download and run the tests on there, why are you talking about doing something else?

All you need to do is cut/paste the commands that you see in the log and run them.  If you need help hacking the harness to repeat until failure, I would be happy to help with that.
As I've said multiple times here and in bug 884471, I need to run the tests *exactly* as they're run on the production test machines -- or as close to that as I can get.

I strongly suspect that something in our build infrastructure is the cause of bug 884471 -- something that isn't run by "ordinary" builders.
Flags: needinfo?(gps)
It is your call, setting up a buildbot master and configuring it exactly as the ones that run the test will take a lot of your time.  

Why are you opposed to running the tests in the same environment (sans buildbot slave script) 100 times to look for the failure?  If there is no failure, then it would seem worth the many man hours to do it via buildbot.
> Why are you opposed to running the tests in the same environment
> (sans buildbot slave script) 100 times to look for the failure?

I'm not, in principle.  Though I think it's probably a fool's errand
-- if that was sufficient, why don't people see these failures when
running tests locally?

Is there an easy way to do this, that doesn't involve changing how my
loaner is set up?  If so, let me know and I'll try it.
I have already told you how to do this, and 99%+ of the time when I run a test on a loaner box by hand the error reproduces.  I have no more information to give you, this is becoming a circular argument.  If you need help hacking the harness, I will be more than happy to assist.  If you do have a question for me, please need-info!  

Good luck and have fun!
> I have already told you how to do this

Where?
> Where?

You mean here, in comment #3?  I'll take a closer look, and see what I can glean from what you said.
I think I've found a way to run desktop_unittest.py on my loaner (which is a slave) without any connection to a master (running buildbot):
https://wiki.mozilla.org/ReleaseEngineering/Mozharness/07-May-2013?title=ReleaseEngineering/Mozharness

Anyone willing to comment?

Whether or not anyone is, I'll try it tomorrow and see what happens.
(Following up comment #20)

That strategy works, more or less.  But the test run aborts before any individual tests are run, with the following error:

INFO -  _RegisterApplication(), FAILED TO establish the default connection to the
        WindowServer, _CGSDefaultConnection() is NULL.

I take this to mean that I need to screen-share with my loaner, which I haven't yet been able to get working.  (Up to this point I've been ssh-ing in to the machine.)

Looks like I'll have to pick this up again the week after next, after I get back from vacation.
Component: Release Engineering: Machine Management → Release Engineering: Loan Requests
Thanks to Kim Moir, I've now got screen sharing and desktop_unittest.py working on my loaner!

Next I'll try to reproduce bug 884471 on it.  This may take a while -- if only because I'll have to rerun tests so many times.  But my path is no longer blocked, and I should eventually be able to try out my interpose library.
Product: mozilla.org → Release Engineering
I'm (basically) done with bug 884471.  But I'm going to hang on to this loaner for a bit longer (a week or two) to work on bug 898519.
(In reply to Steven Michaud from comment #23)
> I'm (basically) done with bug 884471.  But I'm going to hang on to this
> loaner for a bit longer (a week or two) to work on bug 898519.

3 months later Steven, are you still using it for 898519? note if you're not but plan to I'd prefer to reclaim this to our pool and loan you a new one when you are once again able to devote time. But if you are actively using it, happy to leave it in your hands.
Flags: needinfo?(smichaud)
(In reply to comment #24)

Oops, I'd completely forgotten about this :-(

Go ahead and reclaim this machine.  In any case I'm probably not going to be working on bug 898519 anytime soon:  I've got lots of other stuff to do, and bug 898519 is less urgent now that the tests that trigger it are disabled.
Flags: needinfo?(smichaud)
reclaiming
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Loan Requests → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.