Closed Bug 822423 Opened 12 years ago Closed 12 years ago

Create fake pandas for mozpool testing

Categories

(Testing Graveyard :: Mozpool, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mcote, Assigned: dustin)

References

Details

Attachments

(2 files)

Along the lines of fakerelay.py, we need fakepanda.py so we can do mozpool testing without real pandas.
I think this would be linked to or include fakerelay.py (for power control).

I actually have dreams of a shell where we could manually cause failures.  This could be kinda fun.

And, more importantly, it saves us the trouble and capacity cost of allocating a chassis or, worse, rack of pandas to mozpool development.
Totally.  We can have a really basic fake SUT agent on there as well.  We have a python agent, although that might be overkill... we just need something to respond to the few commands that the SUT lifeguard checks will use (testroot, rmdr, mkdr, isdir, and push).  Even better would, as you say, be the ability to cause failures, so for SUT agent, that would be (at least) not responding and bad disk.
We really need this, so I'll get cracking on it.

We can probably host this in a datacenter as a staging instance.  I just don't want to burn a chassis or more on staging.
Assignee: nobody → dustin
Attached patch bug822423.patchSplinter Review
This is in three patches - let's see how bugzilla handles that.  See also https://github.com/djmitche/mozpool/compare/bug822423

Basically this adds a config option that causes the daemon to scan the DB for devices to emulate, and then emulate them.  At startup, all devices are powered down with a blank sdcard.  This implements probabilistic failures with rates that seem a bit higher than real pandas, giving lifeguard a nice workout.  It tries to be as realistic as possible, emulating the relay boards' TCP/IP interface and using mozpool's HTTP interface to submit events, get bootconfig, etc.

I envision running this on a VM, perhaps mobile-imaging-stage1.build.scl1.mozilla.com, using the staging MySQL DB.  We'll manually populate that with a hundred or so fake pandas.  Then we can adjust schemas, stage new versions, and so on, before rolling them out to production.  If we decide to do so, we can also attach the prototype chassis to this imaging server.
Attachment #696123 - Flags: review?(mcote)
Blocks: 824816
Please continue to review, but a note for myself to include in the next revision:

 - README + test data to set up a full instance at home with simulated pandas and everything
Comment on attachment 696123 [details] [diff] [review]
bug822423.patch

Review of attachment 696123 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good, though it took me some amount of time to get it entirely working locally.  From a blank db, I had to do the following in addition to running testdata.py:

- change the devices' relayinfo to an unused port on localhost (I had previously made testdata.py set the relayinfo to the mozpool server's fqdn for lack of anything better).
- change the fake pxe config to include the magic "mobile-imaging-url=..." string (e.g. mobile-imaging-url=foo/b2g-second-stage.sh)
- change the devices' initial state from "free" to "off" to match FakeDevice's initial state (also see my note below).

I can fix testdata.py to do this all automatically.  In fact, now that we have fake pandas, I think that we should make testdata.py set up the data as though it will be run with the fake pandas.  So we can set the relayinfo port to 2101 (I guess overridable via command-line option), set up pxe configs for both android and b2g with the appropriate magic string, and set the devices' state to off.  In other words, replace a lot of the fake data with slightly more realistic fake data that will work with fakerelay/panda.  Then we should be able to have a full test system in just a couple bash lines.  I'll attach my changes to this bug for your review.

Btw while testing I found that ^C often doesn't work to kill the server.  Not sure if that is specific to this patch or if I just had my system stuck in different states from previous testing.

A few comments and questions below.  Nothing blocking, so r+ing this.

::: mozpool/test/fakedevices.py
@@ +42,5 @@
> +        # *our* notion of current state; this is mostly used to set timeouts
> +        # and for debugging purposes, as this class does not implement a state
> +        # machine.  Some of the states align with those in devicemachine.py,
> +        # for developer sanity
> +        self.state = 'off'

Could we do something more intelligent based on the device state in the db?

@@ +69,5 @@
> +        'b2g_extracting': 0.08,
> +        'b2g_rebooting': 0.08,
> +        'running_b2g': 0.05
> +        # default: 0
> +    }

How useful do you think this will be?  I understand the idea of simulating random failures, but I don't see how this would be really useful unless there was a really long-running test.  Forcing failures seems much more useful for testing.

@@ +113,5 @@
> +            with open(filename) as f:
> +                cfg = f.read()
> +                mo = re.search('mobile-imaging-url=[^ ]*/([^ ]*).sh', cfg)
> +                if mo:
> +                    return mo.group(1)

Implied "return None" here is confusing.

@@ +278,5 @@
> +            # only emulate devices managed by this imaging sever
> +            if dev_dict['imaging_server'] != fqdn:
> +                continue
> +
> +            # only emulate devices with relay info starting with 'localhost' 

Trailing whitespace.

::: mozpool/test/fakerelay.py
@@ +21,5 @@
> +
> +    def set_status(self, status):
> +        self.status = status
> +        panda = 'off' if status else 'on'
> +        self.logger.info('set status %s (panda %s)' % (status, panda))

We should probably use something more generic than "panda".

@@ +31,5 @@
> +
> +
> +class RelayBoard(object):
> +
> +    # set this on an instance to control the time between reading a commnad and

Typo in "command".

@@ +43,5 @@
> +        relay on the board, and should return a Relay instance.  It defaults to
> +        the Relay class.
> +
> +        If record_actions is true, then relayboard.actions will contain a list
> +        of the actions that occurred on the board. 

Trailing whitespace.

@@ +103,5 @@
> +
> +            self.handle_commands(csock)
> +            csock.close()
> +
> +            # bail out now if we're only running hte loop once

Typo.
Attachment #696123 - Flags: review?(mcote) → review+
(In reply to Mark Côté ( :mcote ) from comment #6)
> lines.  I'll attach my changes to this bug for your review.

Thanks!  That's what I was talking about in comment 5, but if you're setting it up I'll leave you to it.

> Btw while testing I found that ^C often doesn't work to kill the server. 
> Not sure if that is specific to this patch or if I just had my system stuck
> in different states from previous testing.

In general I've found that to be due to the browser holding HTTP connections open.  It's possible I've missed daemonizing a thread, though.

> A few comments and questions below.  Nothing blocking, so r+ing this.
> 
> ::: mozpool/test/fakedevices.py
> @@ +42,5 @@
> > +        # *our* notion of current state; this is mostly used to set timeouts
> > +        # and for debugging purposes, as this class does not implement a state
> > +        # machine.  Some of the states align with those in devicemachine.py,
> > +        # for developer sanity
> > +        self.state = 'off'
> 
> Could we do something more intelligent based on the device state in the db?

I'm not sure there's much point - it's all fake anyway, right?

I was thinking, though, that if we configure the 'new' state to do a self-test or something natural like that, and set up testdata.py to add all fake pandas in the 'new' state, then we'd get some action right off the bat in any case.

> How useful do you think this will be?  I understand the idea of simulating
> random failures, but I don't see how this would be really useful unless
> there was a really long-running test.  Forcing failures seems much more
> useful for testing.

I'm hoping to use this long-term on the staging server, so I think that random failures have value there.  Forcing failures sounds good, too, but I'm not sure how best to implement it.  Perhaps a REST API uri?
(In reply to Dustin J. Mitchell [:dustin] from comment #7)
> (In reply to Mark Côté ( :mcote ) from comment #6)
> > lines.  I'll attach my changes to this bug for your review.
> 
> Thanks!  That's what I was talking about in comment 5, but if you're setting
> it up I'll leave you to it.

Ah right. :) I'll do testdata.py for now; you can do the README changes if you want.

> 
> > Btw while testing I found that ^C often doesn't work to kill the server. 
> > Not sure if that is specific to this patch or if I just had my system stuck
> > in different states from previous testing.
> 
> In general I've found that to be due to the browser holding HTTP connections
> open.  It's possible I've missed daemonizing a thread, though.

Interesting.  I'll try closing my browser the next time this happens and see if that fixes it.

> 
> > A few comments and questions below.  Nothing blocking, so r+ing this.
> > 
> > ::: mozpool/test/fakedevices.py
> > @@ +42,5 @@
> > > +        # *our* notion of current state; this is mostly used to set timeouts
> > > +        # and for debugging purposes, as this class does not implement a state
> > > +        # machine.  Some of the states align with those in devicemachine.py,
> > > +        # for developer sanity
> > > +        self.state = 'off'
> > 
> > Could we do something more intelligent based on the device state in the db?
> 
> I'm not sure there's much point - it's all fake anyway, right?

Really just a convenience.  If I want to test requests with fresh testdata, I have to (fake) image them first to get them into the right internal state.  Just takes more time and clicks/typing.

> I was thinking, though, that if we configure the 'new' state to do a
> self-test or something natural like that, and set up testdata.py to add all
> fake pandas in the 'new' state, then we'd get some action right off the bat
> in any case.

Ah, a self-test that, if successful, moves the device to the free state?  Cool, that'd work.

> 
> > How useful do you think this will be?  I understand the idea of simulating
> > random failures, but I don't see how this would be really useful unless
> > there was a really long-running test.  Forcing failures seems much more
> > useful for testing.
> 
> I'm hoping to use this long-term on the staging server, so I think that
> random failures have value there.  Forcing failures sounds good, too, but
> I'm not sure how best to implement it.  Perhaps a REST API uri?

Okay just wondering, since for real testing (e.g. with unit tests) the fake pandas won't be very useful.  I guess you want to do some big long activity generator to catch subtle bugs or leaks?

And yeah, I think a REST API is probably the best way.  We could also add a little utility script to make it easier than writing long curl commands.
This appears to work fine.  I just run "mozpool-db run testdata.py -d 10 -p 8080" to generate devices (since I run my local mozpool server on the default port 8080), then I can image them with b2g and then request them.

I thought of two other things:

- with regards to defaulting the internal state to 'off', another inconvenience is that the fakepandas will be busted if you restart the server without refreshing the test data.  You need to force the db state back to off and then image them again to get them to a useful state.

- is it intentional that android images are not pingable in fakedevices.Device?  Imaging fakepandas with android won't work because of this.
Attachment #696818 - Flags: review?(dustin)
Comment on attachment 696818 [details] [diff] [review]
Update testdata.py for fakepandas

Looks good.  Can you push this to github and I'll cherry-pick it?
Attachment #696818 - Flags: review?(dustin) → review+
OK, I'll get this landed.

* I'll punt defaulting the internal state to something other than off to another bug.  It'll be tricky, since this isn't a state machine (can't just assign self.state = 'running_b2g' - the code will still be blocked in _wait_for_power_on)

* I fixed android pingability - that was an oversight

* README added

* Other small details in comment 6 addressed.
* selftest -> bug 819335
* initial state -> bug 825922
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: