Closed Bug 815785 Opened 12 years ago Closed 12 years ago

SUT LifeGuard tests

Categories

(Testing Graveyard :: Mozpool, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mcote, Unassigned)

References

Details

Attachments

(1 file, 2 obsolete files)

LifeGuard will need to do a simple verification that the SUT agent has come up after b2g has been imaged and booted, and that the SD card is writable. This is blocked on having the pandas on the network after imaging.
Depends on: 818264
No longer depends on: 810045
Depends on: 819576
I have tested the checks that you suggested with socket and it was easy to check for the SUT's port.

Do you have an ETA for this? I can make our tools handle the issue if you think it's best.
I should have this done by the end of the week, hopefully earlier.  If you want to go ahead with your scripts, though, go for it.  It won't hurt anything, and we can always remove it later. :)
Assignee: nobody → mcote
Status: NEW → ASSIGNED
This adds states for rebooting via SUT (for real phones) and two SUT verification states (for both pandas and phones).

I didn't add a SUT verification for android imaging.  As far as I know, android images don't include the SUT agent--but please tell me if you've heard otherwise.

I have tested this with a real phone here, and it works perfectly.  I haven't tried it with a B2G panda, though.  Since this requires no db changes, it would be great if we could try this out on one rack first, if possible.

UI changes for phones (well, for images other than b2g, really) will be forthcoming in another patch; it's less critical, since I added phone support in mainly to ease testing.
Attachment #694220 - Flags: review?(dustin)
Comment on attachment 694220 [details] [diff] [review]
Add SUT reboot and verification states, support for phone hardware

Review of attachment 694220 [details] [diff] [review]:
-----------------------------------------------------------------

On the topic of SUT on Android - I'm surprised to hear the Panda Android image doesn't contain SUT (but I don't have any reason to think you're wrong).  IMHO, it should, as we've already had a lot of trouble with pandas that will be solved by this bug.  It'd be a shame to have those troubles continue for pandas running Android.

As for what uses SUT for reboots and checks: there are really three different concepts here:
* PXE-bootability (a property of the hardware type)
* power-controllability (a property of the device, specifically relay_info)
* SUT accessibility (a property of the image)

We should plan to make schema adjustments to add this data (hardware_types.can_pxe_boot and images.has_sutagent), but not necessarily in this patch.  But this patch should probably *anticipate* the distinction between the three concepts above; they're currently blurred in the please_pxe_boot and please_power_cycle states.

My vision for this, is that devices would go through the same set of states regardless of capabilities, just using different operations, and failing if the operation is impossible (power-cycling a device with relay_info=NULL and running an image with NOT images.has_sutagent; or pxe-booting a device with NOT hardware_types.can_pxe_boot).  Specifically:
* the power-cycling steps (PowerCycleMixin) would use sutagent if available, falling back to power cycling if necessary and available; and
* the post-image or post-boot steps (*_rebooting, *_pinging) for either would try to run SUT verification steps if those are available for the new image, falling back to ping if not.  These steps could use some refactoring.

Ideally the 'free' and 'ready' states will also use ping and/or SUTagent checks, but that should *definitely* wait for another bug.

::: mozpool/sut/cli.py
@@ +54,5 @@
> +    return success
> +
> +def reboot(device_fqdn):
> +    logger.info('Rebooting device via SUT agent.')
> +    addr = socket.gethostbyname(config.get('server', 'fqdn').split(':')[0])

We have server.ipaddr in the config -- no need for a DNS lookup here.  (and no need for an IP address at all; see below)

@@ +59,5 @@
> +    port_manager = PortManager(addr)
> +    port = int(port_manager.reserve())
> +    dm = DeviceManagerSUT(device_fqdn)
> +    try:
> +        print 'reboot callback server on %s:%d' % (addr, port)

probably should log this, if anything, rather than print

@@ +60,5 @@
> +    port = int(port_manager.reserve())
> +    dm = DeviceManagerSUT(device_fqdn)
> +    try:
> +        print 'reboot callback server on %s:%d' % (addr, port)
> +        dm.reboot(addr, port_manager.use(port))

Does this automatically time out?  The run_async method takes care of the risk that a callback will happen after the step has timed out, but there's still two worries:
1. that we'd get a lot of threads "hung" waiting for a reboot that will never succeed;
2. that the reboot will complete, but after the state machine has timed out, delivering spurious events to the state machine and possibly rebooting the device at an unexpected time (a possible source of the dreaded "randomly reboots during tests")

Same issues for check_sdcard and sut_verify.

@@ +100,5 @@
> +        prevent others from using the port.
> +        '''
> +        while True:
> +            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +            sock.bind((self.ipaddr, 0))

You can change self.ipaddr to '' here.

@@ +112,5 @@
> +        returning the port.
> +        '''
> +        sock = self.reserved_ports[port]
> +        sock.close()
> +        return port

Well, that's a big 'ol race condition, but I assume this code is from elsewhere, so no need to fix it here.  Better would be for DeviceManagerSUT.reboot to open a socket with bind(('', 0)), and use the port number that socket is assigned.
Attachment #694220 - Flags: review?(dustin) → review-
Android images do have SUT agent running.  I've been telneting to it for debugging.
(In reply to Dustin J. Mitchell [:dustin] from comment #4)
> Comment on attachment 694220 [details] [diff] [review]
> Add SUT reboot and verification states, support for phone hardware
> 
> Review of attachment 694220 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> On the topic of SUT on Android - I'm surprised to hear the Panda Android
> image doesn't contain SUT (but I don't have any reason to think you're
> wrong).  IMHO, it should, as we've already had a lot of trouble with pandas
> that will be solved by this bug.  It'd be a shame to have those troubles
> continue for pandas running Android.

Ah Jake says the SUT agent is built in.  I'll correct this assumption in my
code.

> 
> As for what uses SUT for reboots and checks: there are really three
> different concepts here:
> * PXE-bootability (a property of the hardware type)
> * power-controllability (a property of the device, specifically relay_info)
> * SUT accessibility (a property of the image)

Yeah been thinking about this.  That's a good breakdown.

> 
> We should plan to make schema adjustments to add this data
> (hardware_types.can_pxe_boot and images.has_sutagent), but not necessarily
> in this patch.  But this patch should probably *anticipate* the distinction
> between the three concepts above; they're currently blurred in the
> please_pxe_boot and please_power_cycle states.
> 
> My vision for this, is that devices would go through the same set of states
> regardless of capabilities, just using different operations, and failing if
> the operation is impossible (power-cycling a device with relay_info=NULL and
> running an image with NOT images.has_sutagent; or pxe-booting a device with
> NOT hardware_types.can_pxe_boot).  Specifically:
> * the power-cycling steps (PowerCycleMixin) would use sutagent if available,
> falling back to power cycling if necessary and available; and
> * the post-image or post-boot steps (*_rebooting, *_pinging) for either
> would try to run SUT verification steps if those are available for the new
> image, falling back to ping if not.  These steps could use some refactoring.

Yeah I was worried about the states exploding and getting overly complicated.

I may as well implement the extra columns now.

Just to clarify--so you don't think there should be extra states for sut_verifying?
I feel like, for clarity, any step that waits on something specific should be its
own state, and some processes will take more steps than others.  I don't like the
idea of a single state doing two or more distinct operations--but perhaps I'm
misunderstanding you.

In any case, I think there should still be specified failure states (e.g. failed_sut_verifying)
to make it easy to figure out where a device died.


> ::: mozpool/sut/cli.py
> @@ +59,5 @@
> > +    port_manager = PortManager(addr)
> > +    port = int(port_manager.reserve())
> > +    dm = DeviceManagerSUT(device_fqdn)
> > +    try:
> > +        print 'reboot callback server on %s:%d' % (addr, port)
> 
> probably should log this, if anything, rather than print

Oops yeah that was for debugging.

> 
> @@ +60,5 @@
> > +    port = int(port_manager.reserve())
> > +    dm = DeviceManagerSUT(device_fqdn)
> > +    try:
> > +        print 'reboot callback server on %s:%d' % (addr, port)
> > +        dm.reboot(addr, port_manager.use(port))
> 
> Does this automatically time out?  The run_async method takes care of the
> risk that a callback will happen after the step has timed out, but there's
> still two worries:
> 1. that we'd get a lot of threads "hung" waiting for a reboot that will
> never succeed;
> 2. that the reboot will complete, but after the state machine has timed out,
> delivering spurious events to the state machine and possibly rebooting the
> device at an unexpected time (a possible source of the dreaded "randomly
> reboots during tests")

It does eventually give up, but yeah, the timeout in the state should be at
least as long as that timeout, and I don't think it is right now... though
actually determining this is difficult, since connection time outs can
vary.  I am thinking that there should be a method of last resort to
forcibly kill a thread.

> 
> Same issues for check_sdcard and sut_verify.
> 
> @@ +100,5 @@
> > +        prevent others from using the port.
> > +        '''
> > +        while True:
> > +            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> > +            sock.bind((self.ipaddr, 0))
> 
> You can change self.ipaddr to '' here.

Hm, I did think to prevent the possibility of it opening a socket on
the wrong interface.  I guess it's unlikely on production machines, but
I wanted to be sure.

> 
> @@ +112,5 @@
> > +        returning the port.
> > +        '''
> > +        sock = self.reserved_ports[port]
> > +        sock.close()
> > +        return port
> 
> Well, that's a big 'ol race condition, but I assume this code is from
> elsewhere, so no need to fix it here.  Better would be for
> DeviceManagerSUT.reboot to open a socket with bind(('', 0)), and use the
> port number that socket is assigned.

Yeah that is one of many problems in DeviceManagerSUT.  We're fixing them
slowly.  At the moment there isn't any good way to completely eliminate
the possibility of a race condition as far as I know.
I wasn't clear - I think there should be a *_pinging state and a separate *_sut_verify state; the sut_verify state would just skip to the next state if the image doesn't have SUT on it.  I mostly objected to having an entirely separate sequence of steps for SUT vs. relays.  Sorry about that confusion.

As for connect timeouts -- that's why I implemented that asyncore abomination in mozpool/bmm/pxe.py.  That's a bit impractical for mozdevice.  Killing threads isn't supported in Python's threading library, from what I can tell.  Maybe this is just something we should log profusely (maybe if dm.reboot or the like return after the timeout), bear in mind as a possible cause of random mid-test reboots, and address if it becomes a problem?

Binding to '' will bind to all interfaces, so it shouldn't be a problem.
huh.. i just learned about socket. settimeout, which would have saved me a lot of extra code in pxe.py..
Depends on: 826117
This patch applies on top of head.  I'll fix it up for your self-test changes once that lands.

I fixed up the state transitions as you suggested.  I also made a few other changes:

* I put in specific timeouts where I use the DeviceManagerSUT class and went through the code to calculate the maximum amount of time that it should take.  I fixed a bug in DeviceManagerSUT too, where it wasn't respecting timeouts when connecting.  Regardless, I realized that we don't really need, or probably even want, state timeouts for these things.  DeviceManagerSUT has its own timeouts and should never hang on a dropped/bad connection/bad device/bad agent.  I change it to only retry once, though, so we rely on the state machine for retries, which I think is cleaner and more consistent with the rest of mozpool.

* I removed the use of DeviceManagerSUT's reboot detection (via callback server).  I think it's more trouble than it's worth.  Instead we just ping the device until it's back up, as we would with a panda.  The only gotcha here, which probably applies to pandas is well, is that we can't be 100% certain that the device *actually* rebooted (I've run into this problem on a real phone which refused to reboot but otherwise behaved normally).  We should probably have a generic uptime check or something to verify this.

* For testing purposes, I decided to implement support for can_reuse.  I decided this should be done in the state machine and not in the data functions, since I doubt the logic to prefer a matching device but fall back to any device could be done particularly well in a SQL statement.  Instead we get all the free devices and leave it to the caller to figure out which are better than others.

* Also, like the previous patch but which I forgot to mention, this fixes the UI indicator that a request has specific boot_config flags associated with the image.

New request UI for android images will be in a separate patch.
Attachment #694220 - Attachment is obsolete: true
Attachment #698836 - Flags: review?(dustin)
Comment on attachment 698836 [details] [diff] [review]
Add SUT reboot and verification states, support for phone hardware, v2

Review of attachment 698836 [details] [diff] [review]:
-----------------------------------------------------------------

The timeout issue is easy to fix, but significant.  Since you need to re-draft atop the self-test changes, it's probably best to remove the can_pxe_boot column, too.  Other than that, this looks great.

::: mozpool/lifeguard/devicemachine.py
@@ +101,5 @@
> +                                   args['image'],
> +                                   args['boot_config'])
> +            self.machine.goto_state(start_pxe_boot)
> +        else:
> +            self.logger.error('no capability to image this device')

Actually, probably a better indicator than a boolean can_pxe_boot is looking for the image_pxe_configs row corresponding to the requested image and the device's hardware type.  We may be able to put Android on some devices, but not B2G, for example.  The request logic should handle that for now, and a manual please_image request will just bomb out later, so this isn't something that must be fixed right away.  On the other hand, it involves a schema change, so maybe it should happen in this bug?

@@ +335,5 @@
> +class sut_verifying(statemachine.State):
> +    """
> +    If the image has a SUT agent, verify that we can establish a connection.
> +    If no SUT agent, just proceed to next state.
> +    No timeout here because DeviceManagerSUT methods have their own timeouts,

> * I put in specific timeouts where I use the DeviceManagerSUT class and went
> through the code to calculate the maximum amount of time that it should
> take.  I fixed a bug in DeviceManagerSUT too, where it wasn't respecting
> timeouts when connecting.  Regardless, I realized that we don't really need,
> or probably even want, state timeouts for these things.  DeviceManagerSUT
> has its own timeouts and should never hang on a dropped/bad connection/bad
> device/bad agent.  I change it to only retry once, though, so we rely on the
> state machine for retries, which I think is cleaner and more consistent with
> the rest of mozpool.

We do, in case the mozpool daemon restarts.  Without a timeout, lifeguard would just leave the board in the power-cycling state, without actually power-cycling it.

@@ +372,5 @@
> +    """
> +    If the image has a SUT agent, verify that we can write a test file to the
> +    device's SD card.
> +    If no SUT agent, just proceed to next state.
> +    No timeout here because DeviceManagerSUT methods have their own timeouts,

Needs a timeout, as described above.

::: mozpool/mozpool/requestmachine.py
@@ +182,5 @@
> +            if image_is_reusable:
> +                devices_with_image = [x for x in free_devices
> +                                      if x['image'] == request['image'] and
> +                                         x['boot_config'] ==
> +                                         request['boot_config']]

This means that we need to be careful to set boot_config = {} for images that don't require it.  I *think* that's OK, but we should probably be explicit about it.

@@ +254,5 @@
> +                    return {}
> +            device_config = data.device_config(request_config['assigned_device'])
> +            if (device_config['image'] == request_config['image'] and
> +                from_json(device_config['boot_config']) ==
> +                from_json(request_config['boot_config'])):

Why does this convert from JSON, when the pre-check above does not?
Attachment #698836 - Flags: review?(dustin) → review-
(In reply to Dustin J. Mitchell [:dustin] from comment #10)
> Actually, probably a better indicator than a boolean can_pxe_boot is looking
> for the image_pxe_configs row corresponding to the requested image and the
> device's hardware type.  We may be able to put Android on some devices, but
> not B2G, for example.  The request logic should handle that for now, and a
> manual please_image request will just bomb out later, so this isn't
> something that must be fixed right away.  On the other hand, it involves a
> schema change, so maybe it should happen in this bug?

Good idea.  Removing the can_pxe_boot column.

> We do, in case the mozpool daemon restarts.  Without a timeout, lifeguard
> would just leave the board in the power-cycling state, without actually
> power-cycling it.

Aha right, okay.  I will still leave the callback timeout out, but I'll add in a timeout to the states that is a bit longer than the maximum time that the SUT operations should take.

> This means that we need to be careful to set boot_config = {} for images
> that don't require it.  I *think* that's OK, but we should probably be
> explicit about it.
> ...
> Why does this convert from JSON, when the pre-check above does not?

An oversight. :) I moved the from_json() function to data.py (seems like the best place).  I guess sometimes we want the raw JSON string, so we can just convert on demand.  I fixed up the first bit to use that function, so blank boot_configs are still acceptable.
Updated with requested changes and unbitrotted against mozpool head.  Also added a timeout to the new state, as discussed on IRC.
Attachment #698836 - Attachment is obsolete: true
Attachment #699377 - Flags: review?(dustin)
Comment on attachment 699377 [details] [diff] [review]
Add SUT reboot and verification states, support for phone hardware, v3

Review of attachment 699377 [details] [diff] [review]:
-----------------------------------------------------------------

This looks great.

Sorry for the delay - I thought I submitted this this afternoon!

::: mozpool/db/data.py
@@ +345,5 @@
>      return config
>  
> +def device_hardware_type(device):
> +    """
> +    Get the hardware type, model, and sut-agent capability for this device.

This doesn't return anything about SUT..
Attachment #699377 - Flags: review?(dustin) → review+
Fixed the comment.

https://github.com/djmitche/mozpool/commit/259a9ff52c06b23af2d1633253c7106fcbd293ec
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Blocks: 888945
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: