Closed Bug 819576 Opened 12 years ago Closed 12 years ago

More flexible image configuration

Categories

(Testing Graveyard :: Mozpool, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mcote, Unassigned)

References

Details

Attachments

(1 file, 1 obsolete file)

Currently, we require an "image" argument, and the only image we support is "b2g", which in turn requires a "b2gbase" argument as well.  The pxe config required for the b2g image is specified in the config file, in the [mozpool] section.

A more flexible solution involves creating an "images" table in the database, with these columns:

- id
- name
- pxe_config_id (for now; see image_pxe_configs below)
- boot_config_keys (JSON list)
- can_reuse (can we assign an existing device with the same image, or do we have to reimage every time)

The requests table should be expanded to take an image_id parameter.  The mozpool web handler will identify the requested image and verify that the arguments make sense by ensuring that all the keys in 'boot_config_keys' have been provided.  This allows us to fail immediately on a bad request.

images.boot_config_keys example:
- b2g: '["b2gbase"]'
- android: '[]'
- b2g_proprietary: '["b2gbase", "proprietary_overlay"]'

requests.boot_config example:
- b2g: '{"b2gbase": "http://..."}'
- android: "{}"
- b2g_proprietary: '{"b2gbase": "http://...", "proprietary_overlay": "..."}'


Later, we can expand to a new table to match images, hardware types, and pxe configs:

table image_pxe_configs
- image_id
- hardware_type_id
- pxe_config_id
(primary key image_id, hardware_type_id)
Blocks: 815785
Assignee: nobody → mcote
Status: NEW → ASSIGNED
I decided to go ahead and implement the proposed image_pxe_configs table in this patch.  It's not much more work, and will be one less schema change down the road.

Before this is deployed, we'll need to add the 'images' and 'image_pxe_config' tables.  The requests table also now has a foreign key to the 'images' table, which will need to be added.

Maybe we can deploy it on just one rack and test there first?
Attachment #692825 - Flags: review?(dustin)
Attachment #692825 - Attachment description: Tables for images and images/pxe configs/hardware types → Use tables for images and images/pxe configs/hardware types
Comment on attachment 692825 [details] [diff] [review]
Use tables for images and images/pxe configs/hardware types

Review of attachment 692825 [details] [diff] [review]:
-----------------------------------------------------------------

As far as landing this, all of the servers have the same DB, so we can't change the schema for just one of them.

Aside from dropping the last_pxe_config_id column, these schema changes could be made backward-compatible.  I'll need to think carefully about how to do that.  The tricky bit is the `requests.image_id column`, which Mozpool-1.1.1 doesn't include in its inserts.  We could populate the image table before adding that column, and set a default value pointing to one of those images.

::: mozpool/db/data.py
@@ +70,5 @@
> +    if not args:
> +        return stmt
> +    def argval(a):
> +        if isinstance(a, str) or isinstance(a, unicode):
> +            return '"%s"' % a

This is an SQL injection vulnerability!  I see it already existed in dump_devices and dump_requests.  In the former case, the input only comes from inventory; in the latter, it's an integer ID assigned to the state machine.  So what's in place is safe, but still not a good idea.

Why not just add actual expression objects?
  exprs = [ (col == a) for a in args ]
and use reduce with or_, rather than the conditional?

In fact, a much easier solution is the following:
  if args:
    stmt = stmt.where(col in args)

note that either case will fail for len(args) >> 100, as the DB only allows SQLAlchemy to use a certain number of placeholders (the number varies).  From a look at the code, len(args) <= 1, so I think this restriction is fine.  In fact, the calling convention for dump_devices suggests that dump_devices('panda1', 'panda2') would give me a list of [ {dict for panda1}, {dict for panda2} ], when in fact the ordering of the results is not defined.  So maybe it'd be better to limit these functions to 0 (meaning fetch all) or 1 arguments?

@@ +111,5 @@
>      stmt = sqlalchemy.select(
>          [devices.c.id, devices.c.name, devices.c.fqdn, devices.c.inventory_id,
>           devices.c.mac_address, img_svrs.c.fqdn.label('imaging_server'),
> +         devices.c.relay_info, hw_types.c.type.label('hardware_type'),
> +         hw_types.c.model.label('hardware_model')],

Note that dump_devices is *only* used by inventorysync right now, and its return value must contain exactly the keys that are extracted from inventory (since it compares dicts using ==).

So, inventorysync.py will need an upgrade to handle these new fields.  o['server_model'] is what you want to look at.  For pandas, it's

{u'vendor': u'PandaBoard', u'description': u'', u'part_number': u'', u'model': u'ES', u'id': 529, u'resource_uri': u'/en-US/tasty/v3/server_model/529/'}

It seems best for inventorysync.py to have a mapping from (vendor, model) : (hardware_type, hardware_model): for tegras, we will want hardware_type = "Tegra", not "Nvidia", I think.

::: mozpool/mozpool/handlers.py
@@ +56,4 @@
>          try:
>              assignee = body['assignee']
>              duration = int(body['duration'])
> +            image_name = body['image']

Where does this come from?  It looks like this needs API.txt changes as well as JS frontend changes (and probably an API endpoint to list images, so the JS can make a select box?)

::: setup.py
@@ +18,4 @@
>            'requests',
>            'distribute',
>            'argparse',
> +          'mozdevice',

what's this for?

::: sql/schema.sql
@@ +130,5 @@
> +  -- short identifier
> +  name varchar(32) not null,
> +  -- required boot_config keys (JSON list)
> +  boot_config_keys text not null,
> +  -- true (1) if we can reuse an existing device with this image

I've forgotten this, so probably worth a comment: I think this means that the device can be re-used if the image *and* the boot config match, right?

@@ +154,5 @@
> +  foreign key (hardware_type_id) references hardware_types(id) on delete restrict,
> +  pxe_config_id integer unsigned,
> +  foreign key (pxe_config_id) references pxe_configs(id) on delete restrict,
> +
> +  unique index imagehardware_index (image_id, hardware_type_id)

This should probably be the primary key

@@ +238,5 @@
>    -- time (UTC) at which the request will expire (if not renewed)
>    expires datetime not null,
> +  -- image requested
> +  image_id integer unsigned not null,
> +  foreign key (image_id) references images(id) on delete restrict,

Should a similar column (`last_image_id`) be added to `devices`, and the `last_pxe_config_id` column removed?  You'll also need a `last_boot_config` column, most likely.  It's fine to do that in a subsequent patch, since the data isn't needed yet (it will be when can_reuse becomes useful).  However, it will probably be more natural for users to see an "Image" column in the UI.  If you want to get fancy, that column could have a suffix to remind users that there's boot_config to think about, too:

  Image
  -----
  Android v3
  Android v3
  B2G v4 †
Attachment #692825 - Flags: review?(dustin) → review-
(In reply to Dustin J. Mitchell [:dustin] from comment #2)
> Comment on attachment 692825 [details] [diff] [review]
> Use tables for images and images/pxe configs/hardware types
> 
> Review of attachment 692825 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> As far as landing this, all of the servers have the same DB, so we can't
> change the schema for just one of them.

Oh right.  I think we need to think about a staging server again.  I understand not wanting to have a bunch of pandas sit idle most of the time, but there are a lot of moving parts here, and not being able to fully try out the system before deployment makes me very nervous.  Maybe we can somehow set something up to take an entire rack out of production temporarily?  We could have them set to locked_out in the main db but have a separate staging db to track their states in testing.

> Aside from dropping the last_pxe_config_id column, these schema changes
> could be made backward-compatible.  I'll need to think carefully about how
> to do that.  The tricky bit is the `requests.image_id column`, which
> Mozpool-1.1.1 doesn't include in its inserts.  We could populate the image
> table before adding that column, and set a default value pointing to one of
> those images.

Yeah that's what I was thinking.

Just to be clear, you don't think any code changes are necessary here, right?

> 
> ::: mozpool/db/data.py
> @@ +70,5 @@
> > +    if not args:
> > +        return stmt
> > +    def argval(a):
> > +        if isinstance(a, str) or isinstance(a, unicode):
> > +            return '"%s"' % a
> 
> This is an SQL injection vulnerability!  I see it already existed in
> dump_devices and dump_requests.  In the former case, the input only comes
> from inventory; in the latter, it's an integer ID assigned to the state
> machine.  So what's in place is safe, but still not a good idea.

Ah yeah.  Okay, changing (and simplifying) to take one optional parameter (two in the case of requests, to keep 'include_closed'), defaulting to None.  If it's given, we add precisely one where clause.

> 
> @@ +111,5 @@
> >      stmt = sqlalchemy.select(
> >          [devices.c.id, devices.c.name, devices.c.fqdn, devices.c.inventory_id,
> >           devices.c.mac_address, img_svrs.c.fqdn.label('imaging_server'),
> > +         devices.c.relay_info, hw_types.c.type.label('hardware_type'),
> > +         hw_types.c.model.label('hardware_model')],
> 
> Note that dump_devices is *only* used by inventorysync right now, and its
> return value must contain exactly the keys that are extracted from inventory
> (since it compares dicts using ==).
> 

Ah right.  Hm.  I am starting to think that this is a YAGNI issue, so I think I'll just remove this change, since, as you say, this function is only used in one place, which doesn't care about the extra fields.  We can expand it later if necessary.

> 
> ::: mozpool/mozpool/handlers.py
> @@ +56,4 @@
> >          try:
> >              assignee = body['assignee']
> >              duration = int(body['duration'])
> > +            image_name = body['image']
> 
> Where does this come from?  It looks like this needs API.txt changes as well
> as JS frontend changes (and probably an API endpoint to list images, so the
> JS can make a select box?)

Well, it was already in there, and it is already documented in API.txt (this is the big API change we talked about a couple weeks ago).  I just changed how we deal with it, since before it had to be 'b2g' and nothing else.  However you're right; we need an images API, which I will add.

I see you that added /environment/list/ to the bmm layer.  This is used by mozpool in building requests through the UI--maybe it should be at that layer?  Not that it makes much difference in our current setup.

> 
> ::: setup.py
> @@ +18,4 @@
> >            'requests',
> >            'distribute',
> >            'argparse',
> > +          'mozdevice',
> 
> what's this for?

Oops sorry, that for SUT stuff I'm working on in parallel.  I'll leave it out of this patch.

> 
> ::: sql/schema.sql
> @@ +130,5 @@
> > +  -- short identifier
> > +  name varchar(32) not null,
> > +  -- required boot_config keys (JSON list)
> > +  boot_config_keys text not null,
> > +  -- true (1) if we can reuse an existing device with this image
> 
> I've forgotten this, so probably worth a comment: I think this means that
> the device can be re-used if the image *and* the boot config match, right?

Right.

> 
> @@ +154,5 @@
> > +  foreign key (hardware_type_id) references hardware_types(id) on delete restrict,
> > +  pxe_config_id integer unsigned,
> > +  foreign key (pxe_config_id) references pxe_configs(id) on delete restrict,
> > +
> > +  unique index imagehardware_index (image_id, hardware_type_id)
> 
> This should probably be the primary key

Right.

> 
> @@ +238,5 @@
> >    -- time (UTC) at which the request will expire (if not renewed)
> >    expires datetime not null,
> > +  -- image requested
> > +  image_id integer unsigned not null,
> > +  foreign key (image_id) references images(id) on delete restrict,
> 
> Should a similar column (`last_image_id`) be added to `devices`, and the
> `last_pxe_config_id` column removed?  You'll also need a `last_boot_config`
> column, most likely.  It's fine to do that in a subsequent patch, since the
> data isn't needed yet (it will be when can_reuse becomes useful).  However,
> it will probably be more natural for users to see an "Image" column in the
> UI.  If you want to get fancy, that column could have a suffix to remind
> users that there's boot_config to think about, too:
> 
>   Image
>   -----
>   Android v3
>   Android v3
>   B2G v4 †

I may as well do that now.  New patch should be up for review later today!  Thanks for your thoroughness.
(In reply to Mark Côté ( :mcote ) from comment #3)
> Oh right.  I think we need to think about a staging server again.  I
> understand not wanting to have a bunch of pandas sit idle most of the time,
> but there are a lot of moving parts here, and not being able to fully try
> out the system before deployment makes me very nervous.  Maybe we can
> somehow set something up to take an entire rack out of production
> temporarily?  We could have them set to locked_out in the main db but have a
> separate staging db to track their states in testing.

That's certainly possible - it's basically what we did two weeks ago.  It turned out to be a lot of work!  I don't see a big problem with this particular patch, and rollback is easy enough with due care taken in the schema.

The eventual destination of the prototype chassis is still not nailed down, so it's entirely possible that will end up as our staging environment.

> Yeah that's what I was thinking.
> 
> Just to be clear, you don't think any code changes are necessary here, right?

Correct.

> > Note that dump_devices is *only* used by inventorysync right now, and its
> > return value must contain exactly the keys that are extracted from inventory
> > (since it compares dicts using ==).
> > 
> 
> Ah right.  Hm.  I am starting to think that this is a YAGNI issue, so I
> think I'll just remove this change, since, as you say, this function is only
> used in one place, which doesn't care about the extra fields.  We can expand
> it later if necessary.

It does care about those fields, actually, since it needs to populate them for new devices.  But it could be a separate patch (and we can statically configure the hardware type for existing devices for now) if you'd like.

Also, TIL YAGNI.

> Well, it was already in there, and it is already documented in API.txt (this
> is the big API change we talked about a couple weeks ago).  I just changed
> how we deal with it, since before it had to be 'b2g' and nothing else. 
> However you're right; we need an images API, which I will add.

I'm a dummy.  I blame being away from the code for a week :)

> I see you that added /environment/list/ to the bmm layer.  This is used by
> mozpool in building requests through the UI--maybe it should be at that
> layer?  Not that it makes much difference in our current setup.

That seems the right place -- /images/list/, feeding into a select box

> I may as well do that now.  New patch should be up for review later today! 
> Thanks for your thoroughness.

Always - well, I try anyway ;)
As discussed, this addresses the issues you raised and also changes the please_pxe_boot lifeguard API to please_image.  I updated the UI accordingly.

I have pushed the original patch and these changes to my github in the flexible-images branch, if you want to see the incremental changes over the last patch.
Attachment #692825 - Attachment is obsolete: true
Attachment #693113 - Flags: review?(dustin)
Comment on attachment 693113 [details] [diff] [review]
Use tables for images and images/pxe configs/hardware types

Looks great.  If you land this ASAP, I'll cut and deploy a new version.
Attachment #693113 - Flags: review?(dustin) → review+
https://github.com/djmitche/mozpool/commit/00fed3d32b1241ac422cbd61f5b647fea35fbbcc
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: