Closed Bug 948244 Opened 9 years ago Closed 9 years ago

creating a request takes too long

Categories

(Testing Graveyard :: Mozpool, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

Details

Attachments

(1 file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=31710912&tree=Mozilla-Inbound

> 16:46:09     INFO - retry: Calling <built-in function remove> with args: ('tmpfile_stdout',), kwargs: {}, attempt #1
> 16:46:19    ERROR - Try 1: Can't get http://mobile-imaging-007.p7.releng.scl1.mozilla.com/api/device/panda-0649/request/: HTTPConnectionPool(host='mobile-imaging-007.p7.releng.scl1.mozilla.com', port=80): Request timed out. (timeout=10)!

but

> request.693659 INFO - [2013-12-09 16:46:10,889] entering state finding_device
> request.693659 INFO - [2013-12-09 16:46:11,695] Finding device.
> request.693659 INFO - [2013-12-09 16:46:14,415] Assigning device panda-0649.
> request.693659 INFO - [2013-12-09 16:46:15,516] Request succeeded.
> request.693659 INFO - [2013-12-09 16:46:16,017] entering state contacting_lifeguard
> 127.0.0.1:46788 - - [09/Dec/2013 16:46:19] "HTTP/1.1 POST /api/device/panda-0649/request/" - 200 OK

so the HTTP transaction took slightly more than the 10s timeout in mozpoolclient.  HTTP requests really shouldn't be blocking for that long!

Here's what's going on between those lines

> (initial HTTP request)

mozpool/mozpool/handlers.py
 39         args, body = templeton.handlers.get_request_parms()
 40         try:
 41             assignee = body['assignee']
 42             duration = int(body['duration'])
 43             image_name = body['image']
 44             environment = body.get('environment', 'any')
 45         except (KeyError, ValueError):
 46             raise web.badrequest()
 47 
 48         try:
 49             image = self.db.images.get(image_name)
 50         except exceptions.NotFound:
 51             raise web.notfound()
 52 
 53         boot_config = {}
 54         for k in image['boot_config_keys']:
 55             try:
 56                 boot_config[k] = body[k]
 57             except KeyError:
 58                 raise web.badrequest()
 59 
 60         request_id = self.db.requests.add(device_name, environment, assignee,
 61                 duration, image['id'], boot_config)
 62         mozpool.mozpool.driver.handle_event(request_id, 'find_device', None)

mozpool/statedriver.py
104         machine = self._get_machine(machine_name)
105         machine.handle_event(event, args)

mozpool/statemachine.py
187         handler = self._event_methods.get(event)
188         if handler:
189             handler(self, args)

mozpool/mozpool/requestmachine.py
105         self.machine.goto_state(finding_device)

mozpool/statemachine.py
100         if isinstance(new_state_name_or_class, type) and issubclass(new_state_name_or_class, State):
101             new_state_name_or_class = new_state_name_or_class.state_name
102             
103         self.state.on_exit()
104     
105         # only log actual state changes, rather than re-entries of the same state
106         if self.state.state_name != new_state_name_or_class:
107             self.logger.info('entering state %s' % (new_state_name_or_class,))

> request.693659 INFO - [2013-12-09 16:46:10,889] entering state finding_device

109         self.state = self._make_state_instance(new_state_name_or_class)
110         self.write_state(new_state_name_or_class, self.state.TIMEOUT)
111         
112         self.state.on_entry()

mozpool/mozpool/requestmachine.py
125         self.find_device()

131         self.logger.info('Finding device.')

> request.693659 INFO - [2013-12-09 16:46:11,695] Finding device.

132         device_name = None
133         count = self.machine.increment_counter(self.state_name)
134         self.db.device_requests.clear(self.machine.request_id)
135         request = self.db.requests.get_info(self.machine.request_id)
136         image_is_reusable = self.db.images.is_reusable(request['image'])
137 
138         avail_devices = self.db.devices.list_available(
139                 environment=request['environment'],
140                 device_name=request['requested_device'])
141 
142         if avail_devices:
143             if image_is_reusable:
144                 devices_with_image = [x for x in avail_devices
145                                       if x['image'] == request['image'] and
146                                          util.from_json(x['boot_config']) ==
147                                          util.from_json(request['boot_config'])]
148                 if devices_with_image:
149                     avail_devices = devices_with_image
150 
151             # pick a device at random from the returned list
152             device_name = random.choice(avail_devices)['name']
153             self.logger.info('Assigning device %s.' % device_name)

> request.693659 INFO - [2013-12-09 16:46:14,415] Assigning device panda-0649.

154             if self.db.device_requests.add(self.machine.request_id,
155                                                 device_name):
156                 self.logger.info('Request succeeded.')

> request.693659 INFO - [2013-12-09 16:46:15,516] Request succeeded.

157                 self.machine.goto_state(contacting_lifeguard)

mozpool/statemachine.py
100         if isinstance(new_state_name_or_class, type) and issubclass(new_state_name_or_class, State):
101             new_state_name_or_class = new_state_name_or_class.state_name
102             
103         self.state.on_exit()
104     
105         # only log actual state changes, rather than re-entries of the same state
106         if self.state.state_name != new_state_name_or_class:
107             self.logger.info('entering state %s' % (new_state_name_or_class,))

> request.693659 INFO - [2013-12-09 16:46:16,017] entering state contacting_lifeguard

109         self.state = self._make_state_instance(new_state_name_or_class)
110         self.write_state(new_state_name_or_class, self.state.TIMEOUT)
111         
112         self.state.on_entry()

mozpool/mozpool/requestmachine.py
186         req = self.db.requests.get_info(self.machine.request_id)
187         device_name = req['assigned_device']
188         device_state = self.db.devices.get_machine_state(device_name)
189         if device_state != 'ready':
190             self.logger.error('Assigned device %s is in unexpected state %s '
191                               'when about to contact lifeguard.' %
192                               (device_name, device_state))
193             self.machine.goto_state(finding_device)
194             # note that there's still a small chance of a race here between
195             # mozpool and lifeguard: lifeguard begins self-testing the device
196             # after this check and before it receives the event below; the
197             # device eventually returns to the 'ready' state, but has not sent
198             # the request.  The timeout for 'pending' will catch this rare
199             # situation.
200             return
201 
202         # If the requested image is reusable and we got a device with that
203         # image and the requested bootconfig, just power cycle it.
204         # Otherwise, image it.  Note that there will be a failure if the
205         # image is not installed and the device is not imageable.
206         event = ''
207         device_request_data = {}
208         assigned_device_name = req['assigned_device']
209 
210         if self.db.images.is_reusable(req['image']):
211             dev = self.db.devices.get_image(req['assigned_device'])
212             if (dev['image'] == req['image'] and
213                 util.from_json(dev['boot_config']) ==
214                 util.from_json(req['boot_config'])):
215                 event = 'please_power_cycle'
216 
217         if not event:
218             # Use the device's hardware type and requested image to find the
219             # pxe config, if any.
220             event = 'please_image'
221             device_request_data['boot_config'] = req['boot_config']
222             device_request_data['image'] = req['image']
223 
224         # try to ask lifeguard to start imaging or power cycling
225         device_url = 'http://%s/api/device/%s/event/%s/' % (
226             self.db.devices.get_imaging_server(assigned_device_name),
227             assigned_device_name, event)
228         def posted(result):
229             if result.status_code != 200:
230                 self.logger.warn("got %d from Lifeguard" % result.status_code)
231                 return
232             mozpool.driver.handle_event(self.machine.request_id, 'lifeguard_contacted', {})
233         async.requests.post.start(posted, device_url,
234                 data=json.dumps(device_request_data))

> 127.0.0.1:46788 - - [09/Dec/2013 16:46:19] "HTTP/1.1 POST /api/device/panda-0649/request/" - 200 OK

That's an awful lot to be doing before returning, but not bad given Mozpool's assumptions that it's OK to block on DB operations.  The 0.5s between "Request succeeded." and "entering state contacting_lifeguard" must have been due to the inserts into the request_log table.  That sucks, but we should be resilient to slowish databases.

We could easily finish the HTTP request after "Request succeeded".  Probably the easiest way to accomplish that is to transition to a "contact_lifeguard" state which has no on_entry method but a 0s timeout leading to the "contacting_lifeguard" state.
Attachment #8345321 - Flags: review?(jwatkins)
Comment on attachment 8345321 [details] [diff] [review]
bug948244-p1.patch

lgtm r+
Attachment #8345321 - Flags: review?(jwatkins) → review+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.