Closed Bug 1256375 Opened 8 years ago Closed 8 years ago

Rebooting through slaveapi fails with "Expecting value: line 2 column 1 (char 1)"

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: kmoir)

References

Details

Attachments

(2 files, 1 obsolete file)

e.g. https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-xp32-ix&name=t-xp32-ix-095 though it affects all flavors of test and build slaves.

We've already lost 13% of the WinXP slaves, so if there aren't any events like a temporary slavealloc outage which takes out a huge swath of slaves, figure three or four days before it becomes a blocker.
I reverted 


http://hg.mozilla.org/build/slave_health/rev/49cab2f5cf4c

since it looks like the change to slave health might be the problem
that didn't make a difference so relanding
Attached patch bug1256375.patchSplinter Review
remove references to mozpool and devices.json etc that are preventing reboots
Attachment #8730454 - Flags: review?(coop)
Blocks: 1186617
Assignee: nobody → kmoir
I managed to connect to almost all of the 13 slaves mentioned above - some were running jobs, so I only rebooted the idle ones.

Another problem here is that their status from Slave Health dashboard does not change because /slave_health/json/test-t-xp32-ix.json file does not get updated (that is valid for other pools as well). 

If we open the json file, it can be noticed that it says: generated: "2016-03-14T15:00:09.008130Z"..which does not look right.
Attachment #8730454 - Flags: review?(coop) → review+
Comment on attachment 8730454 [details] [diff] [review]
bug1256375.patch

Now following instructions here 
https://wiki.mozilla.org/ReleaseEngineering/Applications/SlaveAPI
to deploy to dev, then prod
Attachment #8730454 - Flags: checked-in+
Comment on attachment 8730454 [details] [diff] [review]
bug1256375.patch

actually I'm getting 

remote: abort: could not lock repository /repo/hg/mozilla/build/slaveapi: Permission denied
abort: unexpected response: empty string

and can't land it there
Attachment #8730454 - Flags: checked-in+ → checked-in-
Depends on: 1257283
Win7 slaves dead so far (with 3232 pending jobs): t-w732-ix-173 t-w732-ix-047 t-w732-ix-026 t-w732-ix-126 t-w732-ix-028 t-w732-ix-041 t-w732-ix-258 t-w732-ix-281 t-w732-ix-031 t-w732-ix-162 t-w732-ix-150
I added back buildfarm/mobile/devices.json until we can get my commit rights sorted out in bug  	1257283.
Comment on attachment 8730454 [details] [diff] [review]
bug1256375.patch

really checked in this time but in git
Attachment #8730454 - Flags: checked-in- → checked-in+
Hit this error when I restarted production slaveapi so I have reverted the version of slaveapi puppet/production to 1.5.0

2016-03-17 11:26:08,442 - INFO - t-w732-ix-026 - Getting inventory info
2016-03-17 11:26:08,629 - ERROR - t-w732-ix-026 - Something went wrong while processing!
2016-03-17 11:26:08,630 - ERROR - t-w732-ix-026 - Traceback (most recent call last):
2016-03-17 11:26:08,630 - ERROR - t-w732-ix-026 - 
2016-03-17 11:26:08,630 - ERROR - t-w732-ix-026 -   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/processor.py", line 64, in _worker
2016-03-17 11:26:08,630 - ERROR - t-w732-ix-026 -     res, msg = action(slave, *args, **kwargs)
2016-03-17 11:26:08,630 - ERROR - t-w732-ix-026 - 
2016-03-17 11:26:08,630 - ERROR - t-w732-ix-026 -   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 32, in reboot
2016-03-17 11:26:08,631 - ERROR - t-w732-ix-026 -     slave.load_inventory_info()
2016-03-17 11:26:08,631 - ERROR - t-w732-ix-026 - 
2016-03-17 11:26:08,631 - ERROR - t-w732-ix-026 -   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 60, in load_inventory_info
2016-03-17 11:26:08,631 - ERROR - t-w732-ix-026 -     info = Machine.load_inventory_info(self)
2016-03-17 11:26:08,631 - ERROR - t-w732-ix-026 - 
2016-03-17 11:26:08,631 - ERROR - t-w732-ix-026 -   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/machines/base.py", line 44, in load_inventory_info
2016-03-17 11:26:08,631 - ERROR - t-w732-ix-026 -     if info["pdu_fqdn"]:
2016-03-17 11:26:08,632 - ERROR - t-w732-ix-026 - 
2016-03-17 11:26:08,632 - ERROR - t-w732-ix-026 - TypeError: 'NoneType' object has no attribute '__getitem__'
2016-03-17 11:26:08,632 - ERROR - t-w732-ix-026 - 
2016-03-17 11:26:08,632 - ERROR - t-w732-ix-026 -
Attached patch bug1256375fixbustage.patch (obsolete) — Splinter Review
found the problem, my previous patch removed the removed statement from get_system by accident
Attachment #8731833 - Flags: review?(coop)
I built a slaveapi 1.6.1 with this patch + bumping the slaveapi version in puppet.  I deployed it to the production slaveapi instance and it all seems to be working now - I can reboot machines etc. Now all I have to do is land these patches on g.m.org once it up again.
Attachment #8731833 - Attachment is obsolete: true
Attachment #8731833 - Flags: review?(coop)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
...err reopened until kim can land the patches
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #8732148 - Flags: checked-in+
Landed
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: