Closed Bug 988281 Opened 10 years ago Closed 10 years ago

All trees closed - mozpool issues with "We could not request the device: mozpool status not ok, code 500"

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P1)

x86
All

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Unassigned)

Details

Bad return status from http://mobile-imaging-008.p8.releng.scl1.mozilla.com/api/device/panda-0705/request/: 500!
We could not request the device: mozpool status not ok, code 500

seems we lost our pandas :( 

All trees closed
Priority: -- → P1
Inventory data has been updated, which unfortunately is not compatible with the format Mozpool expects, which means Mozpool considers there is only one panda available. This issue occurred after mozpool and inventory automatic sync (every 30 mins).
we have two lines of attack to recover:

1) updating mozpool to handle new data format (:dustin working on this)
2) restoring data to previous format (:pmoore working on this)
I am currently trying to get database access to inventory database, so I can query the mac addresses of all the pandas. Once we have that, we can fix it relatively quickly.
OK inventory is restored, mozpool database is updated, and >800 pandas are pxe-booting.

Hold you hat steady.
*your*
crond is currently disabled on mobile-imaging-001, and the inventory sync script hacked to read MAC addresses from /tmp/nics.txt.  The disabled cron will prevent puppet from running, too.  But we need to fix this expeditiously.
What triggered this?
trees reopened 6:31am
Thanks guys for all your help!

I'll raise a separate bug about a longer-term solution (for the temp adjustment in comment 6).
(In reply to Pete Moore [:pete][:pmoore] from comment #9)
> Thanks guys for all your help!
> 
> I'll raise a separate bug about a longer-term solution (for the temp
> adjustment in comment 6).

Created bug 988306.
Postmortem (all times EDT)

Background:

In order to migrate all of the panda infrastructure to scl3, all of the inventory dns/dhcp records need to change form the old, deprecated format to the new sreg format. This step needs to take place before we can do the prep work to create new address records for them in the new vlans.  

We have done this for thousands of records with no issues in the past, and this was considered an extreme safe operation since any changes made to dns/dhcp would require a manual check and intervention to push after inventory was updated.

Timeline:

At 0723 EDT, I started making the modifications to translate from old to new format. I performed the modification on a select number of machines and checked with nagios just to be sure.

At 0735 EDT (enough time for nagios alerts to happen if something was broken), no errors were reported, and I proceeded making the changes to the rest of the panda infrastructure

At 0800 EDT pmoore|buildduty indicated that there was a problem with the panda infrastructure on #infra. Checking nagios, the mozpool health check was now alerting for each panda-relay.  

At 0822 EDT Tomcat|sheriffduty closed the trees as we tried to determine whether or not things were working and the checks were just broken. 

At 0832 Tomcat|sheriffduty, pmoore|buildduty dustin, RyanVM, and myself joined a vidyo call to discuss the current state and diagnose the issue. We left the vidyo call with action items to find a hostname->mac data source that we could use as a stopgap measure. After some investigation, it turned out that there is a cron job that syncs mac information directly from inventory that was now failing because the data had changed formats (it was still all in the database, just not accessible using the same API call as before). Since the mac information was no longer accessible to mozpool, it had deleted all of the pandas from its internal database.

After we discussed a few potential methods to fix the issue (roll back the database (not possible), use the allizom database (could wind up with other possible issues by using dev data)), Dustin devised a stopgap measure to pull the mac information from dhcp (which was still functioning properly, as expected) to create a flat text file for mozpool to read from. He also disabled the cron job that does the automatic sync from inventory.

At 0841 EDT Tomcat|sheriffduty opened bug 988281 to track this issue while I went to look for a data source and dustin started looking at the mozpool code.

At 0908 EDT, the temporary mapping file was in place and the mozpool database was reconstructed with the 872 records that had been deleted. The pandas all started pxebooting and self testing before reporting ready state.

At 0925 EDT we deemed that there were enough pandas in the ready state and trees were reopened at 0928 EDT.

At 0933 Tomcat|sheriffduty verified that the first 66 retriggers were done and that pandas were successfully picking up work.

Further work required:

dustin will work with uberj to modify the mozpool software so that it can read sreg records from inventory using the new API, patch the code, and reenable the sync cron job.


What we did wrong:

* Though I have made this type of change to thousands of nodes before, this is the first time I have done so with the pandas. I should have coordinated with the mozpool authors to make sure that there were no unplanned interactions with inventory.
* Because the change was made in the morning, EDT, not all west coast resources were available to help debug/fix.
* There is not a good way to roll back such changes in inventory.
* We didn't have any mechanism in the code to protect against mass inventory database changes which resulted in all of the records being deleted.
* We could not find documentation for the current inventory API

What we did right:
* There are protections for dhcp/dns to prevent data loss due to mass inventory database changes (no data would have been lost due to this, since all changes for dns/dhcp would have been successful).
* We were able to communicate effectively and quickly through #infra and vidyo to determine the cause and come up several potential stopgap solutions.
* Because the data was still correct and just in a different format, we were quickly able to create a temporary static mapping and get things back online.
* The mozpool infrastructure (booting and verifying pandas) worked very well under load and was able to process all 800+ pandas in just a few minutes.


Future preventative measures:

* arr opened bug 988321 to modify mozpool so that large numbers of changes require manual intervention, much like we do with dns/dhcp
* arr opened bug 988322 to have the mana documentation for inventory updated.
* arr will coordinate future panda changes with the mozpool folks and notify buildduty of changes as well as infra oncall.
One note I forgot.. the sync cron job runs at :15 and :45 after the hour, so that's why no errors were reported until 0800 EDT.
(In reply to Amy Rich [:arich] [:arr] from comment #11)
> Future preventative measures:
...
> * arr will coordinate future panda changes with the mozpool folks and notify
> buildduty of changes as well as infra oncall.

Can you add sheriffs@m.o and/or #releng to notification as well, please?
Remaining work seems to be defined in:
  bug 988306
  bug 988321
  bug 988322

All trees reopened, RFO documented in comment 8 (thanks :arr), let's close.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.