Closed Bug 989492 Opened 11 years ago Closed 11 years ago

tool to compare different sources of slave and master data

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

()

Details

Attachments

(6 files, 4 obsolete files)

Attached patch tool to compare stuff (obsolete) — Splinter Review
We've got a bunch of different places containing slave and master data, and they often get out of sync. Having something that compares and reports differences would help us to keep things in line. We have data in at least the following places:
* AWS (source of truth for AWS machines)
* Inventory System list (source of truth for hardware machines)
* Inventory DNS/SREG
* Slavealloc
* Buildbot-configs

Here's a first stab at something that compares AWS instance lists+inventory systems vs. inventory dns/sreg. I'm focused on masters and slaves here (for now at least), so it has a ton of excludes to ignore other things.

I'm not really sure where the best place to this is, so it's in tools for now. I almost want a new repo for it. Suggestions welcome.

There's more to do still, including:
* compare against buildbot-configs
* ||ize
Attachment #8398742 - Flags: feedback?(catlee)
Attached file sample report
This report was run before I finalized the exclude list, so it has things about signing and some other machines that are now excluded. It's caught a bunch of interesting things so far including:
* A few machines that exist in a SoT but not slavealloc (eg, panda-0882, tst-linux64-ec2-390)
* Masters in slavealloc whose FQDN is an IP (may or may not be valid)
* Old EC2 machine names in slavealloc that should probably be removed
* Some weird DNS inconsistency (eg bld-linux64-spot-389, servo machines)

There's also a ton of complaints about spot instances that are probably invalid. Haven't dug into those much yet.
Attachment #8398751 - Flags: feedback?(catlee)
Some enhancements in this version:
* Use network interfaces to find spot instances. Rail tells me this is the right way to find the list of possible spot instances.
* Generate list of machines in Buildbot. This part of the script sucks because of our configs. I'm happy to change it if someone has a better idea.

This script now depends on a million packages too, mostly because cloudtools and buildbot depend on a ton of stuff. It also depends on invtool (by way of cloudtools), which doesn't seem to be installable unless you're root. This might make deploying it a little tough.
Attachment #8398742 - Attachment is obsolete: true
Attachment #8398742 - Flags: feedback?(catlee)
Attachment #8399584 - Flags: feedback?(catlee)
Attachment #8399584 - Flags: feedback?(rail)
Attachment #8399584 - Flags: feedback?(rail) → feedback?(rail)
Whoops, got the wrong Rail...
Comment on attachment 8399584 [details] [diff] [review]
fix spot instances; compare against buildbot

Review of attachment 8399584 [details] [diff] [review]:
-----------------------------------------------------------------

In overall it looks great. I have 2 unrelated to the code concerns:

1) boto may return not all objects, see https://github.com/boto/boto/pull/2189
2) the part responsible for spot DNS check may go away once we switch to puppetless approach.
Attachment #8399584 - Flags: feedback?(rail) → feedback+
Comment on attachment 8398751 [details]
sample report

For the "machines missing from AWS or inventory" section, it would be nice to know where the machines ARE defined. e.g. are they coming from slavealloc, or buildbot-configs, or ???
Attachment #8398751 - Flags: feedback?(catlee) → feedback+
(In reply to Chris AtLee [:catlee] from comment #6)
> Comment on attachment 8398751 [details]
> sample report
> 
> For the "machines missing from AWS or inventory" section, it would be nice
> to know where the machines ARE defined. e.g. are they coming from
> slavealloc, or buildbot-configs, or ???

Ah, I think I addressed this in my updated patch already (but I didn't attach an updated report). The new report has:
        report.write("Machines in AWS/Inventory but not in Slavealloc:\n")
        report.write("Machines in AWS/Inventory but not in Buildbot configs:\n")
        report.write("Machines in Slavealloc but not in AWS or inventory:\n")
        report.write("Machines in Buildbot configs but not in AWS or inventory:\n")

And for each type of dns record:
            report.write("Machines with errors in their %s DNS records:\n" % type_)
Attached file updated sample report
Depends on: 991056
No changes to the actual report here, just replacing the command line interface with something compatible with reportor. I wasn't quite sure how to test it, but I was able to run it from the command line after setting REPORTOR_CREDS. Is there more testing I can/should do?

I also got rid of most of the dependencies by copying in the parts of cloudtools that I need (crappy) and fixing bug 991056. The remaining few I added to setup.py.

I'm not 100% sure if I'm done working on the report yet, but I think the best way forward is to get it running, fix up all of the obvious things it whines about, and see where we stand.
Attachment #8398743 - Attachment is obsolete: true
Attachment #8399584 - Attachment is obsolete: true
Attachment #8399584 - Flags: feedback?(catlee)
Attachment #8400724 - Flags: review?(catlee)
Attachment #8400724 - Flags: review?(catlee) → review+
Comment on attachment 8400724 [details] [diff] [review]
run the report in reportor

Checked this in, tested it on cruncher, and now I've deployed it to the "production" reportor spot.
Attachment #8400724 - Flags: checked-in+
Whoops, I forgot to install the newly required deps (boto, etc.). Did that now, we should have a report out tomorrow.
The report is running now but it's a tad noisy because of check_call doesn't output. This should shut it up.
Attachment #8403506 - Flags: review?(catlee)
This patch also gets us ignoring a few more things:
* Servo (because its set-up is pretty static, and it's a PITA to get its slavelist)
* All dev machines (because loaners aren't useful to this report, and non-loaners come in and out of existence).
* aws-manager (not a master or slave)
* buildbot-master81 (not in slavealloc)
Attachment #8403506 - Attachment is obsolete: true
Attachment #8403506 - Flags: review?(catlee)
Attachment #8404163 - Flags: review?(catlee)
I went through and fixed up most of the machines that were in slavealloc but not other places. I also fixed most of the invalid dns. bug 994267 is fixing up buildbot-configs to remove dead machines.
Comment on attachment 8404163 [details] [diff] [review]
even more quieter

Review of attachment 8404163 [details] [diff] [review]:
-----------------------------------------------------------------

::: reports/machine_sanity/machine_sanity.py
@@ +140,5 @@
> +        null = open(devnull, 'w')
> +        try:
> +            check_call(["hg", "clone", buildbot_configs, bbdir], stdout=null)
> +        finally:
> +            null.close()

a bit cleaner as

with open(devnull, 'w') as null:
    check_call(["hg", "clone", buildbot_configs, bbdir], stdout=null)

3 fewer lines! SAVE THE NEWLINES!!!!
Attachment #8404163 - Flags: review?(catlee) → review+
Comment on attachment 8404163 [details] [diff] [review]
even more quieter

Landed with the suggested change.
Attachment #8404163 - Flags: checked-in+
Adding location on reportor as URL. This is awesome!
Attached patch more ignoresSplinter Review
Watch for skip patterns in slavealloc/buildbot names, not just inventory/aws.
Attachment #8406136 - Flags: review?(catlee)
Attachment #8406136 - Flags: review?(catlee) → review+
Attachment #8406136 - Flags: checked-in+
Per IRC, this patch provides a JSON file with all of the valid slaves listed in it. Also, turns out I broke stuff in my last patch. This fixes that.
Attachment #8408412 - Flags: review?(catlee)
Attachment #8408412 - Flags: review?(catlee) → review+
Comment on attachment 8408412 [details] [diff] [review]
provide listing of all valid slaves

Landed and updated cruncher.
Attachment #8408412 - Flags: checked-in+
With the latest patch checked in, we've now got a list of all of the usable slaves:
https://secure.pub.build.mozilla.org/builddata/reports/reportor/daily/machine_sanity/usable_slaves.json

I think we're done here now?
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: