Closed Bug 834812 Opened 11 years ago Closed 11 years ago

Track number of files in on-disk queue on sp-collectorXX nodes

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 841578

People

(Reporter: selenamarie, Assigned: ashish)

References

Details

We should track the number of files that are queued for crash moving on collector nodes. 

These files are stored in a directory structure like: 

/home/socorro/primaryCrashStore/20120128/date/00/05/

We should graph the total number of *files* in this directory.

We're not sure when we need to alarm on it - my best guess for the moment is 10000.
Blocks: 834403
Specifically, we should monitor the total # of files in all subdirs beneath date/.
Small nit: we'd technically be looking for the # of *symlinks*, not files. There are no regular files under /date/... only directories containing symlinks.


We need to be exceptionally careful about this. If whatever we do descends into primaryCrashStore/*/name/*, it will take *hours*.

In my testing so far a naive "find ... ! -path '*/name/*' is not sufficient to avoid this. If you can devise a "find" that is superior to the below (in terms of execution speed), that'd be alright. :)



Here's something that does work fairly well:

for i in /home/socorro/primaryCrashStore/*/date/*/*/*; do echo -n "$i: "; ls $i | wc -l; done

That prints out the number of files in each individual directory, which is great for figuring out which day has a backlog, but not ideal for monitoring. Here's a slightly revised command:

for i in /home/socorro/primaryCrashStore/*/date/*/*/*; do ls $i; done | wc -l

This runs very quickly (under 100ms) when there aren't many files, and still runs acceptably even when there are far too many files (~500k takes a few seconds)


At present, here's what this command would return on each of the 6 nodes:

sp-collector01.phx1: 716
sp-collector02.phx1: 737
sp-collector03.phx1: 720
sp-collector04.phx1: 732
sp-collector05.phx1: 255890
sp-collector06.phx1: 743

Obviously, 05 has a problem (we're working on processing a backlog).

This value does grow slightly over time, and will fluctuate up to 100-200 all the time. If there is a problem it should spike *rapidly*.

An alert threshold of 10,000 seems like a good starting point.


Thanks!
Blocks: 836845
Assignee: server-ops → ashish
This has been implemented in Bug 841578
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.