Closed Bug 598757 Opened 10 years ago Closed 6 years ago

Running out of space for symbols

Categories

(Release Engineering :: General, defect, P3)

x86
All
defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: lsblakk, Assigned: ted)

References

(Blocks 1 open bug)

Details

(Whiteboard: [symbols])

Attachments

(1 file)

From Nagios:

dm-symbolpush01 disk - /mnt/netapp/breakpad DISK WARNING - free space: /mnt/netapp/breakpad 91633 MB (8% inode=94%)

This already happened back at the beginning of the year, in bug 540713
I already e-mailed ted about this (Sat, 18 Sep 2010 14:40:48) and he is working on it, afaik.
Good news - if there's already a bug tracking this feel free to dupe otherwise let's use this one to track it's resolution.
I have not filed another bug. I need to do some more investigation, but I think we may need to increase the amount of storage, since we're storing a lot more than we used to.
Assignee: nobody → ted.mielczarek
Depends on: 598928
Ok, so, the symbols_os dir isn't contributing appreciably:
4.0G	../symbols_os

I filed bug 598928 on getting the OpenSuSE symbols dir cleaned up correctly.

I've been looking into it, and I'll have a proposal today. I think the core issue here is just that we have way more branches than we used to, so we're using a lot more space.
Depends on: 587073
Depends on: 599343
Depends on: 599347
Depends on: 599361
Depends on: 599380
Okay, I reworked my old spreadsheet:
https://spreadsheets.google.com/ccc?key=0An_R0AMMILQEcEZDMUtIbU1UZUVJcndRWHhKZDVWSXc&hl=en

I filed all the low-hanging fruit I could find as bugs blocking this one. There are two sheets in that document: the first one tries to estimate current usage, and the second one shows where we could be at if we fixed the low-hanging fruit.

My "current state" estimate is off by about 350GB, which is like 30%, which makes it not a fabulous estimate, I suppose. I can probably refine that, but it's difficult because every single product has slight differences. If my estimates are anywhere near correct, we can probably save a few hundred GB by fixing the low-hanging fruit, but we still might be cutting it close on disk usage. As you can see from the spreadsheet, we are supporting *a lot* of products+branches.
I gave this volume another 250g, hopefully by the time we get close to using that up, these dependent bugs will be fixed and we will be down to 850g.
Depends on: 599457
Aravind: is there anything else being stored on this volume other than the symbols_* dirs? I ran "du -sh" on each of the symbols_* dirs and it only totaled 659GB. Where's that other ~300GB?
(In reply to comment #7)
> Aravind: is there anything else being stored on this volume other than the
> symbols_* dirs? I ran "du -sh" on each of the symbols_* dirs and it only
> totaled 659GB. Where's that other ~300GB?

Yes, this share also contains some old breakpad dumps, its not being actively used, but it contains some historical dumps probably totaling about 300GB.  I am running a du on those trees and will report back once I know for sure.

Note: these trees are not being used and are static.
(In reply to comment #8)

> Yes, this share also contains some old breakpad dumps, its not being actively
> used, but it contains some historical dumps probably totaling about 300GB.  I
> am running a du on those trees and will report back once I know for sure.
> 

The du is done, those static directories come up to 255 GB.
Ok, that makes a lot more sense!
Duplicate of this bug: 589795
Ted, Lukas: is this still an issue?
Priority: -- → P3
Whiteboard: [symbols]
The deps on this still ought to be fixed. We mitigated this a bit by fixing some of them, and bumping up the storage space, but we really should fix the other deps here.
Duplicate of this bug: 743495
This started to page again.
I was going to say "the sjc1 mount will go away in a week", but things are getting a little tight in phx1, too:

10.8.74.240:/vol/pio_symbols
                      1.9T  1.7T  264G  87% /mnt/netapp/breakpad
There are still a few deps here that could buy us some time, but odds are we're just going to have to increase the storage space. We keep adding new project branches and platforms, and the switch to rapid release added a lot more builds to keep track of, so we increased our storage requirements by quite a bit.
So let's resize the phx1 share and ignore the warnings in sjc1.  Think of it as enticement to leave there :)

This is 2TB in sjc1 now (and 1.9 in scl3..), with 1.8TB used.  Let's pin the current usage at 75%, which means the share should be 2.4TB.

That's on 10.8.74.240:/vol/pio_symbols

Storage folks, is that doable?
Assignee: ted.mielczarek → server-ops
Component: Release Engineering → Server Operations: Storage
QA Contact: release → dparsons
There is no more free space to allocate on that controller. The aggregate is 96% full.
Assignee: server-ops → nobody
Component: Server Operations: Storage → Release Engineering
QA Contact: dparsons → release
Dan, just to verify, the *phx1* aggregate is full?  I only ask because the bug was initially about sjc1, so there's room for confusion.

Should we try to move symbols to scl3?  Or buy a storage blade in phx1?
Going forward we only need symbol storage in PHX. Once we've finished migrating the symbol server (bug 688250) and the consumers of dm-symbolpush01 (bug 688186), we can get rid of the symbol store in SJC1 and not replace it.
Dan's confirmed in email that phx1 is indeed space-constrained.  More is being quoted, so let's muddle along here and check in in a few months to see what the space situation looks like.
Duplicate of this bug: 758778
Things look particularly tight right now.  Is there any additional deletion that can occur?

The new phx1 storage is not yet in place.
The new phx1 storage hasn't even been ordered yet. I hope it will be soon but even if it was ordered today, we're looking at least a month before it's online. We need to do something much sooner.
I don't know that there are any easy wins here at the moment. Have we verified that the symbol cleanup scripts are running on all the various symbols_* directories in this mount? (With the exception of symbols_os.)

I know we had some fiddly issues with index.txt files syncing from SJC->PHX. The cleanup scripts rely on those files. Did we accidentally break something in the move?
Duplicate of this bug: 758866
Step 1, don't get confused by the copy of the old sjc1 symbols we now have in scl3, which is mounted on stage.m.o.  Look on symbols1.dmz.phx1.mozilla.com instead.

Step 2, look at how many manifests we have for nightly builds - there should be a maximum of 30 for each platform on a branch. In the attachment I'm trying to get linux32 mozilla-central builds, excluding all the other branches and 64-bit, and come up with 47 manifests. It's a similar story for mac (51). 

It'll be the naming change between May 16 and May 17 causing the problem, which is a regression from http://hg.mozilla.org/mozilla-central/rev/a0cca6997af4 (bug 753132).

We can probably rename our way to victory, but need to do it very carefully because the naming scheme is a bit funky.
Changes today
* got down to 20G free, so I started looking into renaming some manifests on a less important branch like ux
* wrote a script to do the renames, ran it for ux, then ran the cleanup script at /mnt/netapp/breakpad/cleanup-breakpad-symbols.py against symbols_ffx
* did the same for mozilla-inbound, which was expected to free up ~20G
* talked to lerxst on IRC about not seeing the free space increase, and he's deleted 62G of snapshots (6 hourly + 2 daily), and disabled further snapshots
* we are now at 
[177] symbols1.dmz.phx1:disk - /mnt/netapp/breakpad is WARNING: DISK WARNING - free space: /mnt/netapp/breakpad 82501 MB (4% inode=91%)

I'll keep working on other branches to get us some more headroom.
Standard8, jhopkins - can we nuke symbols_tbrd-test ? Were there any accidents that put actual release symbols into this test directory during the infra transition ? There are some 10.0.2 symbols from April 27, on rev 7d395fbcb557, which looks like a staging release because it doesn't have any tags on it. Also some nightlies which I doubt we need to keep. 

We'd free up 50G if this can go.
(In reply to Nick Thomas [:nthomas] from comment #30)
> Standard8, jhopkins - can we nuke symbols_tbrd-test ? Were there any
> accidents that put actual release symbols into this test directory during
> the infra transition ? There are some 10.0.2 symbols from April 27, on rev
> 7d395fbcb557, which looks like a staging release because it doesn't have any
> tags on it. Also some nightlies which I doubt we need to keep. 
> 
> We'd free up 50G if this can go.

No objections to the nightly -test builds - we didn't publicise those (they could also be removed from ftp as well...). You might want to check with jhopkins about anything release like, as I don't know what went on there.
(In reply to Mark Banner (:standard8) from comment #31)
Ok, I'll wait to hear then.

In other news, I've fixed up the manifest naming in symbols_ffx and symbols_tbrd, and after the cleanup script we're have 180G free and should maintain that better. That leaves out branches like {mozilla,comm}-aurora which haven't merged the naming change yet, so I'll need to revisit them after merge. 

Also haven't looked at Seamonkey or any of the other apps, since I don't have permisions to make changes. Perhaps we can just let them clean up naturally after 90 days.
<jhopkins> nthomas: IIRC we had all the test code removed before doing our beta builds, so i don't see why we'd need the test symbols

Deleted, 200G free now. dustin, could you please remove 
  symbols1.dmz.phx1:/mnt/netapp/breakpad/symbols_tbrd-test
symbols_xr done too, only 10G more space from that.

HOWTO:
For {mozilla,comm}-{central,aurora} nightlies you'd do the likes of this:
 python /home/ffxbld/bug598757/rename-trunk.py /mnt/netapp/breakpad/symbols_ffx
 python /mnt/netapp/breakpad/cleanup-breakpad-symbols.py \
    /mnt/netapp/breakpad/symbols_ffx

For Firefox branches which are peers of mozilla-central, like profiling, fx-team etc, do this:
 python /home/ffxbld/bug598757/rename-peer.py \
    /mnt/netapp/breakpad/symbols_ffx profiling
 python /mnt/netapp/breakpad/cleanup-breakpad-symbols.py \
    /mnt/netapp/breakpad/symbols_ffx
Thanks for the detective work, Nick! I should have known better...

I've fiddled with this so many times that I just get depressed every time I think about it.
(In reply to Nick Thomas [:nthomas] from comment #33)
> Deleted, 200G free now. dustin, could you please remove 
>   symbols1.dmz.phx1:/mnt/netapp/breakpad/symbols_tbrd-test

done
Turns out this is going to be a big problem.

Rapid betas is going to cause a huge spike in the amount of storage needed, and we're already really borderline.  We need to get more space - ideally we need to double what we have from 2TB to 4TB.  Should I file a separate bug for that?  Which component should I file it in?

Rapid betas start on July 17, so time is super short.
Severity: normal → critical
There simply is not enough disk space in phx1 to do this. The most I could give you is another 200GB. We ordered additional capacity, but it will probably be 30 to 90 days before it is online.
I understand from akeybl that the rapid beta date has literally just been pushed back to 8/28 (woohoo!) which gives us some extra breathing room for the new storage to come online.  We'll prune aggressively and try and stay inside the 200GB extra.  ted is going to do some more analysis and see what he can cut.
OK great. I just talked to :dmoore and we might be able to get extra capacity online in the beginning of July. Note that this will be a totally separate system, so we'll have to migrate the existing volume in order to give it more space.
So I think the biggest remaining problem here is that we don't currently clean up beta builds, and with the switch to rapid release we have a *lot* of beta builds.

Rapid betas would just make this already bad problem really really bad.
Severity: critical → normal
Severity: normal → critical
I just gave it +200GB. Just so everyone's clear, after this space is used up, there is literally nothing else I can do to give you more space on that volume.
Just so I don't drop this on the floor, a few numbers I've been poking at:
[tmielczarek@symbols1.dmz.phx1 symbols_ffx]$ du -sh .
796G    .

18G     symbols_fedora

110G        symbols_ubuntu/

I'm looking into lowering the 90 day cleanup time down to 45 days, which will buy us ~90GB of space just from the Firefox+Thunderbird directories alone.
4.3G      symbols_camino
54G /mnt/netapp/breakpad/symbols_opensuse
There are some orphans too. Looking in the mac directory symbols_ffx/XUL, there are 188 directories that aren't mentioned in the manifests, using 42G of space. There is a big range of ages, some as recent as 2012-04 others back to 2007-09. Also 3642 empty dirs making ls operations slow (but only wasting 14MB).
461G    symbols_tbrd
Depends on: 764671
28G /mnt/netapp/breakpad/symbols_os/
Longer term fix: bug 684251
See Also: → 684251
No longer depends on: 599343
Depends on: 599343
No longer depends on: 599343
Depends on: 765236
Assignee: nobody → ted.mielczarek
I think there are only two easy fixes we could do in the short term:
1) Clean up the swath of old beta/RC builds that we've accumulated in the switch to rapid release. (+ the few misnamed ESR nightlies)
2) Per comment 46, clean up orphaned symbol files.

Disk usage seems to be okay for the moment:
Filesystem            Size  Used Avail Use% Mounted on
10.8.74.240:/vol/pio_symbols
                      2.2T  1.8T  351G  84% /mnt/netapp/breakpad

I'd actually expect it to drop somewhat over the next 30+ days as the fix for bug 587073 rolls out.
This is alerting 95% full. 

10.8.74.240:/vol/pio_symbols
                      2.0T  1.8T  117G  95% /mnt/netapp/breakpad
I'm showing 266GB free now... is this still a problem?
This alerted again, so I think the answer is that it is an intermittent problem and it'd be good to have more headroom on this filesystem.  I ran the cleanup script /mnt/netapp/breakpad/cleanup-breakpad-symbols.sh and it got down to 94% full which Nagios was happy with but more disk space or less data would be good here.

Lowering priority to normal to reflect the situation better.
Severity: critical → normal
I gave the volume another 100GB, it's at 89% full now.

What needs to happen in order to mark this bug as R/F?
Duplicate of this bug: 818885
By moving some unrelated volumes off this filer, we've quietly nursed our way to absorbing another 800g of symbols in the last 4 months, but we're at the point of diminishing returns and can't keep up that rate.

So, this is a friendly early warning that we're going to be showing another episode of "Dude, Where's my Diskspace Cleanup?" on the symbols volume in the near future.
Exciting. I will make time to do some analysis next week and see if there's anything stupid going on. I would hazard a guess that it's just growth, with new platforms and B2G and everything, but sometimes there are silly things that slip in.
(We are also working on a design for storing a large portion of this data in Postgres, which should greatly reduce our storage requirements.)
Depends on: 845850
I suspect I need to buckle down and fix bug 599347. I think the rapid release cycle is mostly what's screwing us here, especially since we pile up beta builds that aren't currently cleaned up.
Duplicate of this bug: 849494
This is lighting off alarms for our SREs (bug 849494).
If we're not going to have an automated solution RSN, we need at least some manual cleanup for a limpalong.
Now at 95% and still alerting.

<nagios-phx1:#sysadmins> Mon 05:39:18 PDT [191] 
  symbols1.dmz.phx1.mozilla.com:NFS Mounts - /mnt/netapp/breakpad is WARNING: 
  DISK WARNING - free space: /mnt/netapp/breakpad 173706 MB (5% inode=91%): 
  (http://m.allizom.org/NFS+Mounts+-+/mnt/netapp/breakpad)

10.8.74.240:/vol/pio_symbols
                     3145728000 2969250656 176477344  95% /mnt/netapp/breakpad
Okay, I started working on a change to the cleanup script to let us clean out some old data (as a one-off first, and then I'll figure out how to make it a regular part of cleanup). I think the easiest thing to do first is clean out old XULRunner release symbols. I don't think we're actually using them for anything, and there's almost 500GB of XR symbols total. I'll follow up on IRC or in the bug today when I have it ready.
Following up with pir on IRC. Looks like we should be able to reclaim ~322GB by removing old XULRunner release symbols.
[root@symbols1.dmz.phx1 breakpad]# df -h /mnt/netapp/breakpad
Filesystem            Size  Used Avail Use% Mounted on
10.8.74.240:/vol/pio_symbols
                      3.0T  2.8T  166G  95% /mnt/netapp/breakpad

[root@symbols1.dmz.phx1 pradcliffe]# cd /mnt/netapp/breakpad/symbols_xr
[root@symbols1.dmz.phx1 symbols_xr]# ~tmielczarek/cleanup-breakpad-symbols.py -r /mnt/netapp/breakpad/symbols_xr *-1.9* *-4.0* *-5.0* *-6.0* *-7.0* *-8.0* *-9.0* *-10.0* *-11.0* *-12.0* *-13.0* *-14.0* *-15.0* *-16.0* *-17.0* *-18.0*
[1/4] Reading symbol index files...
[2/4] Looking for symbols to delete...
[3/4] Deleting symbols...
[4/4] Pruning empty directories...
[root@symbols1.dmz.phx1 symbols_xr]# df -h /mnt/netapp/breakpad
Filesystem            Size  Used Avail Use% Mounted on
10.8.74.240:/vol/pio_symbols
                      3.0T  2.6T  388G  88% /mnt/netapp/breakpad
Depends on: 849808
[root@symbols1.dmz.phx1 symbols_sea]# cd /mnt/netapp/breakpad/symbols_sea; ~tmielczarek/cleanup-breakpad-symbols.py -r /mnt/netapp/breakpad/symbols_sea *-2.0* *-2.1-* *-2.1.* *-2.2* *-2.3* *-2.4* *-2.5* *-2.6* *-2.7* *-2.8* *-2.9* *-2.10* *-2.11* *-2.12* *-2.13* *-2.14*
[1/4] Reading symbol index files...
[2/4] Looking for symbols to delete...
[3/4] Deleting symbols...
df -h /mnt/netapp/breakpad
[4/4] Pruning empty directories...
[root@symbols1.dmz.phx1 symbols_sea]# df -h /mnt/netapp/breakpad
Filesystem            Size  Used Avail Use% Mounted on
10.8.74.240:/vol/pio_symbols
                      3.0T  2.5T  511G  83% /mnt/netapp/breakpad
That should be enough breathing room for now. I'm going to leave this open until I fix bug 599347, which should be a longer-term fix. I also filed bug 849808 to figure out how to archive old release symbols.
I pushed the changes I made to the cleanup script that I wrote for pir to use above:
http://hg.mozilla.org/build/tools/diff/4362a609d8d8/buildfarm/breakpad/cleanup-breakpad-symbols.py
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #67)
> That should be enough breathing room for now. I'm going to leave this open
> until I fix bug 599347, which should be a longer-term fix. I also filed bug
> 849808 to figure out how to archive old release symbols.

Found in triage. 

Anything left to do here? If I read the last few comments correctly, this is now a tracking bug, and all remaining work is in depbugs?
Flags: needinfo?(ted)
Yeah. If you think closing this bug or annotating it separately is useful we can do that. We should probably get bugs like this out of RelEng components, but we don't have a useful component for them right now.

Also, bug 889691 moved the storage for this and made this less of a pressing issue.
Flags: needinfo?(ted)
Product: mozilla.org → Release Engineering
Duplicate of this bug: 951597
Might need to run the cleanup script again:

<#sysadmins>Wed 02:42:38 PST [1964] symbols1.dmz.phx1.mozilla.com:NFS Mounts - /mnt/netapp/breakpad is WARNING: DISK WARNING - free space: /mnt/netapp/breakpad 261897 MB (5% inode=89%):
Depends on: 951750
This is going to be irrelevant due to bug 1071724.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.