Closed Bug 402754 Opened 18 years ago Closed 15 years ago

dm-stage02 is periodically kernel panicking about nfs getting unmounted while in use

Categories

(mozilla.org Graveyard :: Server Operations: Projects, task, P2)

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: justdave, Assigned: justdave)

References

Details

Attachments

(3 files)

Attached file panic log
This seems to happen once or twice a week at this point. Just frequently enough to be a bother.
We are running kernel 2.6.22.9-91 from Fedora Core 7 with UnionFS 2.1.6 patches applied to it. I just grabbed the 2.6.23.1-21 SRPM and the UnionFS 2.1.8 patch, there's a bunch of bugfixes, we'll see if it helps any.
Both dm-stage01 and dm-stage02 are now runing 2.6.23.1-21.unionfs2.1.8.fc7 Interestingly enough, both machines kernel panicked with the same nfs unmount error trying to shut down from the previous kernel and had to be power cycled. Here's hoping the new kernel cleans up. :)
Any update since we postponed the switchover last week? Did the new kernel help?
There have been no crashes since the new kernel was deployed. So far so good. :)
Priority: -- → P2
dm-stage02 died again this morning. No kernel panic this time, it just suddenly became non-responsive. The external symptoms were the same as before, but there was no panic on the console when checking the machine this time. No indication whatsoever in the logs to indicate what happened. Machine required a power cycle to recover. Lasted from the 8th to the 23rd though, which is really good compared to how long it had previously been lasting.
Attached file panic log
Just managed to panic dm-stage02 again. Here's the panic log.
We were nearing the completion of a load stress test when this happened this time. I rsynced in about 40 GB of stuff from another box, and watched how it dealt with it. Did great up until it panicked. At the point it panicked, it had just deleted about 25 GB of stuff off the snapshot layer, and was in the process of swapping the layers back to snapshot the previously read-write layer.
There is a new version of unionfs out (there's been two in fact, since the one we're using). I've built kernel-2.6.23.1-21.unionfs2.1.10.fc7. Waiting for the current batch of virus scans to finish before rebooting into it. The changelog is at http://www.fsl.cs.sunysb.edu/pipermail/unionfs-cvs/2007-November/002907.html if anyone can make any sense out of it.
2.1.8 is what we were running previously (see comment 2)
I re-started the stress test at 8pm on the new kernel, we'll see if it passes this time.
Attached file panic log
It panicked again during this morning's load testing. On the bright side, it took a good couple dozen passes to trigger it this time (it triggered on the second pass with the previous version of unionfs). Unfortunately, the first half of the traceback looks almost the same, so it's probably just coincidence.
Dave, did you hear anything back from the unionfs devs ?
Yeah, been working with Erez Zadok on it, he just posted a new version today which he says contains a bunch of fixes that will hopefully help. He hasn't been able to crash it with the setup like we have and the prerelease of the new version, so we're hopeful. :) I just rebooted dm-stage01 and dm-stage02 into 2.6.23.8-34.unionfs2.1.11.fc7 and will be running the load tests on it again tonight.
This bug is now https://bugzilla.fsl.cs.sunysb.edu/show_bug.cgi?id=598 on the unionfs project's Bugzilla.
(In reply to comment #13) > I just rebooted dm-stage01 and dm-stage02 into > 2.6.23.8-34.unionfs2.1.11.fc7 and will be running the load tests on it again > tonight. How did the tests go?
Been through 4 different iterations of the kernel with different amounts of patching from the unionfs maintainer to try things, with different levels of success (none 100% yet, but a couple were close). Just got another major update today which I'll be trying shortly. The bug I mentioned in comment 14 is now open to the public (it wasn't before - apparently accidently).
(In reply to comment #16) > [...] The bug I mentioned in comment 14 > is now open to the public (it wasn't before - apparently accidently). > Well, maybe it is "open to the public" but it uses a self-signed certificate and my trunk SeaMonkey really dislikes that.
(In reply to comment #17) > Well, maybe it is "open to the public" but it uses a self-signed certificate > and my trunk SeaMonkey really dislikes that. Trunk Firefox has a link at the bottom of that dialog to add an exception. Didn't Seamonkey inherit that?
(In reply to comment #18) > (In reply to comment #17) > > Well, maybe it is "open to the public" but it uses a self-signed certificate > > and my trunk SeaMonkey really dislikes that. > > Trunk Firefox has a link at the bottom of that dialog to add an exception. > Didn't Seamonkey inherit that? > Yes, it did, and I used it to add the cert, also unchecked "Add the certificate permanently", but the whole process was scary. Those MozSecurity guys sure did their job right this time.
Current iteration again made it much further than previous attempts, but crashed once again. Details at https://bugzilla.fsl.cs.sunysb.edu/show_bug.cgi?id=598#c11
So Erez managed to reproduce this finally; kinda came out of left field. Someone else posted to the unionfs mailing list that they were getting a kernel OOPS during shutdown when they had a unionfs mount with an NFS mount as one of the layers. The OOPS trace that they posted happened to match the one we've been getting semi-at-random. With some experimenting, turns out we could reliably reproduce the same oops here as well simply by rebooting, and it went away if I unmounted the unionfs before rebooting. It all traced back to insufficient locking on the superblocks of the underlying layer mounts, so that unionfs got massively confused if NFS went away out from under it. That still raises the question of why the NFS mount seems to be periodically unmounting and remounting itself... but anyway, I've got a new patch deployed as of Tuesday night that explicitly addresses this issue, and have had the load testing running since then, and it hasn't crashed yet. :)
And about an hour after I posted that, it locked up. :( The tail end of the OOPS dump was visible on the console, but of course the important part had scrolled off. Nothing made it into the log this time. :(
(In reply to comment #22) > And about an hour after I posted that, it locked up. :( The tail end of the > OOPS dump was visible on the console, but of course the important part had > scrolled off. Nothing made it into the log this time. :( > Looks like TTY33 consoles had their advantages, when not talking about deforestation ;-)
(In reply to comment #22) > And about an hour after I posted that, it locked up. :( Dave, so this new UnionFS survived load testing for only an hour? I dont know how long it lasted before, is that an improvement?
(In reply to comment #24) > (In reply to comment #22) > > And about an hour after I posted that, it locked up. :( > Dave, so this new UnionFS survived load testing for only an hour? I dont know > how long it lasted before, is that an improvement? No, it lasted about 2 1/2 days. Which is a very big improvement (it was previously dying in between a few hours and half a day). The hour was in reference to when I posted the message telling what I'd done so far (because it was an hour after I posted that message when it died).
i.e. I jynxed myself by telling the world it was doing so well :)
Ah, ok, I was confused by the comment-post-timestamps. Yes, 2 1/2 days is quite an improvement. Did I hear right that the developer now can reproduce the problem, or should we revisit getting them access to our boxes?
ok, so here's the current situation: I started running a scaled back load test on Thursday last week. It crashed on Friday night, same way it did the last time I tried it with the full load test. I rebooted it and left the load test running, just to see what would happen. I got an email from Erez on Saturday telling me he'd just posted a new release which had a bunch of fixes in it that he thought would be applicable to our problem. Included were several fixes related to NFS branches and swapping branches on the fly (both of which are things we use a lot of). The new version was deployed late evening on Saturday. The load test was left running. As of this writing, it hasn't crashed yet. Feb 19 20:48:02 dm-stage02 kernel: unionfs: new generation number 6896 I don't think that generation number has made it past 2500 before under load testing conditions. So this looks darn promising. I think we'll try to let it run for a week before we declare it stable though.
To be clear - testing the rest of this week with possible cutover next week. Are there things that need to be done in preparation? If so, we should start on them now...
1) We need to copy the ssh host keys over from surf to avoid disruptions for things that use the stage.mozilla.org domain name. 2) surf will continue to be available via stage-old.mozilla.org for at least a few weeks after we switch as a fallback in case anyone runs into a critical problem that we can't solve immediately after we switch 3) DNS for stage.mozilla.org will be changed to point at dm-stage02 instead of surf on the cutover date. That's the basic plan, assuming all goes well. Changing host keys will involve tinderbox intervention for anyone that's already talking to stage-new.
I guess this part of the conversation should actually be happening on bug 394069
(In reply to comment #28) > The new version was deployed late evening on Saturday. The load test was left > running. As of this writing, it hasn't crashed yet. > Feb 19 20:48:02 dm-stage02 kernel: unionfs: new generation number 6896 > > I don't think that generation number has made it past 2500 before under load > testing conditions. So this looks darn promising. I think we'll try to let it > run for a week before we declare it stable though. I'll continue to keep my fingers crossed, but already this is outstandingly good news!!
Feb 21 15:05:02 dm-stage02 kernel: unionfs: new generation number 11240 and still ticking.
Let's plan for Tuesday - Dave is running point for the cutover, so look to him for direction on what needs to be done when/where/how...
I think we need bug 72410 comment #18 first, so we don't block builds for perf-testing behind l10n builds. Would also like to take a bit of time to figure out exactly what we still need to do, and if a transition is better than a jump. Can I get back to you early next week about the possibility of Thursday ?
Sure - if we need to go past Thursday, lets talk as I want to make sure to get this in soon if possible (given the long wait on unionfs).
Just a reminder, this bug is only about unionfs crashing. For discussion of doing the cutover and whatnot please go to bug 394069.
Feb 24 21:15:02 dm-stage02 kernel: unionfs: new generation number 19321 21:16:02 up 7 days, 19:28, 1 user, load average: 0.00, 0.02, 0.00 I'm calling this fixed. :) I'll reopen if anything else happens, but it looks solid at this point. :)
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Way to go, justdave! I know this has been quite the saga, but nice persistent work tracking this down with the right people!
This is still happening. Way less frequent under our testing than it used to be, but putting it into production broke everything loose again and we had to revert back to surf after it died 4 times on the first day in production. We've gotten the unionfs maintainer access to the two machines this is set up on so he can play around, and he says he's hoping to have something for us later this week or early next week.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
What's the latest news here ? I saw the DNS record was set to the new system a few days ago.
And then it was switched back a few hours later when it crashed again. Still waiting on a new patch from Erez to address the latest crash. He said it might be a week or two because it's in the middle of a small restructuring for compatibility with something else that just landed in the upstream kernels. On the plus side, there is a port of unionfs to the RHEL5 kernel now (it didn't work when we first started so I used Fedora instead). The Fedora release we used is EOLed already (they don't support them very long), so I'm thinking I'm going to redo it on RHEL if I can find an easy way to switch. The more-stable base kernel might help, too.
Component: Server Operations → Server Operations: Projects
Changing QA Contact.
QA Contact: justin → mrz
Can we close this? dm-stage02 isn't even Fedora anymore and the world has changed since we had these issues.
(In reply to comment #44) > Can we close this? dm-stage02 isn't even Fedora anymore and the world has > changed since we had these issues. Before we close this, can we try load-testing unionfs again? It would be good to confirm all is ok before we unblock bug#394069.
(In reply to comment #45) > (In reply to comment #44) > > Can we close this? dm-stage02 isn't even Fedora anymore and the world has > > changed since we had these issues. > > Before we close this, can we try load-testing unionfs again? It would be good > to confirm all is ok before we unblock bug#394069. I'm suggesting we close -this- bug and start over. There is no unionfs on dm-stage02, there's nothing - it's a fresh reload with zero configuration and nothing to load test at the moment. Should mark this as INVALID and figure out if this is method is still the direction we want to go and, if so, make another effort at the whole unionfs deal.
No comments after a year. We'll restart this in a new bug if we ever do this.
Status: REOPENED → RESOLVED
Closed: 18 years ago15 years ago
Resolution: --- → INCOMPLETE
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: