[Builder-preview] NFS performance problems?

RESOLVED FIXED

Status

--
major
RESOLVED FIXED
7 years ago
5 years ago

People

(Reporter: clouserw, Assigned: nmaul)

Tracking

Details

(Reporter)

Description

7 years ago
We've been having some performance issues with NFS on the builder boxes.  The logs below show an example, where the gearman box creates a file and the front end box doesn't see the new file for 15 seconds.

> ------------------------------------------------------------

Jun 20 14:26:57 pm-gearman-amo02.mozilla.org:  f.xpi_utils:INFO [xpi:1tfl2y51jbdhc0] Created xpi: /mnt/netapp_amo/addons.mozilla.org-builder.preview/xpi_targetdir/1tfl2y51jbdhc0.xpi (prep time: 287ms) (build time: 447ms) (copy xpi time: 80ms)
Jun 20 14:26:57 pm-gearman-amo02.mozilla.org: celery:INFO Task xpi.tasks.xpi_build_from_model[0aa48a93-a4fe-497a-977d-bcf03ff96576] succeeded in 0.818953990936s
Jun 20 14:26:58 pm-app-amo24.mozilla.org:  f.xpi:DEBUG Add-on not yet created: [Errno 2] No such file or directory: u'/mnt/netapp_amo/addons.mozilla.org-builder.preview/xpi_targetdir/1tfl2y51jbdhc0.xpi'
Jun 20 14:27:04 pm-app-amo24.mozilla.org:  f.xpi:DEBUG Add-on not yet created: [Errno 2] No such file or directory: u'/mnt/netapp_amo/addons.mozilla.org-builder.preview/xpi_targetdir/1tfl2y51jbdhc0.xpi'
Jun 20 14:27:04 pm-app-amo24.mozilla.org:  f.xpi:DEBUG Add-on not yet created: [Errno 2] No such file or directory: u'/mnt/netapp_amo/addons.mozilla.org-builder.preview/xpi_targetdir/1tfl2y51jbdhc0.xpi'
Jun 20 14:27:06 pm-app-amo24.mozilla.org:  f.xpi:DEBUG Add-on not yet created: [Errno 2] No such file or directory: u'/mnt/netapp_amo/addons.mozilla.org-builder.preview/xpi_targetdir/1tfl2y51jbdhc0.xpi'
Jun 20 14:27:08 pm-app-amo24.mozilla.org:  f.xpi:DEBUG Add-on not yet created: [Errno 2] No such file or directory: u'/mnt/netapp_amo/addons.mozilla.org-builder.preview/xpi_targetdir/1tfl2y51jbdhc0.xpi'
Jun 20 14:27:10 pm-app-amo24.mozilla.org:  f.xpi:DEBUG Add-on not yet created: [Errno 2] No such file or directory: u'/mnt/netapp_amo/addons.mozilla.org-builder.preview/xpi_targetdir/1tfl2y51jbdhc0.xpi'
Jun 20 14:27:12 pm-app-amo24.mozilla.org:  f.xpi:INFO Downloading Add-on: 1tfl2y51jbdhc0

> ------------------------------------------------------------

Can you investigate if there are any load problems or slowness issues with NFS right now?  If NFS is performing as expected, what kind alternatives do we have in order to get the delay time less than 3s?
QA cannot test under the current dev environment conditions.
Severity: normal → critical
Priority: -- → P1
(Reporter)

Comment 2

7 years ago
Marking as blocker for comment.  This can range all the way up to 60+ seconds at times
Severity: critical → blocker
I was getting well over a minute - and now a majority of the time, it is timing out altogether.
Assignee: server-ops → dgherman
(Assignee)

Comment 4

7 years ago
Mounting the NFS share with the 'sync' option everywhere might help this. This would hurt write performance, but it should also delay the actual write from returning until it's completed.
(Reporter)

Comment 5

7 years ago
(In reply to comment #4)
> Mounting the NFS share with the 'sync' option everywhere might help this.
> This would hurt write performance, but it should also delay the actual write
> from returning until it's completed.

Is it true that we'd need to take down production to try this on staging?
(Reporter)

Comment 6

7 years ago
(In reply to comment #4)
> Mounting the NFS share with the 'sync' option everywhere might help this.
> This would hurt write performance, but it should also delay the actual write
> from returning until it's completed.

For the record, the read and the write are on unrelated threads.  Are you thinking that our write() is actually taking several seconds to finish?  Is there any logging that NFS does that could tell us that type of thing?
(Assignee)

Comment 7

7 years ago
Yes, that's what I'm wondering about. There's not really good/common logging options for that, that I'm aware of- only the client can know the timings, and if it's in async mode then it's lying to the userland about it. However, there may be some SystemTap scripts that could show what's going on better. I'm looking. If there's not one ready-to-go, it may be a few days to cook up one that works well enough to rely on.

I believe sync can be added with the 'remount' option, which would not require a full unmount/re-mount cycle, and thus should be zero-downtime. However, I haven't been able to guarantee that... we can try it, though. If it's better than we're done, and if not we can try a full umount/mount and see if that behaves differently.

Also, this would need to be done on the servers trying to write data... the reading server (pm-app-amo24, looks like) would not really matter.
(Reporter)

Comment 8

7 years ago
Ah, didn't realize it was a client side option.  That's easier.

The boxes that would need it are pm-gearman-amo01.mozilla.org and pm-gearman-amo02.mozilla.org.

Neither can afford downtime as they are used in production as well as stage
Assignee: dgherman → nmaul
(Assignee)

Comment 9

7 years ago
I have been able to time the writes from the gearman servers, and even with sync it's only a few dozen milliseconds, which corresponds nicely with the NetApp stats- I've been unable to make it really affect the latency or IOPS graphs. This makes sense, because we're only talking about reading and writing some small XPI files... it should not be a lot of work in that respect.

So, it is not an issue with the gearman (celery) servers writing to the NetApp.

However, I too see the long delays before pm-app-amo24 is able to see the new file and respond to it.

I suspect that we may have a situation here where the client (pm-app-amo24) is caching the negative response from the NetApp, rather than actually re-checking for the existence of the file.


Is this a new feature that isn't working as expected, or an old feature that suddenly broke? I've personally never used this app until today.
(Assignee)

Comment 10

7 years ago
I'm told that this was only recently moved to celery... previous to that, I think it was all running locally on pm-app-amo24.

Prod has always done this via celery though, and does not exhibit these symptoms... at least, not normally, and not before today's builder push (results after today's push are unknown at this time).

I am speculating that this is caused by RHEL5's FS-Cache module, and that removing or disabling it somehow it may fix the problem. Thus far I'm not having much luck with that, though. It may also be that this is a normal NFS behavior, and we just haven't toyed with this particular use case enough to notice the problem.

This module exists on the prod servers too. The reason I suspect it may not be manifesting there is that there are many prod servers... repeated attempts to access the file may not all be handled by the same server, so it's much less likely to hit one that already has the bad cache entry.


I'm downgrading the severity of this, since it's not blocking tomorrow's Firefox 5 launch.
Severity: blocker → critical
Priority: P1 → --
(Assignee)

Comment 11

7 years ago
FS-Cache appears to be a red herring. As near as I can tell, it's getting loaded but not actually used.

I have reverted pm-app-amo24 back to the original kernel (kernel-PAE-2.6.18-164.10.1.el5) instead of the one I'd upgraded to (kernel-PAE-2.6.18-238.12.1.el5). The box appears to have a problem with the newer kernel, as it would occasionally hang and throw errors on the console about non-responsive processes... even things as simple as ls and top. My fear is that this may have been triggered by the reboot in general, and not the new kernel specifically. Time will tell.


I found an option to specify if the lookup cache should cache positive responses, negative responses, or both. A positive response is along the lines of "yes, I have this file, and its attributes are XYZ". Negative would be "file does not exist".

The default timer on the attribute cache is a minimum of 3 seconds and a maximum of 60 seconds. This would explain why it would sometimes take up to a minute to return the proper result. I'm not entirely sure how it determines when exactly to check again, but it could be that more frequent accesses result in a longer cache TTL for that item. A hit every few seconds might be enough to keep it from expiring the entry sometimes.

I've set this to positive responses only, which may help. This is on pm-app-amo24... the gearman/celery servers aren't a concern here.

So far, it seems to have done the trick... I've tested a dozen addons, and they all seem to return in under 10 seconds (generally around 5 or so).


Leaving this open, because this fix will eventually need to make its way into some puppet configs if it continues to work well. Downgrading further.
Severity: critical → major
Status: NEW → ASSIGNED
(Reporter)

Comment 12

7 years ago
Thanks for the investigation!
(Assignee)

Comment 13

7 years ago
Added to puppet. Closing this out.
(Assignee)

Updated

7 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.