Closed Bug 663963 Opened 13 years ago Closed 13 years ago

change LDAP to see if that speeds up mercurial

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nmeyerhans, Unassigned)

Details

The short shorty: I'd like to schedule a downtime for the hg
infrastructure in order to test an LDAP change.  I don't actually
think there will be a service disruption, but I'd like to make
sure all interested parties are aware.  I'd like to do this as
soon as possible.  When is a good time to do this from the releng
perspective? I can do this work pretty much any time.

The long story:

I've suspected for a while that LDAP is a big source of
performance problems in the hg infrastructure.
http://it.pastebin.mozilla.org/1246662 shows a portion of strace
output captured when running 'ls' on dm-svn02, the main hg server
on which all write operations run.  Line 158 in that capture
shows poll(2) blocking for 0.88 seconds waiting for data from the
ldap server.  That means that a simple 'ls' call took nearly a
second to run.  The affect that this problem would have on
mercurial could be far worse. (LDAP is involved via the NSS LDAP
module, loaded by libc)

In order to determine how badly this is affecting mercurial, and
test a possible solution, we'd like to run an experiment.  If we
can ensure that /etc/passwd and /etc/group on dm-svn02 contain
the same data as LDAP, we can disable the NSS LDAP module.  This
will allow hg to run without depending on ldap for local uid/gid
to name mapping.  Ldap will still be used for authentication via
the "LPK" functionality, but this is only used during the initial
connection to the hg server and will not affect the operation of
locally running processes after authentication has taken place. I
would like to run in this configuration for at least 24 hours.

This change should be entirely transparent.  Hg processes that
are running at the time that the change was made will have
already loaded the NSS LDAP module and will continue to use it
until they exit.  The only issue to be aware of is that changes
to hg access (group membership, or the creation of a new account)
will not automatically propagate to the hg servers the way they
do now.  If any hg access changes need to be pushed urgently, we
can do that manually.

If it turns out that this change helps performance in a
meaningful way, we can develop a program to keep /etc/passwd and
/etc/group in sync.

After running this experiment for a day or so, we'll likely want
to revert the configuration change. This change should again be
100% transparent. I'll leave it to releng to decide whether we
should have another short treeclosure at this time, or whether we
can make the change "live".  I believe it should be safe to do
without a treeclosure, but even with a treeclosure the outage
would be very short.
Flags: needs-treeclosure?
Group: infra
(In reply to comment #1)
> The short shorty: I'd like to schedule a downtime for the hg
> infrastructure in order to test an LDAP change.  I don't actually
> think there will be a service disruption, but I'd like to make
> sure all interested parties are aware.  I'd like to do this as
> soon as possible.  When is a good time to do this from the releng
> perspective? I can do this work pretty much any time.
...

> After running this experiment for a day or so, we'll likely want
> to revert the configuration change. This change should again be
> 100% transparent. I'll leave it to releng to decide whether we
> should have another short treeclosure at this time, or whether we
> can make the change "live".  I believe it should be safe to do
> without a treeclosure, but even with a treeclosure the outage
> would be very short.

Lets err on the side of caution, and schedule a treeclosure for this. Bug correctly flagged "needs-treeclosure?" until buildduty has this slotted for an announced downtime. 

Depending on how this goes, we can see if we need another treeclosure for the revert.
Summary: need to schedule downtime for mercuial maintenence → change LDAP to see if that speeds up mercurial
Are you running nscd? If so, any idea why this is still be a problem?
Noah, I put this in the list for Thursday morning downtime June 16 from 4am - 8am PDT.  Please let me know if you are able to be around to do what is needed at that time. If not, then we can reset the flag to ? for needs-treeclosure and this can ride along in a future downtime.
Flags: needs-treeclosure? → needs-treeclosure+
Whiteboard: [downtime 6/16 4am-8am PDT]
I can make this change during the Thursday downtime. However, it'd be nice to revert this change after a day or so, and I'm on PTO Friday.  If we can get a short downtime next week, that might be better.  Do you think that will be possible?
(In reply to comment #4)
> I can make this change during the Thursday downtime. However, it'd be nice
> to revert this change after a day or so, and I'm on PTO Friday.  If we can
> get a short downtime next week, that might be better.  Do you think that
> will be possible?

Per email and irc with noahm:

1) Noah prefers to not do this ldap change for hg in tomorrow morning's downtime after all. 
a) it needs him to be there 24hrs later for a rollback downtime 
b) there are other non-treeclosing ideas he can try first
2) if Noah does want to do the ldap change, he is ok to wait until next week's downtime (likely Thurs morning) after FF5.0 ships


I have therefore cleared the needs-treeclosure flag, and whiteboard, because this will not be happening tomorrow. Noah, whenever you are ready to retry this, please set needs-treeclosure?
Flags: needs-treeclosure+
Whiteboard: [downtime 6/16 4am-8am PDT]
(In reply to comment #5)
> (In reply to comment #4)
> > I can make this change during the Thursday downtime. However, it'd be nice
> > to revert this change after a day or so, and I'm on PTO Friday.  If we can
> > get a short downtime next week, that might be better.  Do you think that
> > will be possible?
> 
> Per email and irc with noahm:
> 
> 1) Noah prefers to not do this ldap change for hg in tomorrow morning's
> downtime after all. 
> a) it needs him to be there 24hrs later for a rollback downtime 
> b) there are other non-treeclosing ideas he can try first
> 2) if Noah does want to do the ldap change, he is ok to wait until next
> week's downtime (likely Thurs morning) after FF5.0 ships
> 
> 
> I have therefore cleared the needs-treeclosure flag, and whiteboard, because
> this will not be happening tomorrow. Noah, whenever you are ready to retry
> this, please set needs-treeclosure?

Found in triage. Fixing the assignment & component.
Assignee: nobody → nmeyerhans
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Is still still necessary?
Assignee: nmeyerhans → server-ops
In relation to comment 2, we are running nscd (at least, we are *now*... I don't know how long we have been). I suspect noahm may have enabled this at some point, but kept this bug around as a more long-term goal. nscd does have the downside that changes may take up to 10 minutes to start working as expected, and of course being a cache it will miss on occasion.

In my own "strace ls -al" on dm-svn02, I'm not seeing any LDAP connectivity... just nscd. After playing with it for a little bit, I believe the hit rate on the nscd cache is very good. The TTL is somewhat short (10min for "positive" responses, 20sec for "negative"), so there will be occasional lags when records have to be re-looked-up, but the alternative is to have credentials hang around in the cache for a long time. Given the current hit rate (seems to be >95%), I'm inclined to leave the TTL alone.

Since this was opened by us (Infra) I'm going to close it back out. If anyone has reason to believe this is still a problem, please re-open and explain any concerns. Thanks!
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.