bugzilla.mozilla.org has resumed normal operation. Attachments prior to 2014 will be unavailable for a few days. This is tracked in Bug 1475801.
Please report any other irregularities here.

hgAccessDate LDAP attribute update always failing

RESOLVED FIXED

Status

Developer Services
Mercurial: hg.mozilla.org
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: gps, Assigned: gps)

Tracking

Details

MozReview Requests

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(4 attachments)

(Assignee)

Description

3 years ago
pash attempts to update the "hgAccessDate" LDAP attribute with the date of the last time a user was active. It is failing 100% of the time due to a comedy of errors.
(Assignee)

Comment 1

3 years ago
Created attachment 8655732 [details]
MozReview Request: pash: use proper API for retrieving LDAP settings (bug 1200847); r?fubar

pash: use proper API for retrieving LDAP settings (bug 1200847); r?fubar

This was failing 100% of the time because of the bad name. Failure was
swallowed by the exception handler. Perhaps we should have tests for
this...
Attachment #8655732 - Flags: review?(klibby)
(Assignee)

Comment 2

3 years ago
Created attachment 8655733 [details]
MozReview Request: pash: use early return (bug 1200847); r?fubar

pash: use early return (bug 1200847); r?fubar

Make the code more readable before modifying it.
Attachment #8655733 - Flags: review?(klibby)
(Assignee)

Comment 3

3 years ago
Created attachment 8655734 [details]
MozReview Request: pash: formatting fixups (bug 1200847); r?fubar

pash: formatting fixups (bug 1200847); r?fubar

Wrap some long lines. Use dedicated variable names. Use "in"
instead of "has_key."
Attachment #8655734 - Flags: review?(klibby)
(Assignee)

Comment 4

3 years ago
Created attachment 8655735 [details]
MozReview Request: pash: make LDAP access date updating actually work (bug 1200847); r?fubar

pash: make LDAP access date updating actually work (bug 1200847); r?fubar

Previously, if the "hgAccessDate" attribute was missing from the LDAP
entry, we would never create it.

The new logic looks to see if the entry has the "hgAccount" objectClass.
If it does, it can hold a "hgAccessDate" attribute and we attempt to
define it.

For good measure, tests verifying the addition of the hgAccessDate
attribute have been added.
Attachment #8655735 - Flags: review?(klibby)
(Assignee)

Comment 5

3 years ago
Comment on attachment 8655735 [details]
MozReview Request: pash: make LDAP access date updating actually work (bug 1200847); r?fubar

pash: make LDAP access date updating actually work (bug 1200847); r?fubar

Previously, if the "hgAccessDate" attribute was missing from the LDAP
entry, we would never create it.

The new logic looks to see if the entry has the "hgAccount" objectClass.
If it does, it can hold a "hgAccessDate" attribute and we attempt to
define it.

For good measure, tests verifying the addition of the hgAccessDate
attribute have been added.
(Assignee)

Comment 6

3 years ago
AFAICT hgAccessDate hasn't worked properly since at least April 2012. There is a flurry of Puppet activity for the hg_new module beginning with r35242. Unfortunately the code until this day is buggy: if the "hgAccessDate" attribute isn't set, there is an attempt to set it. However, there is a reference to an unassigned variable. This raises an exception which is logged to /var/log/pash.log and no LDAP update occurs.

As mentioned in the commit message, existing users with hgAccessDate have that attribute updated on login. At least until August 20, when https://hg.mozilla.org/hgcustom/version-control-tools/rev/2b7276ed597e accidentally broke it.

I'm not sure how hgAccessDate was originally set. Either we had working hgAccessDate creation at one point or the attribute was initialized to a static value. I'm thinking the latter because we have 183 entries in LDAP with hgAccessDate=20100222170313.000000Z.

What's concerning about all this is the number of enabled accounts *without* an hgAccessDate attribute. LDAP says we have 1,050 accounts with hgAccountEnabled=TRUE and 484 with hgAccountEnabled=FALSE. However, only 278 of the enabled accounts have hgAccessDate set. I spot checked the list without hgAccessDate and I know some of these people have pushed to Mercurial and should have hgAccessDate. vcs_accounts_disable.py from puppet/modules/openldap/templates/scripts/vcs_accounts_disable.py appears to only disable accounts that have hgAccessDate set. This seems like a bug.

We can't just lock out users without hgAccessDate right now because it would lock out a lot of people that have been active in the previous several months. I could write a script that finds all users that have pushed in the past 6 months and update hgAccessDate accordingly. Then we could disable all accounts with an old or missing hgAccessDate attribute.
(Assignee)

Comment 7

3 years ago
And, uh, as far as I can tell gitAccessDate isn't set for *any* LDAP records. I'm not sure where the code to set gitAccessDate is, assuming it exists at all.
Given https://bugzilla.mozilla.org/show_bug.cgi?id=524153#c25 I'd bet on it not existing.
Comment on attachment 8655732 [details]
MozReview Request: pash: use proper API for retrieving LDAP settings (bug 1200847); r?fubar

https://reviewboard.mozilla.org/r/18017/#review16169
Attachment #8655732 - Flags: review?(klibby) → review+
Comment on attachment 8655733 [details]
MozReview Request: pash: use early return (bug 1200847); r?fubar

https://reviewboard.mozilla.org/r/18019/#review16171
Attachment #8655733 - Flags: review?(klibby) → review+

Updated

3 years ago
Attachment #8655734 - Flags: review?(klibby) → review+
Comment on attachment 8655734 [details]
MozReview Request: pash: formatting fixups (bug 1200847); r?fubar

https://reviewboard.mozilla.org/r/18021/#review16175

Updated

3 years ago
Attachment #8655735 - Flags: review?(klibby) → review+
Comment on attachment 8655735 [details]
MozReview Request: pash: make LDAP access date updating actually work (bug 1200847); r?fubar

https://reviewboard.mozilla.org/r/18023/#review16177
(Assignee)

Comment 13

3 years ago
Deploying the pash updates now. But I'll keep the bug open to track retroactively populating hgAccessDate on accounts that don't have one. This should result in hundreds of accounts properly getting disabled.
(Assignee)

Comment 14

3 years ago
The deploy still isn't setting the LDAP attribute because the server we're using for writes isn't up, accessible, and/or allowing writes.

jabba: what LDAP server should we be using to update hgAccessDate from hgssh1.dmz.scl3?
Flags: needinfo?(jdow)
(Assignee)

Comment 15

3 years ago
So, uh, writing to ldap://ldapsync1.db.scl3.mozilla.com appears to work. I think there is some kind of replication delay that was causing confusion.
Flags: needinfo?(jdow)

Comment 16

3 years ago
Actually, can you change it to write to ldapmaster1.db.scl3.mozilla.com? I'd like to decom ldapsync1.db.scl3, and the pash thing is the only dependency I have to keep it around. If you write to ldapmaster1.db.scl3.mozilla.com, I think overall you'll get better results, as it's a master and not an intermediate.
(Assignee)

Comment 17

3 years ago
hgssh1 is timing out connecting to ldapmaster1.db.scl3.mozilla.com:389.

Also, I appear to be randomly getting "Server is unavailable" messages from ldapsync1.db.scl3. Not sure what the failure rate is. But it is kinda annoying.
(Assignee)

Comment 18

3 years ago
I wrote a script that set/updated hgAccessDate from pushlog data where hgAccessDate was missing or lagging by more than a day. Several hundred entries were updated. I expect the cron that sets hgAccountEnabled=FALSE for hgAccessDate > 180 days will have a lot of activity the next time it runs.

Comment 19

3 years ago
(In reply to Gregory Szorc [:gps] from comment #17)
> hgssh1 is timing out connecting to ldapmaster1.db.scl3.mozilla.com:389.
> 
> Also, I appear to be randomly getting "Server is unavailable" messages from
> ldapsync1.db.scl3. Not sure what the failure rate is. But it is kinda
> annoying.

ACCEPT     tcp  --  hgssh1.dmz.scl3.mozilla.com  anywhere            multiport dports ldap /* 230 allow hg writes */
ACCEPT     tcp  --  hgssh2.dmz.scl3.mozilla.com  anywhere            multiport dports ldap /* 231 allow hg writes */

The rules to allow this are in place on the master's firewall. I wonder if the netops flow got lost at some point. Needinfo dcurado to check if that rule is still in place. It was when we launched the master and I remember testing it successfully then, but we are now getting timeouts on these flow:

hgssh1.dmz.scl3.mozilla.com -> ldapmaster1.db.scl3.mozilla.com:389
hgssh2.dmz.scl3.mozilla.com -> ldapmaster1.db.scl3.mozilla.com:389
Flags: needinfo?(dcurado)
There are two policies in place related to ldp but they do not include ldapmaster1.db.scl3.mozilla.com

Policy: ldapsync1--ldap, action-type: permit, State: enabled, Index: 390, Scope Policy: 0
  Policy Type: Configured
  Sequence number: 3
  From zone: dmz, To zone: db
  Source addresses:
    hgssh2.dmz.scl3: 10.22.74.29/32
    hgssh1.dmz.scl3: 10.22.74.28/32 <----------- this guy
  Destination addresses:
    ldapsync1.db.scl3: 10.22.70.20/32 <--------- can talk to this guy
  Application: junos-ldap
    IP protocol: tcp, ALG: 0, Inactivity timeout: 1800
      Source port range: [0-0]
      Destination port range: [389-389] <-------- over port 389
  Application: ldaps
    IP protocol: tcp, ALG: 0, Inactivity timeout: 1800
      Source port range: [0-0]
      Destination port range: [636-636]

Are you sure there was a policy for ldapmaster1.db.scl3.mozilla.com?
It would be unusual for a policy to be removed without someone specifically 
asking for it.

Here's the other policy:

Policy: ldapsync-ldaps, action-type: permit, State: enabled, Index: 392, Scope Policy: 0
  Policy Type: Configured
  Sequence number: 5
  From zone: dmz, To zone: db
  Source addresses:
    any-ipv4: 0.0.0.0/0 <----------- anybody from dmz
    any-ipv6: ::/0
  Destination addresses:
    ldapsync1.db.scl3: 10.22.70.20/32 <------ can talk
    ldap.db.scl3: 10.22.70.17/32      <------ to these guys
  Application: ldaps
    IP protocol: tcp, ALG: 0, Inactivity timeout: 1800
      Source port range: [0-0]
      Destination port range: [636-636]
  Application: ldap
    IP protocol: tcp, ALG: 0, Inactivity timeout: 1800
      Source port range: [0-0]
      Destination port range: [389-389] <----- over port 389
Flags: needinfo?(dcurado)
It's a quick change to add ldapmaster1.db.scl3.mozilla.com to that second policy as a destination-address.
Let me know if you'd like me to do that?

Thanks.
Flags: needinfo?(jdow)

Comment 22

3 years ago
dcurade - I'm referring to https://bugzilla.mozilla.org/show_bug.cgi?id=1161724 where I think this all got set up, and I tested it working then.
Flags: needinfo?(jdow) → needinfo?(dcurado)
Do we need to make this bug private now there are internal IP addresses within it?
--------------------------------------------------------------------------
Policy: ldapmaster1--ldap-ssh, action-type: permit, State: enabled, Index: 699, Scope Policy: 0
  Policy Type: Configured
  Sequence number: 22
  From zone: dc, To zone: db
  Source addresses:
    admin1b.private.ber1: 10.249.75.7/32
    admin1a.private.ber1: 10.249.75.6/32
    admin1b.private.mtv2: 10.252.75.7/32
    admin1a.private.mtv2: 10.252.75.6/32
    admin1b.private.lon1: 10.246.75.7/32
    admin1a.private.lon1: 10.246.75.6/32
    admin1b.private.tpe1: 10.247.75.7/32
    admin1a.private.tpe1: 10.247.75.6/32
    admin1b.private.yvr1: 10.244.75.7/32
    admin1a.private.yvr1: 10.244.75.6/32
    admin1b.private.sfo1: 10.251.75.7/32
    admin1a.private.sfo1: 10.251.75.6/32
    admin1b.private.par1: 10.243.75.7/32
    admin1a.private.par1: 10.243.75.6/32
    admin1b.private.tor1: 10.242.75.7/32
    admin1a.private.tor1: 10.242.75.6/32
    admin1b.private.akl1: 10.245.75.7/32
    admin1a.private.akl1: 10.245.75.6/32
    admin1b.private.pdx1: 10.248.75.7/32
    admin1a.private.pdx1: 10.248.75.6/32
  Destination addresses:
    ldapmaster1.db.scl3: 10.22.70.130/32
  Application: ldap
    IP protocol: tcp, ALG: 0, Inactivity timeout: 1800
      Source port range: [0-0]
      Destination port range: [389-389]
  Application: junos-ssh
    IP protocol: tcp, ALG: 0, Inactivity timeout: 86400
      Source port range: [0-0]
      Destination port range: [22-22]
Status: ASSIGNED → UNCONFIRMED
Ever confirmed: false
Flags: needinfo?(dcurado)

Updated

3 years ago
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Is this now working OK for everyone?
Status: ASSIGNED → UNCONFIRMED
Ever confirmed: false
Flags: needinfo?(gps)

Comment 26

3 years ago
The connection is working with my test now. :gps can you try pointing pash at ldapmaster1.db.scl3.mozilla.com again and let me know if it works. I think the ACLs for the bind-hg/bind-vcs/bind-svn users should be set properly, but might need some tweaking, so let me know.
Also, if we know when this stopped working, I really would like to try to track back how this got broken, to avoid having it happen again.  Thanks

Comment 28

3 years ago
I don't know if we'll have data on that. I don't think we ever tried it other than the initial test and I've put the "request relops to fix pash" issue on the back burner until this bug popped up, when I thought "great! now's the time to finally make use of the new ldapmaster flows!"
(Assignee)

Comment 29

3 years ago
I switched hgssh1 to ldapmaster1 and it appears to be working!

I'd also like to have reviewboard-hg1.dmz.scl3 do LDAP updating. It has an ACL to ldapsync1.db.scl3. Could we get connectivity to ldapmaster1 as well?
Flags: needinfo?(gps) → needinfo?(dcurado)
Yes, I'd be glad to add whatever security policy you require.
Can you do me a favor, can you put in terms of:

reviewboard-hg1.dmz.scl3 -> ldapmaster1.db.scl3 tcp/9200

(I think the above may be what you're asking for, I just want to be sure... if it is what you're
 asking for, feel free to say: "yeah, like that")

Also, I think we figured out what happened with the original policy.
Not only do we have *a lot* of security policies on this firewall, but we know that a lot of
it is cruft -- things people asked us to configure, but they no longer use it, and never told
us to delete it.  We used a new feature in the OS to ask, "which of these security policies has
not been hit in the last month" or something like that.  I can't remember if 1 month was the
time frame or not.  Anyway, anything that had not been touched in <some duration of time> we
deleted.  I think that's what happened here.
Flags: needinfo?(dcurado)
(Assignee)

Comment 31

3 years ago
A few days ago we had 1,050 accounts with hgAccountEnabled=TRUE and 484 with hgAccountEnabled=FALSE. After my backfilling of hgAccessDate, it appears that the cron to mark dormant accounts as inactive has done some work: we now have 924 enabled accounts and 608 disabled.

We still have a number of accounts with no hgAccessDate but with hgAccountEnabled=TRUE. 294 to be exact. hgAccessDate wasn't backfilled on these accounts presumably because we could find no pushes attributed to the account. These were likely accounts that were enabled but never used for pushing.

I think whatever process sets hgAccountEnabled should also set hgAccessDate to the current time. This way the account will expire if it is never used. For now, I'll write a script that sets hgAccessDate to like 2010-01-01 00:00:00 for all users that don't have hgAccessDate defined.
(Assignee)

Comment 32

3 years ago
reviewboard-hg1.dmz.scl3 -> ldapmaster1.db.scl3 tcp/389
reviewboard-hg1.dmz.scl3 -> ldapmaster1.db.scl3 tcp/636

Comment 33

3 years ago
(In reply to Gregory Szorc [:gps] from comment #32)
> reviewboard-hg1.dmz.scl3 -> ldapmaster1.db.scl3 tcp/389
> reviewboard-hg1.dmz.scl3 -> ldapmaster1.db.scl3 tcp/636

I added the iptables rule for this to work, so once dcurado gets the firewall updated, and once puppet has ran on the server, this should be good to go.

I believe our account creation tool does in fact set hgAccessDate to current time, although that had not always been the case, and is a feature added in the last few years. I don't track new hg accounts though, so someone that does would need to let me or :gps know next time a new account is added and we can check if it did indeed set hgAccessDate to current time before the user did a push.
(Assignee)

Comment 34

3 years ago
I'm seeing some weird behavior that I think you could explain.

When I bind as uid=bind-hg,ou=logins,dc=mozilla to ldap.db.scl3, I can see the "hgAccountEnabled" attribute. But when I bind as the same user to ldapmaster1.db.scl3, the attribute is never returned. Perhaps ldapmaster1.db.scl3 isn't configured to expose this attribute to that user? Could you please verify that ldapmaster1.db.scl3 will expose all objectClass=hgAccount attributes to this user?

Also, currently we bind to ldap.db.scl3 to get info then make writes to ldapmaster1.db.scl3 (if necessary). However, I've noticed that there appears to be a lag between writing and the change being seen on ldap.db.scl3. This makes me nervous about issuing a query on one host and writing to another since stale data may be involved. I was wondering if you could explain how replication works so I know what considerations we need to have.
Flags: needinfo?(jdow)

Comment 35

3 years ago
There are different ACLs on the two servers. The slave (ldap.db.scl3) allows all bind users to read those attributes. The master on the other hand doesn't have that ACL and only allows the writing of the hgAccessDate attribute.

As far as replication goes, it should be pretty immediate, and definitely faster than when using the ldapsync1 host as was the case previously. When I say immediate, I mean when I have two terminal windows open, one tailing the log on the slave and one writing to the master, I see the replication updates happen pretty instantaneously. I haven't ever done any sort of "write to master, query slave" timing tests though, so perhaps there is some delay there, but I'm not sure the best way to deal with it. You definitely don't want to be reading from the master, it's not designed to handle that type of load, whereas the slaves are (ldap.db.scl3 is a load balancer vip with two higher end servers behind it. the master is a single VM).

I'm not sure of the current work flow, but when a lot of this was originally designed (or rather re-designed during one of the many iterations), the idea was that a code push would be:

1) authenticate to hgssh with ssh key
2) pash.py checks ldap for hgAccountEnabled=TRUE, doesn't care about hgAccessDate
3) pash.py sends the current date to the master to update it with the current date
4) once per day, a cron job on the master looks at all hgAccessDates and if it finds one older than 6 months, it sets hgAccountEnabled=FALSE

So, in that workflow, hgssh servers never actually care what the hgAccessDate was set to during a push, as all they care about is whether the account is disabled or not, and on a successful push, it would send the current time back to the master just to keep the account open another 6 months. I don't think there is really any possibility of a race condition there that would need to be worried about (Unless someone pushes right before the disabler cron runs after having not pushed for exactly 6 months - but, meh, that's just silly to entertain).

If the current design/workflow is significantly different from how I described it above, we can certainly look at redesigning some of this stuff, so let me know.
Flags: needinfo?(jdow)
(Assignee)

Comment 36

3 years ago
Thank you for that detailed reply: it helps a lot!

I think I was getting confused by:

1) Additional replication lag from ldapsync1, which we were using until a few hours ago
2) Me not realizing that searching ldapmaster1/ldapsync1 retrieved no useful attributes
(Assignee)

Comment 37

3 years ago
Comment 33 confirmed that every user with a hgAccount is supposed to have hgAccessDate set. So, I mass populated all mail=* DNs having objectClass=hgAccount with a hgAccessDate value of 20100101000000.000000Z. These accounts should be disabled the next time the user deactivation cron runs. I expect ~295 accounts to be disabled.
(Assignee)

Comment 38

3 years ago
After yesterday's mass setting of hgAccessDate on all accounts that were missing it, we are now down to 633 enabled accounts and 900 disabled ones. At the beginning of this week we had 1,050 enabled and 484 disabled.

I don't see anything left for us to do in this bug, so closing.
Status: UNCONFIRMED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
(Assignee)

Updated

3 years ago
Duplicate of this bug: 833773
You need to log in before you can comment on or make changes to this bug.