Closed Bug 1999425 Opened 23 days ago Closed 12 days ago

Intermittent Phabricator DB issues 2025-11-10--11: “Too many connections” (to DB)

Categories

(Conduit :: Phabricator, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: shtrom, Assigned: shtrom)

References

(Blocks 2 open bugs)

Details

Since 2025-11-10 22:03 UTC (9:03 AEDT), we have been seeing intermitent issues with Phabricator, which seem to stem from database connections issues.

  • From moz-phab:
matthew@ZenTower:~/firefox2$ moz-phab submit 
Submitting 3 commits for review
Phabricator Error: Unable to establish a connection to any database host (while trying "phabricator_conduit"). All masters and replicas are completely unreachable.

AphrontConnectionQueryException: Attempt to connect to mozphab_db@mozphab-phabhost-rds-devsvcprod-201709240232.c8twujv3cfsf.us-west-2.rds.amazonaws.com failed with error #1040: Too many connections.
  • Errors when refreshing pages:
[Core Exception/PhutilAggregateException] Encountered a processing exception, then another exception when trying to build a response for the first exception.
    - PhabricatorDataNotAttachedException: Attempting to access attached data on PhabricatorUser (via getAlternateCSRFString()), but the data is not actually attached. Before accessing attachable data on an object, you must load and attach it.
      
      Data is normally attached by calling the corresponding needX() method on the Query class when the object is loaded. You can also call the corresponding attachX() method explicitly.
    - PhabricatorClusterStrandedException: Unable to establish a connection to any database host (while trying "phabricator_user"). All masters and replicas are completely unreachable.
      
      AphrontConnectionQueryException: Attempt to connect to mozphab_db@mozphab-phabhost-rds-devsvcprod-201709240232.c8twujv3cfsf.us-west-2.rds.amazonaws.com failed with error #1040: Too many connections.
  • Warnings when refreshing pages:
Unable to establish a connection to any database host (while trying "phabricator_policy"). All masters and replicas are completely unreachable. AphrontConnectionQueryException: Attempt to connect to mozphab_db@mozphab-phabhost-rds-devsvcprod-201709240232.c8twujv3cfsf.us-west-2.rds.amazonaws.com failed with error #1040: Too many connections.
  • We are also seeing alerts from phabricator-emails firing and resolving, more or less in line with user reports. As early as 2025-11-10 15:20 UTC (2:20 AEDT):
[FIRING:1] Phabricator conduit (Phabricator Emails Errors Logged low)
**Firing**Value: A=1, B=0
Labels:
- alertname = Phabricator Emails Errors Logged
- grafana_folder = Phabricator
- severity = low
- team = conduit
Annotations:
- description = Phabricatoremails logging errors
Source: https://yardstick.mozilla.org/alerting/grafana/cegjcz26knwg0d/view?orgId=1
Silence: https://yardstick.mozilla.org/alerting/silence/new?alertmanager=grafana&matcher=__alert_rule_uid__%3Dcegjcz26knwg0d&matcher=severity%3Dlow&matcher=team%3Dconduit&orgId=1

Logs for both Phabricator and Phabricator-emails don't reveal anything obvious. Neither do metrics in Yardstick and RDS in the AWS console.

Assignee: nobody → omehani
Severity: -- → S3
Status: NEW → ASSIGNED
Severity: S3 → S2
Summary: Intermitent Phabricator DB issues 2025-11-10--11 → Intermitent Phabricator DB issues 2025-11-10--11: “Too many connections” (to DB)

From dkl: Phabricator has a lot of backed up tasks.

There were very many PhabricatorCalendarImportReloadWorker tasks in the Leased Tasks in https://phabricator.services.mozilla.com/daemon/

As they are not critical, dkl killed them.

It's uncertain whether they were a cause, or a consequence, of the observed issues.

FWIW I can still see problems submitting to phab.

From dkl: the database show shows all or almost the max 1280 connections in use in show processlist
(with /app/phabricator/bin/storage shell from the phd instance); all but a couple are in state Sleep.

Also seeing a lot of PhabricatorSearchWorker tasks (in charge of full-text indexing), killing them now with

/app/phabricator/bin/worker cancel --active --class PhabricatorSearchWorker

dkl also noted that the wait_timeout on the RDS instance is very high, 28800 seconds (vs the default 300).

Summary: Intermitent Phabricator DB issues 2025-11-10--11: “Too many connections” (to DB) → Intermittent Phabricator DB issues 2025-11-10--11: “Too many connections” (to DB)

At 4:30 UTC, :jbuck resized the db to m8g.2xlarge

5:06 UTC; it looks like the incident is mitigated.

Status: ASSIGNED → RESOLVED
Closed: 12 days ago
Resolution: --- → FIXED
Blocks: 2002695
Blocks: 2003932
You need to log in before you can comment on or make changes to this bug.