Intermittent Phabricator DB issues 2025-11-10--11: “Too many connections” (to DB)
Categories
(Conduit :: Phabricator, defect)
Tracking
(Not tracked)
People
(Reporter: shtrom, Assigned: shtrom)
References
(Blocks 2 open bugs)
Details
Since 2025-11-10 22:03 UTC (9:03 AEDT), we have been seeing intermitent issues with Phabricator, which seem to stem from database connections issues.
- From
moz-phab:
matthew@ZenTower:~/firefox2$ moz-phab submit
Submitting 3 commits for review
Phabricator Error: Unable to establish a connection to any database host (while trying "phabricator_conduit"). All masters and replicas are completely unreachable.
AphrontConnectionQueryException: Attempt to connect to mozphab_db@mozphab-phabhost-rds-devsvcprod-201709240232.c8twujv3cfsf.us-west-2.rds.amazonaws.com failed with error #1040: Too many connections.
- Errors when refreshing pages:
[Core Exception/PhutilAggregateException] Encountered a processing exception, then another exception when trying to build a response for the first exception.
- PhabricatorDataNotAttachedException: Attempting to access attached data on PhabricatorUser (via getAlternateCSRFString()), but the data is not actually attached. Before accessing attachable data on an object, you must load and attach it.
Data is normally attached by calling the corresponding needX() method on the Query class when the object is loaded. You can also call the corresponding attachX() method explicitly.
- PhabricatorClusterStrandedException: Unable to establish a connection to any database host (while trying "phabricator_user"). All masters and replicas are completely unreachable.
AphrontConnectionQueryException: Attempt to connect to mozphab_db@mozphab-phabhost-rds-devsvcprod-201709240232.c8twujv3cfsf.us-west-2.rds.amazonaws.com failed with error #1040: Too many connections.
- Warnings when refreshing pages:
Unable to establish a connection to any database host (while trying "phabricator_policy"). All masters and replicas are completely unreachable. AphrontConnectionQueryException: Attempt to connect to mozphab_db@mozphab-phabhost-rds-devsvcprod-201709240232.c8twujv3cfsf.us-west-2.rds.amazonaws.com failed with error #1040: Too many connections.
- We are also seeing alerts from phabricator-emails firing and resolving, more or less in line with user reports. As early as 2025-11-10 15:20 UTC (2:20 AEDT):
[FIRING:1] Phabricator conduit (Phabricator Emails Errors Logged low)
**Firing**Value: A=1, B=0
Labels:
- alertname = Phabricator Emails Errors Logged
- grafana_folder = Phabricator
- severity = low
- team = conduit
Annotations:
- description = Phabricatoremails logging errors
Source: https://yardstick.mozilla.org/alerting/grafana/cegjcz26knwg0d/view?orgId=1
Silence: https://yardstick.mozilla.org/alerting/silence/new?alertmanager=grafana&matcher=__alert_rule_uid__%3Dcegjcz26knwg0d&matcher=severity%3Dlow&matcher=team%3Dconduit&orgId=1
Logs for both Phabricator and Phabricator-emails don't reveal anything obvious. Neither do metrics in Yardstick and RDS in the AWS console.
| Assignee | ||
Updated•23 days ago
|
| Assignee | ||
Updated•23 days ago
|
| Assignee | ||
Updated•23 days ago
|
| Assignee | ||
Comment 1•23 days ago
|
||
Ah! There is a dedicated dashboard for RDS metrics. We do see the database connections maxing out
https://yardstick.mozilla.org/d/beggx0u1j5ekge/mozphab-prod-rds-cloudwatch?orgId=1&from=2025-11-10T20:01:13.307Z&to=2025-11-11T02:01:13.307Z&timezone=browser&viewPanel=panel-4
| Assignee | ||
Comment 2•23 days ago
|
||
From dkl: Phabricator has a lot of backed up tasks.
| Assignee | ||
Comment 3•23 days ago
•
|
||
There were very many PhabricatorCalendarImportReloadWorker tasks in the Leased Tasks in https://phabricator.services.mozilla.com/daemon/
As they are not critical, dkl killed them.
It's uncertain whether they were a cause, or a consequence, of the observed issues.
Comment 4•23 days ago
|
||
FWIW I can still see problems submitting to phab.
| Assignee | ||
Comment 5•23 days ago
|
||
From dkl: the database show shows all or almost the max 1280 connections in use in show processlist
(with /app/phabricator/bin/storage shell from the phd instance); all but a couple are in state Sleep.
| Assignee | ||
Comment 6•23 days ago
•
|
||
Also seeing a lot of PhabricatorSearchWorker tasks (in charge of full-text indexing), killing them now with
/app/phabricator/bin/worker cancel --active --class PhabricatorSearchWorker
| Assignee | ||
Comment 7•23 days ago
|
||
dkl also noted that the wait_timeout on the RDS instance is very high, 28800 seconds (vs the default 300).
| Assignee | ||
Updated•23 days ago
|
| Assignee | ||
Comment 8•23 days ago
|
||
At 4:30 UTC, :jbuck resized the db to m8g.2xlarge
| Assignee | ||
Updated•23 days ago
|
Updated•12 days ago
|
| Assignee | ||
Updated•10 days ago
|
Description
•