Observation queues exceed maximum size targets
Categories
(Location :: General, defect)
Tracking
(Not tracked)
People
(Reporter: jwhitlock, Unassigned)
Details
After bug 1613493, MLS observation queues were stable, by setting the sample rate of the largest geolocate API user to 35%. Since then, the queues have occasionally exceeded the target of no more than 10 million observations to process. The standard rate has been lowered, now at 10%.
Some of this may be due to increased usage and more data per day. Some may also be due to the growing database causing operations to slow down (bug 1602958 may address this).
One issue is that processing stops for 10-15 minutes during a production deploy, while the observations continue to accumulate. The processing system has difficulty catching up, even during the overnight slow periods.
The solution has been:
- Reduce the sample rate for the large API user to 1%
- Wait for the queues to reduce to a low level (10K - 100K each), 1 to 6 hours
- Return the sample rate to the original value
One way to automate this would be to add a global sample rate and a maximum queue size. As the total observation backlog approaches the maximum, the global rate can be reduced from 100%, slowing the rate of incoming observations. When the backlog is back under control, the global rate can grow back to 100%, if the async workers can handle the incoming data.
| Reporter | ||
Updated•5 years ago
|
Description
•