I have become convinced that we need some kind of dashboard for tracking how long try-pushes are taking, as well as breaking down the *individual* components of particularly long-lived try pushes and jobs.
In the past we've done various experiments with gantt-chart-type views of jobs which are useful for understanding what might be going on with an individual push, but never a holistic picture of the entire system, which is essential for optimization (for example, wait times are heavily impacted by what other parts of automation are doing).
Something like new relic's analytics views might be interesting inspiration here, as it combines a view of overall load on the system with samples of particularly long queries for analysis.
Working on showing the high level (90th? percentile), “end-to-end” times. The hope is to reveal the longest running chain of jobs and provide a drill-down to the steps that explain the numbers. This particular statistic may be dubious, however, it provides something to build an interactive UI for other statistics.