Some ETL jobs take too long (over 12 hours). This means another machine may work on it in the meantime; each overwriting the results of the other, and both complaining about the resulting inconsistency. These big results eventually error out, and left on the queue for another to work on. Overtime, the queue is saturated with these long-running jobs consuming the resources of all machines, and preventing further ETL. Find one of these jobs (they are still on the queue) and fix the problem.
If this is still a problem, I have not noticed for a while.