Based on what I'm seeing in the log (https://x0.at/Zuw.txt), we have a valid ESNI record that was generated at 2019-10-26T23:00Z on . I'm guessing, but this is exactly the middle of the notBefore and notAfter fields. The DNS TTL is 3600 (one hour). I need to confirm, but I expect that these records are generated once per hour on the hour. If that is the case, then this record might have entered a cache just before 2019-10-27T00:00Z. That means that the record would be considered usable right up to 2019-10-27T01:00Z, a minute before the failed attempt. It is likely that the ESNI keys at the server were rotated at 2019-10-27T01:00Z or soon afterwards. As a result, the server wasn't able to produce the correct ESNI record and we failed the connection.
This is clearly a server configuration issue. Because Cloudflare tell us that this is valid for 3 days, we can't do anything to safeguard against failures here. That's their prerogative, and they make that call because they don't trust client clocks, which is probably wise. Our experience shows that clocks on clients are sometimes very bad.
In the end, the goal is to have the server ensure that the client won't attempt the ESNI request past when they drop their keys. They do that either by reducing the time that the record is valid (by reducing the TTL on DNS records, for instance) or by extending the time that the server retains the corresponding keys.
If, as I'm guessing, the record was issued at 23:00, replaced at 00:00, and the keys were retired at 01:00, then a TTL of 3600 is too tight. Any delay in provisioning replacement keys at 00:00 would leave a period of exposure. If keys are replaced too soon relative to that, this error occurs. If a DNS server holds on to keys too long, or there are network delays in delivering DNS responses, then the TTL will extend past the 3600 and bad things happen.
Now, I might be wrong and this might be the result of an insane over-extension of the DNS TTL, but I'll check with Cloudflare to confirm this.