Operational troubleshooting notebook
June 5 2024
3 H12 timeouts in last two days
During each, plenty of puma workers, we're not running out of puma workers
Each one has ordinary/fine dyne load (CPU), RAM usage, etc.
Heroku metrics on response time show the 30s response time at each time period — that's the H12 error, heroku timing out after waiting 30 seconds for our app to return a response.
So mystery is what these requests were and why they didn't return a response in under lasted 30 seconds, it's not too many requests/lack of puma workers, but particular long-running requests. Can we figure out what they were?
Ah, we have that info in the logged H12 errors, of course.
Jun 03 20:00:20 scihist-digicoll-production heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/works/i3uubdg/viewer_images_info" host=digital.sciencehistory.org request_id=bbc866f6-d55a-4915-a85b-9d956d4cb063 fwd="40.77.188.137" dyno=web.1 connect=0ms service=30000ms status=503 bytes=0 protocol=https
Currently returns in 140ms, and it’s not a very long response
I don’t think this request even involves Solr
Jun 03 21:36:36 scihist-digicoll-production heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/works/jd472x205/viewer_images_info" host=digital.sciencehistory.org request_id=07811788-12a6-4a74-a3a3-d7b8641cb0b8 fwd="35.150.180.134" dyno=web.1 connect=0ms service=30000ms status=503 bytes=0 protocol=https
Currently returns in 117ms
Same action as above, does not involve solr
Does not look in logs unusual in any way, ua claims to be mobile safari
Jun 04 00:03:43 scihist-digicoll-production heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/collections/wh246s128?f%5Bdepartment_facet%5D%5B%5D=Archives&f%5Blanguage_facet%5D%5B%5D=English&f%5Bplace_facet%5D%5B%5D=California--Fullerton&f%5Bsubject_facet%5D%5B%5D=Beckman+Instruments%2C+Inc.&sort=newest_date" host=digital.sciencehistory.org request_id=f8a42ab8-fcc2-48d5-914a-957449560365 fwd="43.159.41.42" dyno=web.1 connect=0ms service=30000ms status=503 bytes=0 protocol=https
Does involve Solr, and somewhat complex solr query
But currently returns in 390ms
Logs definitely looks like this is a crawler scraping facet results (and doesn’t have a logged user agent), but it didn’t seem to overwhelm our app
Currently still a mystery why these requests timed out at 30 seconds!