Operational troubleshooting notebook

June 5 2024

3 H12 timeouts in last two days

During each, plenty of puma workers, we're not running out of puma workers

Each one has ordinary/fine dyne load (CPU), RAM usage, etc. 

Heroku metrics on response time show the 30s response time at each time period — that's the H12 error, heroku timing out after waiting 30 seconds for our app to return a response. 

So mystery is what these requests were and why they didn't return a response in under lasted 30 seconds, it's not too many requests/lack of puma workers, but particular long-running requests. Can we figure out what they were?

Ah, we have that info in the logged H12 errors, of course.

  • Jun 03 20:00:20 scihist-digicoll-production heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/works/i3uubdg/viewer_images_info" host=digital.sciencehistory.org request_id=bbc866f6-d55a-4915-a85b-9d956d4cb063 fwd="40.77.188.137" dyno=web.1 connect=0ms service=30000ms status=503 bytes=0 protocol=https

    • Currently returns in 140ms, and it’s not a very long response

    • I don’t think this request even involves Solr

  • Jun 03 21:36:36 scihist-digicoll-production heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/works/jd472x205/viewer_images_info" host=digital.sciencehistory.org request_id=07811788-12a6-4a74-a3a3-d7b8641cb0b8 fwd="35.150.180.134" dyno=web.1 connect=0ms service=30000ms status=503 bytes=0 protocol=https

    • Currently returns in 117ms

    • Same action as above, does not involve solr

    • Does not look in logs unusual in any way, ua claims to be mobile safari

  • Jun 04 00:03:43 scihist-digicoll-production heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/collections/wh246s128?f%5Bdepartment_facet%5D%5B%5D=Archives&f%5Blanguage_facet%5D%5B%5D=English&f%5Bplace_facet%5D%5B%5D=California--Fullerton&f%5Bsubject_facet%5D%5B%5D=Beckman+Instruments%2C+Inc.&sort=newest_date" host=digital.sciencehistory.org request_id=f8a42ab8-fcc2-48d5-914a-957449560365 fwd="43.159.41.42" dyno=web.1 connect=0ms service=30000ms status=503 bytes=0 protocol=https

    • Does involve Solr, and somewhat complex solr query

    • But currently returns in 390ms

    • Logs definitely looks like this is a crawler scraping facet results (and doesn’t have a logged user agent), but it didn’t seem to overwhelm our app

 

Currently still a mystery why these requests timed out at 30 seconds!