Quick Operational Troubleshooting Cookbook

Responding to an error report because you are on-call, and need some ideas for how to get started with some quick actions? We got you.

Check on status of heroku dynos

Using heroku CLI, run:

heroku ps -a scihist-digicoll-production

Look at our logs

Consolidated app logs are avilable on heroku dashboard, on “resources” tab, click on “papertrail” add-on at bottom to get a nice web GUI for our logs, that also lets you search.

There are ways to set up command-line access to papertrail logs too. https://github.com/papertrail/papertrail-heroku-plugin https://github.com/papertrail/papertrail-cli
Even without papertrail you can use heroku cli heroku logs to look at running trail of current heroku logs, but the display isn’t nearly as nice as papertrail, and includes more noise that papertrail filters out.

Also in addition to general logs, we have errors specifically monitored by http://honeybadger.io , each person has their own individual login.

Is heroku itself having problems? Or are other platforms we use?

https://status.heroku.com/
https://status.hirefire.io/ status
searchstax status: https://status.searchstax.com/
AWS status (notoriously underreports problems though): https://status.aws.amazon.com/

Restart heroku dynos

From heroku web GUI, you can restart all dynos from the “More” menu in top right navbar, choose “restart all dynos”.

Using the heroku CLI, you can restart only web or only workers, or even a specific dyno.

heroku ps:restart worker -a scihist-digicoll-production
heroku ps:restart web -a scihist-digicoll-production
heroku ps:restart worker.2 -a scihist-digicoll-production

Note: It’s not clear to me how often this restarting heroku dynos will actually fix a problem, and in some cases it could cause a less stable state, if for instance heroku is having problems.

Restart solr on Searchstax

Login to searchstax
1. Use shared credentials stored in our credential spot
Click on the instance you want to restart (scihist_digicoll (production), or scihist-digicoll-staging)
At bottom of page there is a single node listed (our plan only has one node), you can click “stop solr”, and then “Start solr”

note: restarting solr will result in the app having some downtime/generating errors while it’s restarting, if it is up and accessible during restart!

Disable autoscaling

We use http://hirefire.io for autoscaling our worker dynos (maybe in future web dynos). Has it gone crazy and you need to just disable it?

No worries, just login to http://hirefire.io (we each have our own login), and you can click the “enable” toggle on or off next to each autoscale worker, right on the initial dashboard. (We may only have one worker).

Put entire app into maintenance mode

Disable our app, it won’t be accessible to anyone, but they’ll get a nice maintainance message.

In heroku web GUI, go to “settings” tab, scroll down to “Maintenance mode” section, toggle switch.

In heroku CLI , run heroku maintenance:on -a scihist-digicoll-production and heroku maintenance:off -a scihist-digicoll-production

(Note: Right now, this is just a generic heroku maintenance message. It is possible to customize/brand this page, we may get to that eventually. https://github.com/sciencehistory/scihist_digicoll/issues/1201 )

Disable staff logins

We can effectively make the app “read-only” but still available to the public by disabling staff logins. So we don’t have a public facing outage, but if we’re dealing with some kind of data corruption issue we’re trying to diagnose, we might want to ‘freeze’ staff out.

In heroku config vars on heroku dashboard settings tab, just set LOGINS_DISABLED to true.

Reindex solr

If search is weird, our Solr index may have gotten out of sync. Fortunately, we can (re-)build a new Solr index in only a couple minutes. Using the heroku CLI to run our rake tasks:

heroku run rake scihist:solr:reindex scihist:solr:delete_orphans -a scihist-digicoll-production

if this results in an error that makes you think the searchstax solr is not properly set up, you could try:

heroku run rake scihist:solr_cloud:create_collection -a scihist-digicoll-production. (That should not do any harm in any case, it might just complain telling you “collection already exists”
heroku run rake scihist:solr_cloud:sync_configset -a scihist-digicoll-production

And see also restarting Searchstax Solr above.

Restore postgres database from backups

See separate page.