Quick Operational Troubleshooting Cookbook

Responding to an error report because you are on-call, and need some ideas for how to get started with some quick actions? We got you.

1 Check on status of heroku dynos
2 Look at our logs
3 Disable preboot
4 Is heroku itself having problems? Or are other platforms we use?
5 Check heroku release activity
6 Restart heroku dynos
7 Restart solr on Searchstax
8 Disable autoscaling
9 Put entire app into maintenance mode
10 Disable staff logins
11 Reindex solr
12 Restore postgres database from backups

To execute any of these on staging instead of production replace -a scihist-digicoll-production with -a scihist-digicoll-staging.

Check on status of heroku dynos

Using heroku CLI, run:

heroku ps -a scihist-digicoll-production

(Note if a restart/redeploy just happened, and you have heroku preboot on, you may not be seeing status of dynos actually serving requests. See “disable preboot” below.)

Look at our logs

Consolidated app logs are avilable on heroku dashboard, on “resources” tab, click on “papertrail” add-on at bottom to get a nice web GUI for our logs, that also lets you search.

There are ways to set up command-line access to papertrail logs too. https://github.com/papertrail/papertrail-heroku-plugin https://github.com/papertrail/papertrail-cli
Even without papertrail you can use heroku cli heroku logs -a scihist-digicoll-production to look at running trail of current heroku logs, but the display isn’t nearly as nice as papertrail, and includes more noise that papertrail filters out.

Also in addition to general logs, we have errors specifically monitored by http://honeybadger.io , each person has their own individual login.

Disable preboot

The heroku preboot feature makes it possible for us to do zero-downtime deploys, but also really complicates visibility/introspection into heroku dynos, and quick-response to making changes like dyno restarts and redeploys.

If you are troubleshooting an already problematic/downtime situation, it might make sense to turn off preboot to make things more straightforward:

heroku features:disable preboot -a scihist-digicoll-production

You can enable it again with enable instead of disable. You can see if it’s enabled with heroku features -a scihist-digicoll-production

Is heroku itself having problems? Or are other platforms we use?

https://status.heroku.com/
https://status.hirefire.io/ status
searchstax status: https://status.searchstax.com/
AWS status (notoriously underreports problems though): https://status.aws.amazon.com/

Check heroku release activity

If a heroku tried to do a release but failed, you may be in a confusing situation where you aren’t using the version of code/config you think you are. Heroku releases (which may fail) can be triggered not only by pushing new versions of code, but by config variable changes, and in some cases add-on changes.

Look at release history with heroku CLI:

heroku releases -a scihist-digicoll-production

Failed releases will be in red. With the id from the left-most column, you can look at specific log output (mainly of our custom release phase) for the failed or successful release, eg:

heroku releases:output v323 -a scihist-digicoll-production

You can also see some limited release status info in the Web GUI on the Activity tab.

Restart heroku dynos

From heroku web GUI, you can restart all dynos from the “More” menu in top right navbar, choose “restart all dynos”.

Using the heroku CLI, you can restart only web or only workers, or even a specific dyno.

heroku ps:restart worker -a scihist-digicoll-production
heroku ps:restart web -a scihist-digicoll-production
heroku ps:restart worker.2 -a scihist-digicoll-production

Note: It’s not clear to me how often this restarting heroku dynos will actually fix a problem, and in some cases it could cause a less stable state, if for instance heroku is having problems.

Note: If heroku “preboot” is on, it can take 3+ minutes for restart to actually take effect. See “disable preboot” above.

Restart solr on Searchstax

Login to searchstax
1. Use shared credentials stored in our credential spot on the P:\ drive
Click on the instance you want to restart (scihist_digicoll (production), or scihist-digicoll-staging)
At bottom of page there is a single node listed (our plan only has one node), you can click “stop solr”, and then “Start solr”

note: restarting solr will result in the app having some downtime/generating errors while it’s restarting, if it is up and accessible during restart!

Disable autoscaling

We use http://hirefire.io for autoscaling our worker dynos (maybe in future web dynos). Has it gone crazy and you need to just disable it?

No worries, just login to http://hirefire.io (we each have our own login), and you can click the “enable” toggle on or off next to each autoscale worker, right on the initial dashboard. (We may only have one worker).

Note: If you turn off auto-scaling when workers are scaled up, they will probably stay scaled up! Look at the minimum scale value (2, as I write this), you may want to scale down to that manually after turning off auto-scaling:

# how many workers are there?
$ heroku ps worker -a scihist-digicoll-production

# set em back to two
$ heroku ps:scale worker=2 -a scihist-digicoll-production

Put entire app into maintenance mode

Disable our app, it won’t be accessible to anyone, but they’ll get a nice maintainance message.

In heroku web GUI, go to “settings” tab, scroll down to “Maintenance mode” section, toggle switch.

In heroku CLI , run heroku maintenance:on -a scihist-digicoll-production and heroku maintenance:off -a scihist-digicoll-production

(See more on our custom maintenance page configuration at Heroku custom maintenance page )

Disable staff logins

We can effectively make the app “read-only” but still available to the public by disabling staff logins. So we don’t have a public facing outage, but if we’re dealing with some kind of data corruption issue we’re trying to diagnose, we might want to ‘freeze’ staff out.

In heroku config vars section of the heroku dashboard settings tab, just set LOGINS_DISABLED to true.

Set to false or remove the config var entirely to restore staff logins.

Reindex solr

If search is weird, our Solr index may have gotten out of sync. Fortunately, we can (re-)build a new Solr index in only a couple minutes. Using the heroku CLI to run our rake tasks:

heroku run rake scihist:solr:reindex scihist:solr:delete_orphans -a scihist-digicoll-production

if this results in an error that makes you think the searchstax solr is not properly set up, you could try:

heroku run rake scihist:solr_cloud:create_collection -a scihist-digicoll-production. (That should not do any harm in any case, it might just complain telling you “collection already exists”
heroku run rake scihist:solr_cloud:sync_configset -a scihist-digicoll-production

And see also restarting Searchstax Solr above.

Restore postgres database from backups

See separate page.