Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

We don’t currently really have “infrastructure-as-a-service” with our heroku setup, it’s just set up on the heroku system (and third-party systems) via GUI and/or CLI’s, there isn’t any kind of script to recreate our heroku setup from nothing.

...

  • Delete all failed jobs in the rescue admin pages.

  • Make a rake task to enqueue all the jobs to the special_jobs queue.

    • (lightbulb) The task should be smart enough to skip items that have already been processed. That way, you can interrupt the task at any time, fix any problems, and run it again later without having to worry.

    • (lightbulb) Make sure you have an easy way to run the task on individual items manually from the admin pages or the console.

    • (lightbulb) The job that the task calls should print the IDs of any entities it’s working on to the Heroku logs.

    • (lightbulb) It’s very helpful to be able to enqueue a limited number of items and run them first, before embarking on the full run. For instance you could add an extra boolean argument only_do_10 (defaulting to false ) and add a variation on:

      Code Block
      scope = scope[1..10] if only_do_10
  • Test the rake task in staging with only_do_10 set to true.

  • Run the rake task in production but only_do_10 for a trial run.

  • Spin up a single special_jobs dyno and watch it process 10 items.

  • Run the rake task in production.

  • The jobs are now in the special_jobs queue, but no work will actually start until you spin up dedicated dynos.

  • 2 workers per special_jobs dyno is our default, which works nicely with standard-2x dynos, but if you want, try setting SPECIAL_JOB_WORKER_COUNT env variable to 3.

  • Our redis setup is capped at 80 connections, so be careful running more than 10 special_jobs dynos at once. You may want to monitor the redis statistics during the job.

  • Manually spin up a set of special_worker dynos of whatever type you want at Heroku's "resources" page for the application. Heroku will alert you to the cost. (10 standard-2x dynos cost roughly $1 per hour, for instance; with the worker count set to two, you’ll see up to 20 items being processed simultaneously).

  • Monitor the progress of the resulting workers. Work goes much faster than you are used to, so pay careful attention to:

  • (lightbulb) If there are errors in any of the jobs, you can retry the jobs in the Rescue pages, or rerun them from the console.

  • Monitor the number of jobs still pending in the special_jobs queue. When that number goes to zero, it means the work will complete soon and you should start getting ready to turn off the dynos. It does NOT mean the work is complete, however!

  • When all the workers in the special_jobs queue complete their jobs and are idle:

    • (lightbulb) rake scihist:resque:prune_expired_workers will get rid of any expired workers, if needed

    • Set the number of special_workerdynos back to zero.

    • Remove the special_jobs queue from the resque pages.

...

We use Scout to monitor the app’s performance and find problem spots in the code. The account is free, as we’re an open-source project, although billing information is maintained on the account.

Papertrail (logging)

Settings are here:
https://papertrailapp.com/account/settings

Notes re: tuning lograge in our app:
https://bibwild.wordpress.com/2021/08/04/logging-uri-query-params-with-lograge/

General docs re: lograge:
https://github.com/roidrage/lograge

Get the API token at
https://papertrailapp.com/account/profile.

Recipe for downloading all of a day's logs:

Code Block
languagebash
set -x
THE_DATE=$1    # formatted like '2023-12-21'
TOKEN="abc123" # get this from <https://papertrailapp.com/account/profile.>
URL='https://papertrailapp.com/api/v1/archives'

for HOUR in {00..23}; do
	DATE_AND_HOUR=$THE_DATE-$HOUR
	curl --no-include \
		-o $DATE_AND_HOUR.tsv.gz \
		-L \
		-H "X-Papertrail-Token: $TOKEN" \
		$URL/$DATE_AND_HOUR/download;
done

# Remove files that aren't really compressed logs
rm `file * | grep XML | grep -o '.*.gz'`

# uncompress all the logs
gunzip *.gz

History:

We started out with the the "Forsta" plan (~4.2¢/hour, max of $30 a month; 250MB max).

In late 2023 and early 2024, we noticed an increase in both the rate and the volume of our logging, resulting in both:

Notes:

  • These don’t always co-occur: high rates cause the first, large volumes the second.

  • Heroku add-on usage resets daily at midnight (UTC) which is early evening EST, so the notion of a “day” can be confusing here

On Jan 10th we decided to try the "Volmar" plan (~9¢/hour; max of $65 a month; 550MB max) for a couple months, to see if this would ameliorate our increasingly frequent problems with running out of room in the Papertrail log limits. It’s important to note that the $65 plan, based on our current understanding, will not fix the L10 errors, but will likely give us more headroom on days when we get a lot of traffic spread out over the entire day.

Avenues for further research

  • Confirm that the L10 warnings are caused by a surge in bot traffic, rather than a bug in our code or in someone else’s code.

    • If so, this is a good argument for putting cloudflare or equivalent in front of our app, which would screen out misbehaving bots

  • Consider making our app log fewer bytes - either by making some or all log lines more concise, or by asking Papertrail to drop certain lines that we’re not really interested in:

    • some postgresql messages?

    • do we really need to log all status 200 messages? (Probably.)

    • As a last resort, we could also decide not to log heroku/router messages (typically 40-60% of our messages), although those can be really helpful in the event of a catastrophe.