Journal of heroku investigations. Most recent entries on top? See also https://chemheritage.atlassian.net/wiki/spaces/HDC/pages/1009582081

Wed Oct 14

For RAM comparison, on our current EC2 production, after being up for some time, passenger reports this memory use:

------ Passenger processes ------
PID    VMSize     Private   Name
---------------------------------
18108  299.2 MB   2.0 MB    Passenger watchdog
18114  1082.6 MB  5.3 MB    Passenger core
18139  30.4 MB    0.4 MB    /usr/local/lib/ruby/gems/2.6.0/gems/passenger-5.3.7/buildout/support-binaries/PassengerAgent temp-dir-toucher /tmp/passenger-standalone.11jhb2e --cleanup --daemonize --pid-file /tmp/passenger-standalone.11jhb2e/temp_dir_toucher.pid --log-file /opt/scihist_digicoll/shared/passenger.log --user digcol --nginx-pid 18123
18187  958.4 MB   340.2 MB  Passenger AppPreloader: /opt/scihist_digicoll/current (forking...)
18206  873.4 MB   281.1 MB  Passenger AppPreloader: /opt/scihist_digicoll/current (forking...)
18225  738.5 MB   197.3 MB  Passenger AppPreloader: /opt/scihist_digicoll/current (forking...)
18244  736.6 MB   160.8 MB  Passenger AppPreloader: /opt/scihist_digicoll/current (forking...)
18261  736.7 MB   158.1 MB  Passenger AppPreloader: /opt/scihist_digicoll/current (forking...)
18278  736.8 MB   169.9 MB  Passenger AppPreloader: /opt/scihist_digicoll/current (forking...)
18295  736.9 MB   163.0 MB  Passenger AppPreloader: /opt/scihist_digicoll/current (forking...)
18312  737.0 MB   169.8 MB  Passenger AppPreloader: /opt/scihist_digicoll/current (forking...)
18329  737.1 MB   163.2 MB  Passenger AppPreloader: /opt/scihist_digicoll/current (forking...)
18346  737.2 MB   162.4 MB  Passenger AppPreloader: /opt/scihist_digicoll/current (forking...)

So actually it’s true that the Private RSS was getting up to 340MB, although after more use. One difference is that on heroku it seems to balloon up memory quicker. But I may have under-estimated our RAM use – although it still isn’t the 400-500MB+ that we’re seeing in heroku.

An app with the work show page almost entirely disabled is at sample#memory_total=277.77MB sample#memory_rss=269.82MB

We might be able to get under 300 by making the work/show page avoid loading all children at once with an “infinite scroll” technique. This would also take care of our slowest pages. Pages we are trying that are NOT large-membered-work-show seem to currently on ‘standard’ rather than ‘hobby’ resources be loading at similar times to current EC2, we think? Fixity report 3s on heroku compared to 3.5s on EC2, so actually faster on heroku?

If we limit to only 50 children on a page, ramelli loads from heroku in about 2.6s (yeah, still slow), and takes RAM: sample#memory_total=447.76MB sample#memory_rss=399.51MB gah why is this still so much!! – I guess the way we did it we still loaded all children into memory but just didn’t display them, let’s chagne that…. after a few loads, still up to sample#memory_total=466.50MB sample#memory_rss=398.28MB gahhhh.

If we limit to the 5 child work, we get a more reasonable sample#memory_total=321.20MB sample#memory_rss=252.28MB… to compare, let’s slice ramelli to actually 5 children… it’s still taking more than 2 seconds to return (what’s it doing?), but is sample#memory_total=395.11MB sample#memory_rss=326.82MB , ok i guess?

without actual member display code, and limited to 5…. sample#memory_total=391.79MB sample#memory_rss=323.62MB … about the same… aha, it’s partially our viewer_images_info taking up all the memory, that one still has full list. (but doesn’t explain why the page load time is so slow) Just curl… no, still slow still same memory.

A moment to look at speed again

Yes, even with standard-2x and standard pg, ramelli is taking 4-6s on heroku, compared to 2-2.5s on our current EC2. Smaller 115-item work goes from 0.5-0.6s on EC2 to ~0.9-1.2S on heroku, what.

RAM how many threads can we get away with

RAILS_MAX_THREADSour puma config pays attention to that heroku config env, making it easy to switch.

One worker 5 threads on a standard-1x (512MB) dyno – we exceeded memory capacity repeatedly requesting ramelli.

three threads – yep, still exceeded quota.

two threads? seems to be okay, but pushing it! We wouldn’t want our app to expand it’s wasteland any further. sample#memory_total=497.73MB sample#memory_rss=473.06MB

going back to one thread for a consistent baseline for exploring how changes effect memory.

Monday Oct 12/Tuesday Oct 13

Moving database to standard-0 ($50/month), and web dyno to standard-1x ($25/month), just to make sure we’re using production resources, although I don’t expect it to make a difference (hobby pg and dyno we were using ought to be just as fast), but just to rule it out.
- Ramelli is coming back in 3 to 4 seconds, with no apparent spikes to 6 or 7, so… better? If still double the reliable 2 seconds on our EC2 situation.
- RAM still super problematic, sample#memory_total=511.92MB sample#memory_rss=500.29MB
- fixity report page 3-5 seconds, actually matching expected?
Blank Rails new app RAM usage?
- It is using a reasonable sample#memory_total=128.11MB sample#memory_rss=95.13MB
  - OK, what is making our app twice as big even on home page? need to investigate.
- let’s try same skeleton rails app, but with our scihist-digicoll gemfile, so we’re loading all those gems….
  - Up to sample#memory_total=266.05MB sample#memory_rss=191.89MB
  - yeah, that’s a lot more. Although still like half of what we were seeing before! If we can keep it under 300MB, we can be okay. Hmm. We’re gonna have to do memory profiling of scihist-digicoll.
  - scihist-digicoll deploys as sample#memory_total=273.16MB sample#memory_rss=206.83MB, not TOO much more….
    - but just request of home page takes us to sample#memory_total=278.65MB sample#memory_rss=212.80MB… hmm, not THAT much more, refreshing home page gives us a few more.
    - Five child work takes us to sample#memory_total=277.77MB sample#memory_rss=211.89MB , a few refreshes to sample#memory_total=274.44MB sample#memory_rss=208.56MB
    - We are doing way better memory-wise than last time we looked?? Maybe moving away from hobby dyno really did matter???
    - 115 child work at sample#memory_total=306.34MB sample#memory_rss=240.46MB, it is getting bigger hmm.
    - Several Ramelli loads up to sample#memory_total=398.45MB sample#memory_rss=332.57MB DOH. Although taking up to 6 seconds to come back sometimes.
    - Let’s actually try a branch which allocates very little per member, disabled member view.
  - A testing version of scihist_digicoll which only displays a friendlier_id for each thumb/lockup, how does ramelli do….
    - sample#memory_total=443.03MB sample#memory_rss=379.39MB no better???
    - Let’s try without iterating through the children at all….
      - sample#memory_total=421.28MB sample#memory_rss=353.05MB WHAT REALLY? What is taking this memory, we’ve made ramelli hypothetically not load any more objects than a page with one child.
    - Aha, well, decorator.representative_member is still doing a members load. let’s stop it (this is also a point of optimization, we’re doing TWO member fetches here!)
      - Down to sample#memory_total=393.84MB sample#memory_rss=334.83MB … a little bit better, but this is still REALLY WEIRD that it’s so much. We’re going to have to memory profile somehow.
    - Let’s try elminating MOST of show page, it’s just a title! ramelli is still sample#memory_total=313.77MB sample#memory_rss=245.55MB still pretty big. WEIRD.

Thurs Oct 8

The RAM and CPU resource issues are concerning.

Why does an instance seem to take even more RAM on heroku than on our EC2?
Why are slow actions even so much slower on heroku than on our EC2?

Things we might investigate:

Use heroku https://devcenter.heroku.com/articles/log-runtime-metrics experimental add-on to get more precise logging of our RAM use over time as we trigger actions.
Try passenger on heroku instead of puma, to compare apples to applies
Try the heroku buildpack for jemalloc and compiling ruby with that, which some people say makes ruby use RAM better. (We didn’t do that in our EC2 though). https://elements.heroku.com/buildpacks/gaffneyc/heroku-buildpack-jemalloc
Try a heroku standard-0 postgres and standard-1x dyno to be using actual resources we will be using, in case the ’hobby' ones we are using to test have different performance characteristics
- Dyno standard-1x can easily be temporarily turned on and off, but db will probably stay there at $50/month
Actually analyze and try to optimize our app, RAM usage and performance
- Make fixity report run on cronjob and give you stored results instead of running when you click on it
- Make many-child pages use “infinite scroll” technique to only load first X and load more when you scroll down, instead of trying to load all at once
- More efficient production of each child page element on work pages (hard-code URLs etc)
- Use derailed gem to figure out what parts are using so much RAM and fix them https://github.com/schneems/derailed_benchmarks
While we can probably optimize our app, the fact that we weren’t forced to on manual EC2 but will on heroku worries us that we’re raising the skill level and time needed to maintain a working app on heroku? (actually already HAVE spent time optimizing app now, but apparently not yet good enough for heroku?)

RAM measure investigations

Using heroku log-runtime-metrics, confirm that our 1-worker-with-two-threads puma instance is starting at 316MB.

After just accessing home page, it’s up to 346.74MB
Accessing 115-child work ysnh5if, it’s up to 375MB, a few more times 386MB, then 392MB
Accessing ramelli it’s up to 444MB, a couple more times 493MB, then 511MB!!!

We may have a memory leak or bad memory behavior – but why isn’t it effecting us on passenger on our manual EC2s?

Wait, may be bad on passenger too! And yet it works on our EC2…

To measure on passenger, ssh to ubuntu@ staging web server,

run sudo passenger-memory-stats1.
run sudo PASSENGER_INSTANCE_REGISTRY_DIR=/opt/scihist_digicoll/shared passenger-status

passenger-memory-stats on web is showing instance VMSize from 536MB to 738MB. Has something happened to raise our memory usage since last time we looked? And why isn’t this machine swapping horribly? But it also says Total private dirty RSS: 463.93 MB, maybe the “Private” value matters more than the “VMSize” value… but not on heroku that measures actual VMSize? (passenger-status shows only 200M and down, they show different things – neither may be what heroku measures, but they are working okay on our raw EC2….

https://www.phusionpassenger.com/library/indepth/accurately_measuring_memory_usage.html

Heroku claims to be measuring “RSS” too, is that different than “private RSS”? sample#memory_total=509.52MB sample#memory_rss=469.41MB sample#memory_cache=40.11MB sample#memory_swap=0.00MB

Still way more than our passenger numbers! Let’s try with passenger…

Passenger on heroku

Having trouble getting passenger working on heroku for some reason…. hmm, without me doing anything it seems to have settled down and is working.

Just home page query sample#memory_total=376.59MB sample#memory_rss=290.23MB
Accessing 115-child work ysnh5if, sample#memory_total=415.04MB sample#memory_rss=328.95MB, but then recovers to sample#memory_total=351.00MB sample#memory_rss=264.92MB
ramelli 4b29b614k, sample#memory_total=459.54MB sample#memory_rss=373.44MB

So not really that different. Maybe a bit better under passenger.

Try jemalloc with puma

https://elements.heroku.com/buildpacks/gaffneyc/heroku-buildpack-jemalloc
just home page: sample#memory_total=315.88MB sample#memory_rss=241.65MB
115-child ysnh5if: sample#memory_total=348.99MB sample#memory_rss=274.69MB
ramelli 4b29b614k: sample#memory_total=421.22MB sample#memory_rss=346.92MB
- After a couple reloads: sample#memory_total=499.28MB sample#memory_rss=424.98MB

Maybe a bit better, but not so much really, about the same.

Wed Oct 7

We have a semi-functional app deployed to heroku – no Solr (so no searching), no background jobs, lots of edge case issues. But something to look at.

Performance and resource concerns

One puma worker (five threads) for our app takes nearly 500mb of RAM, so we can only fit one in a 512MB dyno, or two in a 1024MB dyno. Somewhat less than we had hoped. Not sure why RAM usage is somewhat bigger than a passenger worker on our present infrastructure, maybe the extra threads? We could try with fewer threads.

More concerning however is the performance of our slowest/largest requests.

Our small/reasonable requests are somewhat slower on heroku than current infrastructure. But our larger requests are much slower, and worse much more variance in response time, sometimes pathologically. Apparently very slow requests--and/or requests which create a lot of objects/use a lot of RAM--cause heroku performance to degrade unpredictably? We could try to diagnose and improve our slowest actions, but this raises concerns about increasing difficulty of app development.

Also accessing some of these most problematic pages actually causes the ONE worker to exceed maximum memory on the 512MB dyno. :(

Action	Current infrastructure	heroku (hobby dyno)

Action	Current infrastructure	heroku (hobby dyno)
work, 694 children (4b29b614k, Ramellli)	2-3s	3.5-6s, but sometimes as high as 7 or 9!!
viewer_images_info json for 694 children	1.2-3s (not sure why the variance)	1.4-3s (don’t know why this one matches heroku)
work, 115 children (ysnh5if, chemical atlas)	1-1.5s	1-2s but occasionally as high as 3s
work, 5 children (sb3979577, Arcana Naturae Detecta)	1.2s (somehow sometimes as low as .2s?)	1.2s (also sometimes somehow as low as .2s?)
Fixity report page	4.7-5s	8s, sometimes as high as 15s

Weird that some of the smaller pages have similar performance, but the big ones get really bad. Not sure what’s going on. We could try “jemalloc” ruby build on heroku but not sure that’s going to be it. Various other optimizations we could try, but are we accepting increased need to optimize and thus increased difficulty level of development on heroku?

Tuesday Oct 6

For future: Asset delivery

Needs to be investigated, heroku recommends CDN, we hadn’t accounted for that in cost or complexity of setup. https://github.com/sciencehistory/scihist_digicoll/issues/874

For future: Production vs Staging

While heroku has ways of creating production and staging environments, we aren’t going to worry about that for now, just working on getting a demo app up with a limited staging-like environment, following piece-by-piece plan from Monday.

For future: backups

Heroku postgres has it’s own built-in backups, including ability to easily rollback to previous point in time. Do we still want to do our own backups of postgres? Probably! But we should wrap our head the heroku backups and how they relate to ours, and update our documentation. https://github.com/sciencehistory/scihist_digicoll/issues/876

Software/configuration steps done

By “heroku dashboard” I mean the web GUI.

Install heroku CLI on my Mac
Run heroku login to auth heroku CLI on my local machine
Create scihist-digicoll app in heroku dashboard
Provision heroku postgresadd-on. For now we’re going to do a hobby-basic at $9/month, although this won’t be enough for production, we plan a standard-0 at $50/month eventually. https://elements.heroku.com/addons/heroku-postgresql
Import database from our staging instance to our heroku db (https://devcenter.heroku.com/articles/heroku-postgres-import-export)
1. Do a new export on staging, since heroku asks for a certain format
  1. Tricky cause pg_dump doesn’t live on staging jobs server! Need to figure out how to ssh to database server maybe… ok can find it in EC2 console, and ssh there as jrochkind. Now need to figure out how to connect to database… can’t find database backup cronjob on database server, what user does it run under? not in ansible… but managed to pg_dump using credentials from local_env.yml on staging.
  2. Per heroku instructions, we need to put it on a private S3 bucket. We’ll use chf-hydra-backup, file digcol-for-heroku.dump. (Pretty slow to upload from my local network, figuring out how to put it in private bucket from the database server itself is beyond me right now though)
    1. Having trouble getting a properly signed URL to that location! hackily reverse engineered from S3 console, not the right way, but getting me there.
  3. Succesfully imported! heroku pg:psql -a scihist-digicoll drops me into PSQL console where I can see tables and data to confirm. Deleted extra backup from our S3 bucket.
Try to deploy app to heroku?
1. add heroku remote to my local git, in local git directory: heroku git:remote -a scihist-digicoll, verify what it did with git remote -v
2. git push heroku
  1. Asset compilation failed, “TypeError: No value was provided for app_url_base “. We need that local_env config value for asset compilation apparently? (Based on stack trace, cause in order to boot the app, it tries to look it up to set config.action_mailer.default_url_options. We could make that be okay if the value isn’t present…). Anyway, we can add the config var in heroku dashboard to our current heroku non-custom url, APP_URL_BASE=https://scihist-digicoll.herokuapp.com/, and try again.
  2. Failed again cause it needs a local_env solr_url value. I can see this is going to be a slow process of discovering additional ones, as it takes a couple minutes to fail each time. But we’ll try adding a heroku config SOLR_URL=http://localhost/dummy/nonexisting
  3. “Lockbox master key is missing in production.” – there’s a lot of ENV we need just to get assets to compile! Try LOCKBOX_MASTER_KEY=000000000000000000000000000000000000000000000000000000000000000
  4. OK, now it’s complaining about missing bucket names. We should just go copy ALL config vars from staging local_env.yml over. Anything sensitive we will replace with dummy values. Basic pattern is eg s3_bucket_originals in local_env.yml turns into S3_BUCKET_ORIGINALS in heroku config, to be picked up as ENV by our Env class.
  5. Got it deployed but with an error! have to figure out how to access logs to see what error was… console doesn’t have enough lines!
    1. `heroku logs -n 1000`
    2. Looks like Rails app can’t connect to postgres. Error may look like: /app/vendor/bundle/ruby/2.6.0/gems/activerecord-6.0.3.3/lib/active_record/connection_adapters/postgresql_adapter.rb:49:in `include?': no implicit conversion of nil into String (TypeError)`
    3. We may not be supplying config properly to get heroku postgres, need to look into it more. Yep. https://github.com/sciencehistory/scihist_digicoll/pull/880
  6. DEPLOYED!!! No bg jobs, so solr, so much not working, but basic app! https://scihist-digicoll.herokuapp.com/

Configure Puma explicitly

Heroku/puma are ending up doing something default/standard. But let’s work on configuring puma explicitly, especially how many workers there are. This is mostly about what you can fit in available RAM. Running a “hobby” dyno has the same amount of RAM as standard-1x, so this is a fine place to start.

Heroku docs: https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server

To know if we have too many workers for RAM, we need to pay attention to if we get out of memory alerts from heroku… so as a pre-requisite, let’s configure one of the heroku logging add-ons at a free tier. Somewhat arbitrarily (gives us 7 days of log retention for free?), let’s try logentries. Unclear if it has Slack integration actually, but we’ll start with it.

DOH easily able to trigger heroku out of memory errors under this configuration. Need a bigger heroku dyno, or fewer workers, need to investigate more.

DEALING WITH MEMORY LIMITS is probably the biggest new challenge of deploying on heroku, which can be ongoing as our app grows. This is a challenging task even for an experienced dev.

See also: https://www.speedshop.co/2017/10/12/appserver.html (Nate suggests a Rails app typically uses between 200 and 400 MB. How does heroku suggest you can get away with two of them in a 512MB worker then? maybe we’re not so bad, it’s just not realistic to think you can get away with a 512MB dyno).

Tried asking a question on reddit heroku, not sure how much traffic that gets. https://www.reddit.com/r/Heroku/comments/j6c2so/q_real_rails_ram_experiences/

Mon Oct 5

Heroku has a LOT of docs, usually well-written. It is pretty well googled. Some heroku overview and getting started docs:

Intersting heroku add-on I noticed, rails-autoscale – instead of needing to build out as many dynos as we might need to handle maximum traffic or ingest, we can have the add-on scale up automatically with use. Works for both web dynos (with traffic), and background job dynos (when we do a big ingest, it can scale up more workers!). Does cost money, price based on how high you want it to be able to scale I think.

I think I will try to get our app on heroku piece by piece…

Get app deployed to heroku with postgres small web dyno only – no bg jobs yet, no solr yet. (Solr functions won’t work!)
Add in bg jobs – including heroku buildpacks with all the software they need (vips, imagemagick, ffmpeg, egc).
Add in solr – not sure whether to start by trying to have it connect to existing staging solr (which would require a heroku add-on for a static outgoing IP via SOCKS, so we could let it through our solr firewall, and/or other solr changed config), OR move right away to a SaaS solr – which would cost money, have to identify which one we need.
App substantially working at this point, but still lots of little pieces to get in place, such as nightly jobs, and various problem cases (out of memory for PDF generation etc).

For future: Infrastructure as code?

Deploying to Heroku involves configuring some things on the platform. For instance what I know about now includes mainly a list of config variables (such as what we have in our local_env.yml), and add-ons selected and their configuration.

You can do this in the heroku console, but I’m nervous about that living only inside heroku’s system. How do we get it in source code, “infrastructure as code”, as we always tried to do with ansible, having our infrastructure re-runnable from files on disk, not just living in live system? This isn’t something that needs to be solved now, but something I want to attend to as part of this process, ask around for what others are doing.

Looks like one solution might be using terraform with heroku, documented by heroku. To look into more later.

https://github.com/sciencehistory/scihist_digicoll/issues/875

jrochkind Heroku Journal