Solr sizing and searchstax comparison

We want to know if we can likely fit in a SearchStax “NDN1” (1GB of RAM and 8GB storage; $20 or $40/ month) or instead need an “NDN2” (2GB of RAM and 16GB storage, $40 or $80/month.

To do this we will compare to our current self-managed Solr, which runs on a AWS EC2 t3a.small, which is 2GB RAM.

We think the storage space is more than sufficient for our small collection on either NDN1 or NDN2. (As far as we can tell, we only need 70MB(?) of storage space). We don’t think this is an issue.
CPU is roughly equivalent, and our solr’s CPU needs are minimal. Also CPU the same probably between NDN1 and NDN2. We don’t think this is an issue.
It’s really about RAM, and if the NDN1 can handle the same usage patterns as our current setup, or all the ones we realistically may encounter.

Current EC2 Solr

production, in use

This is from the production solr dashboard (with an uptime of only 4 days though).

The machine has 2GB of RAM. The JVM has been given a Xmx max size of 2GB – actually a bit excessive, since it’ll trying to use all machine RAM and swapping heavily if it gets there.

However, it’s only actually using 287MB of RAM. The JVM has reserved 512MB of RAM from the OS (possibly because we gave it that as an -Xms minimum size), but isn’t currently using it.

This production machine has only been up for 4 days? Is it possible that over time it will use even more RAM?

staging, after reboot

Compare to staging solr, immediately after reboot.

Only using 155MB after reboot (with index in place).

One day later it’s up to dark bar at 291MB – still under 512MB.

The SearchStax NDN1

Immediately after a reboot (with a full index), this is what our SearchStax NDN1 reports in Solr dashboard.

We can see that while it is advertised as a “1GB” machine, the JVM only has a bit less than half of that.
This isn’t tunable by us, it’s SearchStax’s choice. This is to leave space in RAM for other OS and maintainance tasks, and we can see that the system Physical Memory is pretty healthy too. See more on SearchStax memory choices in SearchStax docs. (And also some SearchStax docs on tuning Solr memory use.)

The good news is the 460MB is still well over the 287MB we saw our in-use production server using.

Only using 109MB on boot, a bit less than our staging solr for some reason, but in the ballpark.

Adding some load

We’ll want to do some reindexes, and also a bunch of queries.

I pointed our existing EC2 staging server at the searchstax solr.

did a reindex, while refreshing Solr admin a lot to see dashboard – sometimes up to 140MB use.
Did a “blank” search, which I know requires all facets to be calculated which can be RAM intensive. Temporarily up to 150MB use, then down to 80MB again.
Went to last page of pagination of “blank search” – I believe Solr is RAM hungry when you do deep pagination like this. Temporairly up to 150MB, then back down to 75MB.

Okay…

Let’s do a load test where we ask for that deep-pagination page over and over… while also doing a reindex. wrk -c 1 -t 1 -d 3m https://staging-digital.sciencehistory.org/catalog?page=333
- => refreshing to see RAM use, never went over 170MB, didn’t get any OOM errors. After done, the RAM rested for a bit around 150MB then returned to 99MB.
Without restarting, now with a some concurrency to stress it more? wrk -c 4 -t 1 -d 3m https://staging-digital.sciencehistory.org/catalog?page=333
- => Observed up to 250MB, but no OOM and that’s still around 50% of capacity, plenty of room.
10 concurrency, far more than we’d ever see in reality, just to see what happens? `wrk -c 10 -t 1 -d 3m https://staging-digital.sciencehistory.org/catalog?page=333`
- Still didn’t observe more than 285MB. And that at one point it dropped to 100MB even though test was still going on?
- Using the app manually it’s definitely kinda slow when this much load is being put on it – unsurprising! But no errors, it’s working!

OK, that’s all looking fine.

Load testing with queries taken from log

Let’s try load testing with a “realistic” set of URLs taken from actual app logs?

grep "/catalog?" production.log | shuf | head -50 > 50_random_queries.txt

Gave me 50 random catalog search queries from our actual production app. (OK, there were a few redirects and other things we didn’t want in there too). Most of them look like bot traffic honestly, just following facet links kind of randomly. Or our “chemistry” search that is a “ping” done every minute by our honeybadger uptime checker.

But let’s try these anyway. Using some ruby regexp magic to actually extract some urls, then we run:

URLS=./500_urls.txt wrk -c 10 -t 1 -d 1m -s load_test/multiplepaths.lua.txt https://staging-digital.sciencehistory.org/

And do a reindex at the same time…

RAM usage observed up to 275MB. Still well under our 490MB JVM max.

App was accessible the whole time manually, although slow. No Solr OOM or other errors reported.

Just for comparison…

Both using current ansible-managed infrastructure. Using our “realistic” list of URLs….

SearchStax NDN1

$ URLS=./500_urls.txt wrk --latency -c 10 -t 1 -d 1m -s load_test/multiplepaths.lua.txt https://staging-digital.sciencehistory.org/
multiplepaths: Found 480 paths
multiplepaths: Found 480 paths
Running 1m test @ https://staging-digital.sciencehistory.org/
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   335.10ms  182.49ms   1.62s    77.95%
    Req/Sec    31.26     15.92   100.00     65.78%
  Latency Distribution
     50%  297.99ms
     75%  383.49ms
     90%  570.78ms
     99%  958.27ms
  1853 requests in 1.00m, 66.83MB read
Requests/sec:     30.83
Transfer/sec:      1.11MB

Original/Current Solr

$ URLS=./500_urls.txt wrk --latency -c 10 -t 1 -d 1m -s load_test/multiplepaths.lua.txt https://staging-digital.sciencehistory.org/
multiplepaths: Found 480 paths
multiplepaths: Found 480 paths
Running 1m test @ https://staging-digital.sciencehistory.org/
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   355.13ms  221.74ms   1.94s    78.84%
    Req/Sec    30.25     15.66    90.00     68.51%
  Latency Distribution
     50%  301.96ms
     75%  412.95ms
     90%  650.28ms
     99%    1.12s
  1784 requests in 1.00m, 64.37MB read
Requests/sec:     29.71
Transfer/sec:      1.07MB

Look pretty similar.

Conclusion?

I think we’re fine with NDN1.

It’s hard to be sure of this though, Java memory use is hard to understand/predict.

If we are wrong, we can always upgrade to NDN2 at any time. Even if we do an annual contract for NDN1, we can always upgrade it to NDN2. We can’t rule out that we may need to do this in the future – especially if our traffic were to drastically increase.

Hydra Digital Collection