Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Only using 155MB after reboot (with index in place).

One day later it’s up to dark bar at 291MB – still under 512MB.

The SearchStax NDN1

Immediately after a reboot (with a full index), this is what our SearchStax NDN1 reports in Solr dashboard.

...

We can see that while it is advertised as a “1GB” machine, the JVM only has a bit less than half of that.
This isn’t tunable by us, it’s SearchStax’s choice. This is to leave space in RAM for other OS and maintainance tasks, and we can see that the system Physical Memory is pretty healthy too. See more on SearchStax memory choices in SearchStax docs. (And also some SearchStax docs on tuning Solr memory use.)

The good news is the 460MB is still well over the 287MB we saw our in-use production server using.

Only using 109MB on boot, a bit less than our staging solr for some reason, but in the ballpark.

Adding some load

We’ll want to do some reindexes, and also a bunch of queries.

I pointed our existing EC2 staging server at the searchstax solr.

  • did a reindex, while refreshing Solr admin a lot to see dashboard – sometimes up to 140MB use.

  • Did a “blank” search, which I know requires all facets to be calculated which can be RAM intensive. Temporarily up to 150MB use, then down to 80MB again.

  • Went to last page of pagination of “blank search” – I believe Solr is RAM hungry when you do deep pagination like this. Temporairly up to 150MB, then back down to 75MB.

Okay…

  • Let’s do a load test where we ask for that deep-pagination page over and over… while also doing a reindex. wrk -c 1 -t 1 -d 3m https://staging-digital.sciencehistory.org/catalog?page=333

    • => refreshing to see RAM use, never went over 170MB, didn’t get any OOM errors. After done, the RAM rested for a bit around 150MB then returned to 99MB.

  • Without restarting, now with a some concurrency to stress it more? wrk -c 4 -t 1 -d 3m https://staging-digital.sciencehistory.org/catalog?page=333

    • => Observed up to 250MB, but no OOM and that’s still around 50% of capacity, plenty of room.

  • 10 concurrency, far more than we’d ever see in reality, just to see what happens? `wrk -c 10 -t 1 -d 3m https://staging-digital.sciencehistory.org/catalog?page=333`

    • Still didn’t observe more than 285MB. And that at one point it dropped to 100MB even though test was still going on?

    • Using the app manually it’s definitely kinda slow when this much load is being put on it – unsurprising! But no errors, it’s working!

OK, that’s all looking fine.

Load testing with queries taken from log

Let’s try load testing with a “realistic” set of URLs taken from actual app logs?

grep "/catalog?" production.log | shuf | head -50 > 50_random_queries.txt

Gave me 50 random catalog search queries from our actual production app. (OK, there were a few redirects and other things we didn’t want in there too). Most of them look like bot traffic honestly, just following facet links kind of randomly. Or our “chemistry” search that is a “ping” done every minute by our honeybadger uptime checker.

But let’s try these anyway. Using some ruby regexp magic to actually extract some urls, then we run:

URLS=./500_urls.txt wrk -c 10 -t 1 -d 1m -s load_test/multiplepaths.lua.txt https://staging-digital.sciencehistory.org/

And do a reindex at the same time…

RAM usage observed up to 275MB. Still well under our 490MB JVM max.

App was accessible the whole time manually, although slow. No Solr OOM or other errors reported.

Just for comparison…

Both using current ansible-managed infrastructure. Using our “realistic” list of URLs….

SearchStax NDN1

Code Block
$ URLS=./500_urls.txt wrk --latency -c 10 -t 1 -d 1m -s load_test/multiplepaths.lua.txt https://staging-digital.sciencehistory.org/
multiplepaths: Found 480 paths
multiplepaths: Found 480 paths
Running 1m test @ https://staging-digital.sciencehistory.org/
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   335.10ms  182.49ms   1.62s    77.95%
    Req/Sec    31.26     15.92   100.00     65.78%
  Latency Distribution
     50%  297.99ms
     75%  383.49ms
     90%  570.78ms
     99%  958.27ms
  1853 requests in 1.00m, 66.83MB read
Requests/sec:     30.83
Transfer/sec:      1.11MB

Original/Current Solr

Code Block
$ URLS=./500_urls.txt wrk --latency -c 10 -t 1 -d 1m -s load_test/multiplepaths.lua.txt https://staging-digital.sciencehistory.org/
multiplepaths: Found 480 paths
multiplepaths: Found 480 paths
Running 1m test @ https://staging-digital.sciencehistory.org/
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   355.13ms  221.74ms   1.94s    78.84%
    Req/Sec    30.25     15.66    90.00     68.51%
  Latency Distribution
     50%  301.96ms
     75%  412.95ms
     90%  650.28ms
     99%    1.12s
  1784 requests in 1.00m, 64.37MB read
Requests/sec:     29.71
Transfer/sec:      1.07MB

Look pretty similar.

Conclusion?

I think we’re fine with NDN1.

It’s hard to be sure of this though, Java memory use is hard to understand/predict.

If we are wrong, we can always upgrade to NDN2 at any time. Even if we do an annual contract for NDN1, we can always upgrade it to NDN2. We can’t rule out that we may need to do this in the future – especially if our traffic were to drastically increase.