Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

At present it’s only search result pages that are protected in this way, as these are where we were getting trouble: Because search pages (backed by Solr) are more resource constrained; and because bots were traversing every combination of facets in a basically limitless path.

After ~10 ~2 searches, a user may be redirected to a Turnstile challenge page, which in many cases will automatically redirect back to search in a few seconds. Users on a given browser should only see the challenge once per 24 hours. (All configurable and subject to change).

...

Configuration and Implementation

Original implentation PR: https://github.com/sciencehistory/scihist_digicoll/pull/2838

Configuration currently lives at the bottom of Currently implemented via the bot_challenge_page gem written by jrochkind.

Configuration is in ./config/initializers/rack_attack.rb (at the bottom, in a to_prepare block). To see all possible things you can configure, see implementation at /app/controllers/bot_detection_controllerbot_challenge_page.rb

What paths are protected are configured here – we intend to include all search results. If you add more search results pages (alternate views of search-within collections, featured topics, etc) at new URLs, you will have to adjust this configuration to protect them!

You can configure the period and count before a challenge is triggered, and how long a ‘passed' challenge is good for before another challenge might be issued.

More sophisticatedly, with additional configuration from the gem, we could change the buckets/keys for which rates are calculated – right now they are subnets; could instead take account of http headers, or information looked up about the client ip. We want the check to be quick though, since it happens on every request.

Logging

A challenge is issued with a 403 http status, and is noted in our logs with bot_chlng=true , as implemented here.

An actual successful pass through the challenge will be logged with Cloudflare Turnstile validation passed

Current observation is that most “failures” aren’t logged, as they occur by an agent getting challenged and just never making it past there.

While CloudFlare Turnstile also keeps some statistics, they are missing most of our challenge hits – maybe because the clients are not executing Javascript at all to engage Turnstile.

Community

The #bots channel on the Code4Lib slack is full of helpful peers. Join it at https://code4lib.org/slack