Over the last five days, we’ve seen problems with multiple PowerPACs in the libraries we work with. The activity we’ve seen is that the PowerPAC will go down, come back up, fall over again, and come back up, and so on. This can happen fairly quickly, but an outage can last as long as a few minutes. This not only affects PowerPAC but also hits ERMS services as well.
After a ticket to Polaris Support, we found out that Googlebot is basically running a DDOS attack on our libraries because it can’t deal with faceted URLs. According to Googlebot’s documentation:
Faceted navigation is a common feature of websites that allows its visitors to change how items (for example, products, articles, or events) are displayed on a page. It’s a popular and useful feature, however its most common implementation, which is based on URL parameters, can generate infinite URL spaces which harms the website in a couple ways:
Overcrawling: Because the URLs created for the faceted navigation seem to be novel and crawlers can’t determine whether the URLs are going to be useful without crawling first, the crawlers will typically access a very large number of faceted navigation URLs before the crawlers’ processes determine the URLs are in fact useless.
Slower discovery crawls: Stemming from the previous point, if crawling is spent on useless URLs, the crawlers have less time to spend on new, useful URLs.
For those who might not know, the basis of the PowerPAC relies on faceted URLs. It’s how it delivers pages from searching to logins to holds and more. Googlebot doesn’t know how to deal with those faceted URLs so instead of ignoring them or doing something intelligent about them, it instead “tries harder.” Which is why one of those library sites saw 356,595 hits from Googlebot in a single morning.
Cloudflare can help with this, and @wesochuck provided the link you see here.
According to Googlebot’s docs, you can also disallow it from crawling faceted URLs in your robots.txt. I went through the PowerPAC and the Polaris documentation and complied a list of facets that you’d likely want to block. There could be more though and, if so, I’ll update the list below.
user-agent: Googlebot
disallow: /*?*term=
disallow: /*?*by=
disallow: /*?*sort=
disallow: /*?*limit=
disallow: /*?*query=
disallow: /*?*page=
disallow: /*?*searchid=
disallow: /*?*ctx=
disallow: /*?*pos=
disallow: /*?*cn=
disallow: /*?*new=
disallow: /*?*isbn=
disallow: /*?*lccn=
disallow: /*?*keyword=
disallow: /*?*title=
disallow: /*?*author=
disallow: /*?*subject=
disallow: /*?*series=
disallow: /*?*upc=
disallow: /*?*oclc=
disallow: /*?*brs=
disallow: /*?*brsn=