[Archivesspace_Users_Group] Help with robots.txt

Tue May 21 10:59:19 EDT 2019

I'll be sure to add this one to the docs, so let me know if this works!

I think you'll need to do an alias, something like this for Apache:

<Location "/robots.txt">
 SetHandler None
 Require all granted
</Location>
Alias /robots.txt /var/www/robots.txt

nginx, more like this:
  location /robots.txt {
    alias /var/www/robots.txt;
  }

________________________________
From: archivesspace_users_group-bounces at lyralists.lyrasis.org <archivesspace_users_group-bounces at lyralists.lyrasis.org> on behalf of Swanson, Bob <bob.swanson at uconn.edu>
Sent: Tuesday, May 21, 2019 9:59 AM
To: archivesspace_users_group at lyralists.lyrasis.org
Subject: [Archivesspace_Users_Group] Help with robots.txt

Please forgive me if this is posted twice, I sent the following yesterday before I submitted the “acceptance Email” to the ArchivesSpace Users Group.  I don’t see where it was posted on the board (am I doing this correctly?).

So far as I can tell, this is how I’m supposed to ask questions regarding ArchivesSpace.

Please forgive and correct me if I’m going about this incorrectly.

I am new to ArchivesSpace, Ruby, JBOD and web development, so I’m pretty dumb.

The PUI Pre-Launch checklist advises creating and updating robots.txt,

So we would like to set up a robots.txt file to control what crawlers can access when they crawl our ArvhivesSpace site https://archivessearch.lib.uconn.edu/.

I understand that robots.txt is supposed to go in the web root directory of the website.

In a normal apache configuration that’s simple enough.

But,

We are serving ArchivesSpace via HTTPS.

a)       All Port 80 traffic is redirected to Port 443.

b)      443 traffic is proxied to 8081 (for the public interface) per the ArchivesSpace documentation.

  RequestHeader set X-Forwarded-Proto "https"

  ProxyPreserveHost On

  ProxyPass / http://localhost:8081/ retry=1 acquire=3000 timeout=600 Keepalive=on

  ProxyPassReverse / http://localhost:8081/

So, my web root directory (var/www/html) is empty (save some garbage left over from when I was testing).

I’ve read the documentation on www.robotstxt.org<http://www.robotstxt.org> but I can’t find anything that pertains to my situation.

I have to imagine that most ArchivesSpace sites are now https and use robots.txt, so this should be a somewhat a somewhat standard implementation.

I don not find much information on the Users Group site pertaining to this,

I find reference to plans for this being implemented at the web server level back in 2016,

But nothing beyond that.

http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/2016-August/003916.html

A search of the ArchivesSpace Technical Documentation for “robots” comes up empty as well.

Can you please direct me to any documentation that may exist on setting up a robots.txt file in a proxied HTTPS instance of ArchviceSpace?

Thank you, and please tolerate my naivety.

Bob Swanson

UConn Libraries

860-486-5260 – Office

860-617-1188 - Mobile

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20190521/87315bc1/attachment.html>