[Archivesspace_Users_Group] Help with robots.txt

Tue May 21 11:37:40 EDT 2019

Hello,

If you put a robots.txt file in the config folder of your ArchivesSpace system, it will be served by a request for /robots.txt, after the next restart. I cannot remember where I read that, and cannot find it now, but can confirm it works, since I believe 2.6.0.

Regards,

Andrew Morrison
Software Engineer
Bodleian Digital Library Systems and Services
https://www.bodleian.ox.ac.uk/bdlss

On Tue, 2019-05-21 at 13:59 +0000, Swanson, Bob wrote:
Please forgive me if this is posted twice, I sent the following yesterday before I submitted the “acceptance Email” to the ArchivesSpace Users Group.  I don’t see where it was posted on the board (am I doing this correctly?).

So far as I can tell, this is how I’m supposed to ask questions regarding ArchivesSpace.
Please forgive and correct me if I’m going about this incorrectly.

I am new to ArchivesSpace, Ruby, JBOD and web development, so I’m pretty dumb.

The PUI Pre-Launch checklist advises creating and updating robots.txt,
So we would like to set up a robots.txt file to control what crawlers can access when they crawl our ArvhivesSpace site https://archivessearch.lib.uconn.edu/.
I understand that robots.txt is supposed to go in the web root directory of the website.
In a normal apache configuration that’s simple enough.

But,
We are serving ArchivesSpace via HTTPS.

a)       All Port 80 traffic is redirected to Port 443.

b)      443 traffic is proxied to 8081 (for the public interface) per the ArchivesSpace documentation.
  RequestHeader set X-Forwarded-Proto "https"
  ProxyPreserveHost On
  ProxyPass / http://localhost:8081/ retry=1 acquire=3000 timeout=600 Keepalive=on
  ProxyPassReverse / http://localhost:8081/
So, my web root directory (var/www/html) is empty (save some garbage left over from when I was testing).

I’ve read the documentation on www.robotstxt.org<http://www.robotstxt.org> but I can’t find anything that pertains to my situation.
I have to imagine that most ArchivesSpace sites are now https and use robots.txt, so this should be a somewhat a somewhat standard implementation.

I don not find much information on the Users Group site pertaining to this,
I find reference to plans for this being implemented at the web server level back in 2016,
But nothing beyond that.
http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/2016-August/003916.html

A search of the ArchivesSpace Technical Documentation for “robots” comes up empty as well.

Can you please direct me to any documentation that may exist on setting up a robots.txt file in a proxied HTTPS instance of ArchviceSpace?
Thank you, and please tolerate my naivety.

Bob Swanson
UConn Libraries
860-486-5260 – Office
860-617-1188 - Mobile

_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group at lyralists.lyrasis.org<mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20190521/46d390ee/attachment.html>