[Archivesspace_Users_Group] Help with robots.txt

Tue May 21 11:54:24 EDT 2019

Look at that!
https://github.com/archivesspace/archivesspace/commit/0bfb91e7f27a18b4cb6e0a27527be1041c877237#diff-f266d24dcc6fcbe9020ee4f31cf538f7
Yep, sure looks like that'll work as well.
So it seems like the easiest way to serve up a robots file is just throw it in your config directory.
________________________________
From: archivesspace_users_group-bounces at lyralists.lyrasis.org <archivesspace_users_group-bounces at lyralists.lyrasis.org> on behalf of Andrew Morrison <andrew.morrison at bodleian.ox.ac.uk>
Sent: Tuesday, May 21, 2019 11:37 AM
To: Archivesspace Users Group
Subject: Re: [Archivesspace_Users_Group] Help with robots.txt

Hello,

If you put a robots.txt file in the config folder of your ArchivesSpace system, it will be served by a request for /robots.txt, after the next restart. I cannot remember where I read that, and cannot find it now, but can confirm it works, since I believe 2.6.0.

Regards,

Andrew Morrison
Software Engineer
Bodleian Digital Library Systems and Services
https://www.bodleian.ox.ac.uk/bdlss

On Tue, 2019-05-21 at 13:59 +0000, Swanson, Bob wrote:

Please forgive me if this is posted twice, I sent the following yesterday before I submitted the “acceptance Email” to the ArchivesSpace Users Group.  I don’t see where it was posted on the board (am I doing this correctly?).

So far as I can tell, this is how I’m supposed to ask questions regarding ArchivesSpace.

Please forgive and correct me if I’m going about this incorrectly.

I am new to ArchivesSpace, Ruby, JBOD and web development, so I’m pretty dumb.

The PUI Pre-Launch checklist advises creating and updating robots.txt,

So we would like to set up a robots.txt file to control what crawlers can access when they crawl our ArvhivesSpace site https://archivessearch.lib.uconn.edu/.

I understand that robots.txt is supposed to go in the web root directory of the website.

In a normal apache configuration that’s simple enough.

But,

We are serving ArchivesSpace via HTTPS.

a)       All Port 80 traffic is redirected to Port 443.

b)      443 traffic is proxied to 8081 (for the public interface) per the ArchivesSpace documentation.

  RequestHeader set X-Forwarded-Proto "https"

  ProxyPreserveHost On

  ProxyPass / http://localhost:8081/ retry=1 acquire=3000 timeout=600 Keepalive=on

  ProxyPassReverse / http://localhost:8081/

So, my web root directory (var/www/html) is empty (save some garbage left over from when I was testing).

I’ve read the documentation on www.robotstxt.org<http://www.robotstxt.org> but I can’t find anything that pertains to my situation.

I have to imagine that most ArchivesSpace sites are now https and use robots.txt, so this should be a somewhat a somewhat standard implementation.

I don not find much information on the Users Group site pertaining to this,

I find reference to plans for this being implemented at the web server level back in 2016,

But nothing beyond that.

http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/2016-August/003916.html

A search of the ArchivesSpace Technical Documentation for “robots” comes up empty as well.

Can you please direct me to any documentation that may exist on setting up a robots.txt file in a proxied HTTPS instance of ArchviceSpace?

Thank you, and please tolerate my naivety.

Bob Swanson

UConn Libraries

860-486-5260 – Office

860-617-1188 - Mobile

_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group at lyralists.lyrasis.org<mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20190521/5d08ab12/attachment-0001.html>