[Archivesspace_Users_Group] Help with robots.txt

Andrew Morrison andrew.morrison at bodleian.ox.ac.uk
Tue May 21 14:26:58 EDT 2019


Small correction: adding robots.txt to the config folder has worked since 2.2.2, not 2.6.0.


I think documentation of this feature was lost because it was added around the same time that the tech-docs repository was being created.


Andrew.

________________________________
From: archivesspace_users_group-bounces at lyralists.lyrasis.org <archivesspace_users_group-bounces at lyralists.lyrasis.org> on behalf of Swanson, Bob <bob.swanson at uconn.edu>
Sent: 21 May 2019 17:08:33
To: Archivesspace Users Group
Subject: Re: [Archivesspace_Users_Group] Help with robots.txt


Thank you so much!

I will comply.



Bob Swanson

UConn Libraries

860-486-5260 – Office

860-617-1188 - Mobile



From: archivesspace_users_group-bounces at lyralists.lyrasis.org <archivesspace_users_group-bounces at lyralists.lyrasis.org> On Behalf Of Blake Carver
Sent: Tuesday, May 21, 2019 11:54 AM
To: Archivesspace Users Group <archivesspace_users_group at lyralists.lyrasis.org>
Subject: Re: [Archivesspace_Users_Group] Help with robots.txt



Look at that!

https://github.com/archivesspace/archivesspace/commit/0bfb91e7f27a18b4cb6e0a27527be1041c877237#diff-f266d24dcc6fcbe9020ee4f31cf538f7<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Farchivesspace%2Farchivesspace%2Fcommit%2F0bfb91e7f27a18b4cb6e0a27527be1041c877237%23diff-f266d24dcc6fcbe9020ee4f31cf538f7&data=02%7C01%7Cbob.swanson%40uconn.edu%7C60149aa8361344aef23008d6de049c5c%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636940508729129268&sdata=MJ9vLNVGxdo0LQ53cO40qAmukhSclOrZv8yFwiO9WKk%3D&reserved=0>

Yep, sure looks like that'll work as well.

So it seems like the easiest way to serve up a robots file is just throw it in your config directory.

________________________________

From: archivesspace_users_group-bounces at lyralists.lyrasis.org<mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org> <archivesspace_users_group-bounces at lyralists.lyrasis.org<mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>> on behalf of Andrew Morrison <andrew.morrison at bodleian.ox.ac.uk<mailto:andrew.morrison at bodleian.ox.ac.uk>>
Sent: Tuesday, May 21, 2019 11:37 AM
To: Archivesspace Users Group
Subject: Re: [Archivesspace_Users_Group] Help with robots.txt



Hello,



If you put a robots.txt file in the config folder of your ArchivesSpace system, it will be served by a request for /robots.txt, after the next restart. I cannot remember where I read that, and cannot find it now, but can confirm it works, since I believe 2.6.0.



Regards,



Andrew Morrison

Software Engineer

Bodleian Digital Library Systems and Services

https://www.bodleian.ox.ac.uk/bdlss<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.bodleian.ox.ac.uk%2Fbdlss&data=02%7C01%7Cbob.swanson%40uconn.edu%7C60149aa8361344aef23008d6de049c5c%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636940508729129268&sdata=hHEwf14Gs%2Bm5sfD8Lp68S7wN0n9pL1NbK2nbkO90pWg%3D&reserved=0>





On Tue, 2019-05-21 at 13:59 +0000, Swanson, Bob wrote:

Please forgive me if this is posted twice, I sent the following yesterday before I submitted the “acceptance Email” to the ArchivesSpace Users Group.  I don’t see where it was posted on the board (am I doing this correctly?).



So far as I can tell, this is how I’m supposed to ask questions regarding ArchivesSpace.

Please forgive and correct me if I’m going about this incorrectly.



I am new to ArchivesSpace, Ruby, JBOD and web development, so I’m pretty dumb.



The PUI Pre-Launch checklist advises creating and updating robots.txt,

So we would like to set up a robots.txt file to control what crawlers can access when they crawl our ArvhivesSpace site https://archivessearch.lib.uconn.edu/<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchivessearch.lib.uconn.edu%2F&data=02%7C01%7Cbob.swanson%40uconn.edu%7C60149aa8361344aef23008d6de049c5c%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636940508729139257&sdata=g1gJFYqoFtX5S5EImTzssbksDfi328kw%2B%2F1H6TigIXk%3D&reserved=0>.

I understand that robots.txt is supposed to go in the web root directory of the website.

In a normal apache configuration that’s simple enough.



But,

We are serving ArchivesSpace via HTTPS.

a)       All Port 80 traffic is redirected to Port 443.

b)      443 traffic is proxied to 8081 (for the public interface) per the ArchivesSpace documentation.

  RequestHeader set X-Forwarded-Proto "https"

  ProxyPreserveHost On

  ProxyPass / http://localhost:8081/ retry=1 acquire=3000 timeout=600 Keepalive=on

  ProxyPassReverse / http://localhost:8081/

So, my web root directory (var/www/html) is empty (save some garbage left over from when I was testing).



I’ve read the documentation on www.robotstxt.org<https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.robotstxt.org&data=02%7C01%7Cbob.swanson%40uconn.edu%7C60149aa8361344aef23008d6de049c5c%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636940508729139257&sdata=BxRVDxhr3LQAM%2FX2GD58d%2Bt%2BUoNsnoJPYfWRk6UtH9c%3D&reserved=0> but I can’t find anything that pertains to my situation.

I have to imagine that most ArchivesSpace sites are now https and use robots.txt, so this should be a somewhat a somewhat standard implementation.



I don not find much information on the Users Group site pertaining to this,

I find reference to plans for this being implemented at the web server level back in 2016,

But nothing beyond that.

http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/2016-August/003916.html<https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flyralists.lyrasis.org%2Fpipermail%2Farchivesspace_users_group%2F2016-August%2F003916.html&data=02%7C01%7Cbob.swanson%40uconn.edu%7C60149aa8361344aef23008d6de049c5c%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636940508729149251&sdata=RaQCHvtRY6H7Ned5F0ULjrwYo1yG6E7lyi%2FvHgrGUSI%3D&reserved=0>



A search of the ArchivesSpace Technical Documentation for “robots” comes up empty as well.



Can you please direct me to any documentation that may exist on setting up a robots.txt file in a proxied HTTPS instance of ArchviceSpace?

Thank you, and please tolerate my naivety.







Bob Swanson

UConn Libraries

860-486-5260 – Office

860-617-1188 - Mobile



_______________________________________________

Archivesspace_Users_Group mailing list

Archivesspace_Users_Group at lyralists.lyrasis.org<mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>

http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group<https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flyralists.lyrasis.org%2Fmailman%2Flistinfo%2Farchivesspace_users_group&data=02%7C01%7Cbob.swanson%40uconn.edu%7C60149aa8361344aef23008d6de049c5c%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636940508729149251&sdata=BUFnf%2FKkS6d%2BBrLDT1S1X5S09kuEtmG%2B3OWFlOk3ob4%3D&reserved=0>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20190521/b4ecd8c1/attachment.html>


More information about the Archivesspace_Users_Group mailing list