[Archivesspace_Users_Group] how to find if a certain page was accessed in the PUI

Andrew Morrison andrew.morrison at bodleian.ox.ac.uk
Wed Mar 25 12:13:04 EDT 2020


Note that access logs may not be viewable without superuser access, or could have been automatically transferred to another the server for secure storage and analysis, if your system admins felt it necessary to comply with local privacy or data protection regulations.

Andrew.

________________________________
From: archivesspace_users_group-bounces at lyralists.lyrasis.org <archivesspace_users_group-bounces at lyralists.lyrasis.org> on behalf of Steele, Henry <Henry.Steele at tufts.edu>
Sent: 25 March 2020 13:06
To: Archivesspace Users Group <archivesspace_users_group at lyralists.lyrasis.org>
Subject: Re: [Archivesspace_Users_Group] how to find if a certain page was accessed in the PUI


Thank you, Blake.  This is all really helpful.  I will see if I can use some of these strategies to look through our logs.



It may not be there.  Our pui log level is “fatal” and I’m not sure the Apache (/var/log/httpd) contain this kind of information.  But it’s certainly good to know how to look



Henry Steele

Systems Librarian

Tufts University Library Technology Services

(617)627-5239



From: archivesspace_users_group-bounces at lyralists.lyrasis.org <archivesspace_users_group-bounces at lyralists.lyrasis.org> On Behalf Of Blake Carver
Sent: Wednesday, March 25, 2020 9:01 AM
To: Archivesspace Users Group <archivesspace_users_group at lyralists.lyrasis.org>
Subject: Re: [Archivesspace_Users_Group] how to find if a certain page was accessed in the PUI



I know quite a few people use Google Analytics, which is not something I find useful at all, but it's used quite often. Check matamo for an open source analytics product. There are many others. I know matamo gives you the ability to customize things, and I bet it could be quite useful, though I've not touched it in many years.



I think your best bet is to get to know your Apache logs. You should be able to get something useful out of there, but you'll need to learn what your logging there, and maybe change it  up.  Read up on Apache's "LogFormat" , it's pretty flexible and you can customize that on your server. You can also customize where log files end up for which domain name, so that might help as well.  If you're running the PUI and STAFF sides on different URLs, or prefixes, that will help set them apart for logging. This is all one of those "It Depends" kinds of things.  Using grep/awk/sed etc... will let you pull out different things from the logs. Try tailing the log as you look at different things on the site and see how those get logged, then work up some simple greps to pull out just what you need every day.  This is a simple one I use to see the busiest sites on a server:



cat /var/log/apache2/other_vhosts_access.log.1 |  awk {'print $1'} | sort |uniq -c |sort -nr | head -20

(If you're looking at that and thinking "You don't need cat in there, dummy" I know I know, old habits die hard)



You could do the same kind of grep work on the archivesspace.out log file and get something out of it. You might need to experiment with loglevel on that to see what you can get. DEBUG is probably way too much.



Here's some real nginx logs... these are based on real logs with some details changed to protect the innocent.



Here's one you might see quite often, if someone is logged into the staff side you'll see this POST to check their session:

4.4.4.4 example.edu - [25/Mar/2020:12:32:22 +0000] "POST /update_monitor/poll HTTP/1.1" 200 4751 "https://example.edu/resources/134/edit" "lock_version=12&uri=%2Frepositories%2F5%2Fresources%2F134" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:71.0) Gecko/20100101 Firefox/71.0" "-"



Here's another one, someone is looked at a resource on the staff side:

4.4.4.4 example.edu - [25/Mar/2020:12:31:03 +0000] "GET /resources/2774?inline=true&undefined_id=%2Frepositories%2F3%2Fresources%2F2774 HTTP/1.1" 200 9839 "https://example.edu/resources/2774" "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36" "-"



And here's a bot crawling the public side.

216.244.66.240 example.edu - [25/Mar/2020:12:29:58 +0000] "GET /repositories/2/archival_objects/97930 HTTP/1.1" 200 21473 "-" "-" "Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help at moz.com<mailto:help at moz.com>)" "-"





Depending on how you configure your Apache/nginx/whatever logs, those log lines will look different and you can log a bunch of different things.



On the ArchivesSpace side (archivesspace/logs/archivesspace.out) the logs can look different depending on your log level. Here's one set to debug showing the indexer doing some work:



INFO: [collection1] webapp= path=/update params={} {add=[/repositories/2/archival_objects/33921#pui, /repositories/2/archival_objects/33922#pui, /repositories/2/archival_objects/33923#pui, /repositories/2/archival_objects/33924#pui, /repositories/2/archival_objects/33925#pui, /repositories/2/archival_objects/33926#pui, /repositories/2/archival_objects/33927#pui, /repositories/2/archival_objects/33928#pui, /repositories/2/archival_objects/33929#pui, /repositories/2/archival_objects/33930#pui, ... (25 adds)]} 0 6



Here's one line from me viewing a resource on the staff side, as you can see it'll be a bit more challenging to get useful stuff out of this log, but it's in there:



[2020-03-25T08:45:29-04:00] INFO: [collection1] webapp= path=/select params={facet.field=assessment_record_types&facet.field=assessment_surveyors&facet.field=assessment_review_required&facet.field=assessment_reviewers&facet.field=assessment_completed&facet.field=assessment_inactive&facet.field=assessment_survey_year&facet.field=assessment_sensitive_material&csv.escape=\&start=0&q.op=AND&fq=repository:"/repositories/3"+OR+repository:global&fq=types:("assessment")&fq=(-types:("pui_only")+AND+(assessment_record_uris:("\/repositories\/3\/resources\/406")))&fq=-exclude_by_default:true&sort=&rows=30&bq=primary_type:resource^100&q=*:*&facet.limit=20&defType=edismax&qf=four_part_id^3+title^2+finding_aid_filing_title^2+fullrecord&pf=four_part_id^4&csv.header=true&csv.encapsulator="&facet.mincount=0&wt=json&facet=true} hits=0 status=0 QTime=61



________________________________

From: archivesspace_users_group-bounces at lyralists.lyrasis.org<mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org> <archivesspace_users_group-bounces at lyralists.lyrasis.org<mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>> on behalf of Steele, Henry <Henry.Steele at tufts.edu<mailto:Henry.Steele at tufts.edu>>
Sent: Wednesday, March 25, 2020 7:54 AM
To: Archivesspace Users Group <archivesspace_users_group at lyralists.lyrasis.org<mailto:archivesspace_users_group at lyralists.lyrasis.org>>
Subject: [Archivesspace_Users_Group] how to find if a certain page was accessed in the PUI



Good morning,



We recently made our PUI open the public and we are trying to find out about usage, particularly of a certain page within our repository.  I’m trying to figure out if there’s any way to see this in the logs.



I’ve looked in the application log archivesspace.out, but I’m not sure what I’m seeing here.  I see records being accessed, with a response of 200, but I don’t know if this is the staff interface, the PUI, or if it’s some indexing activity.  Is there a way in the application log to see if a certain page has been accessed in the PUI?  We have our log level set to “fatal” for the PUI, and the “pui_log” is default.  I know should mean the log only reports on problematic events, but since I see a lot of activity in the log, I’m wondering if this setting doesn’t actually have effect.



Alternately, does anyone know if there might be other server logs that would be of use?   I’m looking in the Apache logs at /var/log/httpd but I’m not sure which of this logs would contain such information if any.



Any information you had would be of great help.  Thanks



If this isn’t



Henry Steele

Systems Librarian

Tufts University Library Technology Services

(617)627-5239


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20200325/b03496df/attachment.html>


More information about the Archivesspace_Users_Group mailing list