[Archivesspace_Users_Group] Checking for Broken URLs in Resources

Thu Feb 11 05:12:10 EST 2021

Another approach would be to modify the Solr schema and the 
ArchivesSpace indexer to set up and populate an index field just for 
URLs. Then write a script to query Solr, to get a list of URLs, then 
send HTTP requests to test them as with any other link checker. That 
would also have the added bonus of giving staff a dedicated option in 
the advanced search for finding records by the URLs they contain (or * 
to get all with external links.) This is something I have been working 
on, but it is currently on the back-burner.

Andrew.

On 10/02/2021 15:44, Corey Schmidt wrote:
> Kevin,
>
> I'd be very interested to get your code, especially with 301 and 302 
> redirects. My initial runs have resulted in redirects stalling the 
> status code response, preventing my program from moving forward. 
> Getting the status code of the link being redirected to would be a big 
> help.
>
> I should also mention that any solution I make needs to be replicable 
> to other faculty/staff in our library - so access to the database may 
> not be consistent in the future. I'm thinking of a scenario where 5 
> years from now, some people may want to run this report again but I 
> may not be working for the library anymore. If that's the case, 
> working to export from ArchivesSpace might be the better long-term, 
> no-fusses solution.
>
> Corey
> ------------------------------------------------------------------------
> *From:* archivesspace_users_group-bounces at lyralists.lyrasis.org 
> <archivesspace_users_group-bounces at lyralists.lyrasis.org> on behalf of 
> Kevin W. Schlottmann <kws2126 at columbia.edu>
> *Sent:* Wednesday, February 10, 2021 10:27 AM
> *To:* Archivesspace Users Group 
> <archivesspace_users_group at lyralists.lyrasis.org>
> *Subject:* Re: [Archivesspace_Users_Group] Checking for Broken URLs in 
> Resources
> [EXTERNAL SENDER - PROCEED CAUTIOUSLY]
>
> Hi Corey,
>
> Earlier this year we did something similar. We started by extracting 
> all xlink:href from the EAD corpus using xslt. (EAD is our go-to data 
> source since we always have a constantly updated set, as we use them 
> to automatically publish our finding aids.) We did not do an 
> additional regex search for non-encoded URLs, but that's not a bad 
> idea.  The result was over 15,000 links.  Each was recorded with 
> bibid, container id (if any), and link text and title (if any). To 
> check them, my colleague ran them through a Python script to get the 
> response code, and if the response was 301/302, retrieve the redirect 
> location and secondary response. This produced some interesting 
> results, and resulted in a fair amount of remediation work to do.
>
> If this sounds on point, I can try to find and share the code we used.
>
> Kevin
>
> On Wed, Feb 10, 2021 at 9:34 AM Corey Schmidt <Corey.Schmidt at uga.edu 
> <mailto:Corey.Schmidt at uga.edu>> wrote:
>
>     Nancy,
>
>     I have access to our staging database, but not production. I'm not
>     sure our sysadmins will allow me to play around in the prod
>     database, unless they can assign me read only maybe? Pulling the
>     file_uri values for file_version would be much more efficient.
>     However, I'm not just looking to check digital object links, but
>     also any links found within collection and archival object level
>     notes, either copied straight into the text of the notes or linked
>     using the <extref> tag. I could probably query the database for
>     that info too.
>
>     Corey
>     ------------------------------------------------------------------------
>     *From:* archivesspace_users_group-bounces at lyralists.lyrasis.org
>     <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>
>     <archivesspace_users_group-bounces at lyralists.lyrasis.org
>     <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>>
>     on behalf of Kennedy, Nancy <KennedyN at si.edu <mailto:KennedyN at si.edu>>
>     *Sent:* Wednesday, February 10, 2021 9:18 AM
>     *To:* Archivesspace Users Group
>     <archivesspace_users_group at lyralists.lyrasis.org
>     <mailto:archivesspace_users_group at lyralists.lyrasis.org>>
>     *Subject:* Re: [Archivesspace_Users_Group] Checking for Broken
>     URLs in Resources
>     [EXTERNAL SENDER - PROCEED CAUTIOUSLY]
>
>     Hi Corey –
>
>     Do you have access to query the database, as a starting point,
>     instead of EAD?  We were able to pull the file_uri values from the
>     file_version table in the database.  Our sysadmin then checked the
>     response codes for that list of URI, and we referred issues out to
>     staff working on those collections.  Some corrections can be made
>     directly by staff, or for long lists, you could include the
>     digital_object id and post updates that way.
>
>     Nancy
>
>     *From:* archivesspace_users_group-bounces at lyralists.lyrasis.org
>     <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>
>     <archivesspace_users_group-bounces at lyralists.lyrasis.org
>     <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>>
>     *On Behalf Of *Corey Schmidt
>     *Sent:* Wednesday, February 10, 2021 8:45 AM
>     *To:* archivesspace_users_group at lyralists.lyrasis.org
>     <mailto:archivesspace_users_group at lyralists.lyrasis.org>
>     *Subject:* [Archivesspace_Users_Group] Checking for Broken URLs in
>     Resources
>
>     *External Email - Exercise Caution*
>
>     Dear all,
>
>
>     Hello, this is Corey Schmidt, ArchivesSpace PM at the University
>     of Georgia. I hope everyone is doing well and staying safe and
>     healthy.
>
>     Would anyone know of any script, plugin, or tool to check for
>     invalid URLs within resources? We are investigating how to grab
>     URLs from exported EAD.xml files and check them to determine if
>     they throw back any sort of error (404s mostly, but also any
>     others). My thinking is to build a small app that will export
>     EAD.xml files from ArchivesSpace, then sift through the raw xml
>     using python's lxml package to catch any URLs using regex. After
>     capturing the URL, it would then use the requests library to check
>     the status code of the URL and if it returns an error, log that
>     error in a .CSV output file to act as a "report" of all the broken
>     links within that resource.
>
>     The problems with this method are: 1. Exporting 1000s of resources
>     takes a lot of time and some processing power, as well as a
>     moderate amount of local storage space. 2. Even checking the raw
>     xml file takes a considerable amount of time. The app I'm working
>     on takes overnight to export and check all the xml files. I was
>     considering pinging the API for different parts of a resource, but
>     I figured that would take as much time as just exporting an
>     EAD.xml and would be even more complex to write. I've checked
>     Awesome ArchivesSpace, this listserv, and a few script libraries
>     from institutions, but haven't found exactly what I am looking for.
>
>     Any info or advice would be greatly appreciated! Thanks!
>
>     Sincerely,
>
>     Corey
>
>     Corey Schmidt
>
>     ArchivesSpace Project Manager
>
>     University of Georgia Special Collections Libraries
>
>     /Email:/Corey.Schmidt at uga.edu <mailto:Corey.Schmidt at uga.edu>
>
>     _______________________________________________
>     Archivesspace_Users_Group mailing list
>     Archivesspace_Users_Group at lyralists.lyrasis.org
>     <mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
>     http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
>     <http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group>
>
>
>
> -- 
> Kevin Schlottmann
> Interim Director and Head of Archives Processing
> Rare Book & Manuscript Library
> Butler Library, Room 801
> Columbia University
> 535 W. 114th St., New York, NY  10027
> (212) 854-8483
>
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20210211/f4536914/attachment.html>