[Archivesspace_Users_Group] Checking for Broken URLs in Resources

Kevin W. Schlottmann kws2126 at columbia.edu
Wed Feb 10 10:27:09 EST 2021

Hi Corey,

Earlier this year we did something similar. We started by extracting all
xlink:href from the EAD corpus using xslt. (EAD is our go-to data source
since we always have a constantly updated set, as we use them to
automatically publish our finding aids.) We did not do an additional regex
search for non-encoded URLs, but that's not a bad idea.  The result was
over 15,000 links.  Each was recorded with bibid, container id (if any),
and link text and title (if any). To check them, my colleague ran them
through a Python script to get the response code, and if the response was
301/302, retrieve the redirect location and secondary response. This
produced some interesting results, and resulted in a fair amount of
remediation work to do.

If this sounds on point, I can try to find and share the code we used.


On Wed, Feb 10, 2021 at 9:34 AM Corey Schmidt <Corey.Schmidt at uga.edu> wrote:

> Nancy,
> I have access to our staging database, but not production. I'm not sure
> our sysadmins will allow me to play around in the prod database, unless
> they can assign me read only maybe? Pulling the file_uri values for
> file_version would be much more efficient. However, I'm not just looking to
> check digital object links, but also any links found within collection and
> archival object level notes, either copied straight into the text of the
> notes or linked using the <extref> tag. I could probably query the database
> for that info too.
> Corey
> ------------------------------
> *From:* archivesspace_users_group-bounces at lyralists.lyrasis.org <
> archivesspace_users_group-bounces at lyralists.lyrasis.org> on behalf of
> Kennedy, Nancy <KennedyN at si.edu>
> *Sent:* Wednesday, February 10, 2021 9:18 AM
> *To:* Archivesspace Users Group <
> archivesspace_users_group at lyralists.lyrasis.org>
> *Subject:* Re: [Archivesspace_Users_Group] Checking for Broken URLs in
> Resources
> Hi Corey –
> Do you have access to query the database, as a starting point, instead of
> EAD?  We were able to pull the file_uri values from the file_version table
> in the database.  Our sysadmin then checked the response codes for that
> list of URI, and we referred issues out to staff working on those
> collections.  Some corrections can be made directly by staff, or for long
> lists, you could include the digital_object id and post updates that way.
> Nancy
> *From:* archivesspace_users_group-bounces at lyralists.lyrasis.org <
> archivesspace_users_group-bounces at lyralists.lyrasis.org> *On Behalf Of *Corey
> Schmidt
> *Sent:* Wednesday, February 10, 2021 8:45 AM
> *To:* archivesspace_users_group at lyralists.lyrasis.org
> *Subject:* [Archivesspace_Users_Group] Checking for Broken URLs in
> Resources
> *External Email - Exercise Caution*
> Dear all,
> Hello, this is Corey Schmidt, ArchivesSpace PM at the University of
> Georgia. I hope everyone is doing well and staying safe and healthy.
> Would anyone know of any script, plugin, or tool to check for invalid URLs
> within resources? We are investigating how to grab URLs from exported
> EAD.xml files and check them to determine if they throw back any sort of
> error (404s mostly, but also any others). My thinking is to build a small
> app that will export EAD.xml files from ArchivesSpace, then sift through
> the raw xml using python's lxml package to catch any URLs using regex.
> After capturing the URL, it would then use the requests library to check
> the status code of the URL and if it returns an error, log that error in a
> .CSV output file to act as a "report" of all the broken links within that
> resource.
> The problems with this method are: 1. Exporting 1000s of resources takes a
> lot of time and some processing power, as well as a moderate amount of
> local storage space. 2. Even checking the raw xml file takes a considerable
> amount of time. The app I'm working on takes overnight to export and check
> all the xml files. I was considering pinging the API for different parts of
> a resource, but I figured that would take as much time as just exporting an
> EAD.xml and would be even more complex to write. I've checked Awesome
> ArchivesSpace, this listserv, and a few script libraries from institutions,
> but haven't found exactly what I am looking for.
> Any info or advice would be greatly appreciated! Thanks!
> Sincerely,
> Corey
> Corey Schmidt
> ArchivesSpace Project Manager
> University of Georgia Special Collections Libraries
> *Email:* Corey.Schmidt at uga.edu
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group

Kevin Schlottmann
Interim Director and Head of Archives Processing
Rare Book & Manuscript Library
Butler Library, Room 801
Columbia University
535 W. 114th St., New York, NY  10027
(212) 854-8483
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20210210/621da39a/attachment.html>

More information about the Archivesspace_Users_Group mailing list