[Archivesspace_Users_Group] Checking for Broken URLs in Resources
Andrew Morrison
andrew.morrison at bodleian.ox.ac.uk
Thu Feb 11 05:12:10 EST 2021
Another approach would be to modify the Solr schema and the
ArchivesSpace indexer to set up and populate an index field just for
URLs. Then write a script to query Solr, to get a list of URLs, then
send HTTP requests to test them as with any other link checker. That
would also have the added bonus of giving staff a dedicated option in
the advanced search for finding records by the URLs they contain (or *
to get all with external links.) This is something I have been working
on, but it is currently on the back-burner.
Andrew.
On 10/02/2021 15:44, Corey Schmidt wrote:
> Kevin,
>
> I'd be very interested to get your code, especially with 301 and 302
> redirects. My initial runs have resulted in redirects stalling the
> status code response, preventing my program from moving forward.
> Getting the status code of the link being redirected to would be a big
> help.
>
> I should also mention that any solution I make needs to be replicable
> to other faculty/staff in our library - so access to the database may
> not be consistent in the future. I'm thinking of a scenario where 5
> years from now, some people may want to run this report again but I
> may not be working for the library anymore. If that's the case,
> working to export from ArchivesSpace might be the better long-term,
> no-fusses solution.
>
> Corey
> ------------------------------------------------------------------------
> *From:* archivesspace_users_group-bounces at lyralists.lyrasis.org
> <archivesspace_users_group-bounces at lyralists.lyrasis.org> on behalf of
> Kevin W. Schlottmann <kws2126 at columbia.edu>
> *Sent:* Wednesday, February 10, 2021 10:27 AM
> *To:* Archivesspace Users Group
> <archivesspace_users_group at lyralists.lyrasis.org>
> *Subject:* Re: [Archivesspace_Users_Group] Checking for Broken URLs in
> Resources
> [EXTERNAL SENDER - PROCEED CAUTIOUSLY]
>
> Hi Corey,
>
> Earlier this year we did something similar. We started by extracting
> all xlink:href from the EAD corpus using xslt. (EAD is our go-to data
> source since we always have a constantly updated set, as we use them
> to automatically publish our finding aids.) We did not do an
> additional regex search for non-encoded URLs, but that's not a bad
> idea. The result was over 15,000 links. Each was recorded with
> bibid, container id (if any), and link text and title (if any). To
> check them, my colleague ran them through a Python script to get the
> response code, and if the response was 301/302, retrieve the redirect
> location and secondary response. This produced some interesting
> results, and resulted in a fair amount of remediation work to do.
>
> If this sounds on point, I can try to find and share the code we used.
>
> Kevin
>
> On Wed, Feb 10, 2021 at 9:34 AM Corey Schmidt <Corey.Schmidt at uga.edu
> <mailto:Corey.Schmidt at uga.edu>> wrote:
>
> Nancy,
>
> I have access to our staging database, but not production. I'm not
> sure our sysadmins will allow me to play around in the prod
> database, unless they can assign me read only maybe? Pulling the
> file_uri values for file_version would be much more efficient.
> However, I'm not just looking to check digital object links, but
> also any links found within collection and archival object level
> notes, either copied straight into the text of the notes or linked
> using the <extref> tag. I could probably query the database for
> that info too.
>
> Corey
> ------------------------------------------------------------------------
> *From:* archivesspace_users_group-bounces at lyralists.lyrasis.org
> <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>
> <archivesspace_users_group-bounces at lyralists.lyrasis.org
> <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>>
> on behalf of Kennedy, Nancy <KennedyN at si.edu <mailto:KennedyN at si.edu>>
> *Sent:* Wednesday, February 10, 2021 9:18 AM
> *To:* Archivesspace Users Group
> <archivesspace_users_group at lyralists.lyrasis.org
> <mailto:archivesspace_users_group at lyralists.lyrasis.org>>
> *Subject:* Re: [Archivesspace_Users_Group] Checking for Broken
> URLs in Resources
> [EXTERNAL SENDER - PROCEED CAUTIOUSLY]
>
> Hi Corey –
>
> Do you have access to query the database, as a starting point,
> instead of EAD? We were able to pull the file_uri values from the
> file_version table in the database. Our sysadmin then checked the
> response codes for that list of URI, and we referred issues out to
> staff working on those collections. Some corrections can be made
> directly by staff, or for long lists, you could include the
> digital_object id and post updates that way.
>
> Nancy
>
> *From:* archivesspace_users_group-bounces at lyralists.lyrasis.org
> <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>
> <archivesspace_users_group-bounces at lyralists.lyrasis.org
> <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>>
> *On Behalf Of *Corey Schmidt
> *Sent:* Wednesday, February 10, 2021 8:45 AM
> *To:* archivesspace_users_group at lyralists.lyrasis.org
> <mailto:archivesspace_users_group at lyralists.lyrasis.org>
> *Subject:* [Archivesspace_Users_Group] Checking for Broken URLs in
> Resources
>
> *External Email - Exercise Caution*
>
> Dear all,
>
>
> Hello, this is Corey Schmidt, ArchivesSpace PM at the University
> of Georgia. I hope everyone is doing well and staying safe and
> healthy.
>
> Would anyone know of any script, plugin, or tool to check for
> invalid URLs within resources? We are investigating how to grab
> URLs from exported EAD.xml files and check them to determine if
> they throw back any sort of error (404s mostly, but also any
> others). My thinking is to build a small app that will export
> EAD.xml files from ArchivesSpace, then sift through the raw xml
> using python's lxml package to catch any URLs using regex. After
> capturing the URL, it would then use the requests library to check
> the status code of the URL and if it returns an error, log that
> error in a .CSV output file to act as a "report" of all the broken
> links within that resource.
>
> The problems with this method are: 1. Exporting 1000s of resources
> takes a lot of time and some processing power, as well as a
> moderate amount of local storage space. 2. Even checking the raw
> xml file takes a considerable amount of time. The app I'm working
> on takes overnight to export and check all the xml files. I was
> considering pinging the API for different parts of a resource, but
> I figured that would take as much time as just exporting an
> EAD.xml and would be even more complex to write. I've checked
> Awesome ArchivesSpace, this listserv, and a few script libraries
> from institutions, but haven't found exactly what I am looking for.
>
> Any info or advice would be greatly appreciated! Thanks!
>
> Sincerely,
>
> Corey
>
> Corey Schmidt
>
> ArchivesSpace Project Manager
>
> University of Georgia Special Collections Libraries
>
> /Email:/Corey.Schmidt at uga.edu <mailto:Corey.Schmidt at uga.edu>
>
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> <mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
> <http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group>
>
>
>
> --
> Kevin Schlottmann
> Interim Director and Head of Archives Processing
> Rare Book & Manuscript Library
> Butler Library, Room 801
> Columbia University
> 535 W. 114th St., New York, NY 10027
> (212) 854-8483
>
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20210211/f4536914/attachment.html>
More information about the Archivesspace_Users_Group
mailing list