<div dir="ltr"><div dir="ltr">Corey,<div><br></div><div>The process Kevin mentioned is in our repo here:</div><div><br></div><div><a href="https://github.com/cul/rbml-archivesspace/tree/master/ead_link_checker">https://github.com/cul/rbml-archivesspace/tree/master/ead_link_checker</a><br></div><div><br></div><div>I think this is a few steps short of what you have in mind but maybe it will give you some code snippets to adapt. As Kevin mentioned, we have our entire corpus exported to EAD daily to back our publishing platform, so it makes an easy source to mine data via XSLT. Given a folder of EAD, the XSLT pulls out the xlink info, and the Python script does status checks and reports the results to a spreadsheet. As we have many links to dois and resolvers, the redirect locations and status checks were useful. We did this as a one-time audit rather than a continuous monitor, but I could see how one might automate a link checker to report problems as they come up.</div><div><br></div><div>I'd be happy to discuss further if it is helpful. Best of luck!</div><div><br></div><div>David</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Feb 10, 2021 at 10:44 AM Corey Schmidt <<a href="mailto:Corey.Schmidt@uga.edu">Corey.Schmidt@uga.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
Kevin,<br>
<br>
I'd be very interested to get your code, especially with 301 and 302 redirects. My initial runs have resulted in redirects stalling the status code response, preventing my program from moving forward. Getting the status code of the link being redirected to
would be a big help.<br>
<br>
I should also mention that any solution I make needs to be replicable to other faculty/staff in our library - so access to the database may not be consistent in the future. I'm thinking of a scenario where 5 years from now, some people may want to run this
report again but I may not be working for the library anymore. If that's the case, working to export from ArchivesSpace might be the better long-term, no-fusses solution.<br>
<br>
Corey<br>
</div>
<div id="gmail-m_-3600102994615128713appendonsend"></div>
<hr style="display:inline-block;width:98%">
<div id="gmail-m_-3600102994615128713divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> <a href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org" target="_blank">archivesspace_users_group-bounces@lyralists.lyrasis.org</a> <<a href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org" target="_blank">archivesspace_users_group-bounces@lyralists.lyrasis.org</a>> on behalf of Kevin W. Schlottmann
<<a href="mailto:kws2126@columbia.edu" target="_blank">kws2126@columbia.edu</a>><br>
<b>Sent:</b> Wednesday, February 10, 2021 10:27 AM<br>
<b>To:</b> Archivesspace Users Group <<a href="mailto:archivesspace_users_group@lyralists.lyrasis.org" target="_blank">archivesspace_users_group@lyralists.lyrasis.org</a>><br>
<b>Subject:</b> Re: [Archivesspace_Users_Group] Checking for Broken URLs in Resources</font>
<div> </div>
</div>
<div><font color="BA0C2F">[EXTERNAL SENDER - PROCEED CAUTIOUSLY]</font><br>
<br>
<div>
<div dir="ltr">
<div>Hi Corey, <br>
</div>
<div><br>
</div>
<div>Earlier this year we did something similar. We started by extracting all xlink:href from the EAD corpus using xslt. (EAD is our go-to data source since we always have a constantly updated set, as we use them to automatically publish our finding aids.)
We did not do an additional regex search for non-encoded URLs, but that's not a bad idea. The result was over 15,000 links. Each was recorded with bibid, container id (if any), and link text and title (if any). To check them, my colleague ran them through
a Python script to get the response code, and if the response was 301/302, retrieve the redirect location and secondary response. This produced some interesting results, and resulted in a fair amount of remediation work to do.
<br>
</div>
<div><br>
</div>
<div>If this sounds on point, I can try to find and share the code we used.<br>
</div>
<div><br>
</div>
<div>Kevin<br>
</div>
</div>
<br>
<div>
<div dir="ltr">On Wed, Feb 10, 2021 at 9:34 AM Corey Schmidt <<a href="mailto:Corey.Schmidt@uga.edu" target="_blank">Corey.Schmidt@uga.edu</a>> wrote:<br>
</div>
<blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
Nancy,<br>
<br>
I have access to our staging database, but not production. I'm not sure our sysadmins will allow me to play around in the prod database, unless they can assign me read only maybe? Pulling the file_uri values for file_version would be much more efficient. However,
I'm not just looking to check digital object links, but also any links found within collection and archival object level notes, either copied straight into the text of the notes or linked using the <extref> tag. I could probably query the database for that
info too.<br>
<br>
Corey<br>
</div>
<div id="gmail-m_-3600102994615128713x_gmail-m_-6818534061483353590gmail-m_-1985334488657977738appendonsend">
</div>
<hr style="display:inline-block;width:98%">
<div id="gmail-m_-3600102994615128713x_gmail-m_-6818534061483353590gmail-m_-1985334488657977738divRplyFwdMsg" dir="ltr">
<font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b>
<a href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org" target="_blank">
archivesspace_users_group-bounces@lyralists.lyrasis.org</a> <<a href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org" target="_blank">archivesspace_users_group-bounces@lyralists.lyrasis.org</a>> on behalf of Kennedy, Nancy <<a href="mailto:KennedyN@si.edu" target="_blank">KennedyN@si.edu</a>><br>
<b>Sent:</b> Wednesday, February 10, 2021 9:18 AM<br>
<b>To:</b> Archivesspace Users Group <<a href="mailto:archivesspace_users_group@lyralists.lyrasis.org" target="_blank">archivesspace_users_group@lyralists.lyrasis.org</a>><br>
<b>Subject:</b> Re: [Archivesspace_Users_Group] Checking for Broken URLs in Resources</font>
<div> </div>
</div>
<div lang="EN-US"><font color="BA0C2F">[EXTERNAL SENDER - PROCEED CAUTIOUSLY]</font><br>
<br>
<div>
<div>
<p>Hi Corey – </p>
<p>Do you have access to query the database, as a starting point, instead of EAD? We were able to pull the file_uri values from the file_version table in the database. Our sysadmin then checked the response codes for that list of URI, and we referred issues
out to staff working on those collections. Some corrections can be made directly by staff, or for long lists, you could include the digital_object id and post updates that way.</p>
<p> </p>
<p>Nancy</p>
<p> </p>
<p> </p>
<div>
<div style="border-color:rgb(225,225,225) currentcolor currentcolor;border-style:solid none none;border-width:1pt medium medium;padding:3pt 0in 0in">
<p><b>From:</b> <a href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org" target="_blank">
archivesspace_users_group-bounces@lyralists.lyrasis.org</a> <<a href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org" target="_blank">archivesspace_users_group-bounces@lyralists.lyrasis.org</a>>
<b>On Behalf Of </b>Corey Schmidt<br>
<b>Sent:</b> Wednesday, February 10, 2021 8:45 AM<br>
<b>To:</b> <a href="mailto:archivesspace_users_group@lyralists.lyrasis.org" target="_blank">
archivesspace_users_group@lyralists.lyrasis.org</a><br>
<b>Subject:</b> [Archivesspace_Users_Group] Checking for Broken URLs in Resources</p>
</div>
</div>
<p> </p>
<p style="line-height:12pt;background:none 0% 0% repeat scroll rgb(255,235,156)">
<b><span style="font-size:9pt;color:rgb(156,101,0)">External Email - Exercise Caution</span></b></p>
<div>
<div>
<p style="background:none 0% 0% repeat scroll white"><span style="font-size:12pt;color:black">Dear all,</span></p>
</div>
<div>
<p style="background:none 0% 0% repeat scroll white"><span style="font-size:12pt;color:black"><br>
Hello, this is Corey Schmidt, ArchivesSpace PM at the University of Georgia. I hope everyone is doing well and staying safe and healthy.<br>
<br>
Would anyone know of any script, plugin, or tool to check for invalid URLs within resources? We are investigating how to grab URLs from exported EAD.xml files and check them to determine if they throw back any sort of error (404s mostly, but also any others).
My thinking is to build a small app that will export EAD.xml files from ArchivesSpace, then sift through the raw xml using python's lxml package to catch any URLs using regex. After capturing the URL, it would then use the requests library to check the status
code of the URL and if it returns an error, log that error in a .CSV output file to act as a "report" of all the broken links within that resource.<br>
<br>
The problems with this method are: 1. Exporting 1000s of resources takes a lot of time and some processing power, as well as a moderate amount of local storage space. 2. Even checking the raw xml file takes a considerable amount of time. The app I'm working
on takes overnight to export and check all the xml files. I was considering pinging the API for different parts of a resource, but I figured that would take as much time as just exporting an EAD.xml and would be even more complex to write. I've checked Awesome
ArchivesSpace, this listserv, and a few script libraries from institutions, but haven't found exactly what I am looking for.<br>
<br>
Any info or advice would be greatly appreciated! Thanks!<br>
<br>
Sincerely,<br>
<br>
Corey</span></p>
<div>
<div>
<p style="background:none 0% 0% repeat scroll white"><span style="font-size:12pt;color:black"> </span></p>
</div>
<div id="gmail-m_-3600102994615128713x_gmail-m_-6818534061483353590gmail-m_-1985334488657977738x_Signature">
<div>
<div>
<p style="background:none 0% 0% repeat scroll white"><span style="font-size:12pt;font-family:Verdana,sans-serif;color:black">Corey Schmidt</span><span style="font-size:12pt;color:black"></span></p>
</div>
<div>
<p style="background:none 0% 0% repeat scroll white"><span style="font-family:Verdana,sans-serif;color:rgb(51,51,51)">ArchivesSpace Project Manager</span><span style="font-size:12pt;color:black"></span></p>
</div>
<div>
<p style="background:none 0% 0% repeat scroll white"><span style="font-family:Verdana,sans-serif;color:rgb(51,51,51)">University of Georgia Special Collections Libraries</span><span style="font-size:12pt;color:black"></span></p>
</div>
<div>
<p style="background:none 0% 0% repeat scroll white"><i><span style="font-size:10pt;font-family:Verdana,sans-serif;color:rgb(102,102,102)">Email:</span></i><span style="font-size:10pt;font-family:Verdana,sans-serif;color:rgb(102,102,102)">
<a href="mailto:Corey.Schmidt@uga.edu" target="_blank">Corey.Schmidt@uga.edu</a></span><span style="font-size:12pt;color:black"></span></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
_______________________________________________<br>
Archivesspace_Users_Group mailing list<br>
<a href="mailto:Archivesspace_Users_Group@lyralists.lyrasis.org" target="_blank">Archivesspace_Users_Group@lyralists.lyrasis.org</a><br>
<a href="http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group" rel="noreferrer" target="_blank">http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group</a><br>
</blockquote>
</div>
<br clear="all">
<br>
-- <br>
<div dir="ltr">
<div dir="ltr">Kevin Schlottmann<br>
Interim Director and Head of Archives Processing<br>
Rare Book & Manuscript Library<br>
Butler Library, Room 801<br>
Columbia University<br>
535 W. 114th St., New York, NY 10027<br>
(212) 854-8483</div>
</div>
</div>
</div>
</div>
_______________________________________________<br>
Archivesspace_Users_Group mailing list<br>
<a href="mailto:Archivesspace_Users_Group@lyralists.lyrasis.org" target="_blank">Archivesspace_Users_Group@lyralists.lyrasis.org</a><br>
<a href="http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group" rel="noreferrer" target="_blank">http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group</a><br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div><div style="font-size:13px"><font color="#999999">David W. Hodges</font></div><div style="font-size:13px"><font color="#999999">Special Collections Analyst</font></div><div style="font-size:13px"><font color="#999999">Columbia University Libraries</font></div><div style="font-size:13px"><span style="color:rgb(153,153,153)">Butler Library</span><br></div><div><font color="#999999"><span style="font-size:13px">535 West 114th St.</span></font><br></div><div style="font-size:13px"><font color="#999999">New York, NY 10027</font></div><div style="font-size:13px"><font color="#999999">212 854-8758</font></div><div style="font-size:13px;color:rgb(0,0,0)"><br></div></div></div></div></div></div></div></div></div></div></div></div></div>