<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
<p>Another approach would be to modify the Solr schema and the
ArchivesSpace indexer to set up and populate an index field just
for URLs. Then write a script to query Solr, to get a list of
URLs, then send HTTP requests to test them as with any other link
checker. That would also have the added bonus of giving staff a
dedicated option in the advanced search for finding records by the
URLs they contain (or * to get all with external links.) This is
something I have been working on, but it is currently on the
back-burner.<br>
</p>
<p><br>
</p>
<p>Andrew.</p>
<p><br>
</p>
<p><br>
</p>
<div class="moz-cite-prefix">On 10/02/2021 15:44, Corey Schmidt
wrote:<br>
</div>
<blockquote type="cite" cite="mid:BN6PR02MB315642922CE145E4EE219C59F38D9@BN6PR02MB3156.namprd02.prod.outlook.com">
<style type="text/css" style="display:none;">P {margin-top:0;margin-bottom:0;}</style>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255,
255, 255);">
Kevin,<br>
<br>
I'd be very interested to get your code, especially with 301 and
302 redirects. My initial runs have resulted in redirects
stalling the status code response, preventing my program from
moving forward. Getting the status code of the link being
redirected to would be a big help.<br>
<br>
I should also mention that any solution I make needs to be
replicable to other faculty/staff in our library - so access to
the database may not be consistent in the future. I'm thinking
of a scenario where 5 years from now, some people may want to
run this report again but I may not be working for the library
anymore. If that's the case, working to export from
ArchivesSpace might be the better long-term, no-fusses solution.<br>
<br>
Corey<br>
</div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b>
<a class="moz-txt-link-abbreviated" href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org">archivesspace_users_group-bounces@lyralists.lyrasis.org</a>
<a class="moz-txt-link-rfc2396E" href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org"><archivesspace_users_group-bounces@lyralists.lyrasis.org></a>
on behalf of Kevin W. Schlottmann <a class="moz-txt-link-rfc2396E" href="mailto:kws2126@columbia.edu"><kws2126@columbia.edu></a><br>
<b>Sent:</b> Wednesday, February 10, 2021 10:27 AM<br>
<b>To:</b> Archivesspace Users Group
<a class="moz-txt-link-rfc2396E" href="mailto:archivesspace_users_group@lyralists.lyrasis.org"><archivesspace_users_group@lyralists.lyrasis.org></a><br>
<b>Subject:</b> Re: [Archivesspace_Users_Group] Checking for
Broken URLs in Resources</font>
<div> </div>
</div>
<div><font color="BA0C2F">[EXTERNAL SENDER - PROCEED CAUTIOUSLY]</font><br>
<br>
<div>
<div dir="ltr">
<div>Hi Corey, <br>
</div>
<div><br>
</div>
<div>Earlier this year we did something similar. We started
by extracting all <a class="moz-txt-link-freetext" href="xlink:href">xlink:href</a> from the EAD corpus using
xslt. (EAD is our go-to data source since we always have a
constantly updated set, as we use them to automatically
publish our finding aids.) We did not do an additional
regex search for non-encoded URLs, but that's not a bad
idea. The result was over 15,000 links. Each was
recorded with bibid, container id (if any), and link text
and title (if any). To check them, my colleague ran them
through a Python script to get the response code, and if
the response was 301/302, retrieve the redirect location
and secondary response. This produced some interesting
results, and resulted in a fair amount of remediation work
to do.
<br>
</div>
<div><br>
</div>
<div>If this sounds on point, I can try to find and share
the code we used.<br>
</div>
<div><br>
</div>
<div>Kevin<br>
</div>
</div>
<br>
<div class="x_gmail_quote">
<div dir="ltr" class="x_gmail_attr">On Wed, Feb 10, 2021 at
9:34 AM Corey Schmidt <<a href="mailto:Corey.Schmidt@uga.edu" target="_blank" moz-do-not-send="true">Corey.Schmidt@uga.edu</a>>
wrote:<br>
</div>
<blockquote class="x_gmail_quote" style="margin:0px 0px 0px
0.8ex; border-left:1px solid rgb(204,204,204);
padding-left:1ex">
<div dir="ltr">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;
font-size:12pt; color:rgb(0,0,0);
background-color:rgb(255,255,255)">
Nancy,<br>
<br>
I have access to our staging database, but not
production. I'm not sure our sysadmins will allow me
to play around in the prod database, unless they can
assign me read only maybe? Pulling the file_uri values
for file_version would be much more efficient.
However, I'm not just looking to check digital object
links, but also any links found within collection and
archival object level notes, either copied straight
into the text of the notes or linked using the
<extref> tag. I could probably query the
database for that info too.<br>
<br>
Corey<br>
</div>
<div id="x_gmail-m_-6818534061483353590gmail-m_-1985334488657977738appendonsend"></div>
<hr style="display:inline-block; width:98%">
<div id="x_gmail-m_-6818534061483353590gmail-m_-1985334488657977738divRplyFwdMsg" dir="ltr">
<font style="font-size:11pt" face="Calibri,
sans-serif" color="#000000"><b>From:</b>
<a href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org" target="_blank" moz-do-not-send="true">
archivesspace_users_group-bounces@lyralists.lyrasis.org</a> <<a href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org" target="_blank" moz-do-not-send="true">archivesspace_users_group-bounces@lyralists.lyrasis.org</a>>
on behalf of Kennedy, Nancy <<a href="mailto:KennedyN@si.edu" target="_blank" moz-do-not-send="true">KennedyN@si.edu</a>><br>
<b>Sent:</b> Wednesday, February 10, 2021 9:18 AM<br>
<b>To:</b> Archivesspace Users Group <<a href="mailto:archivesspace_users_group@lyralists.lyrasis.org" target="_blank" moz-do-not-send="true">archivesspace_users_group@lyralists.lyrasis.org</a>><br>
<b>Subject:</b> Re: [Archivesspace_Users_Group]
Checking for Broken URLs in Resources</font>
<div> </div>
</div>
<div lang="EN-US"><font color="BA0C2F">[EXTERNAL SENDER
- PROCEED CAUTIOUSLY]</font><br>
<br>
<div>
<div>
<p>Hi Corey – </p>
<p>Do you have access to query the database, as a
starting point, instead of EAD? We were able to
pull the file_uri values from the file_version
table in the database. Our sysadmin then
checked the response codes for that list of URI,
and we referred issues out to staff working on
those collections. Some corrections can be made
directly by staff, or for long lists, you could
include the digital_object id and post updates
that way.</p>
<p> </p>
<p>Nancy</p>
<p> </p>
<p> </p>
<div>
<div style="border-color:rgb(225,225,225)
currentcolor currentcolor; border-style:solid
none none; border-width:1pt medium medium;
padding:3pt 0in 0in">
<p><b>From:</b> <a href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org" target="_blank" moz-do-not-send="true">
archivesspace_users_group-bounces@lyralists.lyrasis.org</a> <<a href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org" target="_blank" moz-do-not-send="true">archivesspace_users_group-bounces@lyralists.lyrasis.org</a>>
<b>On Behalf Of </b>Corey Schmidt<br>
<b>Sent:</b> Wednesday, February 10, 2021
8:45 AM<br>
<b>To:</b> <a href="mailto:archivesspace_users_group@lyralists.lyrasis.org" target="_blank" moz-do-not-send="true">
archivesspace_users_group@lyralists.lyrasis.org</a><br>
<b>Subject:</b> [Archivesspace_Users_Group]
Checking for Broken URLs in Resources</p>
</div>
</div>
<p> </p>
<p style="line-height:12pt;
background:rgb(255,235,156) none repeat scroll
0% 0%">
<b><span style="font-size:9pt;
color:rgb(156,101,0)">External Email -
Exercise Caution</span></b></p>
<div>
<div>
<p style="background:white none repeat scroll
0% 0%"><span style="font-size:12pt;
color:black">Dear all,</span></p>
</div>
<div>
<p style="background:white none repeat scroll
0% 0%"><span style="font-size:12pt;
color:black"><br>
Hello, this is Corey Schmidt,
ArchivesSpace PM at the University of
Georgia. I hope everyone is doing well and
staying safe and healthy.<br>
<br>
Would anyone know of any script, plugin,
or tool to check for invalid URLs within
resources? We are investigating how to
grab URLs from exported EAD.xml files and
check them to determine if they throw back
any sort of error (404s mostly, but also
any others). My thinking is to build a
small app that will export EAD.xml files
from ArchivesSpace, then sift through the
raw xml using python's lxml package to
catch any URLs using regex. After
capturing the URL, it would then use the
requests library to check the status code
of the URL and if it returns an error, log
that error in a .CSV output file to act as
a "report" of all the broken links within
that resource.<br>
<br>
The problems with this method are: 1.
Exporting 1000s of resources takes a lot
of time and some processing power, as well
as a moderate amount of local storage
space. 2. Even checking the raw xml file
takes a considerable amount of time. The
app I'm working on takes overnight to
export and check all the xml files. I was
considering pinging the API for different
parts of a resource, but I figured that
would take as much time as just exporting
an EAD.xml and would be even more complex
to write. I've checked Awesome
ArchivesSpace, this listserv, and a few
script libraries from institutions, but
haven't found exactly what I am looking
for.<br>
<br>
Any info or advice would be greatly
appreciated! Thanks!<br>
<br>
Sincerely,<br>
<br>
Corey</span></p>
<div>
<div>
<p style="background:white none repeat
scroll 0% 0%"><span style="font-size:12pt; color:black"> </span></p>
</div>
<div id="x_gmail-m_-6818534061483353590gmail-m_-1985334488657977738x_Signature">
<div>
<div>
<p style="background:white none repeat
scroll 0% 0%"><span style="font-size:12pt;
font-family:"Verdana",sans-serif;
color:black">Corey Schmidt</span><span style="font-size:12pt;
color:black"></span></p>
</div>
<div>
<p style="background:white none repeat
scroll 0% 0%"><span style="font-family:"Verdana",sans-serif;
color:rgb(51,51,51)">ArchivesSpace
Project Manager</span><span style="font-size:12pt;
color:black"></span></p>
</div>
<div>
<p style="background:white none repeat
scroll 0% 0%"><span style="font-family:"Verdana",sans-serif;
color:rgb(51,51,51)">University of
Georgia Special Collections
Libraries</span><span style="font-size:12pt;
color:black"></span></p>
</div>
<div>
<p style="background:white none repeat
scroll 0% 0%"><i><span style="font-size:10pt;
font-family:"Verdana",sans-serif;
color:rgb(102,102,102)">Email:</span></i><span style="font-size:10pt;
font-family:"Verdana",sans-serif;
color:rgb(102,102,102)">
<a href="mailto:Corey.Schmidt@uga.edu" target="_blank" moz-do-not-send="true">Corey.Schmidt@uga.edu</a></span><span style="font-size:12pt;
color:black"></span></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
_______________________________________________<br>
Archivesspace_Users_Group mailing list<br>
<a href="mailto:Archivesspace_Users_Group@lyralists.lyrasis.org" target="_blank" moz-do-not-send="true">Archivesspace_Users_Group@lyralists.lyrasis.org</a><br>
<a href="http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group" rel="noreferrer" target="_blank" moz-do-not-send="true">http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group</a><br>
</blockquote>
</div>
<br clear="all">
<br>
-- <br>
<div dir="ltr">
<div dir="ltr">Kevin Schlottmann<br>
Interim Director and Head of Archives Processing<br>
Rare Book & Manuscript Library<br>
Butler Library, Room 801<br>
Columbia University<br>
535 W. 114th St., New York, NY 10027<br>
(212) 854-8483</div>
</div>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
Archivesspace_Users_Group mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Archivesspace_Users_Group@lyralists.lyrasis.org">Archivesspace_Users_Group@lyralists.lyrasis.org</a>
<a class="moz-txt-link-freetext" href="http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group">http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group</a>
</pre>
</blockquote>
</body>
</html>