<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">

  </head>

  <body>

    <p>Another approach would be to modify the Solr schema and the

      ArchivesSpace indexer to set up and populate an index field just

      for URLs. Then write a script to query Solr, to get a list of

      URLs, then send HTTP requests to test them as with any other link

      checker. That would also have the added bonus of giving staff a

      dedicated option in the advanced search for finding records by the

      URLs they contain (or * to get all with external links.) This is

      something I have been working on, but it is currently on the

      back-burner.<br>

    </p>

    <p><br>

    </p>

    <p>Andrew.</p>

    <p><br>

    </p>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 10/02/2021 15:44, Corey Schmidt

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:BN6PR02MB315642922CE145E4EE219C59F38D9@BN6PR02MB3156.namprd02.prod.outlook.com">

      <style type="text/css" style="display:none;">P {margin-top:0;margin-bottom:0;}</style>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255,

        255, 255);">

        Kevin,<br>

        <br>

        I'd be very interested to get your code, especially with 301 and

        302 redirects. My initial runs have resulted in redirects

        stalling the status code response, preventing my program from

        moving forward. Getting the status code of the link being

        redirected to would be a big help.<br>

        <br>

        I should also mention that any solution I make needs to be

        replicable to other faculty/staff in our library - so access to

        the database may not be consistent in the future. I'm thinking

        of a scenario where 5 years from now, some people may want to

        run this report again but I may not be working for the library

        anymore. If that's the case, working to export from

        ArchivesSpace might be the better long-term, no-fusses solution.<br>

        <br>

        Corey<br>

      </div>

      <hr style="display:inline-block;width:98%" tabindex="-1">

      <div id="divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b>

          <a class="moz-txt-link-abbreviated" href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org">archivesspace_users_group-bounces@lyralists.lyrasis.org</a>

          <a class="moz-txt-link-rfc2396E" href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org"><archivesspace_users_group-bounces@lyralists.lyrasis.org></a>

          on behalf of Kevin W. Schlottmann <a class="moz-txt-link-rfc2396E" href="mailto:kws2126@columbia.edu"><kws2126@columbia.edu></a><br>

          <b>Sent:</b> Wednesday, February 10, 2021 10:27 AM<br>

          <b>To:</b> Archivesspace Users Group

          <a class="moz-txt-link-rfc2396E" href="mailto:archivesspace_users_group@lyralists.lyrasis.org"><archivesspace_users_group@lyralists.lyrasis.org></a><br>

          <b>Subject:</b> Re: [Archivesspace_Users_Group] Checking for

          Broken URLs in Resources</font>

        <div> </div>

      </div>

      <div><font color="BA0C2F">[EXTERNAL SENDER - PROCEED CAUTIOUSLY]</font><br>

        <br>

        <div>

          <div dir="ltr">

            <div>Hi Corey, <br>

            </div>

            <div><br>

            </div>

            <div>Earlier this year we did something similar. We started

              by extracting all <a class="moz-txt-link-freetext" href="xlink:href">xlink:href</a> from the EAD corpus using

              xslt. (EAD is our go-to data source since we always have a

              constantly updated set, as we use them to automatically

              publish our finding aids.) We did not do an additional

              regex search for non-encoded URLs, but that's not a bad

              idea.  The result was over 15,000 links.  Each was

              recorded with bibid, container id (if any), and link text

              and title (if any). To check them, my colleague ran them

              through a Python script to get the response code, and if

              the response was 301/302, retrieve the redirect location

              and secondary response. This produced some interesting

              results, and resulted in a fair amount of remediation work

              to do.

              <br>

            </div>

            <div><br>

            </div>

            <div>If this sounds on point, I can try to find and share

              the code we used.<br>

            </div>

            <div><br>

            </div>

            <div>Kevin<br>

            </div>

          </div>

          <br>

          <div class="x_gmail_quote">

            <div dir="ltr" class="x_gmail_attr">On Wed, Feb 10, 2021 at

              9:34 AM Corey Schmidt <<a href="mailto:Corey.Schmidt@uga.edu" target="_blank" moz-do-not-send="true">Corey.Schmidt@uga.edu</a>>

              wrote:<br>

            </div>

            <blockquote class="x_gmail_quote" style="margin:0px 0px 0px

              0.8ex; border-left:1px solid rgb(204,204,204);

              padding-left:1ex">

              <div dir="ltr">

                <div style="font-family:Calibri,Arial,Helvetica,sans-serif;

                  font-size:12pt; color:rgb(0,0,0);

                  background-color:rgb(255,255,255)">

                  Nancy,<br>

                  <br>

                  I have access to our staging database, but not

                  production. I'm not sure our sysadmins will allow me

                  to play around in the prod database, unless they can

                  assign me read only maybe? Pulling the file_uri values

                  for file_version would be much more efficient.

                  However, I'm not just looking to check digital object

                  links, but also any links found within collection and

                  archival object level notes, either copied straight

                  into the text of the notes or linked using the

                  <extref> tag. I could probably query the

                  database for that info too.<br>

                  <br>

                  Corey<br>

                </div>

                <div id="x_gmail-m_-6818534061483353590gmail-m_-1985334488657977738appendonsend"></div>

                <hr style="display:inline-block; width:98%">

                <div id="x_gmail-m_-6818534061483353590gmail-m_-1985334488657977738divRplyFwdMsg" dir="ltr">

                  <font style="font-size:11pt" face="Calibri,

                    sans-serif" color="#000000"><b>From:</b>

                    <a href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org" target="_blank" moz-do-not-send="true">

archivesspace_users_group-bounces@lyralists.lyrasis.org</a> <<a href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org" target="_blank" moz-do-not-send="true">archivesspace_users_group-bounces@lyralists.lyrasis.org</a>>

                    on behalf of Kennedy, Nancy <<a href="mailto:KennedyN@si.edu" target="_blank" moz-do-not-send="true">KennedyN@si.edu</a>><br>

                    <b>Sent:</b> Wednesday, February 10, 2021 9:18 AM<br>

                    <b>To:</b> Archivesspace Users Group <<a href="mailto:archivesspace_users_group@lyralists.lyrasis.org" target="_blank" moz-do-not-send="true">archivesspace_users_group@lyralists.lyrasis.org</a>><br>

                    <b>Subject:</b> Re: [Archivesspace_Users_Group]

                    Checking for Broken URLs in Resources</font>

                  <div> </div>

                </div>

                <div lang="EN-US"><font color="BA0C2F">[EXTERNAL SENDER

                    - PROCEED CAUTIOUSLY]</font><br>

                  <br>

                  <div>

                    <div>

                      <p>Hi Corey � </p>

                      <p>Do you have access to query the database, as a

                        starting point, instead of EAD?  We were able to

                        pull the file_uri values from the file_version

                        table in the database.  Our sysadmin then

                        checked the response codes for that list of URI,

                        and we referred issues out to staff working on

                        those collections.  Some corrections can be made

                        directly by staff, or for long lists, you could

                        include the digital_object id and post updates

                        that way.</p>

                      <p> </p>

                      <p>Nancy</p>

                      <p> </p>

                      <p> </p>

                      <div>

                        <div style="border-color:rgb(225,225,225)

                          currentcolor currentcolor; border-style:solid

                          none none; border-width:1pt medium medium;

                          padding:3pt 0in 0in">

                          <p><b>From:</b> <a href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org" target="_blank" moz-do-not-send="true">

archivesspace_users_group-bounces@lyralists.lyrasis.org</a> <<a href="mailto:archivesspace_users_group-bounces@lyralists.lyrasis.org" target="_blank" moz-do-not-send="true">archivesspace_users_group-bounces@lyralists.lyrasis.org</a>>

                            <b>On Behalf Of </b>Corey Schmidt<br>

                            <b>Sent:</b> Wednesday, February 10, 2021

                            8:45 AM<br>

                            <b>To:</b> <a href="mailto:archivesspace_users_group@lyralists.lyrasis.org" target="_blank" moz-do-not-send="true">

archivesspace_users_group@lyralists.lyrasis.org</a><br>

                            <b>Subject:</b> [Archivesspace_Users_Group]

                            Checking for Broken URLs in Resources</p>

                        </div>

                      </div>

                      <p> </p>

                      <p style="line-height:12pt;

                        background:rgb(255,235,156) none repeat scroll

                        0% 0%">

                        <b><span style="font-size:9pt;

                            color:rgb(156,101,0)">External Email -

                            Exercise Caution</span></b></p>

                      <div>

                        <div>

                          <p style="background:white none repeat scroll

                            0% 0%"><span style="font-size:12pt;

                              color:black">Dear all,</span></p>

                        </div>

                        <div>

                          <p style="background:white none repeat scroll

                            0% 0%"><span style="font-size:12pt;

                              color:black"><br>

                              Hello, this is Corey Schmidt,

                              ArchivesSpace PM at the University of

                              Georgia. I hope everyone is doing well and

                              staying safe and healthy.<br>

                              <br>

                              Would anyone know of any script, plugin,

                              or tool to check for invalid URLs within

                              resources? We are investigating how to

                              grab URLs from exported EAD.xml files and

                              check them to determine if they throw back

                              any sort of error (404s mostly, but also

                              any others). My thinking is to build a

                              small app that will export EAD.xml files

                              from ArchivesSpace, then sift through the

                              raw xml using python's lxml package to

                              catch any URLs using regex. After

                              capturing the URL, it would then use the

                              requests library to check the status code

                              of the URL and if it returns an error, log

                              that error in a .CSV output file to act as

                              a "report" of all the broken links within

                              that resource.<br>

                              <br>

                              The problems with this method are: 1.

                              Exporting 1000s of resources takes a lot

                              of time and some processing power, as well

                              as a moderate amount of local storage

                              space. 2. Even checking the raw xml file

                              takes a considerable amount of time. The

                              app I'm working on takes overnight to

                              export and check all the xml files. I was

                              considering pinging the API for different

                              parts of a resource, but I figured that

                              would take as much time as just exporting

                              an EAD.xml and would be even more complex

                              to write. I've checked Awesome

                              ArchivesSpace, this listserv, and a few

                              script libraries from institutions, but

                              haven't found exactly what I am looking

                              for.<br>

                              <br>

                              Any info or advice would be greatly

                              appreciated! Thanks!<br>

                              <br>

                              Sincerely,<br>

                              <br>

                              Corey</span></p>

                          <div>

                            <div>

                              <p style="background:white none repeat

                                scroll 0% 0%"><span style="font-size:12pt; color:black"> </span></p>

                            </div>

                            <div id="x_gmail-m_-6818534061483353590gmail-m_-1985334488657977738x_Signature">

                              <div>

                                <div>

                                  <p style="background:white none repeat

                                    scroll 0% 0%"><span style="font-size:12pt;

                                      font-family:"Verdana",sans-serif;

                                      color:black">Corey Schmidt</span><span style="font-size:12pt;

                                      color:black"></span></p>

                                </div>

                                <div>

                                  <p style="background:white none repeat

                                    scroll 0% 0%"><span style="font-family:"Verdana",sans-serif;

                                      color:rgb(51,51,51)">ArchivesSpace

                                      Project Manager</span><span style="font-size:12pt;

                                      color:black"></span></p>

                                </div>

                                <div>

                                  <p style="background:white none repeat

                                    scroll 0% 0%"><span style="font-family:"Verdana",sans-serif;

                                      color:rgb(51,51,51)">University of

                                      Georgia Special Collections

                                      Libraries</span><span style="font-size:12pt;

                                      color:black"></span></p>

                                </div>

                                <div>

                                  <p style="background:white none repeat

                                    scroll 0% 0%"><i><span style="font-size:10pt;

                                        font-family:"Verdana",sans-serif;

                                        color:rgb(102,102,102)">Email:</span></i><span style="font-size:10pt;

                                      font-family:"Verdana",sans-serif;

                                      color:rgb(102,102,102)">

                                      <a href="mailto:Corey.Schmidt@uga.edu" target="_blank" moz-do-not-send="true">Corey.Schmidt@uga.edu</a></span><span style="font-size:12pt;

                                      color:black"></span></p>

                                </div>

                              </div>

                            </div>

                          </div>

                        </div>

                      </div>

                    </div>

                  </div>

                </div>

              </div>

              _______________________________________________<br>

              Archivesspace_Users_Group mailing list<br>

              <a href="mailto:Archivesspace_Users_Group@lyralists.lyrasis.org" target="_blank" moz-do-not-send="true">Archivesspace_Users_Group@lyralists.lyrasis.org</a><br>

              <a href="http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group" rel="noreferrer" target="_blank" moz-do-not-send="true">http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group</a><br>

            </blockquote>

          </div>

          <br clear="all">

          <br>

          -- <br>

          <div dir="ltr">

            <div dir="ltr">Kevin Schlottmann<br>

              Interim Director and Head of Archives Processing<br>

              Rare Book & Manuscript Library<br>

              Butler Library, Room 801<br>

              Columbia University<br>

              535 W. 114th St., New York, NY  10027<br>

              (212) 854-8483</div>

          </div>

        </div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <pre class="moz-quote-pre" wrap="">_______________________________________________

Archivesspace_Users_Group mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Archivesspace_Users_Group@lyralists.lyrasis.org">Archivesspace_Users_Group@lyralists.lyrasis.org</a>

<a class="moz-txt-link-freetext" href="http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group">http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group</a>

</pre>

    </blockquote>

  </body>

</html>