<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class="">I have a fork of python oai-harvest program that is more forgiving of XML parser errors which are typically coming from unescaped ampersands in ArchivesSpace notes text or in query strings embedded in URLs.  It will also recover from other XML errors I’ve seen in the wild from ArchivesSpace exports, like undeclared namespaces or undefined character entities. </span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class="">Typically, how it handles and recovers from errors is to drop the node ( text, attribute, element ) that doesn’t parse correctly, so for example “Jones & Smith” will become “Jones Smith” , or “AT&T” may be output as “AT” , so the data will be incorrect, however you will get well formed EAD output, and what’s more important, the harvester will continue harvesting to completion instead of raising an XMLParser exception and quiting. </span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class="">The program will log recoverable errors in case you want to notify the upstream feed so they can fix their resources.</span></div><div class=""><span style="font-size: 15px;" class="">( Although I believe a fix for the unescaped ampersand issue may be in the next ArchivesSpace release. ) </span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class="">I have submitted a pull request upstream, but in the mean time you can install it will pip or pip3 with this command:</span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class=""><span class="Apple-tab-span" style="white-space:pre">        </span>pip3 install <a href="git+https://github.com/sdm7g/oai-harvest.git@fix-pyoai" class="">git+https://github.com/sdm7g/oai-harvest.git@fix-pyoai</a></span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class="">( fix-pyoai branch is required. ) </span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class="">This version of the program has added another optional argument:  —recover that enables that parser option. </span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class="">For example ( where $URL= url of public OAI endpoint ):</span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class=""><span class="Apple-tab-span" style="white-space:pre"> </span>oai-harvest -p oai_ead -s collection --subdirs-on ':' —recover $URL </span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class="">Or what is better, use the oai-reg program to create a registry database of feeds so you can use “all” instead of a specific URL and also do incremental harvesting. ( The registry database stores the date and time of last harvest. ) </span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class="">The —subdirs-on ‘:’ option in not required, but it will split the OAI identifiers on both ‘:’ and ‘/‘ , and result in a directory tree starting at $PWD/oai/ instead of a flat collection of files. </span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class="">My current collection of feeds is: </span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class="">export VT=<a href="https://aspace.lib.vt.edu:8082/" class="">https://aspace.lib.vt.edu:8082/</a><br class="">export VMFA=<a href="https://archives.vmfa.museum/oai" class="">https://archives.vmfa.museum/oai</a><br class="">export JMU=<a href="https://aspace.lib.jmu.edu/oai" class="">https://aspace.lib.jmu.edu/oai</a><br class="">export VCU=<a href="https://archives.library.vcu.edu/oai" class="">https://archives.library.vcu.edu/oai</a><br class="">export UVA=<a href="https://archives-oai.lib.virginia.edu" class="">https://archives-oai.lib.virginia.edu</a></span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class="">If you submit your public ArchivesSpace OAI endpoint URLs as reply to this thread, I will collect them to put on the Wiki somewhere. </span></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><span style="font-size: 15px;" class="">— Steve Majewski</span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><span style="font-size: 15px;" class=""><br class=""></span></div><div class=""><br class=""></div></body></html>