[Archivesspace_Users_Group] A more forgiving version of oai-harvest & public OAI feeds

Fri Apr 19 19:17:24 EDT 2019

I have a fork of python oai-harvest program that is more forgiving of XML parser errors which are typically coming from unescaped ampersands in ArchivesSpace notes text or in query strings embedded in URLs.  It will also recover from other XML errors I’ve seen in the wild from ArchivesSpace exports, like undeclared namespaces or undefined character entities. 

Typically, how it handles and recovers from errors is to drop the node ( text, attribute, element ) that doesn’t parse correctly, so for example “Jones & Smith” will become “Jones Smith” , or “AT&T” may be output as “AT” , so the data will be incorrect, however you will get well formed EAD output, and what’s more important, the harvester will continue harvesting to completion instead of raising an XMLParser exception and quiting. 

The program will log recoverable errors in case you want to notify the upstream feed so they can fix their resources.
( Although I believe a fix for the unescaped ampersand issue may be in the next ArchivesSpace release. ) 

I have submitted a pull request upstream, but in the mean time you can install it will pip or pip3 with this command:

	pip3 install git+https://github.com/sdm7g/oai-harvest.git@fix-pyoai <git+https://github.com/sdm7g/oai-harvest.git@fix-pyoai>

( fix-pyoai branch is required. ) 

This version of the program has added another optional argument:  —recover that enables that parser option. 

For example ( where $URL= url of public OAI endpoint ):

	oai-harvest -p oai_ead -s collection --subdirs-on ':' —recover $URL 

Or what is better, use the oai-reg program to create a registry database of feeds so you can use “all” instead of a specific URL and also do incremental harvesting. ( The registry database stores the date and time of last harvest. ) 

The —subdirs-on ‘:’ option in not required, but it will split the OAI identifiers on both ‘:’ and ‘/‘ , and result in a directory tree starting at $PWD/oai/ instead of a flat collection of files. 

My current collection of feeds is: 

export VT=https://aspace.lib.vt.edu:8082/
export VMFA=https://archives.vmfa.museum/oai
export JMU=https://aspace.lib.jmu.edu/oai
export VCU=https://archives.library.vcu.edu/oai
export UVA=https://archives-oai.lib.virginia.edu

If you submit your public ArchivesSpace OAI endpoint URLs as reply to this thread, I will collect them to put on the Wiki somewhere. 

— Steve Majewski

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20190419/01990469/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3598 bytes
Desc: not available
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20190419/01990469/attachment.bin>