[Archivesspace_Users_Group] OAI harvesting issue

Majewski, Steven Dennis (sdm7g) sdm7g at virginia.edu
Wed Apr 22 16:35:16 EDT 2020


I believe that the ArchivesSpace OAI feed uses system-mtime for the OAI timestamps, but dates in the staff interface are usually user-mtimes. ( system-mtime used because they propagate thru record hierarchy: i.e. if you make changes to a child archival_object, but not to the parent, user_mtime is updated on the archival_object only, but system_mtime is updated on both. ) 


But if I understand what you’re saying, the timestamps differ when using the harvester compared to, for example, what you see using the oai sample form. ( Is that correct ? ) 

If that is the case, then I do think it’s an issue with the harvester. 

We’re using this branch:
https://github.com/sdm7g/oai-harvest/tree/fix-pyoai <https://github.com/sdm7g/oai-harvest/tree/fix-pyoai>

Which is a patched version of  oaiharvest  https://github.com/bloomonkey/oai-harvest <https://github.com/bloomonkey/oai-harvest> . 

For oai_marc payload, you may be able to use the upstream version that you can install using ‘pip install oaiharvest’ . ( My fixes are needed for EAD payload: EAD export in ArchivesSpace has more bugs in it, so I’m using the recover option to XML parser to recover from and log parse errors that would otherwise halt the harvest before completion. I don’t think MARC payloads, being simpler, have the same issues, and I know oai_dc has no similar glitches. If you do run into parse errors with oai_marc, you can use my version with added ‘—-recover’ command line option. ) 


If your OAI endpoint is public and you send me the URL, I try my harvester and look at results.  


— Steve Majewski



$ oai-harvest -h
usage: oai-harvest [-h] [--db DATABASEPATH] [-p METADATAPREFIX] [-r TOKEN]
                   [-f YYYY-MM-DD] [-u YYYY-MM-DD] [-s SET] [-b HH:MM HH:MM]
                   [-d DIR] [--delete | --no-delete] [-l LIMIT]
                   [--create-subdirs | --subdirs-on SUBDIRS]
                   [--recover | --no-recover]
                   provider [provider ...]

Harvest records from an OAI-PMH provider.

positional arguments:
  provider              OAI-PMH Provider from which to harvest. This may be
                        the base URL of an OAI-PMH server, or the short name
                        of a registered provider. You may also specify "all"
                        for all registered providers.

optional arguments:
  -h, --help            show this help message and exit
  --db DATABASEPATH, --database DATABASEPATH
                        Path to provider registry database. Currently supports
                        sqlite3 only.
  -p METADATAPREFIX, --metadataPrefix METADATAPREFIX
                        the metadataPrefix of the format (XML Schema) in which
                        records should be harvested.
  -r TOKEN, --resume-from TOKEN
                        start at the given resumption TOKEN
  -f YYYY-MM-DD, --from YYYY-MM-DD
                        harvest only records added/modified after this date.
  -u YYYY-MM-DD, --until YYYY-MM-DD
                        harvest only records added/modified up to this date.
  -s SET, --set SET     harvest only records within this set
  -b HH:MM HH:MM, --between HH:MM HH:MM
                        harvest only between the first and the second wall
                        clock time (enables incremental harvesting)
  -d DIR, --dir DIR     where to output files for harvested records. default:
                        current working path
  --delete              respect the server's instructions regarding deletions,
                        i.e. delete the files locally (default)
  --no-delete           ignore the server's instructions regarding deletions,
                        i.e. DO NOT delete the files locally
  -l LIMIT, --limit LIMIT
                        limit the number of records to harvest from each
                        provider
  --create-subdirs      create target subdirs (based on / characters in
                        identifiers) if they don't exist. To use something
                        other than /, use the newer--subdirs-on option
  --subdirs-on SUBDIRS  create target subdirs based on occurrences of the
                        given characterin identifiers
  --recover             create XMLParser with (recover=True) option: parser
                        will try to continue to parse broken XML payloads
  --no-recover          default is --no-recover

Copyright (c) 2013, the University of Liverpool <http://www.liv.ac.uk>. All
rights reserved. Distributed under the terms of the BSD 3-clause License
<http://opensource.org/licenses/BSD-3-Clause>.




> On Apr 22, 2020, at 3:06 PM, Kevin W. Schlottmann <kws2126 at columbia.edu> wrote:
> 
> Dear AS List,
> 
> We rely on the OAI feed to pipe updated records to various places, on a nightly basis.  We recently came across some odd behavior that we are hoping list members might have some suggestions. 
> 
> We have a few resource records that have been recently updated, show the correct updated time in the staff GUI, and have the correct updated time when the downloaded directly using the OAI getRecord command[1].
> 
> However, in our bulk OAI download of all records, using pyoaiharvester[2], the record's datestamp is somehow stuck on an earlier date.  
> 
> Even stranger, if we add the 'from' parameter to [2] manually with the correct date value, we *get* the records, with the correct datestamp.  
> 
> We are digging into this with help from Lyrasis, but we don't have an answer yet.  My guess is an issue with the harvester, but it's not immediately obvious what it would be.  Other avenues we're looking at issues with the resumption token, or with the indexer (the latter often being the cause of AS issues, anecdotally). Questions for the list:
> 
> 1) Is there anything known in the OAI implementation that might cause this off datestamp behavior? 
> 
> 2) Since this may be an issue with the harvester, does anyone have a preferred OAI harvester that handles marcxml?  
> 
> Best,
> 
> Kevin
> 
> [1] getRecord command; getting it as a single record has the right datestamp:
> https://{oaiendpoint}?verb=GetRecord&identifier=oai:columbia//repositories/2/resources/6381&metadataPrefix=oai_marc
> 
> [2] Using the pyoaiharvester library (https://github.com/vphill/pyoaiharvester <https://github.com/vphill/pyoaiharvester>). 
> python /.../as_reports/pyoaiharvester/pyoaiharvest.py -l  {oaiendpoint} -m oai_marc -s collection -o /.../archivesspace/oai/20200419.asRaw.xml
> 
> -- 
> Kevin Schlottmann
> Head of Archives Processing
> Rare Book & Manuscript Library
> Butler Library, Room 801
> Columbia University
> 535 W. 114th St., New York, NY  10027
> (212) 854-8483
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20200422/5a4f9ddc/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3342 bytes
Desc: not available
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20200422/5a4f9ddc/attachment.bin>


More information about the Archivesspace_Users_Group mailing list