[Archivesspace_Users_Group] Batch imports [was: EAD Import - cryptic error messages]

Tue Feb 18 15:59:51 EST 2014

On Feb 18, 2014, at 2:40 PM, Noah Huffman <noah.huffman at duke.edu> wrote:

> Hello,
>  
> As others have mentioned, some of the EAD import error log messages are rather cryptic.  Could anyone help decipher the two messages below?  After running a few test imports, I’m getting the first “wrong number of arguments” error quite a bit on schema valid EAD files.
>  
> Error: wrong number of arguments (6 for 4)
>  
> Error: Unexpected Object Type in Queue: Expected archival_object got file_version
> 
>  
> Also, is there any method for batch importing EAD that will allow the entire batch to process even if one particular file fails?
>  
> Thanks,
> -Noah

Here’s what we’re hacked together: 

[1] created a top-level   env.sh  file with PATH settings and env variables cribbed from scripts/jirb:

# cd to archivespace top-level directory and source this file

export BUNDLE_GEMFILE="$PWD/backend/Gemfile"
JAVA_OPTS="$JAVA_OPTS -Daspace.config.search_user_secret=devserver -Daspace.config.public_user_secret=devserver -Daspace.config.staff_user_secret=devserver"
export JAVA_OPTS
export RUBYLIB=$PWD/common/
PATH=$PWD/build/gems/bin:$PATH:$PWD/scripts:$PWD/backend/scripts

[2] created a Ruby script in backend/scripts/ead_parse.rb
( figured out the gist of this from looking at spec test: backend/spec/lib_ead_converter_spec.rb  )

#!/usr/bin/env jruby
#
# script attempts to parse files with EADConverter
#
# You need to source env.sh to get settings for 
# JAVA_OPTS, RUBYLIB, etc. before running the script. 
#

require_relative '../spec/spec_helper'
require_relative '../app/converters/ead_converter'

ARGV.each do |eadxml|
	begin
		converter = EADConverter.new( eadxml )
		converter.run
		puts eadxml + " : ok."
		# if parse is successful, then write out json for later import
		outname = eadxml.slice(0..eadxml.rindex('.')) + 'json' 
		out = File::open( outname, 'w' )
		puts "writing json to: " + outname
		out.write(IO.read(converter.get_output_path))
		out.close
	rescue Exception => e
		puts eadxml + " : failed: " + e.message
		$stderr.puts e.backtrace
	end
end

There will be a lot of stack traces on stderr, but stdout will just be a listing of success and failures along with the JSON output filenames. 

Running that script on a directory full of EAD xml files will leave behind .json files for the ones that successfully parse.

You can then import the json files with something like ( where $N = repository number ) : 

for JSON in *.json
> do
> curl_as_osx admin admin -d @$JSON http://localhost:8089/repositories/$N/batch_imports 
> done

An advantage of the separate parse & import is that you can do all of the parsing on your development laptop, 
but post to another server’s backend port. 

— Steve Majewski

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20140218/f325a0c8/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4943 bytes
Desc: not available
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20140218/f325a0c8/attachment.bin>