[Archivesspace_Users_Group] Problems & Strategies for importing very large EAD files

Mon Dec 8 17:20:14 EST 2014

The majority of our EAD guides are less than 1MB in size. We have a few larger ones, the two largest being around 13-14 MB. 

I have tried importing these guides from the frontend webapp import: my time or patience has always been 
hit after a couple of hours, before they have managed to import successfully.  

I recently attempted importing the largest file again by calling EADConverter.new( eadfile ) from a ruby command line program,
and writing out the resulting json file to later import using the backend API.  The first attempt ran out of memory after a couple
of hours.  I increased ‘-Xmx300m’ to ‘-Xmx500m’ in the JAVA_OPTS. The 2nd attempt took just over 24 hours ( with an hour or
more of sleep time while my laptop was in-transit ).  This failed with the error: 

failed: #<:ValidationException: {:errors=>{"dates/0/end"=>["must not be before begin"]}, :import_context=>"<c02 level=\"file\"> ... </c02>”}>

which I eventually traced to this:
	<unitdate normal="1916-06/1916-05">May 1916-June 1916</unitdate>

But before fixing this and trying again I wanted to ask if others had experience with importing large files. 

Should I try increasing -Xmx memory radically more, or perhaps MaxPermSize as well ? 

24 hours for 14MB does not seem to be a linear projection from the 3 or 4MB files I have managed to import. 
Does this type of performance seem expected ?   Is there anything else that should be reconfigured ? 

[ Waiting 24 hours for an error message, then repeat again, seems like another good reason to want a schema that 
  describes the flavor of EAD that ArchivesSpace accepts: it would be nice to be able to do a pre-flight check, and 
  I’ve never seen an XSLT stylesheet take that long to validate. ] 

Also: a lot of those smaller EAD files are collections that are split up into separated guides — sometimes due to separate accessions:
xxxx-a, xxxx-b, xxxx-c… 
The more recent tendency has been to unify a collection into a single guide, and we were expecting to perhaps use  ArchivesSpace
to unify some of these separate guides after they had been separately imported. Should we be reconsidering that plan ? 

If increasing memory further doesn’t help, and if we don’t find any other way to improve import performance, I’m considering 
disassembling the file:  strip out much of the <dsc> and import a skeleton guide, and then process the <dsc>, adding resources to 
the guide a few at a time via the backend API. 

Anyone have any other ideas or experience to report ? 

— Steve Majewski / UVA Alderman Library

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4943 bytes
Desc: not available
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20141208/419492b0/attachment.bin>