[Archivesspace_Users_Group] Problems & Strategies for importing very large EAD files
Steven Majewski
sdm7g at virginia.edu
Mon Dec 8 17:20:14 EST 2014
The majority of our EAD guides are less than 1MB in size. We have a few larger ones, the two largest being around 13-14 MB.
I have tried importing these guides from the frontend webapp import: my time or patience has always been
hit after a couple of hours, before they have managed to import successfully.
I recently attempted importing the largest file again by calling EADConverter.new( eadfile ) from a ruby command line program,
and writing out the resulting json file to later import using the backend API. The first attempt ran out of memory after a couple
of hours. I increased ‘-Xmx300m’ to ‘-Xmx500m’ in the JAVA_OPTS. The 2nd attempt took just over 24 hours ( with an hour or
more of sleep time while my laptop was in-transit ). This failed with the error:
failed: #<:ValidationException: {:errors=>{"dates/0/end"=>["must not be before begin"]}, :import_context=>"<c02 level=\"file\"> ... </c02>”}>
which I eventually traced to this:
<unitdate normal="1916-06/1916-05">May 1916-June 1916</unitdate>
But before fixing this and trying again I wanted to ask if others had experience with importing large files.
Should I try increasing -Xmx memory radically more, or perhaps MaxPermSize as well ?
24 hours for 14MB does not seem to be a linear projection from the 3 or 4MB files I have managed to import.
Does this type of performance seem expected ? Is there anything else that should be reconfigured ?
[ Waiting 24 hours for an error message, then repeat again, seems like another good reason to want a schema that
describes the flavor of EAD that ArchivesSpace accepts: it would be nice to be able to do a pre-flight check, and
I’ve never seen an XSLT stylesheet take that long to validate. ]
Also: a lot of those smaller EAD files are collections that are split up into separated guides — sometimes due to separate accessions:
xxxx-a, xxxx-b, xxxx-c…
The more recent tendency has been to unify a collection into a single guide, and we were expecting to perhaps use ArchivesSpace
to unify some of these separate guides after they had been separately imported. Should we be reconsidering that plan ?
If increasing memory further doesn’t help, and if we don’t find any other way to improve import performance, I’m considering
disassembling the file: strip out much of the <dsc> and import a skeleton guide, and then process the <dsc>, adding resources to
the guide a few at a time via the backend API.
Anyone have any other ideas or experience to report ?
— Steve Majewski / UVA Alderman Library
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4943 bytes
Desc: not available
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20141208/419492b0/attachment.bin>
More information about the Archivesspace_Users_Group
mailing list