[Archivesspace_Users_Group] [archivesspace] Problems & Strategies for importing very large EAD files

Chris Fitzpatrick Chris.Fitzpatrick at lyrasis.org
Tue Dec 9 03:28:56 EST 2014



Hi Steven,

Yeah, you're probably going to want to have the -Xmx at around the 1G default. I've dropped down to 512M ( which is what the sandbox.archivesspace.org runs on ), but it'll be rather sluggish.

If can go that high, you can start the application with the public and index apps disabled, which will reduce some of the overhead ( especially the indexer ). 

Try that and see if helps the issue..
b,chris. 

Chris Fitzpatrick | Developer, ArchivesSpace
Skype: chrisfitzpat  | Phone: 918.236.6048
http://archivesspace.org/

________________________________________
From: archivesspace at googlegroups.com <archivesspace at googlegroups.com> on behalf of Steven Majewski <sdm7g at virginia.edu>
Sent: Monday, December 8, 2014 11:20 PM
To: archivesspace_users_group at lyralists.lyrasis.org
Cc: archivesspace at googlegroups.com
Subject: [archivesspace] Problems & Strategies for importing very large EAD files

The majority of our EAD guides are less than 1MB in size. We have a few larger ones, the two largest being around 13-14 MB.

I have tried importing these guides from the frontend webapp import: my time or patience has always been
hit after a couple of hours, before they have managed to import successfully.

I recently attempted importing the largest file again by calling EADConverter.new( eadfile ) from a ruby command line program,
and writing out the resulting json file to later import using the backend API.  The first attempt ran out of memory after a couple
of hours.  I increased ‘-Xmx300m’ to ‘-Xmx500m’ in the JAVA_OPTS. The 2nd attempt took just over 24 hours ( with an hour or
more of sleep time while my laptop was in-transit ).  This failed with the error:

failed: #<:ValidationException: {:errors=>{"dates/0/end"=>["must not be before begin"]}, :import_context=>"<c02 level=\"file\"> ... </c02>”}>

which I eventually traced to this:
        <unitdate normal="1916-06/1916-05">May 1916-June 1916</unitdate>


But before fixing this and trying again I wanted to ask if others had experience with importing large files.

Should I try increasing -Xmx memory radically more, or perhaps MaxPermSize as well ?

24 hours for 14MB does not seem to be a linear projection from the 3 or 4MB files I have managed to import.
Does this type of performance seem expected ?   Is there anything else that should be reconfigured ?

[ Waiting 24 hours for an error message, then repeat again, seems like another good reason to want a schema that
  describes the flavor of EAD that ArchivesSpace accepts: it would be nice to be able to do a pre-flight check, and
  I’ve never seen an XSLT stylesheet take that long to validate. ]


Also: a lot of those smaller EAD files are collections that are split up into separated guides — sometimes due to separate accessions:
xxxx-a, xxxx-b, xxxx-c…
The more recent tendency has been to unify a collection into a single guide, and we were expecting to perhaps use  ArchivesSpace
to unify some of these separate guides after they had been separately imported. Should we be reconsidering that plan ?

If increasing memory further doesn’t help, and if we don’t find any other way to improve import performance, I’m considering
disassembling the file:  strip out much of the <dsc> and import a skeleton guide, and then process the <dsc>, adding resources to
the guide a few at a time via the backend API.

Anyone have any other ideas or experience to report ?


— Steve Majewski / UVA Alderman Library





--
You received this message because you are subscribed to the Google Groups "ArchivesSpace" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivesspace+unsubscribe at googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



More information about the Archivesspace_Users_Group mailing list