[Archivesspace_Users_Group] [archivesspace] Problems & Strategies for importing very large EAD files

Thu Dec 11 14:08:09 EST 2014

You should try importing those EADs into the AT (www.archiviststoolkit.org)
to establish some kind of baseline since those import times seems way too
long for relatively small datasets.  For example, migration of a 600MB AT
database into ASpace takes under 5 hours.  You can then use the AT
migration plugin to transfer those records over to ASpace and see how long
that takes.

You send a couple of those EAD to me directly (ns96 at nyu.edu) and I can try
importing into AT, and transferring to ASpace on our end.

On Thu, Dec 11, 2014 at 12:43 PM, Steven Majewski <sdm7g at virginia.edu>
wrote:
>
>
> I increased memory sizes in JAVA_OPTS:
>
> JAVA_OPTS="$JAVA_OPTS -XX:MaxPermSize=512m -Xmx1080m -Xss8m"
>
> This for a command line program running: EADConverter.new( eadxml )
> Backend server also running on the same machine because I called
> JSONModel:init in client_mode,
> and it wants a server to talk to.
>
> I fixed the date problem in my xml that caused it to be incomplete on the
> previous run.
> I added some debug trace statements in the JSONModel validate calls to
> track progress.
> ( Comparing the amount of debug output vs. the size of JSON output, I
> don’t think this
>   had a significant effect on performance, but I could be wrong. )
>
> Running on a MacBook Pro w/ 2 GHz Intel Core i7 and 16 GB 1600 MHz DDR3
> memory and SSD.
>
> I ran Activity Monitor during the run to check on CPU and Memory usage,
> which seemed to peak at around 700GB.
> The process seemed to run at 100% +/- 3% CPU.   I ran another parse
> operation in parallel on another smaller file
> for a time. This didn’t seem to have a bit impact: both cores were running
> close to 100% rather than just one.
> ( I assume there isn’t much parallelism to be wrung out of the parse
> operation. )  That 2nd file was just over 6MB and
> took over 6 hours of CPU time. ( Not sure of the exact time, as I wasn’t
> nearby when it finished. Real time was much
> greater, but I’m not sure that the laptop didn’t sleep for part of that
> time.  Time for the 14MB file was over 24 hours
> CPU time (greater than that in actual real clock time).  The previous
> attempt ( with less memory allocated ) was
> 24 hours or more in real time — I did not measure CPU time on that attempt
> — but it got a validation error and the
> resulting JSON file was incomplete : there were internal URI references
> that were never defined in the JSON file.
>
>
> After completion, I attempted to POST it to backend server
>  /repositories/$REPO/batch_imports.  First to the backend
> development instance running on my laptop, then to our actual test server.
> That attempt to post to the test server took
> too long to finish while at the office, and my VPN at home usually times
> out after an hour or two, so I copied the JSON
> file up to the server and did the ‘nohup curl_as’ to the backend port on
> the same server, and left it running overnight.
>
> So success, but there was an incredible amount of overhead in the parsing
> and validation.
>
> Any ideas where the bottleneck is in this operation ?
>
> IF the overhead is due to the validation (and here, I’m just guessing),
> that would be another argument for validating
> the EAD against a schema before JSON parsing,  and perhaps providing a
>  ‘trusted path’ parser that skips that validation.
>
>
> I’m learning my way around the code base, but not quite sure I’m ready to
> tackle that myself yet!
> I did run the code with 'raise_errors=false’ , but that still runs the
> validations.
>
> We may attempt this again in the future with more memory and better
> instrumentation, but for now, I just needed to
> determine whether we could actually successfully ingest our largest
> guides.
>
>
> — Steve Majewski / UVA Alderman Library
>
>
> On Dec 9, 2014, at 3:28 AM, Chris Fitzpatrick <
> Chris.Fitzpatrick at lyrasis.org> wrote:
>
>
>
> Hi Steven,
>
> Yeah, you're probably going to want to have the -Xmx at around the 1G
> default. I've dropped down to 512M ( which is what the
> sandbox.archivesspace.org runs on ), but it'll be rather sluggish.
>
> If can go that high, you can start the application with the public and
> index apps disabled, which will reduce some of the overhead ( especially
> the indexer ).
>
> Try that and see if helps the issue..
> b,chris.
>
> Chris Fitzpatrick | Developer, ArchivesSpace
> Skype: chrisfitzpat  | Phone: 918.236.6048
> http://archivesspace.org/
>
> ________________________________________
> From: archivesspace at googlegroups.com <archivesspace at googlegroups.com> on
> behalf of Steven Majewski <sdm7g at virginia.edu>
> Sent: Monday, December 8, 2014 11:20 PM
> To: archivesspace_users_group at lyralists.lyrasis.org
> Cc: archivesspace at googlegroups.com
> Subject: [archivesspace] Problems & Strategies for importing very large
> EAD files
>
> The majority of our EAD guides are less than 1MB in size. We have a few
> larger ones, the two largest being around 13-14 MB.
>
> I have tried importing these guides from the frontend webapp import: my
> time or patience has always been
> hit after a couple of hours, before they have managed to import
> successfully.
>
> I recently attempted importing the largest file again by calling
> EADConverter.new( eadfile ) from a ruby command line program,
> and writing out the resulting json file to later import using the backend
> API.  The first attempt ran out of memory after a couple
> of hours.  I increased ‘-Xmx300m’ to ‘-Xmx500m’ in the JAVA_OPTS. The 2nd
> attempt took just over 24 hours ( with an hour or
> more of sleep time while my laptop was in-transit ).  This failed with the
> error:
>
> failed: #<:ValidationException: {:errors=>{"dates/0/end"=>["must not be
> before begin"]}, :import_context=>"<c02 level=\"file\"> ... </c02>”}>
>
> which I eventually traced to this:
>        <unitdate normal="1916-06/1916-05">May 1916-June 1916</unitdate>
>
>
> But before fixing this and trying again I wanted to ask if others had
> experience with importing large files.
>
> Should I try increasing -Xmx memory radically more, or perhaps MaxPermSize
> as well ?
>
> 24 hours for 14MB does not seem to be a linear projection from the 3 or
> 4MB files I have managed to import.
> Does this type of performance seem expected ?   Is there anything else
> that should be reconfigured ?
>
> [ Waiting 24 hours for an error message, then repeat again, seems like
> another good reason to want a schema that
>  describes the flavor of EAD that ArchivesSpace accepts: it would be nice
> to be able to do a pre-flight check, and
>  I’ve never seen an XSLT stylesheet take that long to validate. ]
>
>
> Also: a lot of those smaller EAD files are collections that are split up
> into separated guides — sometimes due to separate accessions:
> xxxx-a, xxxx-b, xxxx-c…
> The more recent tendency has been to unify a collection into a single
> guide, and we were expecting to perhaps use  ArchivesSpace
> to unify some of these separate guides after they had been separately
> imported. Should we be reconsidering that plan ?
>
> If increasing memory further doesn’t help, and if we don’t find any other
> way to improve import performance, I’m considering
> disassembling the file:  strip out much of the <dsc> and import a skeleton
> guide, and then process the <dsc>, adding resources to
> the guide a few at a time via the backend API.
>
> Anyone have any other ideas or experience to report ?
>
>
> — Steve Majewski / UVA Alderman Library
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "ArchivesSpace" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to archivesspace+unsubscribe at googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
>
>
>
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
>
>

-- 
Nathan Stevens
Programmer/Analyst
Digital Library Technology Services
New York University

1212-998-2653
ns96 at nyu.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20141211/6cc0326e/attachment.html>