[Archivesspace_Users_Group] [archivesspace] Problems & Strategies for importing very large EAD files

Chris Fitzpatrick Chris.Fitzpatrick at lyrasis.org
Fri Dec 12 03:29:24 EST 2014


Hi Steve,



Ok, so you're running the process in the jirb console?

That will be not very effectient, as the irb mode is not really built with performance in mind...


Validating it against a schema is not a bad idea, but we don't have an XML schema to validate it against. The EAD  schema is very loose in terms of what it allows, so having something that's EAD valid won't really be helpful. Someone would need to make a ArchivesSpace XML schema that we can validate against...b,chris.




Chris Fitzpatrick | Developer, ArchivesSpace
Skype: chrisfitzpat  | Phone: 918.236.6048
http://archivesspace.org/
________________________________
From: archivesspace at googlegroups.com <archivesspace at googlegroups.com> on behalf of Nathan Stevens <ns96 at nyu.edu>
Sent: Thursday, December 11, 2014 8:08 PM
To: Archivesspace Users Group
Cc: archivesspace at googlegroups.com
Subject: Re: [Archivesspace_Users_Group] [archivesspace] Problems & Strategies for importing very large EAD files

You should try importing those EADs into the AT (www.archiviststoolkit.org<http://www.archiviststoolkit.org>) to establish some kind of baseline since those import times seems way too long for relatively small datasets.  For example, migration of a 600MB AT database into ASpace takes under 5 hours.  You can then use the AT migration plugin to transfer those records over to ASpace and see how long that takes.

You send a couple of those EAD to me directly (ns96 at nyu.edu<mailto:ns96 at nyu.edu>) and I can try importing into AT, and transferring to ASpace on our end.

On Thu, Dec 11, 2014 at 12:43 PM, Steven Majewski <sdm7g at virginia.edu<mailto:sdm7g at virginia.edu>> wrote:

I increased memory sizes in JAVA_OPTS:

JAVA_OPTS="$JAVA_OPTS -XX:MaxPermSize=512m -Xmx1080m -Xss8m"

This for a command line program running: EADConverter.new( eadxml )
Backend server also running on the same machine because I called JSONModel:init in client_mode,
and it wants a server to talk to.

I fixed the date problem in my xml that caused it to be incomplete on the previous run.
I added some debug trace statements in the JSONModel validate calls to track progress.
( Comparing the amount of debug output vs. the size of JSON output, I don’t think this
  had a significant effect on performance, but I could be wrong. )

Running on a MacBook Pro w/ 2 GHz Intel Core i7 and 16 GB 1600 MHz DDR3 memory and SSD.

I ran Activity Monitor during the run to check on CPU and Memory usage, which seemed to peak at around 700GB.
The process seemed to run at 100% +/- 3% CPU.   I ran another parse operation in parallel on another smaller file
for a time. This didn’t seem to have a bit impact: both cores were running close to 100% rather than just one.
( I assume there isn’t much parallelism to be wrung out of the parse operation. )  That 2nd file was just over 6MB and
took over 6 hours of CPU time. ( Not sure of the exact time, as I wasn’t nearby when it finished. Real time was much
greater, but I’m not sure that the laptop didn’t sleep for part of that time.  Time for the 14MB file was over 24 hours
CPU time (greater than that in actual real clock time).  The previous attempt ( with less memory allocated ) was
24 hours or more in real time — I did not measure CPU time on that attempt — but it got a validation error and the
resulting JSON file was incomplete : there were internal URI references that were never defined in the JSON file.


After completion, I attempted to POST it to backend server  /repositories/$REPO/batch_imports.  First to the backend
development instance running on my laptop, then to our actual test server. That attempt to post to the test server took
too long to finish while at the office, and my VPN at home usually times out after an hour or two, so I copied the JSON
file up to the server and did the ‘nohup curl_as’ to the backend port on the same server, and left it running overnight.

So success, but there was an incredible amount of overhead in the parsing and validation.

Any ideas where the bottleneck is in this operation ?

IF the overhead is due to the validation (and here, I’m just guessing), that would be another argument for validating
the EAD against a schema before JSON parsing,  and perhaps providing a  ‘trusted path’ parser that skips that validation.


I’m learning my way around the code base, but not quite sure I’m ready to tackle that myself yet!
I did run the code with 'raise_errors=false’ , but that still runs the validations.

We may attempt this again in the future with more memory and better instrumentation, but for now, I just needed to
determine whether we could actually successfully ingest our largest guides.


— Steve Majewski / UVA Alderman Library


On Dec 9, 2014, at 3:28 AM, Chris Fitzpatrick <Chris.Fitzpatrick at lyrasis.org<mailto:Chris.Fitzpatrick at lyrasis.org>> wrote:



Hi Steven,

Yeah, you're probably going to want to have the -Xmx at around the 1G default. I've dropped down to 512M ( which is what the sandbox.archivesspace.org<http://sandbox.archivesspace.org> runs on ), but it'll be rather sluggish.

If can go that high, you can start the application with the public and index apps disabled, which will reduce some of the overhead ( especially the indexer ).

Try that and see if helps the issue..
b,chris.

Chris Fitzpatrick | Developer, ArchivesSpace
Skype: chrisfitzpat  | Phone: 918.236.6048<tel:918.236.6048>
http://archivesspace.org/

________________________________________
From: archivesspace at googlegroups.com<mailto:archivesspace at googlegroups.com> <archivesspace at googlegroups.com<mailto:archivesspace at googlegroups.com>> on behalf of Steven Majewski <sdm7g at virginia.edu<mailto:sdm7g at virginia.edu>>
Sent: Monday, December 8, 2014 11:20 PM
To: archivesspace_users_group at lyralists.lyrasis.org<mailto:archivesspace_users_group at lyralists.lyrasis.org>
Cc: archivesspace at googlegroups.com<mailto:archivesspace at googlegroups.com>
Subject: [archivesspace] Problems & Strategies for importing very large EAD files

The majority of our EAD guides are less than 1MB in size. We have a few larger ones, the two largest being around 13-14 MB.

I have tried importing these guides from the frontend webapp import: my time or patience has always been
hit after a couple of hours, before they have managed to import successfully.

I recently attempted importing the largest file again by calling EADConverter.new( eadfile ) from a ruby command line program,
and writing out the resulting json file to later import using the backend API.  The first attempt ran out of memory after a couple
of hours.  I increased ‘-Xmx300m’ to ‘-Xmx500m’ in the JAVA_OPTS. The 2nd attempt took just over 24 hours ( with an hour or
more of sleep time while my laptop was in-transit ).  This failed with the error:

failed: #<:ValidationException: {:errors=>{"dates/0/end"=>["must not be before begin"]}, :import_context=>"<c02 level=\"file\"> ... </c02>”}>

which I eventually traced to this:
       <unitdate normal="1916-06/1916-05">May 1916-June 1916</unitdate>


But before fixing this and trying again I wanted to ask if others had experience with importing large files.

Should I try increasing -Xmx memory radically more, or perhaps MaxPermSize as well ?

24 hours for 14MB does not seem to be a linear projection from the 3 or 4MB files I have managed to import.
Does this type of performance seem expected ?   Is there anything else that should be reconfigured ?

[ Waiting 24 hours for an error message, then repeat again, seems like another good reason to want a schema that
 describes the flavor of EAD that ArchivesSpace accepts: it would be nice to be able to do a pre-flight check, and
 I’ve never seen an XSLT stylesheet take that long to validate. ]


Also: a lot of those smaller EAD files are collections that are split up into separated guides — sometimes due to separate accessions:
xxxx-a, xxxx-b, xxxx-c…
The more recent tendency has been to unify a collection into a single guide, and we were expecting to perhaps use  ArchivesSpace
to unify some of these separate guides after they had been separately imported. Should we be reconsidering that plan ?

If increasing memory further doesn’t help, and if we don’t find any other way to improve import performance, I’m considering
disassembling the file:  strip out much of the <dsc> and import a skeleton guide, and then process the <dsc>, adding resources to
the guide a few at a time via the backend API.

Anyone have any other ideas or experience to report ?


— Steve Majewski / UVA Alderman Library





--
You received this message because you are subscribed to the Google Groups "ArchivesSpace" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivesspace+unsubscribe at googlegroups.com<mailto:archivesspace%2Bunsubscribe at googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.
_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group at lyralists.lyrasis.org<mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group


_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group at lyralists.lyrasis.org<mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group



--
Nathan Stevens
Programmer/Analyst
Digital Library Technology Services
New York University

1212-998-2653
ns96 at nyu.edu<mailto:ns96 at nyu.edu>

--
You received this message because you are subscribed to the Google Groups "ArchivesSpace" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivesspace+unsubscribe at googlegroups.com<mailto:archivesspace+unsubscribe at googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20141212/ea893153/attachment.html>


More information about the Archivesspace_Users_Group mailing list