[Archivesspace_Users_Group] [archivesspace] Problems & Strategies for importing very large EAD files

Chris Fitzpatrick Chris.Fitzpatrick at lyrasis.org
Mon Dec 15 18:31:39 EST 2014


Hi,

Yes, I actually work expect this. We did some work to improve performance on the migration endpoints, but I don't think we've done any on the EAD importer. For example, I'd bet a donut that one bottleneck is the component reordering issue we ran into for migration ( when adding a component the tree is reorder in the db, since the normal use case done via the UI would assume this is necessary. )

Also, the AT  and archon ead importers are certainly much more mature at this point ,so I think for large ead it's not a bad idea to do as Nathan suggest,if your a at or archon institution.

B,Chris.

Sent from my HTC

----- Reply message -----
From: "Nathan Stevens" <ns96 at nyu.edu>
To: "Archivesspace Users Group" <archivesspace_users_group at lyralists.lyrasis.org>
Cc: "archivesspace at googlegroups.com" <archivesspace at googlegroups.com>
Subject: [Archivesspace_Users_Group] [archivesspace] Problems & Strategies for importing very large EAD files
Date: Sat, Dec 13, 2014 21:42

All Testing was conducted on a windows 7 32 bit/Core i5 3.2Gz/4GB machine using Oracles JRE1.7.0_67 and ASpace v1.1.0

First. the provided EAD file needed some slight modifications to validate against the EAD schema. Once validated, import fails about 2.5 hours with the error message below.

For comparison, the same EAD imported into AT in about 50 seconds (not a typo) and can be migrated to the same ASpace instance in 12 minutes using the migration plugin.

Not sure where the bottle neck is, but clearly the ASpace EAD importer needs some work to match the AT importer.

viuh00010.xml ================================================== !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! IMPORT ERROR !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! The following errors were found: extents : At least 1 item(s) is required title : Property is required but was missing id_0 : Property is required but was missing level : Property is required but was missing For JSONModel(:resource): #"resource", "external_ids"=>[], "subjects"=>[], "linked_events"=>[], "extents"=>[], "dates"=>[], "external_documents"=>[], "rights_statements"=>[], "linked_agents"=>[], "restrictions"=>false, "instances"=>[], "deaccessions"=>[], "related_accessions"=>[], "notes"=>[], "uri"=>"/repositories/import/resources/import_082e59ec-ae98-4c49-984a-91bdfb90a9e5", "finding_aid_title"=>"A Guide to the Philip S. Hench Walter Reed Yellow Fever Collectioncirca 1800- circa 1998", "finding_aid_author"=>"William B. Bean, Donna L. Purvis, Mark Mones, Henry K. Sharp, Janet Pearson, and Dan Cavanaugh.", "finding_aid_edition_statement"=>"3rd edition\n

This edition reflects substantial changes made to the 2nd edition finding aid. The repository staff made these changes to prepare the digital collection for inclusion in the University of Virginia Library's digital repository.

", "finding_aid_language"=>"Description is inEnglish", "language"=>"eng"}> In : <ead xsi:schemaLocation="urn:isbn:1-931666-22-9 http://www.loc.gov/ead/ead.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="http://www.w3.org/1999/xlink" xmlns="urn:isbn:1-931666-22-9"> ... </ead>

On Fri, Dec 12, 2014 at 3:56 PM, Steven Majewski <sdm7g at virginia.edu<mailto:sdm7g at virginia.edu>> wrote:

On Dec 12, 2014, at 3:29 AM, Chris Fitzpatrick <Chris.Fitzpatrick at lyrasis.org<mailto:Chris.Fitzpatrick at lyrasis.org>> wrote:


Hi Steve,


Ok, so you're running the process in the jirb console?
That will be not very effectient, as the irb mode is not really built with performance in mind...


No, not running in the jirb console — just a script running from jruby. Here is the code:
( A slight variation on a previously posted version: trying to figure out the correct environment in which to run jobs like this.
  Using frontend config and client_mode because I was intending to add code to do the batch_imports POST directly. )

require 'frontend/config/boot'
require 'frontend/config/application'
require 'jsonmodel'
require 'backend/app/converters/ead_converter'
require 'fileutils'
require 'backend/app/lib/logging'

BACKEND_URL='http://localhost:4567/'

def json_path(xmlpath)
        xmlpath.slice(0..xmlpath.rindex('.xml')) + 'json'  if xmlpath.end_with? ".xml"
end


JSONModel::init( :strict => false, :client_mode => true, :url => BACKEND_URL )

ARGV.each do |eadxml|
  begin
        puts eadxml
        converter = EADConverter.new( eadxml )
        converter.run
        puts eadxml + " : ok."
        $stderr.puts '#@@++ ' + eadxml + " : ok."
  rescue Exception => e
        puts eadxml + " : failed: " + e.message
        $stderr.puts e.backtrace
        $stderr.puts '#@@-- ' + eadxml + " : failed: " + e.message
  ensure
        FileUtils::move( converter.get_output_path, json_path(eadxml), :verbose => true )
 end
end




Validating it against a schema is not a bad idea, but we don't have an XML schema to validate it against. The EAD  schema is very loose in terms of what it allows, so having something that's EAD valid won't really be helpful. Someone would need to make a ArchivesSpace XML schema that we can validate against...b,chris.




Extending or restricting the EAD schema is not very difficult if you know what rules you need to add.
The problem is that those rules are undocumented except in the code.
So far, we’ve been discovering them by a trial and error experimental process.
I’m only just now getting familiar enough with the code base to attempt to reverse engineer them from the code.

XML Schemas are just one possible way of documenting those rules.
You could argue that the JSONModel schemas are another form of documentation.
However, it seems to take quite a lot of cross referencing  across several schemas ( due to inheritance and inclusions )
and the mappings elsewhere in the code to get a full picture.
Sometimes XML Schemas or DTDs present similar cross-referencing problems, but they are older and more
standardized formats and there are tools for dealing with them.  Are there any tools, for example, to expand
those JSONModel schemas inline into something more readable ?


— Steve Majewski



Chris Fitzpatrick | Developer, ArchivesSpace
Skype: chrisfitzpat  | Phone: 918.236.6048<tel:918.236.6048>
http://archivesspace.org/
________________________________
From: archivesspace at googlegroups.com<mailto:archivesspace at googlegroups.com> <archivesspace at googlegroups.com<mailto:archivesspace at googlegroups.com>> on behalf of Nathan Stevens <ns96 at nyu.edu<mailto:ns96 at nyu.edu>>
Sent: Thursday, December 11, 2014 8:08 PM
To: Archivesspace Users Group
Cc: archivesspace at googlegroups.com<mailto:archivesspace at googlegroups.com>
Subject: Re: [Archivesspace_Users_Group] [archivesspace] Problems & Strategies for importing very large EAD files

You should try importing those EADs into the AT (www.archiviststoolkit.org<http://www.archiviststoolkit.org/>) to establish some kind of baseline since those import times seems way too long for relatively small datasets.  For example, migration of a 600MB AT database into ASpace takes under 5 hours.  You can then use the AT migration plugin to transfer those records over to ASpace and see how long that takes.

You send a couple of those EAD to me directly (ns96 at nyu.edu<mailto:ns96 at nyu.edu>) and I can try importing into AT, and transferring to ASpace on our end.

On Thu, Dec 11, 2014 at 12:43 PM, Steven Majewski <sdm7g at virginia.edu<mailto:sdm7g at virginia.edu>> wrote:

I increased memory sizes in JAVA_OPTS:

JAVA_OPTS="$JAVA_OPTS -XX:MaxPermSize=512m -Xmx1080m -Xss8m"

This for a command line program running: EADConverter.new( eadxml )
Backend server also running on the same machine because I called JSONModel:init in client_mode,
and it wants a server to talk to.

I fixed the date problem in my xml that caused it to be incomplete on the previous run.
I added some debug trace statements in the JSONModel validate calls to track progress.
( Comparing the amount of debug output vs. the size of JSON output, I don’t think this
  had a significant effect on performance, but I could be wrong. )

Running on a MacBook Pro w/ 2 GHz Intel Core i7 and 16 GB 1600 MHz DDR3 memory and SSD.

I ran Activity Monitor during the run to check on CPU and Memory usage, which seemed to peak at around 700GB.
The process seemed to run at 100% +/- 3% CPU.   I ran another parse operation in parallel on another smaller file
for a time. This didn’t seem to have a bit impact: both cores were running close to 100% rather than just one.
( I assume there isn’t much parallelism to be wrung out of the parse operation. )  That 2nd file was just over 6MB and
took over 6 hours of CPU time. ( Not sure of the exact time, as I wasn’t nearby when it finished. Real time was much
greater, but I’m not sure that the laptop didn’t sleep for part of that time.  Time for the 14MB file was over 24 hours
CPU time (greater than that in actual real clock time).  The previous attempt ( with less memory allocated ) was
24 hours or more in real time — I did not measure CPU time on that attempt — but it got a validation error and the
resulting JSON file was incomplete : there were internal URI references that were never defined in the JSON file.


After completion, I attempted to POST it to backend server  /repositories/$REPO/batch_imports.  First to the backend
development instance running on my laptop, then to our actual test server. That attempt to post to the test server took
too long to finish while at the office, and my VPN at home usually times out after an hour or two, so I copied the JSON
file up to the server and did the ‘nohup curl_as’ to the backend port on the same server, and left it running overnight.

So success, but there was an incredible amount of overhead in the parsing and validation.

Any ideas where the bottleneck is in this operation ?

IF the overhead is due to the validation (and here, I’m just guessing), that would be another argument for validating
the EAD against a schema before JSON parsing,  and perhaps providing a  ‘trusted path’ parser that skips that validation.


I’m learning my way around the code base, but not quite sure I’m ready to tackle that myself yet!
I did run the code with 'raise_errors=false’ , but that still runs the validations.

We may attempt this again in the future with more memory and better instrumentation, but for now, I just needed to
determine whether we could actually successfully ingest our largest guides.


— Steve Majewski / UVA Alderman Library


On Dec 9, 2014, at 3:28 AM, Chris Fitzpatrick <Chris.Fitzpatrick at lyrasis.org<mailto:Chris.Fitzpatrick at lyrasis.org>> wrote:



Hi Steven,

Yeah, you're probably going to want to have the -Xmx at around the 1G default. I've dropped down to 512M ( which is what the sandbox.archivesspace.org<http://sandbox.archivesspace.org/> runs on ), but it'll be rather sluggish.

If can go that high, you can start the application with the public and index apps disabled, which will reduce some of the overhead ( especially the indexer ).

Try that and see if helps the issue..
b,chris.

Chris Fitzpatrick | Developer, ArchivesSpace
Skype: chrisfitzpat  | Phone: 918.236.6048<tel:918.236.6048>
http://archivesspace.org/

________________________________________
From: archivesspace at googlegroups.com<mailto:archivesspace at googlegroups.com> <archivesspace at googlegroups.com<mailto:archivesspace at googlegroups.com>> on behalf of Steven Majewski <sdm7g at virginia.edu<mailto:sdm7g at virginia.edu>>
Sent: Monday, December 8, 2014 11:20 PM
To: archivesspace_users_group at lyralists.lyrasis.org<mailto:archivesspace_users_group at lyralists.lyrasis.org>
Cc: archivesspace at googlegroups.com<mailto:archivesspace at googlegroups.com>
Subject: [archivesspace] Problems & Strategies for importing very large EAD files

The majority of our EAD guides are less than 1MB in size. We have a few larger ones, the two largest being around 13-14 MB.

I have tried importing these guides from the frontend webapp import: my time or patience has always been
hit after a couple of hours, before they have managed to import successfully.

I recently attempted importing the largest file again by calling EADConverter.new( eadfile ) from a ruby command line program,
and writing out the resulting json file to later import using the backend API.  The first attempt ran out of memory after a couple
of hours.  I increased ‘-Xmx300m’ to ‘-Xmx500m’ in the JAVA_OPTS. The 2nd attempt took just over 24 hours ( with an hour or
more of sleep time while my laptop was in-transit ).  This failed with the error:

failed: #<:ValidationException: {:errors=>{"dates/0/end"=>["must not be before begin"]}, :import_context=>"<c02 level=\"file\"> ... </c02>”}>

which I eventually traced to this:
       <unitdate normal="1916-06/1916-05">May 1916-June 1916</unitdate>


But before fixing this and trying again I wanted to ask if others had experience with importing large files.

Should I try increasing -Xmx memory radically more, or perhaps MaxPermSize as well ?

24 hours for 14MB does not seem to be a linear projection from the 3 or 4MB files I have managed to import.
Does this type of performance seem expected ?   Is there anything else that should be reconfigured ?

[ Waiting 24 hours for an error message, then repeat again, seems like another good reason to want a schema that
 describes the flavor of EAD that ArchivesSpace accepts: it would be nice to be able to do a pre-flight check, and
 I’ve never seen an XSLT stylesheet take that long to validate. ]


Also: a lot of those smaller EAD files are collections that are split up into separated guides — sometimes due to separate accessions:
xxxx-a, xxxx-b, xxxx-c…
The more recent tendency has been to unify a collection into a single guide, and we were expecting to perhaps use  ArchivesSpace
to unify some of these separate guides after they had been separately imported. Should we be reconsidering that plan ?

If increasing memory further doesn’t help, and if we don’t find any other way to improve import performance, I’m considering
disassembling the file:  strip out much of the <dsc> and import a skeleton guide, and then process the <dsc>, adding resources to
the guide a few at a time via the backend API.

Anyone have any other ideas or experience to report ?


— Steve Majewski / UVA Alderman Library





--
You received this message because you are subscribed to the Google Groups "ArchivesSpace" group.
To unsubscribe from this group and stop receiving emails from it, send an email toarchivesspace+unsubscribe at googlegroups.com<mailto:archivesspace%2Bunsubscribe at googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.
_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group at lyralists.lyrasis.org<mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group


_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group at lyralists.lyrasis.org<mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group



--
Nathan Stevens
Programmer/Analyst
Digital Library Technology Services
New York University

1212-998-2653<tel:1212-998-2653>
ns96 at nyu.edu<mailto:ns96 at nyu.edu>

--
You received this message because you are subscribed to the Google Groups "ArchivesSpace" group.
To unsubscribe from this group and stop receiving emails from it, send an email toarchivesspace+unsubscribe at googlegroups.com<mailto:archivesspace+unsubscribe at googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.
_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group at lyralists.lyrasis.org<mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group


_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group at lyralists.lyrasis.org<mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group



--
Nathan Stevens
Programmer/Analyst
Digital Library Technology Services
New York University

1212-998-2653
ns96 at nyu.edu<mailto:ns96 at nyu.edu>

--
You received this message because you are subscribed to the Google Groups "ArchivesSpace" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivesspace+unsubscribe at googlegroups.com<mailto:archivesspace+unsubscribe at googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20141215/59f3248f/attachment.html>


More information about the Archivesspace_Users_Group mailing list