<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div><br></div><div>I increased memory sizes in JAVA_OPTS:</div><div><br></div><div><div style="margin: 0px; font-family: Menlo; background-color: rgb(189, 238, 237); position: static; z-index: auto;">JAVA_OPTS="$JAVA_OPTS -XX:MaxPermSize=512m -Xmx1080m -Xss8m"</div></div><div><br></div><div>This for a command line program running: EADConverter.new( eadxml )</div><div>Backend server also running on the same machine because I called JSONModel:init in client_mode,</div><div>and it wants a server to talk to. </div><div><br></div><div>I fixed the date problem in my xml that caused it to be incomplete on the previous run. </div><div>I added some debug trace statements in the JSONModel validate calls to track progress. </div><div>( Comparing the amount of debug output vs. the size of JSON output, I don’t think this </div><div> had a significant effect on performance, but I could be wrong. ) </div><div><br></div><div>Running on a MacBook Pro w/ 2 GHz Intel Core i7 and 16 GB 1600 MHz DDR3 memory and SSD. </div><div><br></div><div>I ran Activity Monitor during the run to check on CPU and Memory usage, which seemed to peak at around 700GB. </div><div>The process seemed to run at 100% +/- 3% CPU. I ran another parse operation in parallel on another smaller file</div><div>for a time. This didn’t seem to have a bit impact: both cores were running close to 100% rather than just one. </div><div>( I assume there isn’t much parallelism to be wrung out of the parse operation. ) That 2nd file was just over 6MB and</div><div>took over 6 hours of CPU time. ( Not sure of the exact time, as I wasn’t nearby when it finished. Real time was much</div><div>greater, but I’m not sure that the laptop didn’t sleep for part of that time. Time for the 14MB file was over 24 hours</div><div>CPU time (greater than that in actual real clock time). The previous attempt ( with less memory allocated ) was </div><div>24 hours or more in real time — I did not measure CPU time on that attempt — but it got a validation error and the</div><div>resulting JSON file was incomplete : there were internal URI references that were never defined in the JSON file. </div><div><br></div><div><br></div><div>After completion, I attempted to POST it to backend server /repositories/$REPO/batch_imports. First to the backend</div><div>development instance running on my laptop, then to our actual test server. That attempt to post to the test server took</div><div>too long to finish while at the office, and my VPN at home usually times out after an hour or two, so I copied the JSON</div><div>file up to the server and did the ‘nohup curl_as’ to the backend port on the same server, and left it running overnight. </div><div><br></div><div>So success, but there was an incredible amount of overhead in the parsing and validation. </div><div><br></div><div>Any ideas where the bottleneck is in this operation ? </div><div><br></div><div>IF the overhead is due to the validation (and here, I’m just guessing), that would be another argument for validating</div><div>the EAD against a schema before JSON parsing, and perhaps providing a ‘trusted path’ parser that skips that validation. </div><div><br></div><div><br></div><div>I’m learning my way around the code base, but not quite sure I’m ready to tackle that myself yet! </div><div>I did run the code with 'raise_errors=false’ , but that still runs the validations. </div><div><br></div><div>We may attempt this again in the future with more memory and better instrumentation, but for now, I just needed to</div><div>determine whether we could actually successfully ingest our largest guides. </div><div><br></div><div><br></div><div>— Steve Majewski / UVA Alderman Library</div><div><br></div><br><div><div>On Dec 9, 2014, at 3:28 AM, Chris Fitzpatrick <<a href="mailto:Chris.Fitzpatrick@lyrasis.org">Chris.Fitzpatrick@lyrasis.org</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><br><br>Hi Steven,<br><br>Yeah, you're probably going to want to have the -Xmx at around the 1G default. I've dropped down to 512M ( which is what the <a href="http://sandbox.archivesspace.org">sandbox.archivesspace.org</a> runs on ), but it'll be rather sluggish.<br><br>If can go that high, you can start the application with the public and index apps disabled, which will reduce some of the overhead ( especially the indexer ). <br><br>Try that and see if helps the issue..<br>b,chris. <br><br>Chris Fitzpatrick | Developer, ArchivesSpace<br>Skype: chrisfitzpat | Phone: 918.236.6048<br><a href="http://archivesspace.org/">http://archivesspace.org/</a><br><br>________________________________________<br>From: archivesspace@googlegroups.com <archivesspace@googlegroups.com> on behalf of Steven Majewski <sdm7g@virginia.edu><br>Sent: Monday, December 8, 2014 11:20 PM<br>To: archivesspace_users_group@lyralists.lyrasis.org<br>Cc: archivesspace@googlegroups.com<br>Subject: [archivesspace] Problems & Strategies for importing very large EAD files<br><br>The majority of our EAD guides are less than 1MB in size. We have a few larger ones, the two largest being around 13-14 MB.<br><br>I have tried importing these guides from the frontend webapp import: my time or patience has always been<br>hit after a couple of hours, before they have managed to import successfully.<br><br>I recently attempted importing the largest file again by calling EADConverter.new( eadfile ) from a ruby command line program,<br>and writing out the resulting json file to later import using the backend API. The first attempt ran out of memory after a couple<br>of hours. I increased ‘-Xmx300m’ to ‘-Xmx500m’ in the JAVA_OPTS. The 2nd attempt took just over 24 hours ( with an hour or<br>more of sleep time while my laptop was in-transit ). This failed with the error:<br><br>failed: #<:ValidationException: {:errors=>{"dates/0/end"=>["must not be before begin"]}, :import_context=>"<c02 level=\"file\"> ... </c02>”}><br><br>which I eventually traced to this:<br> <unitdate normal="1916-06/1916-05">May 1916-June 1916</unitdate><br><br><br>But before fixing this and trying again I wanted to ask if others had experience with importing large files.<br><br>Should I try increasing -Xmx memory radically more, or perhaps MaxPermSize as well ?<br><br>24 hours for 14MB does not seem to be a linear projection from the 3 or 4MB files I have managed to import.<br>Does this type of performance seem expected ? Is there anything else that should be reconfigured ?<br><br>[ Waiting 24 hours for an error message, then repeat again, seems like another good reason to want a schema that<br> describes the flavor of EAD that ArchivesSpace accepts: it would be nice to be able to do a pre-flight check, and<br> I’ve never seen an XSLT stylesheet take that long to validate. ]<br><br><br>Also: a lot of those smaller EAD files are collections that are split up into separated guides — sometimes due to separate accessions:<br>xxxx-a, xxxx-b, xxxx-c…<br>The more recent tendency has been to unify a collection into a single guide, and we were expecting to perhaps use ArchivesSpace<br>to unify some of these separate guides after they had been separately imported. Should we be reconsidering that plan ?<br><br>If increasing memory further doesn’t help, and if we don’t find any other way to improve import performance, I’m considering<br>disassembling the file: strip out much of the <dsc> and import a skeleton guide, and then process the <dsc>, adding resources to<br>the guide a few at a time via the backend API.<br><br>Anyone have any other ideas or experience to report ?<br><br><br>— Steve Majewski / UVA Alderman Library<br><br><br><br><br><br>--<br>You received this message because you are subscribed to the Google Groups "ArchivesSpace" group.<br>To unsubscribe from this group and stop receiving emails from it, send an email to archivesspace+unsubscribe@googlegroups.com.<br>For more options, visit https://groups.google.com/d/optout.<br>_______________________________________________<br>Archivesspace_Users_Group mailing list<br>Archivesspace_Users_Group@lyralists.lyrasis.org<br>http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group<br></blockquote></div><br></body></html>