[Archivesspace_Users_Group] PUI indexing issues
Tom Hanstra
hanstra at nd.edu
Mon Mar 22 12:33:22 EDT 2021
Mark,
Thanks. Some new information and ideas in there. I have our
record_inheritance settings directly out of the box, which I assume are:
# {
# :property => 'notes',
# :inherit_if => proc {|json| json.select {|j|
j['type'] == 'accessrestrict'} },
# :inherit_directly => true
# },
# {
# :property => 'notes',
# :inherit_if => proc {|json| json.select {|j|
j['type'] == 'scopecontent'} },
# :inherit_directly => false
# },
# {
# :property => 'notes',
# :inherit_if => proc {|json| json.select {|j|
j['type'] == 'langmaterial'} },
# :inherit_directly => false
# },
Setting the restricted notes to *false* might help. Is that a normal
setting for others? I don't know what is appropriate or desired in terms
of regular usage.
As far as watching what is happening, I'd be rather happy to see any of the
indexing complete to the point where I have to look at the database for
further info. So far, the only thing that has *ever* completed was the
Staff index once and then everything died when the database had the
'savings time' error. That data set *thought* that staff indexing was
complete and only did PUI after that and never completed just PUI, so
whether running one before the other would work for our data is a question
mark. But that is worth a try as well.
I've been keeping an eye on disk space and have not hit any limits there.
I do see where backups could take up a lot of space over time. So will keep
watching that going forward (assuming I ever get a completed solr index).
Tom
On Mon, Mar 22, 2021 at 12:05 PM Custer, Mark <mark.custer at yale.edu> wrote:
> Tom,
>
> Not sure if it will help (and not sure if you shared your config.rb file
> in a previous message), but have you tried:
>
> - turning off the Solr backups during the re-indexing (
> https://github.com/archivesspace/archivesspace/blob/d207e8a7bb01c2b7b6f42ee5c0025d95f35ee7ae/common/config/config-defaults.rb#L76).
> Just going back to Dave's suggestion about keeping an eye out on disk
> space.
> - updating the record inheritance settings and removing the bit about
> inheriting scope and contents notes, which really bloats the index since
> most finding aids won't have lower-level descriptive notes (
> https://github.com/archivesspace/archivesspace/blob/d207e8a7bb01c2b7b6f42ee5c0025d95f35ee7ae/common/config/config-defaults.rb#L411-L413).
> For our record inheritance settings ,
> https://github.com/YaleArchivesSpace/aspace-deployment/blob/master/prod/config.rb#L204-L244),
> we only inherit two notes currently: access notes and preferred citation
> notes.
> - Turning off the PUI indexer until the staff indexing is done, and
> then turning the PUI indexer back on?
>
> When testing re-indexing locally, I usually bump up the two values that
> Blake listed below, at least until I start getting Java heapspace errors
> and don't' have any more RAM to allot to ASpace 🙂. But even just
> waiting for the archival objects can take a while, depending on the
> settings in your config.rb file. While waiting for a full re-index once
> when things were re-indexing on a server that I didn't have access to, I'd
> periodically look at the last archival object ID that was indexed, and then
> run a database query to see how many more archival objects were left in
> that repo, since a full re-index seems to go in order of the primary keys.
> e.g.
>
> select count(*) from archival_object
> where id > {archival_object id}
> and repo_id = {repo id};
>
>
> That would at least give me a sense of how much longer it might be for all
> the archival objects were left in one of our repos.
>
> As for the slowdown, after all the archival objects are indexed in a
> repository, the next thing that happens (although it can take quite a
> while) will be for all the tree indexes to be created and finally committed
> to Solr. See
> https://github.com/archivesspace/archivesspace/blob/82c4603fe22bf0fd06043974478d4caf26e1c646/indexer/app/lib/pui_indexer.rb#L136.
> If i recall correctly, there won't be any specific mentions in the logs
> about that, but after you get a message about all of the archival objects
> being indexed in a specfic repository, you'll get another message about the
> archival objects being indexed again sometime later, at which point the
> full trees have been reindexed again, and then the indexer will be off to
> the next repo (or record type, like classification records, etc.)
>
>
> Mark
>
>
> ------------------------------
> *From:* archivesspace_users_group-bounces at lyralists.lyrasis.org <
> archivesspace_users_group-bounces at lyralists.lyrasis.org> on behalf of Tom
> Hanstra <hanstra at nd.edu>
> *Sent:* Monday, March 22, 2021 11:21 AM
> *To:* Archivesspace Users Group <
> archivesspace_users_group at lyralists.lyrasis.org>
> *Subject:* Re: [Archivesspace_Users_Group] PUI indexing issues
>
> Thanks, Blake.
>
> In your testing, how big was the repository that you were testing against.
> Mine has "763368 archival_object records" and I consistently get into the
> 670K range for staff and 575 range for PUI before things really slow down.
> I'm now trying to really increase the Java settings to see if that will
> help. So far, the problem is similar: real slow downs after zipping through
> the first records. I'll also try some of the settings you have there to see
> if fewer but larger threads work better than multiple smaller threads.
>
> Tom
>
> On Mon, Mar 22, 2021 at 10:52 AM Blake Carver <blake.carver at lyrasis.org>
> wrote:
>
> I did some experimenting this weekend, messing around with indexer speeds,
> and found I could get it to succeed with the right indexer settings. I
> think the answer is going to be "it depends" and you'll need to experiment
> with what works on your set up with your data. I started with the defaults,
> then dropped it to reallllly slow (1 thread 1 per), then just tried to dial
> it up and down. The last one I tried worked fine, it was fast enough to
> finish in a reasonable amount of time and didn't slow down or crash. Your
> settings may not look like this, but here's something to try.
>
> AppConfig[:pui_indexer_records_per_thread] = 50
> AppConfig[:pui_indexer_thread_count] = 1
>
>
> So some extra detail for the mailing list archives... if your site keeps
> crashing before the indexers finish and you're not seeing any particular
> errors in the logs that make you think you have a problem with your data,
> try turning the knobs on your indexer speed and see if that helps.
>
> It looks like maybe the indexer just eats up too much memory on BIG
> records and having too many (too many being 15ish) threads running causes
> it to crash. I know BIG is pretty subjective, if you have a bunch of
> resources (maybe a few thousand) AND those resources all have ALLOTA (maybe
> a few thousand) children with ALLOTA subjects/agents/notes/stuff, then you
> might hit this problem. Seems like it's not the total number of resources,
> it's probably because those resources are big/complex/deep.
>
> ------------------------------
> *From:* archivesspace_users_group-bounces at lyralists.lyrasis.org <
> archivesspace_users_group-bounces at lyralists.lyrasis.org> on behalf of Tom
> Hanstra <hanstra at nd.edu>
> *Sent:* Thursday, March 18, 2021 11:24 AM
> *To:* Archivesspace Users Group <
> archivesspace_users_group at lyralists.lyrasis.org>
> *Subject:* Re: [Archivesspace_Users_Group] PUI indexing issues
>
> Dave,
>
> Thanks for the suggestion, but unless there is some direct limitation
> within Solr, that should not be an issue. My disk is at only about 50% of
> capacity and Solr should be able to expand as needed. In my case, I don't
> think there has been much addition to Solr because I'm reindexing records
> which have been indexed already. So the deleted records are growing, but
> not the overall number of records. My index is currently at about 6GB.
>
> Any other thoughts out there?
>
> Thanks,
> Tom
>
> On Thu, Mar 18, 2021 at 10:51 AM Mayo, Dave <dave_mayo at harvard.edu> wrote:
>
> This is a little bit of a shot in the dark, but have you looked at disk
> space on whatever host Solr is resident on? (the ASpace server if you’re
> not running an external one)?
>
> A thing we’ve hit a couple times is that Solr, at least in some
> configurations, needs substantial headroom on disk to perform well – I
> think it’s related to how it builds and maintains the index? So it might
> be worth looking to see if Solr is filling up the disk enough that it can’t
> efficiently handle itself.
>
>
>
> --
>
> Dave Mayo (he/him)
>
> Senior Digital Library Software Engineer
> Harvard University > HUIT > LTS
>
>
>
> *From: *<archivesspace_users_group-bounces at lyralists.lyrasis.org> on
> behalf of Tom Hanstra <hanstra at nd.edu>
> *Reply-To: *Archivesspace Users Group <
> archivesspace_users_group at lyralists.lyrasis.org>
> *Date: *Wednesday, March 17, 2021 at 11:43 AM
> *To: *Archivesspace Users Group <
> archivesspace_users_group at lyralists.lyrasis.org>
> *Subject: *Re: [Archivesspace_Users_Group] PUI indexing issues
>
>
>
>
>
>
>
> - What really bothers me is the slowdown. That indicates to me that some
> resource is being lost along the way. Anyone have thoughts on what that
> might be?
>
>
>
>
>
> Just to follow up on my earlier post, I did get even lower numbers from
> Blake to try based upon what he used for our hosted account. But I'm seeing
> the same pattern in terms of slowdowns regarding the number of records that
> get processed/hour. Is this typical? Is it just hitting records that have
> more work to be done? Or do I still have a resource issue.
>
>
>
> I note that the number of docs in Solr has not changed at all throughout
> the last couple of attempts, which again leads me to believe it has already
> handled these records (at least once) before and thus there is no more
> indexing to really be done with the records which it is running through
> the PUI indexer again. Which leads back to the "why does PUI indexing
> restart each time from 0" question. How does one add an enhancement request
> to have this reviewed and (perhaps) changed?
>
>
>
> Thanks,
>
> Tom
>
>
>
> --
>
> *Tom Hanstra*
>
> *Sr. Systems Administrator*
>
> hanstra at nd.edu
>
>
>
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
> <https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Flyralists.lyrasis.org%2Fmailman%2Flistinfo%2Farchivesspace_users_group&data=04%7C01%7Cmark.custer%40yale.edu%7C7f1f33055593492d4ce008d8ed462e73%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637520233005628409%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ADhPxo7qubWUyufL9kPnYAkrEY2QyEz9%2Fc%2F1q5VSnA8%3D&reserved=0>
>
>
>
> --
> *Tom Hanstra*
> *Sr. Systems Administrator*
> hanstra at nd.edu
>
>
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
> <https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Flyralists.lyrasis.org%2Fmailman%2Flistinfo%2Farchivesspace_users_group&data=04%7C01%7Cmark.custer%40yale.edu%7C7f1f33055593492d4ce008d8ed462e73%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637520233005628409%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ADhPxo7qubWUyufL9kPnYAkrEY2QyEz9%2Fc%2F1q5VSnA8%3D&reserved=0>
>
>
>
> --
> *Tom Hanstra*
> *Sr. Systems Administrator*
> hanstra at nd.edu
>
>
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
>
--
*Tom Hanstra*
*Sr. Systems Administrator*
hanstra at nd.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20210322/30e07116/attachment.html>
More information about the Archivesspace_Users_Group
mailing list