[Archivesspace_Users_Group] Indexing and search issues
Tom Hanstra
hanstra at nd.edu
Thu Mar 11 15:53:35 EST 2021
On Thu, Mar 11, 2021 at 1:25 PM Andrew Morrison <
andrew.morrison at bodleian.ox.ac.uk> wrote:
> *> I also notice that indexing overall slows down as it gets farther into
> our records.*
>
> I haven't observed a slow-down in indexing. At least not noticeably enough
> to cause me to want to measure it. As I mentioned, there are some latter
> stages of the indexing process (building trees, committing changes) that
> can run for a long time without logging anything. But for the main indexing
> of archival objects, the last 1000 doesn't seem to take longer than the
> first 1000.
>
*I really only have one set of data to work from, but I've been tracking
times for PUI indexing especially. During the first hour, it was able to
complete193300 records. I'm assuming it was able to do that because it was
covering ground it had passed before the Java Heap error caused the
previous attempt to stop. But looking later in the processing, hour 3 it
indexed 41500 records, hour 12 it was down to 22250 records, and now in the
last hour (hour 23) it has only completed 950 records. That is why I was
asking about the slow down.*
> *> Is an external Solr a good idea for a site like ours?*
>
> Hard to say. There was a discussion on the performance benefits of running
> external Solr on here last month:
>
>
> http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/2021-February/thread.html#8168
>
> And documentation here:
>
> https://archivesspace.github.io/tech-docs/provisioning/solr.html
> Whether it affects indexing speed for large numbers of records I cannot
> say. We made the decision to use it before all our data was migrated into
> ArchivesSpace.
>
> *Looks as if separating Solr out is a good idea, based upon feedback. I'll
work on that (assuming indexing finally finishes at some point) *
> Andrew.
>
>
> On 11/03/2021 15:32, Tom Hanstra wrote:
>
> Thanks, Andrew. Some responses intertwined below, italicized:
>
> On Thu, Mar 11, 2021 at 7:31 AM Andrew Morrison <
> andrew.morrison at bodleian.ox.ac.uk> wrote:
>
>> You can allocate more memory to ArchivesSpace by setting the
>> ASPACE_JAVA_XMX environment variable it runs under. Setting that to
>> "-Xmx4g" should be sufficient.
>>
> *I did bump that and the ASPACE_JAVA_XSS up a bit for this round, which
> looks like it will finally complete. Just a few more PUI records need to be
> added. *
>
>> Those FATAL lines in the log snippet are caused by a bot probing for
>> known vulnerabilities in common web platforms and applications, hoping to
>> find a web site running an out-of-date copy (of Drupal in this case) which
>> it can exploit. It has nothing to do with ArchivesSpace, which has no PHP
>> code. It is merely logging that it doesn't know what to do with that
>> request.
>>
> *Thanks. I was hoping this was just extraneous. *
>
>> How did you know your one successful re-indexing completed? There are two
>> indexers, Staff and PUI, with the latter usually taking much longer to
>> finish. So if the PUI indexer fails after the staff indexer finishes, you
>> will see more records in the staff interface than the public interface,
>> even if they're all set to be public. Also both indexers log messages that
>> could be interpreted as meaning they've finished, but they then run
>> additional indexing to build trees, to enable navigation within collections
>> to work. A finally they instruct Solr to commit changes, which can be slow
>> depending on the performance of your storage. You could try doubling
>> AppConfig[:indexer_solr_timeout_seconds] to allow more time for each
>> operation.
>>
>
>
> *At least one set of logs, when I earlier gave the server more resources,
> showed that the indexing had completed. But, because the second repository
> was showing nothing, I decided to indexing again. This time around, we do
> have the second repository found so that, too, indicates that things have
> gone better. I guess I have to wait for things to complete but there are
> still some questions outstanding. For instance, one search I did for
> "football" (something dear to the Notre Dame experience) within the
> repository which is supposed to be pretty much indexed, showed over 32K
> results on our hosted site but only 17K locally. That seems wildly off with
> only a few PUI records to be completed (log shows 736500 of 763368). Could
> the incomplete index really be that far off?*
>
> *I also notice that indexing overall slows down as it gets farther into
> our records. Is that probably because there is just more to be done with
> the records that might not have gotten done in earlier attempts while the
> first records buzz by rapidly because of earlier indexing attempts? Or
> could it be that resources are taken up early in the processing and no
> longer available for processing the later records? Is resource tuning just
> a trial/error prospect? I don't see a lot of information in the
> documentation.*
>
> Or it could've re-indexed one repository but failed on the next. And it is
> possible for entire repositories to be set as non-public, which could be
> another explanation for fewer records.
>
> Are you running an external Solr? If so, is the AppConfig[:solr_url] in
> config.rb pointing to the correct server?
> *I'm running a local Solr as part of the application. Is an external Solr
> a good idea for a site like ours? I will also do some tweaking with the
> Solr settings to see if that might help...after I get through at least one
> complete index. *
>
>> There are many possible reasons for search slowness, including not enough
>> memory. Are there any differences in the speed of doing the same search in
>> the staff and public interfaces? Or between two ways of getting the same
>> results in the PUI. For example, does the link in the header to list all
>> collections (/repositories/resources) return results faster than searching
>> everything then filtering to just collections
>> (/search?q[]=*&op[]=&field[]=keyword&filter_fields[]=primary_type&filter_values[]=resource).
>> There's a fix coming in 3.0.0 for the latter.
>>
> *I had not tried comparing staff to public. I will do that (though I first
> have to get some access to the staff side!). And I'll really not try to do
> much comparison until we get indexing complete, in case the indexing itself
> is slowing things down. *
>
> *More questions to come, I'm sure. But thanks for your input and ideas of
> places to look further. Much appreciated.*
>
>> Andrew.
>>
>>
>> On 11/03/2021 02:07, Tom Hanstra wrote:
>>
>> I'm very new to ArchivesSpace and so my issues may be early configuration
>> problems. But I'm hoping some out there can assist. We are moving from
>> hosted to local, so I have a large database full of data that I'm working
>> with.
>>
>> Indexing
>> Right now, I'm running into two primary problems:
>>
>> - Twice now, I've hit issues where the indexing fails due to the Java
>> heap space being exhausted. Do others run into this? What do others use for
>> Java settings?
>> - I've broken out my PUI indexing log into a separate log and see FATAL
>> errors in the log:
>> ------
>> I, [2021-03-10T15:32:03.747156 #2919] INFO -- :
>> [1b34df32-d3b7-49c3-b205-01a59daf03e5] Started GET "/system_api.php"
>> for 206.189.134.38 at 2021-03-10 15:32:03 -0500
>> F, [2021-03-10T15:32:03.881297 #2919] FATAL -- :
>> [1b34df32-d3b7-49c3-b205-01a59daf03e5]
>> F, [2021-03-10T15:32:03.881658 #2919] FATAL -- :
>> [1b34df32-d3b7-49c3-b205-01a59daf03e5] ActionController::RoutingErro
>> r (No route matches [GET] "/system_api.php"):
>> F, [2021-03-10T15:32:03.881866 #2919] FATAL -- :
>> [1b34df32-d3b7-49c3-b205-01a59daf03e5]
>> F, [2021-03-10T15:32:03.882085 #2919] FATAL -- :
>> [1b34df32-d3b7-49c3-b205-01a59daf03e5] actionpack (5.2.4.4) lib/acti
>> on_dispatch/middleware/debug_exceptions.rb:65:in `call'
>> [1b34df32-d3b7-49c3-b205-01a59daf03e5] actionpack (5.2.4.4)
>> lib/action_dispatch/middleware/show_exceptions.rb:33:in `
>> call'
>> ------
>> Is this something to be concerned about? Why is it showing up in the PUI
>> log?
>>
>> Search issues
>> - Supposedly, I did get one round of indexing completed without a heap
>> error. But the resulting searches yielded numbers which were incorrect
>> compared to our hosted version. This is why I've been trying reindexing. Is
>> it usual to have indexing *look* like it is complete but really be
>> incomplete?
>> - When I do a search, the response is really slow. I've got nginx set up
>> as a proxy in front of ArchivesSpace and it is showing that the slowness is
>> in ArchivesSpace itself somewhere. I don't see anything in the logs to show
>> what is taking so long. Where should I be checking for issues?
>>
>> Thanks,
>> Tom
>>
>> --
>> *Tom Hanstra*
>> *Sr. Systems Administrator*
>> hanstra at nd.edu
>>
>>
>>
>> _______________________________________________
>> Archivesspace_Users_Group mailing listArchivesspace_Users_Group at lyralists.lyrasis.orghttp://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
>>
>> _______________________________________________
>> Archivesspace_Users_Group mailing list
>> Archivesspace_Users_Group at lyralists.lyrasis.org
>> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
>>
>
>
> --
> *Tom Hanstra*
> *Sr. Systems Administrator*
> hanstra at nd.edu
>
>
>
> _______________________________________________
> Archivesspace_Users_Group mailing listArchivesspace_Users_Group at lyralists.lyrasis.orghttp://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
>
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
>
--
*Tom Hanstra*
*Sr. Systems Administrator*
hanstra at nd.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20210311/7c98bbc0/attachment.html>
More information about the Archivesspace_Users_Group
mailing list