[Archivesspace_Users_Group] Indexing and search issues
Andrew Morrison
andrew.morrison at bodleian.ox.ac.uk
Thu Mar 11 13:25:33 EST 2021
//> I also notice that indexing overall slows down as it gets farther
into our records.//
I haven't observed a slow-down in indexing. At least not noticeably
enough to cause me to want to measure it. As I mentioned, there are some
latter stages of the indexing process (building trees, committing
changes) that can run for a long time without logging anything. But for
the main indexing of archival objects, the last 1000 doesn't seem to
take longer than the first 1000.
/> Is an external Solr a good idea for a site like ours?/
Hard to say. There was a discussion on the performance benefits of
running external Solr on here last month:
http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/2021-February/thread.html#8168
And documentation here:
https://archivesspace.github.io/tech-docs/provisioning/solr.html
Whether it affects indexing speed for large numbers of records I cannot
say. We made the decision to use it before all our data was migrated
into ArchivesSpace.
Andrew.
On 11/03/2021 15:32, Tom Hanstra wrote:
> Thanks, Andrew. Some responses intertwined below, italicized:
>
> On Thu, Mar 11, 2021 at 7:31 AM Andrew Morrison
> <andrew.morrison at bodleian.ox.ac.uk
> <mailto:andrew.morrison at bodleian.ox.ac.uk>> wrote:
>
> You can allocate more memory to ArchivesSpace by setting the
> ASPACE_JAVA_XMX environment variable it runs under. Setting that
> to "-Xmx4g" should be sufficient.
>
> /I did bump that and the ASPACE_JAVA_XSS up a bit for this round,
> which looks like it will finally complete. Just a few more PUI records
> need to be added. /
>
> Those FATAL lines in the log snippet are caused by a bot probing
> for known vulnerabilities in common web platforms and
> applications, hoping to find a web site running an out-of-date
> copy (of Drupal in this case) which it can exploit. It has nothing
> to do with ArchivesSpace, which has no PHP code. It is merely
> logging that it doesn't know what to do with that request.
>
> /Thanks. I was hoping this was just extraneous. /
>
> How did you know your one successful re-indexing completed? There
> are two indexers, Staff and PUI, with the latter usually taking
> much longer to finish. So if the PUI indexer fails after the staff
> indexer finishes, you will see more records in the staff interface
> than the public interface, even if they're all set to be public.
> Also both indexers log messages that could be interpreted as
> meaning they've finished, but they then run additional indexing to
> build trees, to enable navigation within collections to work. A
> finally they instruct Solr to commit changes, which can be slow
> depending on the performance of your storage. You could try
> doubling AppConfig[:indexer_solr_timeout_seconds] to allow more
> time for each operation.
>
> /At least one set of logs, when I earlier gave the server more
> resources, showed that the indexing had completed. But, because the
> second repository was showing nothing, I decided to indexing again.
>
> This time around, we do have the second repository found so that, too,
> indicates that things have gone better. I guess I have to wait for
> things to complete but there are still some questions outstanding. For
> instance, one search I did for "football" (something dear to the Notre
> Dame experience) within the repository which is supposed to be pretty
> much indexed, showed over 32K results on our hosted site but only 17K
> locally. That seems wildly off with only a few PUI records to be
> completed (log shows 736500 of 763368). Could the incomplete index
> really be that far off?/
> /
> /
> /I also notice that indexing overall slows down as it gets farther
> into our records. Is that probably because there is just more to be
> done with the records that might not have gotten done in earlier
> attempts while the first records buzz by rapidly because of earlier
> indexing attempts? Or could it be that resources are taken up early
> in the processing and no longer available for processing the later
> records? Is resource tuning just a trial/error prospect? I don't see
> a lot of information in the documentation./
>
> Or it could've re-indexed one repository but failed on the next. And
> it is possible for entire repositories to be set as non-public, which
> could be another explanation for fewer records.
>
> Are you running an external Solr? If so, is the AppConfig[:solr_url]
> in config.rb pointing to the correct server?
>
> /I'm running a local Solr as part of the application. Is an external
> Solr a good idea for a site like ours? I will also do some tweaking
> with the Solr settings to see if that might help...after I get through
> at least one complete index. /
>
> There are many possible reasons for search slowness, including not
> enough memory. Are there any differences in the speed of doing the
> same search in the staff and public interfaces? Or between two
> ways of getting the same results in the PUI. For example, does the
> link in the header to list all collections
> (/repositories/resources) return results faster than searching
> everything then filtering to just collections
> (/search?q[]=*&op[]=&field[]=keyword&filter_fields[]=primary_type&filter_values[]=resource).
> There's a fix coming in 3.0.0 for the latter.
>
> /I had not tried comparing staff to public. I will do that (though I
> first have to get some access to the staff side!). And I'll really not
> try to do much comparison until we get indexing complete, in case the
> indexing itself is slowing things down. /
> /
> /
> /More questions to come, I'm sure. But thanks for your input and ideas
> of places to look further. Much appreciated./
>
> Andrew.
>
>
> On 11/03/2021 02:07, Tom Hanstra wrote:
>> I'm very new to ArchivesSpace and so my issues may be early
>> configuration problems. But I'm hoping some out there can assist.
>> We are moving from hosted to local, so I have a large database
>> full of data that I'm working with.
>>
>> Indexing
>> Right now, I'm running into two primary problems:
>>
>> - Twice now, I've hit issues where the indexing fails due to the
>> Java heap space being exhausted. Do others run into this? What do
>> others use for Java settings?
>> - I've broken out my PUI indexing log into a separate log and see
>> FATAL errors in the log:
>> ------
>> I, [2021-03-10T15:32:03.747156 #2919] INFO -- :
>> [1b34df32-d3b7-49c3-b205-01a59daf03e5] Started GET "/system_api.php"
>> for 206.189.134.38 at 2021-03-10 15:32:03 -0500
>> F, [2021-03-10T15:32:03.881297 #2919] FATAL -- :
>> [1b34df32-d3b7-49c3-b205-01a59daf03e5]
>> F, [2021-03-10T15:32:03.881658 #2919] FATAL -- :
>> [1b34df32-d3b7-49c3-b205-01a59daf03e5] ActionController::RoutingErro
>> r (No route matches [GET] "/system_api.php"):
>> F, [2021-03-10T15:32:03.881866 #2919] FATAL -- :
>> [1b34df32-d3b7-49c3-b205-01a59daf03e5]
>> F, [2021-03-10T15:32:03.882085 #2919] FATAL -- :
>> [1b34df32-d3b7-49c3-b205-01a59daf03e5] actionpack (5.2.4.4) lib/acti
>> on_dispatch/middleware/debug_exceptions.rb:65:in `call'
>> [1b34df32-d3b7-49c3-b205-01a59daf03e5] actionpack (5.2.4.4)
>> lib/action_dispatch/middleware/show_exceptions.rb:33:in `
>> call'
>> ------
>> Is this something to be concerned about? Why is it showing up in
>> the PUI log?
>>
>> Search issues
>> - Supposedly, I did get one round of indexing completed without a
>> heap error. But the resulting searches yielded numbers which were
>> incorrect compared to our hosted version. This is why I've been
>> trying reindexing. Is it usual to have indexing *look* like it is
>> complete but really be incomplete?
>> - When I do a search, the response is really slow. I've got nginx
>> set up as a proxy in front of ArchivesSpace and it is showing
>> that the slowness is in ArchivesSpace itself somewhere. I don't
>> see anything in the logs to show what is taking so long. Where
>> should I be checking for issues?
>>
>> Thanks,
>> Tom
>>
>> --
>> *Tom Hanstra*
>> /Sr. Systems Administrator/
>> hanstra at nd.edu <mailto:hanstra at nd.edu>
>>
>>
>>
>> _______________________________________________
>> Archivesspace_Users_Group mailing list
>> Archivesspace_Users_Group at lyralists.lyrasis.org <mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
>> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group <http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group>
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> <mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
> <http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group>
>
>
>
> --
> *Tom Hanstra*
> /Sr. Systems Administrator/
> hanstra at nd.edu <mailto:hanstra at nd.edu>
>
>
>
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20210311/50fd16d7/attachment.html>
More information about the Archivesspace_Users_Group
mailing list