[Archivesspace_Users_Group] Indexing and search issues

Thu Mar 11 13:25:33 EST 2021

//> I also notice that indexing overall slows down as it gets farther 
into our records.//

I haven't observed a slow-down in indexing. At least not noticeably 
enough to cause me to want to measure it. As I mentioned, there are some 
latter stages of the indexing process (building trees, committing 
changes) that can run for a long time without logging anything. But for 
the main indexing of archival objects, the last 1000 doesn't seem to 
take longer than the first 1000.

/> Is an external Solr a good idea for a site like ours?/

Hard to say. There was a discussion on the performance benefits of 
running external Solr on here last month:

http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/2021-February/thread.html#8168

And documentation here:

https://archivesspace.github.io/tech-docs/provisioning/solr.html

Whether it affects indexing speed for large numbers of records I cannot 
say. We made the decision to use it before all our data was migrated 
into ArchivesSpace.

Andrew.

On 11/03/2021 15:32, Tom Hanstra wrote:
> Thanks, Andrew. Some responses intertwined below, italicized:
>
> On Thu, Mar 11, 2021 at 7:31 AM Andrew Morrison 
> <andrew.morrison at bodleian.ox.ac.uk 
> <mailto:andrew.morrison at bodleian.ox.ac.uk>> wrote:
>
>     You can allocate more memory to ArchivesSpace by setting the
>     ASPACE_JAVA_XMX environment variable it runs under. Setting that
>     to "-Xmx4g" should be sufficient.
>
> /I did bump that and the ASPACE_JAVA_XSS up a bit for this round, 
> which looks like it will finally complete. Just a few more PUI records 
> need to be added. /
>
>     Those FATAL lines in the log snippet are caused by a bot probing
>     for known vulnerabilities in common web platforms and
>     applications, hoping to find a web site running an out-of-date
>     copy (of Drupal in this case) which it can exploit. It has nothing
>     to do with ArchivesSpace, which has no PHP code. It is merely
>     logging that it doesn't know what to do with that request.
>
> /Thanks. I was hoping this was just extraneous. /
>
>     How did you know your one successful re-indexing completed? There
>     are two indexers, Staff and PUI, with the latter usually taking
>     much longer to finish. So if the PUI indexer fails after the staff
>     indexer finishes, you will see more records in the staff interface
>     than the public interface, even if they're all set to be public.
>     Also both indexers log messages that could be interpreted as
>     meaning they've finished, but they then run additional indexing to
>     build trees, to enable navigation within collections to work. A
>     finally they instruct Solr to commit changes, which can be slow
>     depending on the performance of your storage. You could try
>     doubling AppConfig[:indexer_solr_timeout_seconds] to allow more
>     time for each operation.
>
> /At least one set of logs, when I earlier gave the server more 
> resources, showed that the indexing had completed. But, because the 
> second repository was showing nothing, I decided to indexing again.
>
> This time around, we do have the second repository found so that, too, 
> indicates that things have gone better. I guess I have to wait for 
> things to complete but there are still some questions outstanding. For 
> instance, one search I did for "football" (something dear to the Notre 
> Dame experience) within the repository which is supposed to be pretty 
> much indexed, showed over 32K results on our hosted site but only 17K 
> locally. That seems wildly off with only a few PUI records to be 
> completed (log shows 736500 of 763368). Could the incomplete index 
> really be that far off?/
> /
> /
> /I also notice that indexing overall slows down as it gets farther 
> into our records. Is that probably because there is just more to be 
> done with the records that might not have gotten done in earlier 
> attempts while the first records buzz by rapidly because of earlier 
> indexing attempts?  Or could it be that resources are taken up early 
> in the processing and no longer available for processing the later 
> records? Is resource tuning just a trial/error prospect?  I don't see 
> a lot of information in the documentation./
>
> Or it could've re-indexed one repository but failed on the next. And 
> it is possible for entire repositories to be set as non-public, which 
> could be another explanation for fewer records.
>
> Are you running an external Solr? If so, is the AppConfig[:solr_url] 
> in config.rb pointing to the correct server?
>
> /I'm running a local Solr as part of the application. Is an external 
> Solr a good idea for a site like ours? I will also do some tweaking 
> with the Solr settings to see if that might help...after I get through 
> at least one complete index. /
>
>     There are many possible reasons for search slowness, including not
>     enough memory. Are there any differences in the speed of doing the
>     same search in the staff and public interfaces? Or between two
>     ways of getting the same results in the PUI. For example, does the
>     link in the header to list all collections
>     (/repositories/resources) return results faster than searching
>     everything then filtering to just collections
>     (/search?q[]=*&op[]=&field[]=keyword&filter_fields[]=primary_type&filter_values[]=resource).
>     There's a fix coming in 3.0.0 for the latter.
>
> /I had not tried comparing staff to public. I will do that (though I 
> first have to get some access to the staff side!). And I'll really not 
> try to do much comparison until we get indexing complete, in case the 
> indexing itself is slowing things down. /
> /
> /
> /More questions to come, I'm sure. But thanks for your input and ideas 
> of places to look further. Much appreciated./
>
>     Andrew.
>
>
>     On 11/03/2021 02:07, Tom Hanstra wrote:
>>     I'm very new to ArchivesSpace and so my issues may be early
>>     configuration problems. But I'm hoping some out there can assist.
>>     We are moving from hosted to local, so I have a large database
>>     full of data that I'm working with.
>>
>>     Indexing
>>     Right now, I'm running into two primary problems:
>>
>>     - Twice now, I've hit issues where the indexing fails due to the
>>     Java heap space being exhausted. Do others run into this? What do
>>     others use for Java settings?
>>     - I've broken out my PUI indexing log into a separate log and see
>>     FATAL errors in the log:
>>     ------
>>     I, [2021-03-10T15:32:03.747156 #2919]  INFO -- :
>>     [1b34df32-d3b7-49c3-b205-01a59daf03e5] Started GET "/system_api.php"
>>      for 206.189.134.38 at 2021-03-10 15:32:03 -0500
>>     F, [2021-03-10T15:32:03.881297 #2919] FATAL -- :
>>     [1b34df32-d3b7-49c3-b205-01a59daf03e5]
>>     F, [2021-03-10T15:32:03.881658 #2919] FATAL -- :
>>     [1b34df32-d3b7-49c3-b205-01a59daf03e5] ActionController::RoutingErro
>>     r (No route matches [GET] "/system_api.php"):
>>     F, [2021-03-10T15:32:03.881866 #2919] FATAL -- :
>>     [1b34df32-d3b7-49c3-b205-01a59daf03e5]
>>     F, [2021-03-10T15:32:03.882085 #2919] FATAL -- :
>>     [1b34df32-d3b7-49c3-b205-01a59daf03e5] actionpack (5.2.4.4) lib/acti
>>     on_dispatch/middleware/debug_exceptions.rb:65:in `call'
>>     [1b34df32-d3b7-49c3-b205-01a59daf03e5] actionpack (5.2.4.4)
>>     lib/action_dispatch/middleware/show_exceptions.rb:33:in `
>>     call'
>>     ------
>>     Is this something to be concerned about? Why is it showing up in
>>     the PUI log?
>>
>>     Search issues
>>     - Supposedly, I did get one round of indexing completed without a
>>     heap error. But the resulting searches yielded numbers which were
>>     incorrect compared to our hosted version. This is why I've been
>>     trying reindexing. Is it usual to have indexing *look* like it is
>>     complete but really be incomplete?
>>     - When I do a search, the response is really slow. I've got nginx
>>     set up as a proxy in front of ArchivesSpace and it is showing
>>     that the slowness is in ArchivesSpace itself somewhere. I don't
>>     see anything in the logs to show what is taking so long. Where
>>     should I be checking for issues?
>>
>>     Thanks,
>>     Tom
>>
>>     -- 
>>     *Tom Hanstra*
>>     /Sr. Systems Administrator/
>>     hanstra at nd.edu <mailto:hanstra at nd.edu>
>>
>>
>>
>>     _______________________________________________
>>     Archivesspace_Users_Group mailing list
>>     Archivesspace_Users_Group at lyralists.lyrasis.org  <mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
>>     http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group  <http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group>
>     _______________________________________________
>     Archivesspace_Users_Group mailing list
>     Archivesspace_Users_Group at lyralists.lyrasis.org
>     <mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
>     http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
>     <http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group>
>
>
>
> -- 
> *Tom Hanstra*
> /Sr. Systems Administrator/
> hanstra at nd.edu <mailto:hanstra at nd.edu>
>
>
>
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20210311/50fd16d7/attachment.html>