[Archivesspace_Users_Group] Indexing and search issues

Thu Mar 11 10:32:20 EST 2021

Thanks, Andrew. Some responses intertwined below, italicized:

On Thu, Mar 11, 2021 at 7:31 AM Andrew Morrison <
andrew.morrison at bodleian.ox.ac.uk> wrote:

> You can allocate more memory to ArchivesSpace by setting the
> ASPACE_JAVA_XMX environment variable it runs under. Setting that to
> "-Xmx4g" should be sufficient.
>
*I did bump that and the ASPACE_JAVA_XSS up a bit for this round, which
looks like it will finally complete. Just a few more PUI records need to be
added. *

> Those FATAL lines in the log snippet are caused by a bot probing for known
> vulnerabilities in common web platforms and applications, hoping to find a
> web site running an out-of-date copy (of Drupal in this case) which it can
> exploit. It has nothing to do with ArchivesSpace, which has no PHP code. It
> is merely logging that it doesn't know what to do with that request.
>
*Thanks. I was hoping this was just extraneous. *

> How did you know your one successful re-indexing completed? There are two
> indexers, Staff and PUI, with the latter usually taking much longer to
> finish. So if the PUI indexer fails after the staff indexer finishes, you
> will see more records in the staff interface than the public interface,
> even if they're all set to be public. Also both indexers log messages that
> could be interpreted as meaning they've finished, but they then run
> additional indexing to build trees, to enable navigation within collections
> to work. A finally they instruct Solr to commit changes, which can be slow
> depending on the performance of your storage. You could try doubling
> AppConfig[:indexer_solr_timeout_seconds] to allow more time for each
> operation.
>

*At least one set of logs, when I earlier gave the server more resources,
showed that the indexing had completed. But, because the second repository
was showing nothing, I decided to indexing again. This time around, we do
have the second repository found so that, too, indicates that things have
gone better. I guess I have to wait for things to complete but there are
still some questions outstanding. For instance, one search I did for
"football" (something dear to the Notre Dame experience) within the
repository which is supposed to be pretty much indexed, showed over 32K
results on our hosted site but only 17K locally. That seems wildly off with
only a few PUI records to be completed (log shows 736500 of 763368). Could
the incomplete index really be that far off?*

*I also notice that indexing overall slows down as it gets farther into our
records. Is that probably because there is just more to be done with the
records that might not have gotten done in earlier attempts while the first
records buzz by rapidly because of earlier indexing attempts?  Or could it
be that resources are taken up early in the processing and no longer
available for processing the later records? Is resource tuning just a
trial/error prospect?  I don't see a lot of information in the
documentation.*

Or it could've re-indexed one repository but failed on the next. And it is
possible for entire repositories to be set as non-public, which could be
another explanation for fewer records.

Are you running an external Solr? If so, is the AppConfig[:solr_url] in
config.rb pointing to the correct server?
*I'm running a local Solr as part of the application. Is an external Solr a
good idea for a site like ours? I will also do some tweaking with the Solr
settings to see if that might help...after I get through at least one
complete index. *

> There are many possible reasons for search slowness, including not enough
> memory. Are there any differences in the speed of doing the same search in
> the staff and public interfaces? Or between two ways of getting the same
> results in the PUI. For example, does the link in the header to list all
> collections (/repositories/resources) return results faster than searching
> everything then filtering to just collections
> (/search?q[]=*&op[]=&field[]=keyword&filter_fields[]=primary_type&filter_values[]=resource).
> There's a fix coming in 3.0.0 for the latter.
>
*I had not tried comparing staff to public. I will do that (though I first
have to get some access to the staff side!). And I'll really not try to do
much comparison until we get indexing complete, in case the indexing itself
is slowing things down. *

*More questions to come, I'm sure. But thanks for your input and ideas of
places to look further. Much appreciated.*

> Andrew.
>
>
> On 11/03/2021 02:07, Tom Hanstra wrote:
>
> I'm very new to ArchivesSpace and so my issues may be early configuration
> problems. But I'm hoping some out there can assist. We are moving from
> hosted to local, so I have a large database full of data that I'm working
> with.
>
> Indexing
> Right now, I'm running into two primary problems:
>
> - Twice now, I've hit issues where the indexing fails due to the Java heap
> space being exhausted. Do others run into this? What do others use for Java
> settings?
> - I've broken out my PUI indexing log into a separate log and see FATAL
> errors in the log:
> ------
> I, [2021-03-10T15:32:03.747156 #2919]  INFO -- :
> [1b34df32-d3b7-49c3-b205-01a59daf03e5] Started GET "/system_api.php"
>  for 206.189.134.38 at 2021-03-10 15:32:03 -0500
> F, [2021-03-10T15:32:03.881297 #2919] FATAL -- :
> [1b34df32-d3b7-49c3-b205-01a59daf03e5]
> F, [2021-03-10T15:32:03.881658 #2919] FATAL -- :
> [1b34df32-d3b7-49c3-b205-01a59daf03e5] ActionController::RoutingErro
> r (No route matches [GET] "/system_api.php"):
> F, [2021-03-10T15:32:03.881866 #2919] FATAL -- :
> [1b34df32-d3b7-49c3-b205-01a59daf03e5]
> F, [2021-03-10T15:32:03.882085 #2919] FATAL -- :
> [1b34df32-d3b7-49c3-b205-01a59daf03e5] actionpack (5.2.4.4) lib/acti
> on_dispatch/middleware/debug_exceptions.rb:65:in `call'
> [1b34df32-d3b7-49c3-b205-01a59daf03e5] actionpack (5.2.4.4)
> lib/action_dispatch/middleware/show_exceptions.rb:33:in `
> call'
> ------
> Is this something to be concerned about? Why is it showing up in the PUI
> log?
>
> Search issues
> - Supposedly, I did get one round of indexing completed without a heap
> error. But the resulting searches yielded numbers which were incorrect
> compared to our hosted version. This is why I've been trying reindexing. Is
> it usual to have indexing *look* like it is complete but really be
> incomplete?
> - When I do a search, the response is really slow. I've got nginx set up
> as a proxy in front of ArchivesSpace and it is showing that the slowness is
> in ArchivesSpace itself somewhere. I don't see anything in the logs to show
> what is taking so long. Where should I be checking for issues?
>
> Thanks,
> Tom
>
> --
> *Tom Hanstra*
> *Sr. Systems Administrator*
> hanstra at nd.edu
>
>
>
> _______________________________________________
> Archivesspace_Users_Group mailing listArchivesspace_Users_Group at lyralists.lyrasis.orghttp://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
>
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group
>

-- 
*Tom Hanstra*
*Sr. Systems Administrator*
hanstra at nd.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20210311/3682ffbf/attachment.html>