[Archivesspace_Users_Group] Diagnosing issues with ArchivesSpace

Wed May 24 11:07:55 EDT 2023

Hi Michael

These aren't answers, but I think it might help the group if we knew a bit more about how your instance is structured - both from a tech perspective (memory allocation to the app and Solr) and things like how many repos and how many objects (resources, AOs, etc) are in the DB. The structure of your resources may also be useful. IE are they wide or deep or both? Wide meaning a lot of siblings at each level, but not a lot of levels in the hierarchy and deep meaning a lot of levels in the hierarchy, but not as many siblings at each level.

The plugins that you are using probably aren't the culprit, but they can add/override index functionality, so listing those out may help as well.

It might also be good to know how many edits are made concurrently on average.

Couple of things that sprang to mind to check (if you haven't already). Have you noticed this same behavior in an instance that is not in use? IE have you set up a clone of your production instance, let it do its initial full index, and then just let it sit? Do you see errors in the app log that have any bearing on the problem or pop up around or just before the app goes unresponsive or OOM?

In case it helps for comparison, Dartmouth is running 3.3.1 (skipped 3.2.0) and allocating 4GB each to the app and Solr - everything running in containers. We have 5 repos, though only one is utilized much. That repo has about 15k resources and 670k AOs with 30k top containers and 15k agents. We have relatively few events or subjects. The resources tend to be wide with max 4 levels of hierarchy. Our largest resource has 10s of thousands of AOs in the hierarchy. We also run a huge number of plugins. We have relatively few editors - less than 5 at any one time.

Full index typically takes about 24 hours. We have not seen memory issues in any of our instances, though I have occasionally seen indexer timeouts during a full index. We have stock settings for the indexer (4, 1, 25) - though I had to raise the solr timeout a huge amount to 7200 for 3.3.1 to avoid solr timeouts. We do run the PUI, so much of the full index time is the PUI index churning away. Staff side indexing takes about 6-8 hours.

Best,
Joshua

________________________________
From: archivesspace_users_group-bounces at lyralists.lyrasis.org <archivesspace_users_group-bounces at lyralists.lyrasis.org> on behalf of Michael Smith <mismith at nla.gov.au>
Sent: Tuesday, May 23, 2023 7:52 PM
To: archivesspace_users_group at lyralists.lyrasis.org <archivesspace_users_group at lyralists.lyrasis.org>
Subject: [Archivesspace_Users_Group] Diagnosing issues with ArchivesSpace

You don't often get email from mismith at nla.gov.au. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>

Hello,

Our team has been facing recurring issues with our ArchivesSpace setup since October last year, which we've been unable to fully resolve despite concerted efforts.

We’re currently running v3.2 on Red Hat Enterprise Linux Server 7.9 (Maipo) and we do have a few custom plugins developed by Hudmol. These don’t appear to be causing the issues that we’re seeing but we haven’t ruled that out yet.

The primary problem involves intermittent system slowdowns and shutdowns, requiring frequent reboots to regain functionality. This occurs on average 3-4 times weekly but can sometimes be more frequent. This issue is affecting multiple teams across our organization.

The most common symptom of our problem that we are seeing now looks to be a connection pool leak where what looks like indexer threads are holding connections in a closed wait state and preventing them from being used for other requests.  This leads to the main page timing out and staff seeing 504 errors, when unresponsive in this manner we usually restart the application. If the application hits an OOM, it will restart itself.

Some of the things we’ve attempted so far,

  *   changed default config settings for indexer records per thread, thread count and solr timeout to 10, 2 & 300
  *   modified archivesspace.sh to increase memory available (ASPACE_JAVA_XMX="-Xmx35g")
  *   disabled both PUI and PUI indexer
  *   application logging to a circular log
  *   changed the garbage collection policies (ASPACE_GC_OPTS="-XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:NewRatio=1 -XX:+ExitOnOutOfMemoryError -XX:+UseGCOverheadLimit")
  *   checked top_containers with empty relationships (0 results)
  *   checked for duplicate event relationships (0 results)
  *   checked for empty indexer state files per record type (0 empty state files)
  *   nightly restarts of the system

Any advice with further diagnosis / troubleshooting would be appreciated. If you need additional information about our setup or the issues we're encountering, please let us know.

Regards,

Michael Smith  |  Software Developer
02 6262 1029  |  mismith at nla.gov.au<mailto:mismith at nla.gov.au>  |  National Library of Australia

The National Library of Australia acknowledges Australia’s First Nations Peoples – the First Australians – as the Traditional Owners and Custodians of this land and gives respect to the Elders – past and present – and through them to all Australian Aboriginal and Torres Strait Islander people.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20230524/a742eec0/attachment.html>