[Archivesspace_Users_Group] Diagnosing issues with ArchivesSpace

Wed May 24 14:38:53 EDT 2023

Michael Smith wrote on 2023-05-23 23:52:45:
> Our team has been facing recurring issues with our ArchivesSpace setup since
> October last year, which we've been unable to fully resolve despite
> concerted efforts.

Could pretty much be a description of our team up until very recently, when a
team member was finally able to hook the application up with Datadog's tracing
facilities.

We're currently running 3.1.1 with a standalone Solr 7.7.3 and MariaDB 10.4.24
on Ubuntu 20.04/22.04 servers.

> The primary problem involves intermittent system slowdowns and shutdowns,
> requiring frequent reboots to regain functionality. This occurs on average
> 3-4 times weekly but can sometimes be more frequent. This issue is affecting
> multiple teams across our organization.

Using the tracing facilities mentioned above we've found that the object
resolver in Archivesspace does not deduplicate the object tree properly and as
a result a resource we had with over 1100 event links produced a 130MB+ JSON
object and was subsequently parsed into 1.3GB of Ruby data and due to a quirk
of rendering all this was done twice. We reported this on Github
(https://github.com/archivesspace/archivesspace/issues/2993) 3 weeks ago.
The events were not very important to our archivists, so we ended up deleting
them.

We've also found that search is also suboptimal for us. Searches are taking
exponentially longer with every added term and for every search thousands of
requests are made to populate the 'Found in' column of the results. We're on
an old version of Solr and are using a fairly old schema, so we want to
upgrade both before we report this issue.

We've also noticed that database queries trying to update the archivesspace
software agent's system_mtime are failing and we've found that the row has not
been updated since we switched from 2.8.1 to 3.1.1. Possibly linked to this...

> The most common symptom of our problem that we are seeing now looks to be a
> connection pool leak where what looks like indexer threads are holding
> connections in a closed wait state and preventing them from being used for
> other requests.  This leads to the main page timing out and staff seeing 504
> errors, when unresponsive in this manner we usually restart the application.

...our main problem: users are unable to save records due to the updates
timing out waiting for locks. Looking at the database processlist we've
observed 2-3 instances of identical update queries in different sessions and
on the tracing level the queries retry several times before failing on their
LIMIT 1 clause, as there are no rows to update. We don't fully understand this
problem yet, but seeing your message this might be because we don't see
indexer threads in the traces, as they're on a different host.

> Some of the things we’ve attempted so far,
> 
>   *   changed default config settings for indexer records per thread, thread
>   count and solr timeout to 10, 2 & 300

>   *   modified archivesspace.sh to increase memory available
>   (ASPACE_JAVA_XMX="-Xmx35g")

We're on 56GB of heap now. We have ~3.2 million objects in the database across
~30 repositories, I believe this to be one of the larger installations of AS
out there.

>   *   disabled both PUI and PUI indexer

We've been actually thinking of doing this, we currently have the indexer on a
separate host. Does disabling the indexers impact visibility of changes in any
way for you?

> Any advice with further diagnosis / troubleshooting would be appreciated. If
> you need additional information about our setup or the issues we're
> encountering, please let us know.

Our colleague has written a trivial plugin that enables Datadog tracing and
telemetry and it has been, excuse the phrasing, instrumental. He also made it
public, the brilliant bloke (use the log-scope branch for now):
https://gitlab.developers.cam.ac.uk/lib/dev/ams/aspace-datadog

Hope that helps,
p