[Archivesspace_Users_Group] Indexer failing on restart of service - anyone else seen this?

Tue Oct 13 13:54:23 EDT 2015

Hello Joshua,

I wanted to share the results of the indexing issues Yale Library was experiencing this summer. As you noted, it may be the same root cause to the problem, or otherwise just a good lesson about administration of ASpace.

To recap the problems at Yale Library related to indexing:

·         Staff using ASpace would report they created new content and were unable to retrieve it, either through type ahead or direct searches for the content. This created service disruption.

·         In an attempt to analyze the log files for the date/times of the reported problems, we frequently saw EOF/end of file errors. (I don’t see that error in your snippet below, but…)

·         What we saw was empty *.dat files, namely (repositories_repositories.dat and deletes_deletes.dat, and sometimes top_container, but that last one is a Yale customization)

·         We also would notice the the application would stop logging at the same time these files would turn to an empty state/0 state

·         We would initiate a partial reindex, then restart the application. This would temporarily fix the problem (app would resume logging, files would rebuild, staff would report “hey I found my stuff”)

·         However, with 3-4 hours the pattern would repeat, logging would stop, no new content could be retrieved, etc.

We had HM work with us to analyze the issue. The root cause was lack of disk space. Even with 25GB of allocated space, our production instance routinely ran as low as 12MB (yes an M) of free space. These fluctuations on the server basically lined up to every 4 hours when a solr backup was set to run. Our index is currently about 15GB. ASpace keeps the index and tries to create a full 2nd copy. 15 + 15 =30, so we ran out of space. We believe that is why logging stopped and needed a kick to get going again.

To test this theory, we temporarily disabled solr backups, and had a script from HM In place to monitor disk usage both pre and post solr backups. When the backups were temporarily suspended, the disk capacity did still go as high as 93%, but left us with 12GB (with a G) to spare. Logging has continued, indexing has remained up to date, and staff are able to use the system as expected.

Several recommendations were also made to write the backups to a different filesystem than the one ASpace was using (outsde of /usr/local), and monitor for disk capacity with alerts.

Again, not sure that this is the problem you are having, but you mention a stopstart, so maybe check for empty files, or if the app is logging, or if disk capacity is on the high end.

Melissa Wisner
Yale Library IT

From: archivesspace_users_group-bounces at lyralists.lyrasis.org [mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org] On Behalf Of Joshua D. Shaw
Sent: Tuesday, October 13, 2015 12:31 PM
To: Archivesspace Users Group
Subject: [Archivesspace_Users_Group] Indexer failing on restart of service - anyone else seen this?

Hey All-

This may be related to some of the indexer problems that Maureen and others have noticed from time to time, but I'm throwing this out there in case others want to chime in with a solution (best!) or some insight.

Anyway, here's the scenario. Our setup for production is configured as shown below the fold. Basically 4 frontend servers behind an F5. We've noticed that when we do a stop/start of the frontends (for plugin updates, etc) the indexer for one of the frontends sometimes fails. The logs for that server show a lot of errors of the following type:

D, [2015-10-07T17:53:28.369000 #16298] DEBUG -- : Thread-9382: POST /users/staff_system/login [session: nil]
D, [2015-10-07T17:53:28.377000 #16298] DEBUG -- : Thread-9382: Post-processed params: {:username=>"staff_system", :password=>"[FILTERED]", :expiring=>false}
D, [2015-10-07T17:53:28.529000 #16298] DEBUG -- : Thread-9382: Responded with [403, {"Content-Type"=>"application/json", "Cache-Control"=>"private, must-revalidate, max-age=0", "Content-Length"=>"25"}, ["{\"error\":\"Login failed\"}\n"]]... in 161.0ms
D, [2015-10-07T17:53:28.542000 #16298] DEBUG -- : Thread-9384: POST /update_monitor [session: nil]
D, [2015-10-07T17:53:28.548000 #16298] DEBUG -- : Thread-9384: Post-processed params: {:active_edits=>#<JSONModel(:active_edits) {"jsonmodel_type"=>"active_edits", "active_edits"=>[]}>}
D, [2015-10-07T17:53:28.550000 #16298] DEBUG -- : Thread-9384: Responded with [403, {"Content-Type"=>"application/json", "Cache-Control"=>"private, must-revalidate, max-age=0", "Content-Length"=>"26"}, ["{\"error\":\"Access denied\"}\n"]]... in 8.0ms
D, [2015-10-07T17:53:28.558000 #16298] DEBUG -- : Thread-5526: POST /users/staff_system/login [session: nil]
D, [2015-10-07T17:53:28.564000 #16298] DEBUG -- : Thread-5526: Post-processed params: {:username=>"staff_system", :password=>"[FILTERED]", :expiring=>false}
D, [2015-10-07T17:53:28.723000 #16298] DEBUG -- : Thread-5526: Responded with [403, {"Content-Type"=>"application/json", "Cache-Control"=>"private, must-revalidate, max-age=0", "Content-Length"=>"25"}, ["{\"error\":\"Login failed\"}\n"]]... in 165.0ms
D, [2015-10-07T17:53:32.879000 #16298] DEBUG -- : Thread-5518: POST /users/search_indexer/login [session: nil]
D, [2015-10-07T17:53:32.885000 #16298] DEBUG -- : Thread-5518: Post-processed params: {:username=>"search_indexer", :password=>"[FILTERED]", :expiring=>false}
D, [2015-10-07T17:53:33.050000 #16298] DEBUG -- : Thread-5518: Responded with [403, {"Content-Type"=>"application/json", "Cache-Control"=>"private, must-revalidate, max-age=0", "Content-Length"=>"25"}, ["{\"error\":\"Login failed\"}\n"]]... in 171.0ms
#<RuntimeError: Authentication to backend failed: {"error":"Login failed"}

Sometimes a simple start/stop of that particular server solves the issue, but we have had to replicate a known good server to the bad to clear things up.

Has anyone else seen something similar? Any thoughts on what a potential cause (and cure) might be?

Thanks!
Joshua

Setup details with some obfuscation for security.

-          ArchivesSpace version: 1.3
-          Java version: 1.7.0_85-b15
-          CPU cores: 4 cores (each server)
-          Memory: 36gb (each server)

At the top level, clients connection for our hardware load balancer, Big-IP’s F5. The F5 terminates SSL and sends HTTP traffic to one of four servers running Oracle HTTP Server (OHS), which is essentially Apache. Two of these servers also run the ArchivesSpace application. OHS is configured to send the HTTP traffic to the two application servers.

[cid:image001.png at 01D105BE.A9BFF5F0]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20151013/e889f74f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 51836 bytes
Desc: image001.png
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20151013/e889f74f/attachment.png>