[Archivesspace_Users_Group] Indexing repository details in all records skews results set
Joshua D. Shaw
Joshua.D.Shaw at dartmouth.edu
Wed Jun 27 11:02:43 EDT 2018
Hi Christine-
I found its in both the staff and public interfaces, though the public side does seem to be more problematic - since it seems to be even greedier about how much its adding to the index fields. I think we got "lucky" because of the whoel "webster" thing.
I'd be happy to chat with Laney and anyone else about what I've been poking at!
Best,
Joshua
PS. For the record - and in case anyone else is doing something similar - one of the issues I ran into initially had to do with adding data from ancestors. I was initially adding the resource as a fully resolved attribute to an AO using the "add_attribute_to_resolve" method and then fiddling with that data to get what I needed for faceting and display. But I had forgotten about the fullrecord field and all of the resource data was being pushed into the AO fullrecord as well. I've switched to fetching each resource, extracting the data from that, and pushing just what I need into the AO record. It adds a bunch of overhead to the indexer process, so I'd be happy to hear if anyone has a better idea!
________________________________
From: archivesspace_users_group-bounces at lyralists.lyrasis.org <archivesspace_users_group-bounces at lyralists.lyrasis.org> on behalf of Christine Di Bella <christine.dibella at lyrasis.org>
Sent: Wednesday, June 27, 2018 10:36 AM
To: Archivesspace Users Group
Subject: Re: [Archivesspace_Users_Group] Indexing repository details in all records skews results set
Hi Joshua,
Thanks for posting about this. Just to clarify, are you referring to search behavior in the staff interface or the public interface (or both)?
If it’s the public interface, you’ve hit the nail on the head – the full record field is a big issue there. We’ve been working on specific improvements to search behavior that we believe will address situations like this. We’d love to have you test what we’ve been doing and can point you to it, if you’d like.
(If it’s the staff interface, full record exhibits the same behavior, but it seems to be an issue for fewer people.)
We’d also love your help with this indexer work if you’re like to talk with Laney and some others who been looking into this. Sounds like your perspective and investigations could be really helpful to everyone, especially while we’re all knee deep in it!
Christine
Christine Di Bella
ArchivesSpace Program Manager
christine.dibella at lyrasis.org<mailto:christine.dibella at lyrasis.org>
800.999.8558 x2905
678-235-2905
cdibella13 (Skype)
From: archivesspace_users_group-bounces at lyralists.lyrasis.org <archivesspace_users_group-bounces at lyralists.lyrasis.org> On Behalf Of Joshua D. Shaw
Sent: Wednesday, June 27, 2018 10:31 AM
To: Archivesspace Users Group <archivesspace_users_group at lyralists.lyrasis.org>
Subject: Re: [Archivesspace_Users_Group] Indexing repository details in all records skews results set
I did a little more digging and to answer my own question, the "fullrecord" field holds everything (well almost) in the SOLR doc. I think that the steps to build this field, specifically the "extract_string_values" method in IndexerCommon is probably a bit greedy and probably should skip the repository in addition to the update times, etc. I'm testing that locally.
My own issue was also complicated by some custom indexer stuff I'm doing that was initially adding the resource as a fully resolved attribute to the AO docs (I'm doing it differently now).....which doubled the fullrecord issue and added its own headaches for searching relevancy.
Joshua
________________________________
From: archivesspace_users_group-bounces at lyralists.lyrasis.org<mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org> <archivesspace_users_group-bounces at lyralists.lyrasis.org<mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>> on behalf of Joshua D. Shaw <Joshua.D.Shaw at dartmouth.edu<mailto:Joshua.D.Shaw at dartmouth.edu>>
Sent: Tuesday, June 26, 2018 5:25 PM
To: Archivesspace Users Group
Subject: [Archivesspace_Users_Group] Indexing repository details in all records skews results set
Hi All-
I think this has been the behavior of AS from the beginning, but during some recent testing, I finally realized that AS is indexing the repository details with every record in the repository. Since part of our address is "6065 Webster Hall" and we have a *lot* of Daniel Webster related material (he's a Dartmouth alum), searching for "webster" is a bad thing since every record in the repo is listed. In a vanilla install, you can see the repository details in the json package in the results (result['json']), so that sort of made sense....
I've done some cooking of the indexer to remove the resolved repository details (result['json']['repository']['_resolved'] (and fiddle some other things), but even though the json representation of the search results contains no instance of the search string, I *still* get results based on the repository details.
Example:
Repository Name is "rauner" and the long name is "Rauner Special Collections Library"
Search: "rauner"
Example results in json for a top container and an archival object below. Note that these *do not* contain the string "rauner"
I must be missing something in how the indexer is actually storing and searching data. I'd love to know if someone has a method to remove the repository details (and anything else global) from the results to prevent this sort of thing and to cut down on erroneous results.
Thanks!
Joshua
TC:
{
"id": "/repositories/2/top_containers/53",
"uri": "/repositories/2/top_containers/53",
"title": "MS-1371b, Box 53",
"primary_type": "top_container",
"types": [
"top_container"
],
"json": "{\"lock_version\":38,\"indicator\":\"53\",\"created_by\":\"admin\",\"last_modified_by\":\"admin\",\"create_time\":\"2018-06-26T20:28:33Z\",\"system_mtime\":\"2018-06-26T21:11:11Z\",\"user_mtime\":\"2018-06-26T20:28:33Z\",\"type\":\"box\",\"jsonmodel_type\":\"top_container\",\"active_restrictions\":[],\"container_locations\":[],\"series\":[],\"collection\":[{\"ref\":\"/repositories/2/resources/1\",\"identifier\":\"MS-1371b\",\"display_string\":\"Mario Puzo papers\"}],\"uri\":\"/repositories/2/top_containers/53\",\"repository\":{\"ref\":\"/repositories/2\",\"_resolved\":\"\"},\"restricted\":false,\"is_linked_to_published_record\":false,\"display_string\":\"Box 53\",\"long_display_string\":\"MS-1371b, Box 53\"}",
"suppressed": false,
"publish": false,
"system_generated": false,
"repository": "/repositories/2",
"type_enum_s": [
"box"
],
"created_by": "admin",
"last_modified_by": "admin",
"user_mtime": "2018-06-26T20:28:33Z",
"system_mtime": "2018-06-26T21:11:11Z",
"create_time": "2018-06-26T20:28:33Z",
"display_string": "Box 53",
"collection_uri_u_sstr": [
"/repositories/2/resources/1"
],
"collection_display_string_u_sstr": [
"Mario Puzo papers"
],
"collection_identifier_stored_u_sstr": [
"MS-1371b"
],
"collection_identifier_u_stext": [
"MS-1371b",
"MS 1371b",
"MS1371b",
"MS- 1371 b"
],
"exported_u_sbool": [
false
],
"empty_u_sbool": [
false
],
"indicator_u_stext": [
"53"
],
"jsonmodel_type": "top_container"
}
AO:
{
"id": "/repositories/2/archival_objects/3",
"uri": "/repositories/2/archival_objects/3",
"title": "<emph render=\"italic\">The Fortunate Pilgrim</emph>",
"primary_type": "archival_object",
"types": [
"archival_object"
],
"json": "{\"lock_version\":0,\"position\":2,\"publish\":true,\"ref_id\":\"a97bf46cbc2cd85e9789c76098a3ee1b\",\"title\":\"<emph render=\\\"italic\\\">The Fortunate Pilgrim</emph>\",\"display_string\":\"<emph render=\\\"italic\\\">The Fortunate Pilgrim</emph>\",\"restrictions_apply\":false,\"created_by\":\"admin\",\"last_modified_by\":\"admin\",\"create_time\":\"2018-06-26T20:28:33Z\",\"system_mtime\":\"2018-06-26T21:11:11Z\",\"user_mtime\":\"2018-06-26T20:28:33Z\",\"suppressed\":false,\"level\":\"series\",\"jsonmodel_type\":\"archival_object\",\"external_ids\":[],\"subjects\":[],\"linked_events\":[],\"extents\":[],\"dates\":[],\"external_documents\":[],\"rights_statements\":[],\"linked_agents\":[],\"onbase_documents\":[],\"ancestors\":[{\"ref\":\"/repositories/2/resources/1\",\"level\":\"collection\"}],\"instances\":[],\"notes\":[],\"uri\":\"/repositories/2/archival_objects/3\",\"repository\":{\"ref\":\"/repositories/2\",\"_resolved\":\"\"},\"resource\":{\"ref\":\"/repositories/2/resources/1\"},\"has_unpublished_ancestor\":false,\"resource_identifier_u_sstr\":\"MS-1371b\",\"resource_type_u_sstr\":null,\"resource_title\":\"Mario Puzo papers\"}",
"suppressed": false,
"publish": false,
"system_generated": false,
"repository": "/repositories/2",
"level_enum_s": [
"series",
"collection"
],
"resource": "/repositories/2/resources/1",
"ref_id": "a97bf46cbc2cd85e9789c76098a3ee1b",
"created_by": "admin",
"last_modified_by": "admin",
"user_mtime": "2018-06-26T20:28:33Z",
"system_mtime": "2018-06-26T21:11:11Z",
"create_time": "2018-06-26T20:28:33Z",
"notes": "",
"level": "series",
"ancestors": [
"/repositories/2/resources/1"
],
"total_restrictions_u_sstr": [
"false"
],
"resource_identifier_u_sstr": [
"MS-1371b"
],
"resource_title_u_sstr": [
"Mario Puzo papers"
],
"resource_identifier_w_title_u_sstr": [
"MS-1371b: Mario Puzo papers"
],
"jsonmodel_type": "archival_object"
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20180627/e8344d90/attachment.html>
More information about the Archivesspace_Users_Group
mailing list