[Archivesspace_Users_Group] Indexing repository details in all records skews results set

Joshua D. Shaw Joshua.D.Shaw at dartmouth.edu
Wed Jun 27 10:31:12 EDT 2018


I did a little more digging and to answer my own question, the "fullrecord" field holds everything (well almost) in the SOLR doc. I think that the steps to build this field, specifically the "extract_string_values" method in IndexerCommon is probably a bit greedy and probably should skip the repository in addition to the update times, etc. I'm testing that locally.


My own issue was also complicated by some custom indexer stuff I'm doing that was initially adding the resource as a fully resolved attribute to the AO docs (I'm doing it differently now).....which doubled the fullrecord issue and added its own headaches for searching relevancy.


Joshua


________________________________
From: archivesspace_users_group-bounces at lyralists.lyrasis.org <archivesspace_users_group-bounces at lyralists.lyrasis.org> on behalf of Joshua D. Shaw <Joshua.D.Shaw at dartmouth.edu>
Sent: Tuesday, June 26, 2018 5:25 PM
To: Archivesspace Users Group
Subject: [Archivesspace_Users_Group] Indexing repository details in all records skews results set


Hi All-


I think this has been the behavior of AS from the beginning, but during some recent testing, I finally realized that AS is indexing the repository details with every record in the repository. Since part of our address is "6065 Webster Hall" and we have a *lot* of Daniel Webster related material (he's a Dartmouth alum), searching for "webster" is a bad thing since every record in the repo is listed. In a vanilla install, you can see the repository details in the json package in the results (result['json']), so that sort of made sense....


I've done some cooking of the indexer to remove the resolved repository details (result['json']['repository']['_resolved'] (and fiddle some other things), but even though the json representation of the search results contains no instance of the search string, I *still* get results based on the repository details.


Example:


Repository Name is "rauner" and the long name is "Rauner Special Collections Library"

Search: "rauner"

Example results in json for a top container and an archival object below. Note that these *do not* contain the string "rauner"


I must be missing something in how the indexer is actually storing and searching data. I'd love to know if someone has a method to remove the repository details (and anything else global) from the results to prevent this sort of thing and to cut down on erroneous results.


Thanks!

Joshua



TC:

{
        "id": "/repositories/2/top_containers/53",
        "uri": "/repositories/2/top_containers/53",
        "title": "MS-1371b, Box 53",
        "primary_type": "top_container",
        "types": [
          "top_container"
        ],
        "json": "{\"lock_version\":38,\"indicator\":\"53\",\"created_by\":\"admin\",\"last_modified_by\":\"admin\",\"create_time\":\"2018-06-26T20:28:33Z\",\"system_mtime\":\"2018-06-26T21:11:11Z\",\"user_mtime\":\"2018-06-26T20:28:33Z\",\"type\":\"box\",\"jsonmodel_type\":\"top_container\",\"active_restrictions\":[],\"container_locations\":[],\"series\":[],\"collection\":[{\"ref\":\"/repositories/2/resources/1\",\"identifier\":\"MS-1371b\",\"display_string\":\"Mario Puzo papers\"}],\"uri\":\"/repositories/2/top_containers/53\",\"repository\":{\"ref\":\"/repositories/2\",\"_resolved\":\"\"},\"restricted\":false,\"is_linked_to_published_record\":false,\"display_string\":\"Box 53\",\"long_display_string\":\"MS-1371b, Box 53\"}",
        "suppressed": false,
        "publish": false,
        "system_generated": false,
        "repository": "/repositories/2",
        "type_enum_s": [
          "box"
        ],
        "created_by": "admin",
        "last_modified_by": "admin",
        "user_mtime": "2018-06-26T20:28:33Z",
        "system_mtime": "2018-06-26T21:11:11Z",
        "create_time": "2018-06-26T20:28:33Z",
        "display_string": "Box 53",
        "collection_uri_u_sstr": [
          "/repositories/2/resources/1"
        ],
        "collection_display_string_u_sstr": [
          "Mario Puzo papers"
        ],
        "collection_identifier_stored_u_sstr": [
          "MS-1371b"
        ],
        "collection_identifier_u_stext": [
          "MS-1371b",
          "MS 1371b",
          "MS1371b",
          "MS- 1371 b"
        ],
        "exported_u_sbool": [
          false
        ],
        "empty_u_sbool": [
          false
        ],
        "indicator_u_stext": [
          "53"
        ],
        "jsonmodel_type": "top_container"
      }

AO:


{
        "id": "/repositories/2/archival_objects/3",
        "uri": "/repositories/2/archival_objects/3",
        "title": "<emph render=\"italic\">The Fortunate Pilgrim</emph>",
        "primary_type": "archival_object",
        "types": [
          "archival_object"
        ],
        "json": "{\"lock_version\":0,\"position\":2,\"publish\":true,\"ref_id\":\"a97bf46cbc2cd85e9789c76098a3ee1b\",\"title\":\"<emph render=\\\"italic\\\">The Fortunate Pilgrim</emph>\",\"display_string\":\"<emph render=\\\"italic\\\">The Fortunate Pilgrim</emph>\",\"restrictions_apply\":false,\"created_by\":\"admin\",\"last_modified_by\":\"admin\",\"create_time\":\"2018-06-26T20:28:33Z\",\"system_mtime\":\"2018-06-26T21:11:11Z\",\"user_mtime\":\"2018-06-26T20:28:33Z\",\"suppressed\":false,\"level\":\"series\",\"jsonmodel_type\":\"archival_object\",\"external_ids\":[],\"subjects\":[],\"linked_events\":[],\"extents\":[],\"dates\":[],\"external_documents\":[],\"rights_statements\":[],\"linked_agents\":[],\"onbase_documents\":[],\"ancestors\":[{\"ref\":\"/repositories/2/resources/1\",\"level\":\"collection\"}],\"instances\":[],\"notes\":[],\"uri\":\"/repositories/2/archival_objects/3\",\"repository\":{\"ref\":\"/repositories/2\",\"_resolved\":\"\"},\"resource\":{\"ref\":\"/repositories/2/resources/1\"},\"has_unpublished_ancestor\":false,\"resource_identifier_u_sstr\":\"MS-1371b\",\"resource_type_u_sstr\":null,\"resource_title\":\"Mario Puzo papers\"}",
        "suppressed": false,
        "publish": false,
        "system_generated": false,
        "repository": "/repositories/2",
        "level_enum_s": [
          "series",
          "collection"
        ],
        "resource": "/repositories/2/resources/1",
        "ref_id": "a97bf46cbc2cd85e9789c76098a3ee1b",
        "created_by": "admin",
        "last_modified_by": "admin",
        "user_mtime": "2018-06-26T20:28:33Z",
        "system_mtime": "2018-06-26T21:11:11Z",
        "create_time": "2018-06-26T20:28:33Z",
        "notes": "",
        "level": "series",
        "ancestors": [
          "/repositories/2/resources/1"
        ],
        "total_restrictions_u_sstr": [
          "false"
        ],
        "resource_identifier_u_sstr": [
          "MS-1371b"
        ],
        "resource_title_u_sstr": [
          "Mario Puzo papers"
        ],
        "resource_identifier_w_title_u_sstr": [
          "MS-1371b: Mario Puzo papers"
        ],
        "jsonmodel_type": "archival_object"
      }


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20180627/9a4c6ff7/attachment.html>


More information about the Archivesspace_Users_Group mailing list