[Archivesspace_Users_Group] PUI question: external indexing of container list tree link text

Fri Jan 4 09:59:23 EST 2019

Hi all,

I administer a finding aids aggregation service that in part scrapes HTML-source code as a data input and I am looking for some advice/start a conversation.

Several of our contributing repositories with this data type moved to ArchivesSpace in 2018 and we are not able to crawl ASpace's collection_organization#tree source which seems to be the only organized view of container list data. As many of you probably know these are coded as URIs in the HTML-source and are only rendered as visible text at runtime via javascript and css in the browser.

Our crawler cannot translate these HTML-source URIs into text that it can index. The only workaround we've been able to find is indexing the PDF view, but not everyone implements this feature. Additionally, our crawler sometimes times out on large PDFs as it can take ASpace a while to generate them at runtime.

I'm also wondering if PUI implementers have noticed any issues with other search engines having difficulty indexing their PUI content at a full-document level?

I searched the Jira backlog and PUI Enhancements wikispace and did not find anything specifically addressing this use case.

Thanks,
John

John P. Rees
Archivist and Digital Resources Manager
History of Medicine Division
National Library of Medicine
301-827-4510

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20190104/8f12a457/attachment.html>