[Archivesspace_Users_Group] PUI question: external indexing of container list tree link text

Custer, Mark mark.custer at yale.edu
Fri Jan 4 15:58:41 EST 2019


Exactly what Steve said!

As an aside, one thing that I thought about (quite a while back now, I guess it was) was adding those mappings to the JSON-LD metadata that are currently only applied on Resource, Repository, and Agent pages within the PUI (if you view the source of a Resource landing page, for example, you’ll see a bit of JSON-LD at the very top).  In other words, the Resource JSON-LD could have https://schema.org/hasPart statements which would include the URL of the immediate archival object children.  But, that could result in a lot of extra data since sometimes folks have very, very flat hierarchies (e.g. 10,000 children archival objects all attached to 1 resource record…. yikes!). Because of that very real possibility, it might be a better mapping strategy not to use hasPart in ASpace, but instead to just add a single https://schema.org/isPartOf to each archival object page, but then I don’t think that would help your use case.

We haven’t done this just yet, but I was planning to add a readme file in Github that just lists out all of the EAD files for this sort of aggregation purpose.  E.g. https://github.com/YaleArchivesSpace/Archives-at-Yale-EAD3/blob/master/med-ead/README.md   We’re not using the OAI endpoint right now, though, since we post-process and validate our EAD files after export.  For folks who are using it, then that would be a very good way to get the data harvested.

As for search engines indexing at a deep level in the PUI, I suspect that could be an issue. It would probably be best if the PUI had sitemaps out of the box.  That said, we have a LOT of archival objects being indexed by Google in our instance of the PUI, but I don’t think the crawler is getting the entire finding aids (as long as they get all of the Resources, I’m happy for now).  I’d expect that a sitemap would be best for ensuring that, but I’ve honestly no clue in this day and age of the Web!


From: archivesspace_users_group-bounces at lyralists.lyrasis.org [mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org] On Behalf Of Majewski, Steven Dennis (sdm7g)
Sent: Friday, 04 January, 2019 11:04 AM
To: Archivesspace Users Group <archivesspace_users_group at lyralists.lyrasis.org>
Subject: Re: [Archivesspace_Users_Group] PUI question: external indexing of container list tree link text

I would suggest crawling the OAI endpoint for indexing, but linking to the PUI record.
oai_ead metadata just EAD in an OAI wrapper, and has the complete resource tree.
The problem with that is that not everyone may have configured OAI or made it public.

But yes: that’s a problem with progressive web apps: all of the data you want indexed isn’t in the page.
I wonder if there is a way thru google webmaster console or sitemaps to configure this sort of action, i.e.
Use this other URL to index this resource.

— Steve Majewski

On Jan 4, 2019, at 9:59 AM, Rees, John (NIH/NLM) [E] <reesj at mail.nlm.nih.gov<mailto:reesj at mail.nlm.nih.gov>> wrote:

Hi all,

I administer a finding aids aggregation service that in part scrapes HTML-source code as a data input and I am looking for some advice/start a conversation.

Several of our contributing repositories with this data type moved to ArchivesSpace in 2018 and we are not able to crawl ASpace’s collection_organization#tree source which seems to be the only organized view of container list data. As many of you probably know these are coded as URIs in the HTML-source and are only rendered as visible text at runtime via javascript and css in the browser.

Our crawler cannot translate these HTML-source URIs into text that it can index. The only workaround we’ve been able to find is indexing the PDF view, but not everyone implements this feature. Additionally, our crawler sometimes times out on large PDFs as it can take ASpace a while to generate them at runtime.

I’m also wondering if PUI implementers have noticed any issues with other search engines having difficulty indexing their PUI content at a full-document level?

I searched the Jira backlog and PUI Enhancements wikispace and did not find anything specifically addressing this use case.


John P. Rees
Archivist and Digital Resources Manager
History of Medicine Division
National Library of Medicine

Archivesspace_Users_Group mailing list
Archivesspace_Users_Group at lyralists.lyrasis.org<mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20190104/e9a2f894/attachment.html>

More information about the Archivesspace_Users_Group mailing list