[Archivesspace_Users_Group] PUI question: external indexing of container list tree link text

Rees, John (NIH/NLM) [E] reesj at mail.nlm.nih.gov
Mon Jan 7 12:20:45 EST 2019


We have several ASpace users (and others) that provide EADs for harvesting outside their user interfaces, but those take effort/capacity, desire to collaborate, and supervision/management that not everyone can muster. I am hoping for an unsupervised option that would be a native ASpace default configuration. I had looked at the jsonld and tree-builder javascript hoping for some sort of hook but I reckon the base-functionality just isn’t in the stack.

Like you say, I’ve found lots of atomic resources in various PUI instances via Google, but not a whole document – Google-magic must be stronger than that of the commercial engine we use as a crawler. I doubt I could muster support for a workaround on our end.

Any idea how difficult it might be to add something like a title attribute to those URIs?

John


From: Custer, Mark <mark.custer at yale.edu>
Sent: Friday, January 04, 2019 3:59 PM
To: Archivesspace Users Group <archivesspace_users_group at lyralists.lyrasis.org>
Subject: Re: [Archivesspace_Users_Group] PUI question: external indexing of container list tree link text

John,

Exactly what Steve said!

As an aside, one thing that I thought about (quite a while back now, I guess it was) was adding those mappings to the JSON-LD metadata that are currently only applied on Resource, Repository, and Agent pages within the PUI (if you view the source of a Resource landing page, for example, you’ll see a bit of JSON-LD at the very top).  In other words, the Resource JSON-LD could have https://schema.org/hasPart statements which would include the URL of the immediate archival object children.  But, that could result in a lot of extra data since sometimes folks have very, very flat hierarchies (e.g. 10,000 children archival objects all attached to 1 resource record…. yikes!). Because of that very real possibility, it might be a better mapping strategy not to use hasPart in ASpace, but instead to just add a single https://schema.org/isPartOf to each archival object page, but then I don’t think that would help your use case.

We haven’t done this just yet, but I was planning to add a readme file in Github that just lists out all of the EAD files for this sort of aggregation purpose.  E.g. https://github.com/YaleArchivesSpace/Archives-at-Yale-EAD3/blob/master/med-ead/README.md   We’re not using the OAI endpoint right now, though, since we post-process and validate our EAD files after export.  For folks who are using it, then that would be a very good way to get the data harvested.

As for search engines indexing at a deep level in the PUI, I suspect that could be an issue. It would probably be best if the PUI had sitemaps out of the box.  That said, we have a LOT of archival objects being indexed by Google in our instance of the PUI, but I don’t think the crawler is getting the entire finding aids (as long as they get all of the Resources, I’m happy for now).  I’d expect that a sitemap would be best for ensuring that, but I’ve honestly no clue in this day and age of the Web!

Mark



From: archivesspace_users_group-bounces at lyralists.lyrasis.org<mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org> [mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org] On Behalf Of Majewski, Steven Dennis (sdm7g)
Sent: Friday, 04 January, 2019 11:04 AM
To: Archivesspace Users Group <archivesspace_users_group at lyralists.lyrasis.org<mailto:archivesspace_users_group at lyralists.lyrasis.org>>
Subject: Re: [Archivesspace_Users_Group] PUI question: external indexing of container list tree link text


I would suggest crawling the OAI endpoint for indexing, but linking to the PUI record.
oai_ead metadata just EAD in an OAI wrapper, and has the complete resource tree.
The problem with that is that not everyone may have configured OAI or made it public.

But yes: that’s a problem with progressive web apps: all of the data you want indexed isn’t in the page.
I wonder if there is a way thru google webmaster console or sitemaps to configure this sort of action, i.e.
Use this other URL to index this resource.

— Steve Majewski



On Jan 4, 2019, at 9:59 AM, Rees, John (NIH/NLM) [E] <reesj at mail.nlm.nih.gov<mailto:reesj at mail.nlm.nih.gov>> wrote:

Hi all,

I administer a finding aids aggregation service that in part scrapes HTML-source code as a data input and I am looking for some advice/start a conversation.

Several of our contributing repositories with this data type moved to ArchivesSpace in 2018 and we are not able to crawl ASpace’s collection_organization#tree source which seems to be the only organized view of container list data. As many of you probably know these are coded as URIs in the HTML-source and are only rendered as visible text at runtime via javascript and css in the browser.

Our crawler cannot translate these HTML-source URIs into text that it can index. The only workaround we’ve been able to find is indexing the PDF view, but not everyone implements this feature. Additionally, our crawler sometimes times out on large PDFs as it can take ASpace a while to generate them at runtime.

I’m also wondering if PUI implementers have noticed any issues with other search engines having difficulty indexing their PUI content at a full-document level?

I searched the Jira backlog and PUI Enhancements wikispace and did not find anything specifically addressing this use case.

Thanks,
John


John P. Rees
Archivist and Digital Resources Manager
History of Medicine Division
National Library of Medicine
301-827-4510


_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group at lyralists.lyrasis.org<mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flyralists.lyrasis.org%2Fmailman%2Flistinfo%2Farchivesspace_users_group&data=02%7C01%7Cmark.custer%40yale.edu%7C1de074c8e97345343f2508d6725e3931%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C636822146362599345&sdata=QTIuc7RLBJxsi8cPNbwM6KT5%2FUy1%2FzKv593HbryyOJU%3D&reserved=0>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20190107/686139be/attachment.html>


More information about the Archivesspace_Users_Group mailing list