[Archivesspace_Users_Group] ArchivesSpace/OAI/EAD exports in the wild.
Majewski, Steven Dennis (sdm7g)
sdm7g at virginia.edu
Mon Apr 15 13:03:33 EDT 2019
Forwarding the results of this test to the list, as it may be useful in determining where to devote attention to fixing EAD export errors. Test was harvesting 2500+ resources from 5 sites using their ArchivesSpace OAI feeds. ( As part of setting up automatic harvesting of records for Virginia Heritage. )
I’ve been using modified version Python oai-harvest. The standard version you get from ‘pip install’ will stop harvesting when it encounters an error in the feed. I am in the process of trying to push patches to fix this upstream to both oai-harvest and pyoai python modules. ( Ask me for more info, if you don’t want to wait and want to patch these yourself locally. )
—
After making some modifications to the OAI-harvester to recover from XML parsing errors, I was able to harvest 2500+ EAD files from UVA, VT, JMU, VCU & VMFA ArchivesSpace OAI feeds. Below are the sorts of errors encountered, preceded by the number of occurrences of that error. I don’t have a count on how many resources produced errors, but there are frequently multiple errors in a file. ( All of the TAG_NAME_MISMATCH errors are from the same file, and likely produced by the same malformed element tag. And I suspect all of those UNDECLARE_ENTITY errors are from the same file as well. )
The good news is that the vast majority of errors are produced by unescaped ampersands, and I have seen commits that appear to address this issue in the next release. ( Currently, using the recover=True option on the XML parser in the OAI harvester, has the effect of just dropping the “&” in the output EAD. And typically, in the other cases, some offending or undefined element or entity is being dropped on output to produce valid XML output. )
And it would appear that all of the errors originate from mixed content note fields.
Perhaps this can be fixed by having ArchivesSpace attempt to parse mixed content fields as XML fragments before accepting them. Then bad mixed content fields can be manually fixed before resource can be saved or updated.
This does not address EAD validation errors — just XML parser errors from malformed XML.
I haven’t done a massive validation test on this sample yet, but from previous test, the majority of validation errors I’ve seen produced from ASpace EAD exports ( when well formed ) have been attribute name errors. Typically container attributes are being translated thru the locales and have embedded spaces. I would probably be better to export attributes using the internal names rather than locales.
2 NAMESPACE:NS_ERR_UNDEFINED_NAMESPACE: Namespace prefix ns2 for actuate on extref is not defined
3 NAMESPACE:NS_ERR_UNDEFINED_NAMESPACE: Namespace prefix ns2 for href on extref is not defined
2 NAMESPACE:NS_ERR_UNDEFINED_NAMESPACE: Namespace prefix ns2 for show on extref is not defined
1 PARSER:ERR_DOCUMENT_END: Extra content at the end of the document
58 PARSER:ERR_ENTITYREF_SEMICOL_MISSING: EntityRef: expecting ';'
1 PARSER:ERR_GT_REQUIRED: Couldn't find end of Start Tag span line 20
1 PARSER:ERR_NAME_REQUIRED: error parsing attribute name
33 PARSER:ERR_NAME_REQUIRED: xmlParseEntityRef: no name
1 PARSER:ERR_SPACE_REQUIRED: attributes construct error
1 PARSER:ERR_TAG_NAME_MISMATCH: Opening and ending tag mismatch: ListRecords line 1 and record
1 PARSER:ERR_TAG_NAME_MISMATCH: Opening and ending tag mismatch: OAI-PMH line 1 and ListRecords
1 PARSER:ERR_TAG_NAME_MISMATCH: Opening and ending tag mismatch: abstract line 20 and span
1 PARSER:ERR_TAG_NAME_MISMATCH: Opening and ending tag mismatch: archdesc line 2 and did
1 PARSER:ERR_TAG_NAME_MISMATCH: Opening and ending tag mismatch: did line 3 and abstract
1 PARSER:ERR_TAG_NAME_MISMATCH: Opening and ending tag mismatch: ead line 2 and archdesc
1 PARSER:ERR_TAG_NAME_MISMATCH: Opening and ending tag mismatch: metadata line 1 and ead
1 PARSER:ERR_TAG_NAME_MISMATCH: Opening and ending tag mismatch: record line 1 and metadata
1 PARSER:ERR_UNDECLARED_ENTITY: Entity 'aacute' not defined
1 PARSER:ERR_UNDECLARED_ENTITY: Entity 'ecircne' not defined
2 PARSER:ERR_UNDECLARED_ENTITY: Entity 'uuml' not defined
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3598 bytes
Desc: not available
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20190415/961a19c1/attachment.bin>
More information about the Archivesspace_Users_Group
mailing list