[Archivesspace_Users_Group] EAD Import Issue...
Steven Majewski
sdm7g at virginia.edu
Fri Aug 7 17:39:02 EDT 2015
Yes, but my first point was that ASpace is NOT choking on those smart quotes (at least, not on my test server).
There are a number of U+2019 RIGHT SINGLE QUOTATION MARK characters in the file, and for me, the original file is importing without any problems.
The original file is also UTF-8 encoding, so the error message was sending us down the wrong track initially.
Error: #<Encoding::UndefinedConversionError: ""\x9D"" from Windows-1252 to UTF-8>
There’s no x9D in U+2019, and Windows-1252 encoding would actually be 92.
Maybe the error message indicates something else misconfigured so that it’s reading a valid UTF-8
file but thinking it’s in Windows-1252 encoding, and hitting a valid UTF-8 sequence with a x9D
( which is empty/invalid in W-1252 ) and giving a misleading error message.
There is a x9D in RIGHT DOUBLE QUOTATION MARK http://www.fileformat.info/info/unicode/char/201d/index.htm <http://www.fileformat.info/info/unicode/char/201d/index.htm>
which does also occur in that file.
— Steve.
> On Aug 6, 2015, at 1:32 PM, Custer, Mark <mark.custer at yale.edu> wrote:
>
> Right, there are no XML encoding errors in the file, so xmllint is indeed correct. After I downloaded the original file, though, I got the same error that Dominic received when trying to import the file. After replacing the smart quotes as Christy suggested, the file imported without issue.
>
> Ideally, I don’t think that the ASpace importer should choke on smart quotes, but you can certainly test for any potentially offending characters via Schematron (or other ways) prior to importing. After learning of this issue, I added the following test to a Schematron file to do just that:
>
>
> <pattern>
> <rule context="text()">
> <report test="matches(., '’|“|”')">
> Smart quote detected. These characters need to be replaced before importing your files
> into ArchivesSpace.
> </report>
> </rule>
> </pattern>
>
> Not an ideal way to do things, but at least it warns me now if any my EAD files have one of three characters prior to importing the file.
>
>
>
> From: archivesspace_users_group-bounces at lyralists.lyrasis.org [mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org] On Behalf Of Steven Majewski
> Sent: Friday, July 31, 2015 2:04 PM
> To: Archivesspace Users Group
> Subject: Re: [Archivesspace_Users_Group] EAD Import Issue...
>
>
> When I download that original email enclosure and run xmllint on it, it doesn’t show any encoding errors.
> I was also able to import it into ArchivesSpace without any errors.
>
> I wonder if the translation to and from Base64 encoding of the email enclosure somehow
> transforms the character encoding and fixes the problem ?
>
>
> Re: testing with Schematron:
>
> In my experience ( with doing validation in Java and hitting those kind of encoding errors )
> encoding errors come from early in the processing pipeline before Schematron or XSLT processing.
> I think you would need to scan the file for invalid encoding before passing it to the XML parser.
> ( In fact, I’m not even sure if you can express an invalid encoding in Schematron if it’s XML in
> a particular encoding. )
>
>
> — Steve.
>
>
>
> On Jul 31, 2015, at 1:12 PM, Custer, Mark <mark.custer at yale.edu <mailto:mark.custer at yale.edu>> wrote:
>
> Interesting. I just tried to change the encoding value, but that doesn’t work. If you do a find and replace in the file to replace the single quotes, though, the record will import fine. I’ve attached a copy of the record that I was able to import.
>
> For the record, using that type of single quote doesn’t invalidate the EAD file. It’s still perfectly valid, but I don’t know if it’s fully UTF-8 compliant.
>
> Is there any way to come up with a list of invalid characters? If so, then that could be added to a Schematron file to test and make sure those values aren’t present before attempting to do the batch upload.
>
> Mark
>
>
>
> From: archivesspace_users_group-bounces at lyralists.lyrasis.org <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org> [mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>] On Behalf Of Steven Majewski
> Sent: Friday, July 31, 2015 12:31 PM
> To: Archivesspace Users Group
> Subject: Re: [Archivesspace_Users_Group] EAD Import Issue...
>
>
> You might also try changing the encoding of the EAD file in the XML header.
> If it’s not declared, by default it’s UTF-8.
> Change the first line to:
>
> <?xml version="1.0" encoding="windows-1252"?>
>
>
> ( I don’t know for a fact if this will work for ArchivesSpace, but it works with most parsers and validators. )
>
>
> Alternatively, if you have ‘iconv’ you can run a conversion thru that program to change the encoding:
>
> iconv -f WINDOWS-1252 -t UTF-8
>
>
>
>
> — Steve Majewski
>
>
>
> On Jul 31, 2015, at 12:17 PM, Tomecek, Christy <christy.tomecek at yale.edu <mailto:christy.tomecek at yale.edu>> wrote:
>
> Hello,
>
> I think the issue is that there are Word “Smart Quotes” in your text fields (not the markup itself). The EAD won’t validate if they are present.
>
> Example (Smart quote highlighted):
>
> <abstract label="Abstract">Dating from 1918 to 2000, the History and Background Information series consists of written histories, newspaper clippings, and anniversary publications documenting St. Vincent’s steady growth in the Lincoln Park neighborhood.
>
> There is a way to turn off Smart Quotes in Word so this way you don’t have to go line by line fixing them if you are doing a copy-paste from a Word Document into ASpace.
>
> · Open Word. Go to File (or if you are in Windows 8/8.1, go to the Windows logo button).
> · Scroll to the bottom of the sidebar where things like "New," "Save," etc. are and click on "Options" at the bottom.
> · Go to "Proofing," located on the sidebar.
> · Go to "AutoCorrect Options" in the main panel.
> · Go to the "AutoFormat As You Type" tab and uncheck the "'Straight quotes' with 'smart quotes'" options under "Replace When You Type."
>
> Best,
> Christy
>
> --
>
> Christy Tomecek
> Archives Assistant
> Yale University Library
> Manuscripts and Archives
> 203-432-7382
> christy.tomecek at yale.edu <mailto:christy.tomecek at yale.edu>
>
>
> From: archivesspace_users_group-bounces at lyralists.lyrasis.org <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org> [mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>] On Behalf Of Rossetti, Dominic
> Sent: Friday, July 31, 2015 11:58 AM
> To: 'archivesspace_users_group at lyralists.lyrasis.org <mailto:archivesspace_users_group at lyralists.lyrasis.org>'
> Subject: [Archivesspace_Users_Group] EAD Import Issue...
>
> Hey all,
>
> When trying to import EAD I get the following error message in the log file:
>
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> IMPORT ERROR
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>
>
> Error: #<Encoding::UndefinedConversionError: ""\x9D"" from Windows-1252 to UTF-8>
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>
> I’ve attached a file as an example. The EAD is valid and correct. Not sure what is causing the issue.
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org <mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group <https://urldefense.proofpoint.com/v2/url?u=http-3A__lyralists.lyrasis.org_mailman_listinfo_archivesspace-5Fusers-5Fgroup&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=s7ciGQfUJeaV_ryx908hbeXDoU9aqDwDN0Z0VbfsJ3Y&m=ONZ_tzFly3LOGxk_7dqccGCBrqT5JyiPkFAKzNIo-fk&s=MlHFxUG4tOKSstrhCGRzJgiOBpbTCE-S0CkPZW3m9m0&e=>
>
> <dpu_ead_cm0001_stvincentchurchchi-edited.xml>_______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org <mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group <https://urldefense.proofpoint.com/v2/url?u=http-3A__lyralists.lyrasis.org_mailman_listinfo_archivesspace-5Fusers-5Fgroup&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=s7ciGQfUJeaV_ryx908hbeXDoU9aqDwDN0Z0VbfsJ3Y&m=ofIEXqw7AGXqopT2A5_Y6fkTEJ0KKNCS9M-hidf0j2A&s=DS3YauL_HbawcTrzipxVH9Lq7-ynvwB5YVsHTwUA--w&e=>
>
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org <mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group <http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20150807/05b29903/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4943 bytes
Desc: not available
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20150807/05b29903/attachment.bin>
More information about the Archivesspace_Users_Group
mailing list