[Archivesspace_Users_Group] EAD Import Issue...

Custer, Mark mark.custer at yale.edu
Sat Aug 8 19:19:53 EDT 2015


I got the same error message when I tied to upload the original file to http://test.archivesspace.org/,http://test.archivesspace.org/, 1.3.2-dev11 (and it failed in the first case, which is why it was posted to the listserv).  It's good to know that you were able to import it on your server, but that just makes the whole thing curiouser!  I should try to import an EAD or MARC file with right single quotation marks into our our test ASpace server, but I haven't done so yet.  It would be nice to get to the bottom of this soon, but I'm afraid that doing so is beyond me.  Hopefully others can weigh in on the issue, and hopefully Dominic has since been able to import the EAD file into ASpace.


________________________________
From: archivesspace_users_group-bounces at lyralists.lyrasis.org [archivesspace_users_group-bounces at lyralists.lyrasis.org] on behalf of Steven Majewski [sdm7g at virginia.edu]
Sent: Friday, August 07, 2015 5:39 PM
To: Archivesspace Users Group
Subject: Re: [Archivesspace_Users_Group] EAD Import Issue...


Yes, but my first point was that ASpace is NOT choking on those smart quotes (at least, not on my test server).

There are a number of U+2019 RIGHT SINGLE QUOTATION MARK characters in the file, and for me, the original file is importing without any problems.

The original file is also UTF-8 encoding, so the error message was sending us down the wrong track initially.


  Error: #<Encoding::UndefinedConversionError: ""\x9D"" from Windows-1252 to UTF-8>

There’s no x9D in U+2019, and Windows-1252 encoding would actually be 92.


Maybe the error message indicates something else misconfigured so that it’s reading a valid UTF-8
file but thinking it’s in Windows-1252 encoding, and hitting a valid UTF-8 sequence with a x9D
( which is empty/invalid in W-1252 ) and giving a misleading error message.

There is a x9D in RIGHT DOUBLE QUOTATION MARK http://www.fileformat.info/info/unicode/char/201d/index.htm<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.fileformat.info_info_unicode_char_201d_index.htm&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=s7ciGQfUJeaV_ryx908hbeXDoU9aqDwDN0Z0VbfsJ3Y&m=bh1TkzMhyO_cew3F4k50IJNuW1AjHC9DF8qg3m95Vn0&s=L5mEesI6OUwiJW9pFZHewE3qzAf8SV_TdFrQKK7OLbI&e=>
which does also occur in that file.


— Steve.




On Aug 6, 2015, at 1:32 PM, Custer, Mark <mark.custer at yale.edu<mailto:mark.custer at yale.edu>> wrote:

Right, there are no XML encoding errors in the file, so xmllint is indeed correct.  After I downloaded the original file, though, I got the same error that Dominic received when trying to import the file.  After replacing the smart quotes as Christy suggested, the file imported without issue.

Ideally, I don’t think that the ASpace importer should choke on smart quotes, but you can certainly test for any potentially offending characters via Schematron (or other ways) prior to importing.  After learning of this issue, I added the following test to a Schematron file to do just that:



                <pattern>
                        <rule context="text()">
                            <report test="matches(., '’|“|”')">
                                Smart quote detected. These characters need to be replaced before importing your files
                                into ArchivesSpace.
                            </report>
                        </rule>
                </pattern>

Not an ideal way to do things, but at least it warns me now if any my EAD files have one of three characters prior to importing the file.



From: archivesspace_users_group-bounces at lyralists.lyrasis.org<mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org> [mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org] On Behalf Of Steven Majewski
Sent: Friday, July 31, 2015 2:04 PM
To: Archivesspace Users Group
Subject: Re: [Archivesspace_Users_Group] EAD Import Issue...


When I download that original email enclosure and run xmllint on it, it doesn’t show any encoding errors.
I was also able to import it into ArchivesSpace without any errors.

I wonder if the translation to and from Base64 encoding of the email enclosure somehow
transforms the character encoding and fixes the problem ?


Re: testing with Schematron:

In my experience ( with doing validation in Java and hitting those kind of encoding errors )
encoding errors come from early in the processing pipeline before Schematron or XSLT processing.
I think you would need to scan the file for invalid encoding before passing it to the XML parser.
( In fact, I’m not even sure if you can express an invalid encoding in Schematron if it’s XML in
  a particular encoding. )


— Steve.



On Jul 31, 2015, at 1:12 PM, Custer, Mark <mark.custer at yale.edu<mailto:mark.custer at yale.edu>> wrote:

Interesting.  I just tried to change the encoding value, but that doesn’t work.  If you do a find and replace in the file to replace the single quotes, though, the record will import fine.  I’ve attached a copy of the record that I was able to import.

For the record, using that type of single quote doesn’t invalidate the EAD file.  It’s still perfectly valid, but I don’t know if it’s fully UTF-8 compliant.

Is there any way to come up with a list of invalid characters?  If so, then that could be added to a Schematron file to test and make sure those values aren’t present before attempting to do the batch upload.

Mark



From: archivesspace_users_group-bounces at lyralists.lyrasis.org<mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org> [mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org] On Behalf Of Steven Majewski
Sent: Friday, July 31, 2015 12:31 PM
To: Archivesspace Users Group
Subject: Re: [Archivesspace_Users_Group] EAD Import Issue...


You might also try changing the encoding of the EAD file in the XML header.
If it’s not declared, by default it’s UTF-8.
Change the first line to:

            <?xml version="1.0" encoding="windows-1252"?>


( I don’t know for a fact if this will work for ArchivesSpace, but it works with most parsers and validators. )


Alternatively, if you have ‘iconv’ you can run a conversion thru that program to change the encoding:

iconv -f WINDOWS-1252 -t UTF-8




— Steve Majewski



On Jul 31, 2015, at 12:17 PM, Tomecek, Christy <christy.tomecek at yale.edu<mailto:christy.tomecek at yale.edu>> wrote:

Hello,

I think the issue is that there are Word “Smart Quotes” in your text fields (not the markup itself). The EAD won’t validate if they are present.

Example (Smart quote highlighted):

<abstract label="Abstract">Dating from 1918 to 2000, the History and Background Information series consists of written histories, newspaper clippings, and anniversary publications documenting St. Vincent’s steady growth in the Lincoln Park neighborhood.

There is a way to turn off Smart Quotes in Word so this way you don’t have to go line by line fixing them if you are doing a copy-paste from a Word Document into ASpace.

•         Open Word. Go to File (or if you are in Windows 8/8.1, go to the Windows logo button).
•         Scroll to the bottom of the sidebar where things like "New," "Save," etc. are and click on "Options" at the bottom.
•         Go to "Proofing," located on the sidebar.
•         Go to "AutoCorrect Options" in the main panel.
•         Go to the "AutoFormat As You Type" tab and uncheck the "'Straight quotes' with 'smart quotes'" options under "Replace When You Type."

Best,
Christy

--

Christy Tomecek
Archives Assistant
Yale University Library
Manuscripts and Archives
203-432-7382
christy.tomecek at yale.edu<mailto:christy.tomecek at yale.edu>


From: archivesspace_users_group-bounces at lyralists.lyrasis.org<mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org> [mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org] On Behalf Of Rossetti, Dominic
Sent: Friday, July 31, 2015 11:58 AM
To: 'archivesspace_users_group at lyralists.lyrasis.org<mailto:archivesspace_users_group at lyralists.lyrasis.org>'
Subject: [Archivesspace_Users_Group] EAD Import Issue...

Hey all,

When trying to import EAD I get the following error message in the log file:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
IMPORT ERROR
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


Error: #<Encoding::UndefinedConversionError: ""\x9D"" from Windows-1252 to UTF-8>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

I’ve attached a file as an example. The EAD is valid and correct. Not sure what is causing the issue.
_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group at lyralists.lyrasis.org<mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group<https://urldefense.proofpoint.com/v2/url?u=http-3A__lyralists.lyrasis.org_mailman_listinfo_archivesspace-5Fusers-5Fgroup&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=s7ciGQfUJeaV_ryx908hbeXDoU9aqDwDN0Z0VbfsJ3Y&m=ONZ_tzFly3LOGxk_7dqccGCBrqT5JyiPkFAKzNIo-fk&s=MlHFxUG4tOKSstrhCGRzJgiOBpbTCE-S0CkPZW3m9m0&e=>

<dpu_ead_cm0001_stvincentchurchchi-edited.xml>_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group at lyralists.lyrasis.org<mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group<https://urldefense.proofpoint.com/v2/url?u=http-3A__lyralists.lyrasis.org_mailman_listinfo_archivesspace-5Fusers-5Fgroup&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=s7ciGQfUJeaV_ryx908hbeXDoU9aqDwDN0Z0VbfsJ3Y&m=ofIEXqw7AGXqopT2A5_Y6fkTEJ0KKNCS9M-hidf0j2A&s=DS3YauL_HbawcTrzipxVH9Lq7-ynvwB5YVsHTwUA--w&e=>

_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group at lyralists.lyrasis.org<mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group<https://urldefense.proofpoint.com/v2/url?u=http-3A__lyralists.lyrasis.org_mailman_listinfo_archivesspace-5Fusers-5Fgroup&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=s7ciGQfUJeaV_ryx908hbeXDoU9aqDwDN0Z0VbfsJ3Y&m=bh1TkzMhyO_cew3F4k50IJNuW1AjHC9DF8qg3m95Vn0&s=U6C8rSiN31fOSdUcr5kDUKLYCgT-1QCfvRklYU8BfnU&e=>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20150808/fe0d9cdf/attachment.html>


More information about the Archivesspace_Users_Group mailing list