[Archivesspace_Users_Group] EAD Import Issue...

Steven Majewski sdm7g at virginia.edu
Sat Aug 8 20:50:58 EDT 2015


If it’s working for me but not for you and Dominic, that makes me think it’s an issue with a difference in default encodings on the JVM for our respective servers. ( And there are several default settings. Google
“java default encodings” and you’ll find several posts like this: http://stackoverflow.com/questions/1749064/how-to-find-the-default-charset-encoding-in-java <http://stackoverflow.com/questions/1749064/how-to-find-the-default-charset-encoding-in-java> )

I have experienced the case in Java where the first line below looses the encoding declared in the input file, and falls back to a java default, while the 2nd line works properly:
-		InputSource indoc = new InputSource( new InputStreamReader( xmlfile.getInputStream() ));
+		InputSource indoc = new InputSource(  xmlfile.getInputStream() );


I wonder if something like that is happening and the UTF-8 encoded file is being read as Windows-1252. 
( That DOES seem to be what that error message is saying to me. ) 

— Steve. 
 



> On Aug 8, 2015, at 7:19 PM, Custer, Mark <mark.custer at yale.edu> wrote:
> 
> I got the same error message when I tied to upload the original file to http://test.archivesspace.org/,http://test.archivesspace.org/, <http://test.archivesspace.org/,> 1.3.2-dev11 (and it failed in the first case, which is why it was posted to the listserv).  It's good to know that you were able to import it on your server, but that just makes the whole thing curiouser!  I should try to import an EAD or MARC file with right single quotation marks into our our test ASpace server, but I haven't done so yet.  It would be nice to get to the bottom of this soon, but I'm afraid that doing so is beyond me.  Hopefully others can weigh in on the issue, and hopefully Dominic has since been able to import the EAD file into ASpace.
> 
> 
> From: archivesspace_users_group-bounces at lyralists.lyrasis.org [archivesspace_users_group-bounces at lyralists.lyrasis.org] on behalf of Steven Majewski [sdm7g at virginia.edu]
> Sent: Friday, August 07, 2015 5:39 PM
> To: Archivesspace Users Group
> Subject: Re: [Archivesspace_Users_Group] EAD Import Issue...
> 
> 
> Yes, but my first point was that ASpace is NOT choking on those smart quotes (at least, not on my test server).
> 
> There are a number of U+2019 RIGHT SINGLE QUOTATION MARK characters in the file, and for me, the original file is importing without any problems. 
> 
> The original file is also UTF-8 encoding, so the error message was sending us down the wrong track initially. 
> 
> 
>   Error: #<Encoding::UndefinedConversionError: ""\x9D"" from Windows-1252 to UTF-8>
> 
> There’s no x9D in U+2019, and Windows-1252 encoding would actually be 92. 
> 
> 
> Maybe the error message indicates something else misconfigured so that it’s reading a valid UTF-8
> file but thinking it’s in Windows-1252 encoding, and hitting a valid UTF-8 sequence with a x9D 
> ( which is empty/invalid in W-1252 ) and giving a misleading error message. 
> 
> There is a x9D in RIGHT DOUBLE QUOTATION MARK http://www.fileformat.info/info/unicode/char/201d/index.htm <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.fileformat.info_info_unicode_char_201d_index.htm&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=s7ciGQfUJeaV_ryx908hbeXDoU9aqDwDN0Z0VbfsJ3Y&m=bh1TkzMhyO_cew3F4k50IJNuW1AjHC9DF8qg3m95Vn0&s=L5mEesI6OUwiJW9pFZHewE3qzAf8SV_TdFrQKK7OLbI&e=>
> which does also occur in that file. 
> 
> 
> — Steve.
> 
> 
> 
> 
>> On Aug 6, 2015, at 1:32 PM, Custer, Mark <mark.custer at yale.edu <mailto:mark.custer at yale.edu>> wrote:
>> 
>> Right, there are no XML encoding errors in the file, so xmllint is indeed correct.  After I downloaded the original file, though, I got the same error that Dominic received when trying to import the file.  After replacing the smart quotes as Christy suggested, the file imported without issue.
>>  
>> Ideally, I don’t think that the ASpace importer should choke on smart quotes, but you can certainly test for any potentially offending characters via Schematron (or other ways) prior to importing.  After learning of this issue, I added the following test to a Schematron file to do just that:
>>  
>>  
>>                 <pattern>
>>                         <rule context="text()">
>>                             <report test="matches(., '’|“|”')">
>>                                 Smart quote detected. These characters need to be replaced before importing your files
>>                                 into ArchivesSpace.
>>                             </report>
>>                         </rule>
>>                 </pattern>
>>  
>> Not an ideal way to do things, but at least it warns me now if any my EAD files have one of three characters prior to importing the file.
>>  
>>  
>>  
>> From: archivesspace_users_group-bounces at lyralists.lyrasis.org <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org> [mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>] On Behalf Of Steven Majewski
>> Sent: Friday, July 31, 2015 2:04 PM
>> To: Archivesspace Users Group
>> Subject: Re: [Archivesspace_Users_Group] EAD Import Issue...
>>  
>>  
>> When I download that original email enclosure and run xmllint on it, it doesn’t show any encoding errors.
>> I was also able to import it into ArchivesSpace without any errors.
>>  
>> I wonder if the translation to and from Base64 encoding of the email enclosure somehow 
>> transforms the character encoding and fixes the problem ? 
>>  
>>  
>> Re: testing with Schematron: 
>>  
>> In my experience ( with doing validation in Java and hitting those kind of encoding errors )
>> encoding errors come from early in the processing pipeline before Schematron or XSLT processing. 
>> I think you would need to scan the file for invalid encoding before passing it to the XML parser. 
>> ( In fact, I’m not even sure if you can express an invalid encoding in Schematron if it’s XML in 
>>   a particular encoding. ) 
>>  
>>  
>> — Steve. 
>>  
>>  
>>  
>> On Jul 31, 2015, at 1:12 PM, Custer, Mark <mark.custer at yale.edu <mailto:mark.custer at yale.edu>> wrote:
>>  
>> Interesting.  I just tried to change the encoding value, but that doesn’t work.  If you do a find and replace in the file to replace the single quotes, though, the record will import fine.  I’ve attached a copy of the record that I was able to import.
>>  
>> For the record, using that type of single quote doesn’t invalidate the EAD file.  It’s still perfectly valid, but I don’t know if it’s fully UTF-8 compliant.
>>  
>> Is there any way to come up with a list of invalid characters?  If so, then that could be added to a Schematron file to test and make sure those values aren’t present before attempting to do the batch upload.
>>  
>> Mark
>>  
>>  
>>  
>> From: archivesspace_users_group-bounces at lyralists.lyrasis.org <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org> [mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>] On Behalf Of Steven Majewski
>> Sent: Friday, July 31, 2015 12:31 PM
>> To: Archivesspace Users Group
>> Subject: Re: [Archivesspace_Users_Group] EAD Import Issue...
>>  
>>  
>> You might also try changing the encoding of the EAD file in the XML header.
>> If it’s not declared, by default it’s UTF-8. 
>> Change the first line to:
>>  
>>             <?xml version="1.0" encoding="windows-1252"?>
>>  
>>  
>> ( I don’t know for a fact if this will work for ArchivesSpace, but it works with most parsers and validators. )
>>  
>>  
>> Alternatively, if you have ‘iconv’ you can run a conversion thru that program to change the encoding:
>>  
>> iconv -f WINDOWS-1252 -t UTF-8 
>>  
>>  
>>  
>>  
>> — Steve Majewski
>>  
>>  
>>  
>> On Jul 31, 2015, at 12:17 PM, Tomecek, Christy <christy.tomecek at yale.edu <mailto:christy.tomecek at yale.edu>> wrote:
>>  
>> Hello,
>>  
>> I think the issue is that there are Word “Smart Quotes” in your text fields (not the markup itself). The EAD won’t validate if they are present.
>>  
>> Example (Smart quote highlighted):
>>  
>> <abstract label="Abstract">Dating from 1918 to 2000, the History and Background Information series consists of written histories, newspaper clippings, and anniversary publications documenting St. Vincent’s steady growth in the Lincoln Park neighborhood.
>>  
>> There is a way to turn off Smart Quotes in Word so this way you don’t have to go line by line fixing them if you are doing a copy-paste from a Word Document into ASpace.
>>  
>> ·         Open Word. Go to File (or if you are in Windows 8/8.1, go to the Windows logo button).
>> ·         Scroll to the bottom of the sidebar where things like "New," "Save," etc. are and click on "Options" at the bottom. 
>> ·         Go to "Proofing," located on the sidebar.
>> ·         Go to "AutoCorrect Options" in the main panel.
>> ·         Go to the "AutoFormat As You Type" tab and uncheck the "'Straight quotes' with 'smart quotes'" options under "Replace When You Type."
>>  
>> Best,
>> Christy
>>  
>> --
>>  
>> Christy Tomecek
>> Archives Assistant
>> Yale University Library
>> Manuscripts and Archives
>> 203-432-7382
>> christy.tomecek at yale.edu <mailto:christy.tomecek at yale.edu>
>>  
>>  
>> From: archivesspace_users_group-bounces at lyralists.lyrasis.org <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org> [mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org <mailto:archivesspace_users_group-bounces at lyralists.lyrasis.org>] On Behalf Of Rossetti, Dominic
>> Sent: Friday, July 31, 2015 11:58 AM
>> To: 'archivesspace_users_group at lyralists.lyrasis.org <mailto:archivesspace_users_group at lyralists.lyrasis.org>'
>> Subject: [Archivesspace_Users_Group] EAD Import Issue...
>>  
>> Hey all,
>>  
>> When trying to import EAD I get the following error message in the log file:
>>  
>> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>> IMPORT ERROR
>> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>>  
>>  
>> Error: #<Encoding::UndefinedConversionError: ""\x9D"" from Windows-1252 to UTF-8>
>> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>>  
>> I’ve attached a file as an example. The EAD is valid and correct. Not sure what is causing the issue.
>> _______________________________________________
>> Archivesspace_Users_Group mailing list
>> Archivesspace_Users_Group at lyralists.lyrasis.org <mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
>> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group <https://urldefense.proofpoint.com/v2/url?u=http-3A__lyralists.lyrasis.org_mailman_listinfo_archivesspace-5Fusers-5Fgroup&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=s7ciGQfUJeaV_ryx908hbeXDoU9aqDwDN0Z0VbfsJ3Y&m=ONZ_tzFly3LOGxk_7dqccGCBrqT5JyiPkFAKzNIo-fk&s=MlHFxUG4tOKSstrhCGRzJgiOBpbTCE-S0CkPZW3m9m0&e=>
>>  
>> <dpu_ead_cm0001_stvincentchurchchi-edited.xml>_______________________________________________
>> Archivesspace_Users_Group mailing list
>> Archivesspace_Users_Group at lyralists.lyrasis.org <mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
>> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group <https://urldefense.proofpoint.com/v2/url?u=http-3A__lyralists.lyrasis.org_mailman_listinfo_archivesspace-5Fusers-5Fgroup&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=s7ciGQfUJeaV_ryx908hbeXDoU9aqDwDN0Z0VbfsJ3Y&m=ofIEXqw7AGXqopT2A5_Y6fkTEJ0KKNCS9M-hidf0j2A&s=DS3YauL_HbawcTrzipxVH9Lq7-ynvwB5YVsHTwUA--w&e=>
>>  
>> _______________________________________________
>> Archivesspace_Users_Group mailing list
>> Archivesspace_Users_Group at lyralists.lyrasis.org <mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
>> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group <https://urldefense.proofpoint.com/v2/url?u=http-3A__lyralists.lyrasis.org_mailman_listinfo_archivesspace-5Fusers-5Fgroup&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=s7ciGQfUJeaV_ryx908hbeXDoU9aqDwDN0Z0VbfsJ3Y&m=bh1TkzMhyO_cew3F4k50IJNuW1AjHC9DF8qg3m95Vn0&s=U6C8rSiN31fOSdUcr5kDUKLYCgT-1QCfvRklYU8BfnU&e=>
> _______________________________________________
> Archivesspace_Users_Group mailing list
> Archivesspace_Users_Group at lyralists.lyrasis.org <mailto:Archivesspace_Users_Group at lyralists.lyrasis.org>
> http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group <http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20150808/a7a8e1d3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4943 bytes
Desc: not available
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20150808/a7a8e1d3/attachment.bin>


More information about the Archivesspace_Users_Group mailing list