[Archivesspace_Users_Group] Missing Japanese charactires in a PUI generated PDF

Mayo, Dave dave_mayo at harvard.edu
Wed Sep 14 23:01:18 EDT 2022


Hi!

This is something we’ve recently had to deal with – I’m not 100% sure from what you’ve posted that it’s the same issue we had, but there are a few issues with the PUI’s current PDF generation support that make font handling challenging.

So, first of all – if you’re setting up a fallback hierarchy and the font with Japanese characters isn’t in the first position, the PDF generation library isn’t seeing it at all.  The flying saucer ipdf library doesn’t support font fallback, which is a real problem if you need to support multiple languages.

So, first thing I’d try is making sure that the text in question is _solely_ the font supporting Japanese.  If the Japanese characters render, that’ll at least verify that that’s the reason.

Our solution, which I’m hoping to work up and submit as a pull request, was to replace the existing library with https://github.com/danfickle/openhtmltopdf - a project based on flying saucer but with several enhancements.  Implementing it is somewhat complex:

1. Openhtmltopdf and _all dependencies thereof_ need to be provided by putting them in the archivesspace/lib directory (the directory the MySQL connector goes in during install)

Currently we’re doing this in our dockerfile via:

wget -P /archivesspace/lib https://repo1.maven.org/maven2/com/google/zxing/core/3.5.0/core-3.5.0.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/junit/junit/4.13.2/junit-4.13.2.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/com/openhtmltopdf/openhtmltopdf-core/1.0.10/openhtmltopdf-core-1.0.10.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/com/openhtmltopdf/openhtmltopdf-pdfbox/1.0.10/openhtmltopdf-pdfbox-1.0.10.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/de/rototor/pdfbox/graphics2d/0.34/graphics2d-0.34.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/apache/pdfbox/pdfbox/2.0.26/pdfbox-2.0.26.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/apache/pdfbox/xmpbox/2.0.26/xmpbox-2.0.26.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/apache/pdfbox/fontbox/2.0.26/fontbox-2.0.26.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/jfree/jfreechart/1.5.3/jfreechart-1.5.3.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/freemarker/freemarker/2.3.27-incubating/freemarker-2.3.27-incubating.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/apache/servicemix/bundles/org.apache.servicemix.bundles.rhino/1.7.10_1/org.apache.servicemix.bundles.rhino-1.7.10_1-sources.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/openjdk/jmh/jmh-core/1.29/jmh-core-1.29.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/codelibs/jhighlight/1.1.0/jhighlight-1.1.0.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/thymeleaf/extras/thymeleaf-extras-java8time/3.0.4.RELEASE/thymeleaf-extras-java8time-3.0.4.RELEASE.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/thymeleaf/thymeleaf/3.1.0.M2/thymeleaf-3.1.0.M2.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/yaml/snakeyaml/1.26/snakeyaml-1.26.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/com/ibm/icu/icu4j/59.1/icu4j-59.1.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/apache/xmlgraphics/batik-codec/1.14/batik-codec-1.14.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/apache/xmlgraphics/batik-ext/1.14/batik-ext-1.14.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/apache/xmlgraphics/batik-transcoder/1.14/batik-transcoder-1.14.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/apache/xmlgraphics/xmlgraphics-commons/2.7/xmlgraphics-commons-2.7.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/org/verapdf/validation-model/1.18.8/validation-model-1.18.8.jar && \

wget -P /archivesspace/lib https://repo1.maven.org/maven2/de/rototor/snuggletex/snuggletex-core/1.3.0/snuggletex-core-1.3.0.jar && \
wget -P /archivesspace/lib https://repo1.maven.org/maven2/net/sourceforge/jeuclid/jeuclid-core/3.1.9/jeuclid-core-3.1.9.jar && \

2. Then, the code that generates the PDFs needs to be overridden with code based on the new library. We do this in our PUI customization plugin here:

https://github.com/harvard-library/aspace-hvd-pui/blob/bd4b1c3cf728674cc3445dee39a16282848c2cca/public/models/hvd_pdf.rb#L152

We were already overriding PDF generation, the model in core ArchivesSpace is located here:

https://github.com/archivesspace/archivesspace/blob/ceeb72d1796a8b67104814065ffea23215403f78/public/app/models/finding_aid_pdf.rb#L94

I believe my co-worker Doug still couldn’t get a web font to work ever really – we ended up using the Kurinto fonts (and some others) provided with archivesspace and used by the XSLT PDF processing in the backend.  https://github.com/harvard-library/aspace-hvd-pui/blob/bd4b1c3cf728674cc3445dee39a16282848c2cca/public/models/hvd_pdf.rb#L165

I hope this is somewhat helpful! I very much want to try and package this up in a less terrible way, either by getting this incorporated into core or through creating a plugin – a plugin would need to either copy the libraries into the right place on install or have a manual step of downloading and installing the libraries, so it’d be a bit inelegant.

If you have any questions, I’d be happy to try and answer them!

--
Dave Mayo (he/him)
Senior Digital Library Software Engineer
Harvard University > HUIT > LTS

From: <archivesspace_users_group-bounces at lyralists.lyrasis.org> on behalf of 松山 ひとみ <matsuyama-h at nakka-art.jp>
Reply-To: Archivesspace Users Group <archivesspace_users_group at lyralists.lyrasis.org>
Date: Tuesday, September 13, 2022 at 9:41 PM
To: "'archivesspace_users_group at lyralists.lyrasis.org'" <archivesspace_users_group at lyralists.lyrasis.org>
Subject: [Archivesspace_Users_Group] Missing Japanese charactires in a PUI generated PDF

Hi all.

We’ve been struggling with an issue of a PUI generated PDF, in which no Japanese characters are displayed.
Could anyone tell us what we should try next, or anything wrong in our procedure?

We’d tried as follows;

1. Created "./plugins/local/public/views/pdf/_header.html.erb", and edited.
We confirmed that the CSS was applied.

2. In the style of 1., we specified these 3 fonts, "serif", "sans-serif", and the font used in converting itext into Japanese;

body {
  font-family: KozMinPro-Regular;
}

3. In addition, we loaded Google Fonts and executed. It didn’t work.

@import url('https://fonts.googleapis.com/css2?family=Sawarabi+Gothic&display=swap');

body {
font-family: 'Sawarabi Gothic', sans-serif;
}

We’ve looked through the previous Q&As;
http://lyralists.lyrasis.org/mailman/htdig/archivesspace_users_group/2017-August/005046.html
http://lyralists.lyrasis.org/mailman/htdig/archivesspace_users_group/2017-August/005047.html

We would appreciate a lot your generous assistance!

Hitomi Matsuyama, Audiovisual Archivist

Nakanoshima Museum of Art, Osaka
4-3-1 Nakanoshima, Kita-ku
Osaka 530-0005 JAPAN
tel. +81 (0)6 64 79 05 58
email. matsuyama-h at nakka-art.jp<mailto:matsuyama-h at nakka-art.jp>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/archivesspace_users_group/attachments/20220915/e375c27d/attachment.html>


More information about the Archivesspace_Users_Group mailing list