使用Java PDFBox库编写俄语PDF

我正在使用一个名为PDFBox的Java库来尝试将文本写入PDF。它适用于英文文本，但当我试图在PDF中写入俄文文本时，这些字母显得很奇怪。似乎问题在于使用的字体，但我对此不太确定，所以我希望有人能指导我完成这个。以下是重要的代码行：

PDTrueTypeFont font = PDTrueTypeFont.loadTTF( pdfFile, new File( "fonts/VREMACCI.TTF" ) ); // Windows Russian font imported to write the Russian text. font.setEncoding( new WinAnsiEncoding() ); // Define the Encoding used in writing. // Some code here to open the PDF & define a new page. contentStream.drawString( "отделом компьютерной" ); // Write the Russian text.

WinAnsiEncoding源代码是：点击这里

———————编辑于2009年11月18日

经过一些调查，我现在确定它是一个编码问题，这可以通过使用名为DictionaryEncoding的有用的PDFBox类定义我自己的编码来解决。

我不知道如何使用它，但这是我迄今为止尝试过的：

 COSDictionary cosDic = new COSDictionary(); cosDic.setString( COSName.getPDFName("Ercyrillic"), "0420 " ); // Russian letter. font.setEncoding( new DictionaryEncoding( cosDic ) );

这不起作用，因为我似乎以错误的方式填写字典，当我使用它写一个PDF页面时，它显示为空白。

DictionaryEncoding源代码是：点击这里

尝试使用这种结构：

 PDFont font = PDType0Font.load( pdfFile, new File( "fonts/VREMACCI.TTF" ) ); // Windows Russian font imported to write the Russian text. // Some code here to open the PDF & define a new page. contentStream.beginText(); contentStream.setFont(font, 12); contentStream.showText( "отделом компьютерной" ); // Write the Russian text. contentStream.endText();

长话故事是这样的 – 为了从TrueType字体中以PDF格式进行unicode输出，输出必须包含大量详细且看似多余的信息。它归结为 – 在TrueType字体内，字形存储为字形ID。这些字形ID与特定的unicode字符相关联（而IIRC，内部的unicode字形可能指的是几个代码点 – 就像é指的是e和一个尖锐的口音 – 我的记忆是模糊的）。 PDF并不真正具有unicode支持，只是说存在从字符串中的UTF16BE值到TrueType字体中的字形ID的映射，以及从UTF16BE值到Unicode的映射 – 即使它是标识。

子类型Type0的字体字典
- 一个DescendantFonts数组，其中包含如下所述的条目
- ToUnicode条目，将UTF16BE值映射到unicode
- 编码设置为Identity-H

我自己的工具上的一个unit testing的输出如下所示：

 13 0 obj << /BaseFont /DejaVuSansCondensed /DescendantFonts [ 4 0 R ] /ToUnicode 14 0 R /Type /Font /Subtype /Type0 /Encoding /Identity-H >> endobj 14 0 obj << /Length 346 >> stream /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <0000>  endcodespacerange 1 beginbfrange <0000>  <0000> endbfrange endcmap CMapName currentdict /CMap defineresource pop end end

endstream％请注意流的格式错误

子类型CIDFontTYpe2的字体字典
- 一个CIDSsytemInfo
- 一个FontDescriptor
- DW和W.
- 从字符ID映射到字形ID的CIDToGIDMap

这是来自同一测试的那个 – 这是DescendantFonts数组中的对象：

 4 0 obj << /Subtype /CIDFontType2 /Type /Font /BaseFont /DejaVuSansCondensed /CIDSystemInfo 8 0 R /FontDescriptor 9 0 R /DW 1000 /W 10 0 R /CIDToGIDMap 11 0 R >> 8 0 obj << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> endobj

我为什么告诉你这个？它与PDFBox有什么关系？就是这样：坦率地说，PDF格式的Unicode输出是对手的皇家痛苦。 Acrobat是在有Unicode之前开发的，从一开始就很难有没有Unicode的CJK编码（我知道 – 我当时在Acrobat上工作过）。后来添加了Unicode支持，但它确实感觉它已经被弄糊涂了。人们希望你只是说/ Encoding / Unicode，并且拥有以刺和y-dieresis字符开头的字符串，然后离开。没有这样的运气。如果你没有放入所有详细的东西（实际上，Acrobat，嵌入一个PostScript程序来翻译成Unicode？WTH？），你会在Acrobat中得到一个空白页面。我发誓，我不是这样做的。

在这一点上，我为一家独立的公司编写了PDF生成工具（.NET现在，所以它对你没有帮助），我把它设计成隐藏所有废话的设计目标。所有文本都是unicode – 如果你只使用那些与WinAnsi相同的字符代码，那就是你得到的内容。使用其他任何东西，你得到所有其他东西。如果PDFBox能帮到你，我会感到很惊讶 – 这是一个非常麻烦的事情。

解决方案非常简单。

1）您必须找到与要显示的字符兼容的字体。
2）在本地下载字体的.ttf文件。
3）从您的应用程序加载字体

例如，如果您想使用希腊字符，则必须执行此操作：

 content = new PDPageContentStream(document, page); pdfFont = PDType0Font.load( document, new File( "arialuni.ttf" ) ) content.setFont(pdfFont, fontSize);

也许俄语编码类需要编写，它应该看起来像WinAnsiEncoding ，我想。
现在，我不知道该放什么！

或者，如果这不是你已经做过的，也许你应该用UTF-8编码源文件并使用默认编码。
我看到一些消息与从现有PDF文件中提取俄语文本的问题有关（当然使用PDFBox），但我不知道输出是否相关。
您也可以写入PDFBox邮件列表。

测试这是否是编码问题应该很容易（只需切换到UTF16编码）。

我假设您已尝试使用编辑器或VREMACCI字体的某些内容并确认它显示您期望的方式？

您可能想尝试在iText中执行相同的操作，只是为了了解问题是否与PdfBox库本身有关…如果您的主要目标是生成PDF文件，那么iText可能是更好的解决方案。

编辑 – 评论的长回答：

好的 – 抱歉在编码问题上来回…你的核心问题（你可能已经知道）是写入内容流的字节编码与用于查找字形的编码不同。现在我会尝试实际上有所帮助：

我看了一下PdfBox中的字典编码类，看起来很不直观……有问题的’字典’是一本PDF字典。所以你基本上需要做的是创建一个Pdf字典对象（我认为PdfBox称这是一种COSObject），然后添加条目。

字体的编码在PDF中定义为字典（参见上述规范的第266页）。该字典包含基本编码名称和可选差异数组。从技术上讲，差异数组不应该与真实字体一起使用（虽然我已经看到它在某些情况下使用 – 但不要使用它）。

然后，您将为cmap指定编码条目。此cmap将是您的字体的编码。

我的建议是采用现有的PDF来做你想要的，然后获取字体的字典结构的转储，这样你就可以看到它的样子。

这绝对不适合胆小的人。我可以提供一些帮助 – 如果你需要一个字典转储，给我一个带有示例PDF的超链接，我将通过我在iText开发中使用的一些算法运行它（我是iText文本提取子的维护者） -系统）。

编辑 – 11/17/09

好的 – 这是来自russian.pdf文件的字典转储（子字典列出缩进，按照它们出现在包含字典中的顺序）：

 (/CropBox=[0, 0, 595, 842], /Parent=Dictionary of type: /Pages, /Type=/Page, /Contents=[209 0 R, 210 0 R, 211 0 R, 214 0 R, 215 0 R, 216 0 R, 222 0 R, 223 0 R], /Resources=Dictionary, /MediaBox=[0, 0, 595, 842], /StructParents=0, /Rotate=0) Subdictionary /Parent = (/Type=/Pages, /Count=6, /Kids=[195 0 R, 1 0 R, 3 0 R, 5 0 R, 7 0 R, 9 0 R]) Subdictionary /Resources = (/ExtGState=Dictionary, /ProcSet=[/PDF, /Text], /ColorSpace=Dictionary, /Font=Dictionary, /Properties=Dictionary) Subdictionary /ExtGState = (/GS0=Dictionary of type: /ExtGState) Subdictionary /GS0 = (/OPM=1, /op=false, /Type=/ExtGState, /SA=false, /OP=false, /SM=0.02) Subdictionary /ColorSpace = (/CS0=[/ICCBased, 228 0 R]) Subdictionary /Font = (/C2_1=Dictionary of type: /Font, /C2_2=Dictionary of type: /Font, /C2_3=Dictionary of type: /Font, /C2_4=Dictionary of type: /Font, /TT2=Dictionary of type: /Font, /TT1=Dictionary of type: /Font, /TT0=Dictionary of type: /Font, /C2_0=Dictionary of type: /Font, /TT3=Dictionary of type: /Font) Subdictionary /C2_1 = (/DescendantFonts=[243 0 R], /BaseFont=/LDMIEC+TimesNewRomanPS-BoldMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream) Subdictionary /C2_2 = (/DescendantFonts=[233 0 R], /BaseFont=/LDMIBO+TimesNewRomanPSMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream) Subdictionary /C2_3 = (/DescendantFonts=[224 0 R], /BaseFont=/LDMIHD+TimesNewRomanPS-ItalicMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream) Subdictionary /C2_4 = (/DescendantFonts=[229 0 R], /BaseFont=/LDMIDA+Tahoma, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream) Subdictionary /TT2 = (/LastChar=58, /BaseFont=/LDMIFC+TimesNewRomanPS-BoldMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 333], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32) Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=136, /Descent=-216, /FontWeight=700, /FontBBox=[-558, -307, 2000, 1026], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=34, /XHeight=0, /FontFamily=Times New Roman, /FontName=/LDMIFC+TimesNewRomanPS-BoldMT, /Ascent=891, /ItalicAngle=0) Subdictionary /TT1 = (/LastChar=187, /BaseFont=/LDMICP+TimesNewRomanPSMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 833, 778, 0, 333, 333, 0, 0, 250, 333, 250, 278, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 278, 278, 0, 564, 0, 444, 0, 722, 667, 667, 722, 611, 556, 0, 722, 333, 389, 0, 611, 889, 722, 722, 556, 0, 667, 556, 611, 0, 722, 944, 0, 722, 0, 333, 0, 333, 0, 500, 0, 444, 500, 444, 500, 444, 333, 500, 500, 278, 0, 500, 278, 778, 500, 500, 500, 0, 333, 389, 278, 500, 500, 722, 0, 500, 444, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32) Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=82, /Descent=-216, /FontWeight=400, /FontBBox=[-568, -307, 2000, 1007], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=34, /XHeight=0, /FontFamily=Times New Roman, /FontName=/LDMICP+TimesNewRomanPSMT, /Ascent=891, /ItalicAngle=0) Subdictionary /TT0 = (/LastChar=55, /BaseFont=/LDMIBN+TimesNewRomanPS-BoldItalicMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 250, 0, 500, 500, 500, 0, 0, 0, 0, 500], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32) Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=116.867004, /Descent=-216, /FontWeight=700, /FontBBox=[-547, -307, 1206, 1032], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=98, /XHeight=468, /FontFamily=Times New Roman, /FontName=/LDMIBN+TimesNewRomanPS-BoldItalicMT, /Ascent=891, /ItalicAngle=-15) Subdictionary /C2_0 = (/DescendantFonts=[238 0 R], /BaseFont=/LDMHPN+TimesNewRomanPS-BoldItalicMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream) Subdictionary /TT3 = (/LastChar=169, /BaseFont=/LDMIEB+Tahoma, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[313, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 546, 0, 546, 0, 0, 546, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 929], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32) Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=92, /Descent=-206, /FontWeight=400, /FontBBox=[-600, -208, 1338, 1034], /CapHeight=734, /FontFile2=Stream, /FontStretch=/Normal, /Flags=32, /XHeight=546, /FontFamily=Tahoma, /FontName=/LDMIEB+Tahoma, /Ascent=1000, /ItalicAngle=0) Subdictionary /Properties = (/MC0=Dictionary of type: /OCMD) Subdictionary /MC0 = (/Type=/OCMD, /OCGs=Dictionary of type: /OCG) Subdictionary /OCGs = (/Usage=Dictionary, /Type=/OCG, /Name=HeaderFooter) Subdictionary /Usage = (/CreatorInfo=Dictionary, /PageElement=Dictionary) Subdictionary /CreatorInfo = (/Creator=Acrobat PDFMaker 6.0 äëÿ Word) Subdictionary /PageElement = (/SubType=/HF)

这里有很多活动部件。你可能想把一个只有3或4个字符的测试文档放在一起…这里使用了很多类型1字体（除了TT字体之外），所以很难分辨您特定问题涉及的内容。

（你确定你不想至少尝试使用iText吗？;-)我不是说它会起作用，只是它可能值得一试）。

作为参考，使用com.lowagie.text.pdf.parser.PdfContentReaderTool类获得上面的字典转储

试试这个：

短语leftTitle = new Phrase（“САНКТ-ПЕТЕРБУРГ”，FontFactory.getFont（“Tahoma”，“Cp1251”，true，25））;

这至少适用于最新的（5.0.1）iText

使用Java PDFBox库编写俄语PDF

为什么不建议将常量存储在单独的类中？

java.net.ConnectException：连接被拒绝

无法使用弹簧数据function绑定数据

EhCache：简单程序不起作用

在java中绘制极坐标图

使用JSch将文件放入远程目录，如果该目录不存在，则创建它

以编程方式从PEM获取KeyStore

在Java中启用Intel超线程

JavaFX datepicker没有更新值

Java Servlet – 获取具有相同名称的参数