Windows 64位上的Tess4j:multithreading上的exception

我在Windows 64位上使用tesseract 3和Java 8到OCR扫描的PDF。 我已按照Tess4j页面上的说明操作并使用了所需DLL的64位版本,并安装了64位Ghostscript。

当我使用正常的@Test(无参数)运行我的unit testing时, 代码运行正常 ,所以我想我已经正确安装了所有内容。

当我用2个并行线程运行它时(见下文)我得到一个例外。

我已经在这里阅读了相关的主题,但是建议使用我正在使用的Tesseract1(我已经尝试过)。

有任何想法吗?

这是代码:

// @Test // works @Test(invocationCount = 2, threadPoolSize = 2) public void testOcr() throws OcrException, TesseractException { File scannedPdf = new File(this.getClass().getClassLoader().getResource("scanned.pdf").getFile()); // Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping String str = instance.doOCR(scannedPdf); System.out.println("OCR Result: " + str); } 

这是例外:

 log4j:WARN No appenders could be found for logger (org.ghost4j.Ghostscript). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Ιουλ 16, 2014 6:22:23 ΜΜ net.sourceforge.vietocr.PdfUtilities convertPdf2Png SEVERE: Cannot initialize Ghostscript interpreter. Error code is -21 org.ghost4j.GhostscriptException: Cannot initialize Ghostscript interpreter. Error code is -21 at org.ghost4j.Ghostscript.initialize(Ghostscript.java:365) at net.sourceforge.vietocr.PdfUtilities.convertPdf2Png(Unknown Source) at net.sourceforge.vietocr.PdfUtilities.convertPdf2Tiff(Unknown Source) at net.sourceforge.vietocr.ImageIOHelper.getIIOImageList(Unknown Source) at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source) at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source) at OcrUtilsTest.testOcr(OcrUtilsTest.java:19) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84) at org.testng.internal.Invoker.invokeMethod(Invoker.java:714) at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901) at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231) at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127) at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111) at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) java.lang.Error: Invalid memory access at com.sun.jna.Native.invokeInt(Native Method) at com.sun.jna.Function.invoke(Function.java:383) at com.sun.jna.Function.invoke(Function.java:315) at com.sun.jna.Library$Handler.invoke(Library.java:212) at com.sun.proxy.$Proxy3.gsapi_init_with_args(Unknown Source) at org.ghost4j.Ghostscript.initialize(Ghostscript.java:350) at net.sourceforge.vietocr.PdfUtilities.convertPdf2Png(Unknown Source) at net.sourceforge.vietocr.PdfUtilities.convertPdf2Tiff(Unknown Source) at net.sourceforge.vietocr.ImageIOHelper.getIIOImageList(Unknown Source) at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source) at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source) at OcrUtilsTest.testOcr(OcrUtilsTest.java:19) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84) at org.testng.internal.Invoker.invokeMethod(Invoker.java:714) at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901) at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231) at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127) at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111) at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) net.sourceforge.tess4j.TesseractException: javax.imageio.IIOException: I/O error reading header! at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source) at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source) at OcrUtilsTest.testOcr(OcrUtilsTest.java:19) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84) at org.testng.internal.Invoker.invokeMethod(Invoker.java:714) at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901) at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231) at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127) at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111) at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: javax.imageio.IIOException: I/O error reading header! at com.sun.media.imageioimpl.plugins.tiff.TIFFImageReader.readHeader(TIFFImageReader.java:224) at com.sun.media.imageioimpl.plugins.tiff.TIFFImageReader.locateImage(TIFFImageReader.java:231) at com.sun.media.imageioimpl.plugins.tiff.TIFFImageReader.getNumImages(TIFFImageReader.java:279) at net.sourceforge.vietocr.ImageIOHelper.getIIOImageList(Unknown Source) ... 18 more Caused by: java.io.EOFException at javax.imageio.stream.ImageInputStreamImpl.readShort(ImageInputStreamImpl.java:229) at javax.imageio.stream.ImageInputStreamImpl.readUnsignedShort(ImageInputStreamImpl.java:242) at com.sun.media.imageioimpl.plugins.tiff.TIFFImageReader.readHeader(TIFFImageReader.java:199) ... 21 more 

更新 :似乎与此有关。

Tesseract本身只能将图像转换为文本,而不是PDF,即使扫描PDF也是如此。

在引擎盖下,Tess4j使用Ghostscript(通过ghost4j)将每个页面转换为单个图像文件,然后将其提供给Tesseract进行OCR。 它将结果字符串连接成一个返回的字符串。

exception的原因是Tess4j以不支持multithreading的方式使用Ghost4j。 如此处所述,ghost4j 确实从其高级API提供multithreading支持(实际上它分别运行不同的Ghostscript实例,每个实例从不同的JVM调用)。 但是,Tess4j使用其低级API,可以使用单个Ghostscript实例。