解压缩HTTPInputStream时GZIPInputStream过早关闭

请参阅下面编辑部分中的更新问题

我正在尝试使用GZIPInputStream动态解压缩来自Amazon S3的大型(~300M)GZIPed文件,但它只输出文件的一部分; 但是,如果我在解压缩之前下载到文件系统,那么GZIPInputStream将解压缩整个文件。

如何让GZIPInputStream解压缩整个HTTPInputStream而不只是解压缩它的第一部分?

我试过的

请参阅下面编辑部分中的更新

我怀疑HTTP问题,除了没有抛出任何exception,GZIPInputStream每次返回一个相当一致的文件块,据我所知,它总是在WET记录边界上打破,尽管它选择的边界是不同的URL(这很奇怪,因为所有内容都被视为二进制流,文件中的WET记录根本没有解析。)

我能找到的最接近的问题是当从s3读取时GZIPInputStream过早关闭该问题的答案是一些GZIP文件实际上是多个附加的GZIP文件而GZIPInputStream不能很好地处理。 但是,如果是这种情况,为什么GZIPInputStream可以在文件的本地副本上正常工作?

演示代码和输出

下面是一段示例代码,演示了我所看到的问题。 我在两个不同的网络上的两台不同的Linux计算机上用Java 1.8.0_72和1.8.0_112测试了它,结果相似。 我希望解压缩的HTTPInputStream中的字节数与文件的解压缩本地副本的字节数相同,但解压缩的HTTPInputStream要小得多。

产量

Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 87894 bytes from HTTP->GZIP Read 448974935 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile0.wet ------ Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 1772936 bytes from HTTP->GZIP Read 451171329 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile40.wet ------ Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 89217 bytes from HTTP->GZIP Read 453183600 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile500.wet 

示例代码

 import java.net.*; import java.io.*; import java.util.zip.GZIPInputStream; import java.nio.channels.*; public class GZIPTest { public static void main(String[] args) throws Exception { // Our three test files from CommonCrawl URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz"); URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz"); URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz"); /* * Test the URLs and display the results */ test(url0, "testfile0.wet"); System.out.println("------"); test(url40, "testfile40.wet"); System.out.println("------"); test(url500, "testfile500.wet"); } public static void test(URL url, String testGZFileName) throws Exception { System.out.println("Testing URL "+url.toString()); // First directly wrap the HTTPInputStream with GZIPInputStream // and count the number of bytes we read // Go ahead and save the extracted stream to a file for further inspection System.out.println("Testing HTTP Input Stream direct to GZIPInputStream"); int bytesFromGZIPDirect = 0; URLConnection urlConnection = url.openConnection(); FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName); // FIRST TEST - Decompress from HTTPInputStream GZIPInputStream gzipishttp = new GZIPInputStream(urlConnection.getInputStream()); byte[] buffer = new byte[1024]; int bytesRead = -1; while ((bytesRead = gzipishttp.read(buffer, 0, 1024)) != -1) { bytesFromGZIPDirect += bytesRead; directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection } gzipishttp.close(); directGZIPOutStream.close(); // Now save the GZIPed file locally System.out.println("Testing saving to file before decompression"); int bytesFromGZIPFile = 0; ReadableByteChannel rbc = Channels.newChannel(url.openStream()); FileOutputStream outputStream = new FileOutputStream("./test.wet.gz"); outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE); outputStream.close(); // SECOND TEST - decompress from FileInputStream GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz")); buffer = new byte[1024]; bytesRead = -1; while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) { bytesFromGZIPFile += bytesRead; } gzipis.close(); // The Results - these numbers should match but they don't System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP"); System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP"); System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName); } } 

编辑

根据@VGR的评论,演示代码中的封闭流和关联频道。

更新

问题似乎确实是文件特有的。 我在本地(wget)下载了Common Crawl WET存档,未压缩它(gunzip 1.8),然后重新压缩它(gzip 1.8)并重新上传到S3,然后即时解压缩工作正常。 如果您修改上面的示例代码以包含以下行,则可以看到测试:

 // Original file from CommonCrawl hosted on S3 URL originals3 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz"); // Recompressed file hosted on S3 URL rezippeds3 = new URL("https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz"); test(originals3, "originalhost.txt"); test(rezippeds3, "rezippedhost.txt"); 

URL rezippeds3指向我下载,解压缩和重新压缩的WET存档文件,然后重新上传到S3。 您将看到以下输出:

 Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 7212400 bytes from HTTP->GZIP Read 448974935 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file originals3.txt ----- Testing URL https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 448974935 bytes from HTTP->GZIP Read 448974935 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file rezippeds3.txt 

正如您所看到的,一旦文件被重新压缩,我就可以通过GZIPInputStream流式传输并获取整个文件。 原始文件仍然显示解压缩的通常过早结束。 当我下载并上传WET文件而不重新压缩它时,我得到了相同的不完整流式传输行为,所以它肯定是修复它的再压缩。 我还将原始文件和重新压缩的文件放到传统的Apache Web服务器上,并且能够复制结果,因此S3似乎与问题没有任何关系。

所以。 我有一个新问题。

新问题

为什么FileInputStream在读取相同内容时的行为与HTTPInputStream不同。 如果它是完全相同的文件,为什么:

new GZIPInputStream(urlConnection.getInputStream());

表现得与…不同

new GZIPInputStream(new FileInputStream(“./ test.wet.gz”));

?? 输入流不是输入流吗?

根本原因讨论

事实certificate,InputStreams可能会有很大差异。 特别是它们在实现.available()方法方面有所不同。 例如,ByteArrayInputStream .available()返回InputStream中剩余的字节数。 但是,HTTPInputStream .available()返回在需要阻塞IO请求以重新填充缓冲区之前可读取的字节数。 (有关详细信息,请参阅Java Docs)

问题是GZIPInputStream使用.available()的输出来确定在完成解压缩完整的GZIP文件后,InputStream中是否有可用的额外GZIP文件。 这是来自OpenJDK源文件GZIPInputStream.java方法readTrailer()的第231行。

  if (this.in.available() > 0 || n > 26) { 

如果HTTPInputStream读取缓冲区在两个连接的GZIP文件的边界处清空,则GZIPInputStream调用.available(),它响应为0,因为它需要到网络重新填充缓冲区,因此GZIPInputStream将文件视为完整过早关闭。

Common Crawl .wet存档是数百兆字节的小型连接GZIP文件,因此最终HTTPInputStream缓冲区将在其中一个连接的GZIP文件的末尾清空,GZIPInputStream将过早关闭。 这解释了问题中certificate的问题。

解决方案和解决方案

这个GIST包含一个jdk8u152-b00修订版12039的补丁和两个jtreg测试,删除了(以我的拙见)对.available()的错误依赖。

如果无法修补JDK,解决方法是确保available()始终返回> 0,这会强制GZIPInputStream始终检查流中的另一个GZIP文件。 不幸的是,HTTPInputStream是私有的,所以你不能直接对它进行子类化,而是扩展InputStream并包装HTTPInputStream。 以下代码演示了这项工作。

演示代码和输出

这是输出显示,当HTTPInputStream被包装时,如下所述,当从文件读取连接的GZIP并直接从HTTP读取时,GZIPInputStream将产生相同的结果。

 Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 448974935 bytes from HTTP->GZIP Read 448974935 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile0.wet ------ Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 451171329 bytes from HTTP->GZIP Read 451171329 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile40.wet ------ Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 453183600 bytes from HTTP->GZIP Read 453183600 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile500.wet 

以下是使用InputStream包装器修改的问题的演示代码。

 import java.net.*; import java.io.*; import java.util.zip.GZIPInputStream; import java.nio.channels.*; public class GZIPTest { // Here is a wrapper class that wraps an InputStream // but always returns > 0 when .available() is called. // This will cause GZIPInputStream to always make another // call to the InputStream to check for an additional // concatenated GZIP file in the stream. public static class AvailableInputStream extends InputStream { private InputStream is; AvailableInputStream(InputStream inputstream) { is = inputstream; } public int read() throws IOException { return(is.read()); } public int read(byte[] b) throws IOException { return(is.read(b)); } public int read(byte[] b, int off, int len) throws IOException { return(is.read(b, off, len)); } public void close() throws IOException { is.close(); } public int available() throws IOException { // Always say that we have 1 more byte in the // buffer, even when we don't int a = is.available(); if (a == 0) { return(1); } else { return(a); } } } public static void main(String[] args) throws Exception { // Our three test files from CommonCrawl URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz"); URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz"); URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz"); /* * Test the URLs and display the results */ test(url0, "testfile0.wet"); System.out.println("------"); test(url40, "testfile40.wet"); System.out.println("------"); test(url500, "testfile500.wet"); } public static void test(URL url, String testGZFileName) throws Exception { System.out.println("Testing URL "+url.toString()); // First directly wrap the HTTP inputStream with GZIPInputStream // and count the number of bytes we read // Go ahead and save the extracted stream to a file for further inspection System.out.println("Testing HTTP Input Stream direct to GZIPInputStream"); int bytesFromGZIPDirect = 0; URLConnection urlConnection = url.openConnection(); // Wrap the HTTPInputStream in our AvailableHttpInputStream AvailableInputStream ais = new AvailableInputStream(urlConnection.getInputStream()); GZIPInputStream gzipishttp = new GZIPInputStream(ais); FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName); int buffersize = 1024; byte[] buffer = new byte[buffersize]; int bytesRead = -1; while ((bytesRead = gzipishttp.read(buffer, 0, buffersize)) != -1) { bytesFromGZIPDirect += bytesRead; directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection } gzipishttp.close(); directGZIPOutStream.close(); // Save the GZIPed file locally System.out.println("Testing saving to file before decompression"); ReadableByteChannel rbc = Channels.newChannel(url.openStream()); FileOutputStream outputStream = new FileOutputStream("./test.wet.gz"); outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE); // Now decompress the local file and count the number of bytes int bytesFromGZIPFile = 0; GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz")); buffer = new byte[1024]; while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) { bytesFromGZIPFile += bytesRead; } gzipis.close(); // The Results System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP"); System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP"); System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName); } }