解压缩HTTPInputStream时GZIPInputStream过早关闭

题

请参阅下面编辑部分中的更新问题

我正在尝试使用GZIPInputStream动态解压缩来自Amazon S3的大型（~300M）GZIPed文件，但它只输出文件的一部分; 但是，如果我在解压缩之前下载到文件系统，那么GZIPInputStream将解压缩整个文件。

如何让GZIPInputStream解压缩整个HTTPInputStream而不只是解压缩它的第一部分？

我试过的

请参阅下面编辑部分中的更新

我怀疑HTTP问题，除了没有抛出任何exception，GZIPInputStream每次返回一个相当一致的文件块，据我所知，它总是在WET记录边界上打破，尽管它选择的边界是不同的URL（这很奇怪，因为所有内容都被视为二进制流，文件中的WET记录根本没有解析。）

我能找到的最接近的问题是当从s3读取时GZIPInputStream过早关闭该问题的答案是一些GZIP文件实际上是多个附加的GZIP文件而GZIPInputStream不能很好地处理。但是，如果是这种情况，为什么GZIPInputStream可以在文件的本地副本上正常工作？

演示代码和输出

下面是一段示例代码，演示了我所看到的问题。我在两个不同的网络上的两台不同的Linux计算机上用Java 1.8.0_72和1.8.0_112测试了它，结果相似。我希望解压缩的HTTPInputStream中的字节数与文件的解压缩本地副本的字节数相同，但解压缩的HTTPInputStream要小得多。

产量

Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 87894 bytes from HTTP->GZIP Read 448974935 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile0.wet ------ Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 1772936 bytes from HTTP->GZIP Read 451171329 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile40.wet ------ Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 89217 bytes from HTTP->GZIP Read 453183600 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile500.wet

示例代码

 import java.net.*; import java.io.*; import java.util.zip.GZIPInputStream; import java.nio.channels.*; public class GZIPTest { public static void main(String[] args) throws Exception { // Our three test files from CommonCrawl URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz"); URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz"); URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz"); /* * Test the URLs and display the results */ test(url0, "testfile0.wet"); System.out.println("------"); test(url40, "testfile40.wet"); System.out.println("------"); test(url500, "testfile500.wet"); } public static void test(URL url, String testGZFileName) throws Exception { System.out.println("Testing URL "+url.toString()); // First directly wrap the HTTPInputStream with GZIPInputStream // and count the number of bytes we read // Go ahead and save the extracted stream to a file for further inspection System.out.println("Testing HTTP Input Stream direct to GZIPInputStream"); int bytesFromGZIPDirect = 0; URLConnection urlConnection = url.openConnection(); FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName); // FIRST TEST - Decompress from HTTPInputStream GZIPInputStream gzipishttp = new GZIPInputStream(urlConnection.getInputStream()); byte[] buffer = new byte[1024]; int bytesRead = -1; while ((bytesRead = gzipishttp.read(buffer, 0, 1024)) != -1) { bytesFromGZIPDirect += bytesRead; directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection } gzipishttp.close(); directGZIPOutStream.close(); // Now save the GZIPed file locally System.out.println("Testing saving to file before decompression"); int bytesFromGZIPFile = 0; ReadableByteChannel rbc = Channels.newChannel(url.openStream()); FileOutputStream outputStream = new FileOutputStream("./test.wet.gz"); outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE); outputStream.close(); // SECOND TEST - decompress from FileInputStream GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz")); buffer = new byte[1024]; bytesRead = -1; while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) { bytesFromGZIPFile += bytesRead; } gzipis.close(); // The Results - these numbers should match but they don't System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP"); System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP"); System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName); } }

编辑

根据@VGR的评论，演示代码中的封闭流和关联频道。

更新：

问题似乎确实是文件特有的。我在本地（wget）下载了Common Crawl WET存档，未压缩它（gunzip 1.8），然后重新压缩它（gzip 1.8）并重新上传到S3，然后即时解压缩工作正常。如果您修改上面的示例代码以包含以下行，则可以看到测试：

 // Original file from CommonCrawl hosted on S3 URL originals3 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz"); // Recompressed file hosted on S3 URL rezippeds3 = new URL("https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz"); test(originals3, "originalhost.txt"); test(rezippeds3, "rezippedhost.txt");

URL rezippeds3指向我下载，解压缩和重新压缩的WET存档文件，然后重新上传到S3。您将看到以下输出：

 Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 7212400 bytes from HTTP->GZIP Read 448974935 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file originals3.txt ----- Testing URL https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 448974935 bytes from HTTP->GZIP Read 448974935 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file rezippeds3.txt

正如您所看到的，一旦文件被重新压缩，我就可以通过GZIPInputStream流式传输并获取整个文件。原始文件仍然显示解压缩的通常过早结束。当我下载并上传WET文件而不重新压缩它时，我得到了相同的不完整流式传输行为，所以它肯定是修复它的再压缩。我还将原始文件和重新压缩的文件放到传统的Apache Web服务器上，并且能够复制结果，因此S3似乎与问题没有任何关系。

所以。我有一个新问题。

新问题

为什么FileInputStream在读取相同内容时的行为与HTTPInputStream不同。如果它是完全相同的文件，为什么：

new GZIPInputStream（urlConnection.getInputStream（））;

表现得与…不同

new GZIPInputStream（new FileInputStream（“./ test.wet.gz”））;

?? 输入流不是输入流吗？

根本原因讨论

事实certificate，InputStreams可能会有很大差异。特别是它们在实现.available（）方法方面有所不同。例如，ByteArrayInputStream .available（）返回InputStream中剩余的字节数。但是，HTTPInputStream .available（）返回在需要阻塞IO请求以重新填充缓冲区之前可读取的字节数。（有关详细信息，请参阅Java Docs）

问题是GZIPInputStream使用.available（）的输出来确定在完成解压缩完整的GZIP文件后，InputStream中是否有可用的额外GZIP文件。这是来自OpenJDK源文件GZIPInputStream.java方法readTrailer（）的第231行。

  if (this.in.available() > 0 || n > 26) {

如果HTTPInputStream读取缓冲区在两个连接的GZIP文件的边界处清空，则GZIPInputStream调用.available（），它响应为0，因为它需要到网络重新填充缓冲区，因此GZIPInputStream将文件视为完整过早关闭。

Common Crawl .wet存档是数百兆字节的小型连接GZIP文件，因此最终HTTPInputStream缓冲区将在其中一个连接的GZIP文件的末尾清空，GZIPInputStream将过早关闭。这解释了问题中certificate的问题。

解决方案和解决方案

这个GIST包含一个jdk8u152-b00修订版12039的补丁和两个jtreg测试，删除了（以我的拙见）对.available（）的错误依赖。

如果无法修补JDK，解决方法是确保available（）始终返回> 0，这会强制GZIPInputStream始终检查流中的另一个GZIP文件。不幸的是，HTTPInputStream是私有的，所以你不能直接对它进行子类化，而是扩展InputStream并包装HTTPInputStream。以下代码演示了这项工作。

演示代码和输出

这是输出显示，当HTTPInputStream被包装时，如下所述，当从文件读取连接的GZIP并直接从HTTP读取时，GZIPInputStream将产生相同的结果。

 Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 448974935 bytes from HTTP->GZIP Read 448974935 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile0.wet ------ Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 451171329 bytes from HTTP->GZIP Read 451171329 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile40.wet ------ Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 453183600 bytes from HTTP->GZIP Read 453183600 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile500.wet

以下是使用InputStream包装器修改的问题的演示代码。

 import java.net.*; import java.io.*; import java.util.zip.GZIPInputStream; import java.nio.channels.*; public class GZIPTest { // Here is a wrapper class that wraps an InputStream // but always returns > 0 when .available() is called. // This will cause GZIPInputStream to always make another // call to the InputStream to check for an additional // concatenated GZIP file in the stream. public static class AvailableInputStream extends InputStream { private InputStream is; AvailableInputStream(InputStream inputstream) { is = inputstream; } public int read() throws IOException { return(is.read()); } public int read(byte[] b) throws IOException { return(is.read(b)); } public int read(byte[] b, int off, int len) throws IOException { return(is.read(b, off, len)); } public void close() throws IOException { is.close(); } public int available() throws IOException { // Always say that we have 1 more byte in the // buffer, even when we don't int a = is.available(); if (a == 0) { return(1); } else { return(a); } } } public static void main(String[] args) throws Exception { // Our three test files from CommonCrawl URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz"); URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz"); URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz"); /* * Test the URLs and display the results */ test(url0, "testfile0.wet"); System.out.println("------"); test(url40, "testfile40.wet"); System.out.println("------"); test(url500, "testfile500.wet"); } public static void test(URL url, String testGZFileName) throws Exception { System.out.println("Testing URL "+url.toString()); // First directly wrap the HTTP inputStream with GZIPInputStream // and count the number of bytes we read // Go ahead and save the extracted stream to a file for further inspection System.out.println("Testing HTTP Input Stream direct to GZIPInputStream"); int bytesFromGZIPDirect = 0; URLConnection urlConnection = url.openConnection(); // Wrap the HTTPInputStream in our AvailableHttpInputStream AvailableInputStream ais = new AvailableInputStream(urlConnection.getInputStream()); GZIPInputStream gzipishttp = new GZIPInputStream(ais); FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName); int buffersize = 1024; byte[] buffer = new byte[buffersize]; int bytesRead = -1; while ((bytesRead = gzipishttp.read(buffer, 0, buffersize)) != -1) { bytesFromGZIPDirect += bytesRead; directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection } gzipishttp.close(); directGZIPOutStream.close(); // Save the GZIPed file locally System.out.println("Testing saving to file before decompression"); ReadableByteChannel rbc = Channels.newChannel(url.openStream()); FileOutputStream outputStream = new FileOutputStream("./test.wet.gz"); outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE); // Now decompress the local file and count the number of bytes int bytesFromGZIPFile = 0; GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz")); buffer = new byte[1024]; while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) { bytesFromGZIPFile += bytesRead; } gzipis.close(); // The Results System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP"); System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP"); System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName); } }

解压缩HTTPInputStream时GZIPInputStream过早关闭

题

我试过的

演示代码和输出

编辑

新问题

根本原因讨论

解决方案和解决方案

演示代码和输出

AWS S3 Java SDK – 拒绝访问

AWS Java SDK – 无法通过区域提供程序链查找区域

AmazonS3，如何检查上传是否成功？

Maven可以从私有s3存储桶中引用父POM吗？

com.amazonaws.services.s3.model.AmazonS3Exception：Forbidden（Service：Amazon S3; Status Code：403; Error Code：403 Forbidden; Request ID：XXXXXXXX）

AWS S3 Java：doesObjectExist导致403：FORBIDDEN

如何通过Pause / Resume支持上传到S3？

使用预先签名的URL通过curl上传到s3（获得403）

仅列出s3存储桶中的子文件夹

使用AWS Java SDK为现有S3对象设置Expires标头

解压缩HTTPInputStream时GZIPInputStream过早关闭

题

我试过的

演示代码和输出

编辑

新问题

根本原因讨论

解决方案和解决方案

演示代码和输出

AWS S3 Java SDK – 拒绝访问

AWS Java SDK – 无法通过区域提供程序链查找区域

AmazonS3，如何检查上传是否成功？

Maven可以从私有s3存储桶中引用父POM吗？

com.amazonaws.services.s3.model.AmazonS3Exception：Forbidden（Service：Amazon S3; Status Code：403; Error Code：403 Forbidden; Request ID：XXXXXXXX）

AWS S3 Java：d​​oesObjectExist导致403：FORBIDDEN

如何通过Pause / Resume支持上传到S3？

使用预先签名的URL通过curl上传到s3（获得403）

仅列出s3存储桶中的子文件夹

使用AWS Java SDK为现有S3对象设置Expires标头

AWS S3 Java：doesObjectExist导致403：FORBIDDEN