FileChannel ByteBuffer和Hashing Files

我在java中构建了一个文件哈希方法，它接受filepath+filename输入字符串表示，然后计算该文件的哈希值。散列可以是任何本机支持的java散列算法，例如MD2到SHA-512 。

我试图找出最后一滴性能，因为这个方法是我正在研究的项目的一个组成部分。我被建议尝试使用FileChannel而不是常规的FileInputStream 。

我原来的方法：

  /** * Gets Hash of file. * * @param file String path + filename of file to get hash. * @param hashAlgo Hash algorithm to use. 
 * Supported algorithms are: 
 * MD2, MD5 
 * SHA-1 
 * SHA-256, SHA-384, SHA-512 * @return String value of hash. (Variable length dependent on hash algorithm used) * @throws IOException If file is invalid. * @throws HashTypeException If no supported or valid hash algorithm was found. */ public String getHash(String file, String hashAlgo) throws IOException, HashTypeException { StringBuffer hexString = null; try { MessageDigest md = MessageDigest.getInstance(validateHashType(hashAlgo)); FileInputStream fis = new FileInputStream(file); byte[] dataBytes = new byte[1024]; int nread = 0; while ((nread = fis.read(dataBytes)) != -1) { md.update(dataBytes, 0, nread); } fis.close(); byte[] mdbytes = md.digest(); hexString = new StringBuffer(); for (int i = 0; i < mdbytes.length; i++) { hexString.append(Integer.toHexString((0xFF & mdbytes[i]))); } return hexString.toString(); } catch (NoSuchAlgorithmException | HashTypeException e) { throw new HashTypeException("Unsuppored Hash Algorithm.", e); } }

重构方法：

  /** * Gets Hash of file. * * @param file String path + filename of file to get hash. * @param hashAlgo Hash algorithm to use. 
 * Supported algorithms are: 
 * MD2, MD5 
 * SHA-1 
 * SHA-256, SHA-384, SHA-512 * @return String value of hash. (Variable length dependent on hash algorithm used) * @throws IOException If file is invalid. * @throws HashTypeException If no supported or valid hash algorithm was found. */ public String getHash(String fileStr, String hashAlgo) throws IOException, HasherException { File file = new File(fileStr); MessageDigest md = null; FileInputStream fis = null; FileChannel fc = null; ByteBuffer bbf = null; StringBuilder hexString = null; try { md = MessageDigest.getInstance(hashAlgo); fis = new FileInputStream(file); fc = fis.getChannel(); bbf = ByteBuffer.allocate(1024); // allocation in bytes int bytes; while ((bytes = fc.read(bbf)) != -1) { md.update(bbf.array(), 0, bytes); } fc.close(); fis.close(); byte[] mdbytes = md.digest(); hexString = new StringBuilder(); for (int i = 0; i < mdbytes.length; i++) { hexString.append(Integer.toHexString((0xFF & mdbytes[i]))); } return hexString.toString(); } catch (NoSuchAlgorithmException e) { throw new HasherException("Unsupported Hash Algorithm.", e); } }

两者都返回正确的哈希值，但重构的方法似乎只对小文件合作。当我传入一个大文件时，它完全窒息而我无法弄清楚原因。我是NIO新手所以请指教。

编辑：忘了提到我正在通过它投掷SHA-512进行测试。

UPDATE:使用我现在的方法更新。

  /** * Gets Hash of file. * * @param file String path + filename of file to get hash. * @param hashAlgo Hash algorithm to use. 
 * Supported algorithms are: 
 * MD2, MD5 
 * SHA-1 
 * SHA-256, SHA-384, SHA-512 * @return String value of hash. (Variable length dependent on hash algorithm used) * @throws IOException If file is invalid. * @throws HashTypeException If no supported or valid hash algorithm was found. */ public String getHash(String fileStr, String hashAlgo) throws IOException, HasherException { File file = new File(fileStr); MessageDigest md = null; FileInputStream fis = null; FileChannel fc = null; ByteBuffer bbf = null; StringBuilder hexString = null; try { md = MessageDigest.getInstance(hashAlgo); fis = new FileInputStream(file); fc = fis.getChannel(); bbf = ByteBuffer.allocateDirect(8192); // allocation in bytes - 1024, 2048, 4096, 8192 int b; b = fc.read(bbf); while ((b != -1) && (b != 0)) { bbf.flip(); byte[] bytes = new byte[b]; bbf.get(bytes); md.update(bytes, 0, b); bbf.clear(); b = fc.read(bbf); } fis.close(); byte[] mdbytes = md.digest(); hexString = new StringBuilder(); for (int i = 0; i < mdbytes.length; i++) { hexString.append(Integer.toHexString((0xFF & mdbytes[i]))); } return hexString.toString(); } catch (NoSuchAlgorithmException e) { throw new HasherException("Unsupported Hash Algorithm.", e); } }

所以我尝试使用我的原始示例和我最新的更新示例，对2.92GB文件的MD5进行基准测试。当然，任何基准测试都是相对的，因为存在操作系统和磁盘缓存以及其他会导致重复读取相同文件的“魔法”……但这里有一些基准测试。我把每个方法加载起来并在将它编译成新鲜后将其关闭5次。基准测试取自最后一次（第5次），因为这将是该算法的“最热门”运行，以及任何“魔术”（在我的理论中无论如何）。

 Here's the benchmarks so far: Original Method - 14.987909 (s) Latest Method - 11.236802 (s)

散列相同的2.92GB文件所花费的时间25.03% decrease了25.03% decrease 。非常好。

3意见建议：

1）每次读取后清除缓冲区

 while (fc.read(bbf) != -1) { md.update(bbf.array(), 0, bytes); bbf.clear(); }

2）不要关闭fc和fis，这是多余的，关闭fis就足够了。 FileInputStream.close API说：

 If this stream has an associated channel then the channel is closed as well.

3）如果您希望使用FileChannel提高性能

 ByteBuffer.allocateDirect(1024);

如果代码仅分配临时缓冲区一次，则可能会出现另一种可能的改进。

例如

  int bufsize = 8192; ByteBuffer buffer = ByteBuffer.allocateDirect(bufsize); byte[] temp = new byte[bufsize]; int b = channel.read(buffer); while (b > 0) { buffer.flip(); buffer.get(temp, 0, b); md.update(temp, 0, b); buffer.clear(); b = channel.read(buffer); }

附录

注意：字符串构建代码中存在错误。它将零打印为单个数字。这很容易修复。例如

 hexString.append(mdbytes[i] == 0 ? "00" : Integer.toHexString((0xFF & mdbytes[i])));

另外，作为实验，我重写了代码以使用映射的字节缓冲区。它的运行速度提高了约30％（6-7毫秒对9-11毫瓦FWIW）。如果您编写直接在字节缓冲区上运行的代码散列代码，我希望您能从中获得更多。

我尝试通过在启动计时器之前使用每个算法散列不同的文件来考虑JVM初始化和文件系统缓存。第一次运行代码比正常运行慢约25倍。这似乎是由于JVM初始化，因为定时循环中的所有运行长度大致相同。他们似乎没有从缓存中受益。我用MD5算法测试过。此外，在定时部分期间，在测试程序的持续时间内仅运行一种算法。

循环中的代码更短，因此可能更容易理解。我不是100％确定在高容量下许多文件会对JVM施加什么样的压力内存映射，所以如果你想运行，如果你想考虑这种解决方案，那么你可能需要研究和考虑这在负载下。

 public static byte[] hash(File file, String hashAlgo) throws IOException { FileInputStream inputStream = null; try { MessageDigest md = MessageDigest.getInstance(hashAlgo); inputStream = new FileInputStream(file); FileChannel channel = inputStream.getChannel(); long length = file.length(); if(length > Integer.MAX_VALUE) { // you could make this work with some care, // but this code does not bother. throw new IOException("File "+file.getAbsolutePath()+" is too large."); } ByteBuffer buffer = channel.map(MapMode.READ_ONLY, 0, length); int bufsize = 1024 * 8; byte[] temp = new byte[bufsize]; int bytesRead = 0; while (bytesRead < length) { int numBytes = (int)length - bytesRead >= bufsize ? bufsize : (int)length - bytesRead; buffer.get(temp, 0, numBytes); md.update(temp, 0, numBytes); bytesRead += numBytes; } byte[] mdbytes = md.digest(); return mdbytes; } catch (NoSuchAlgorithmException e) { throw new IllegalArgumentException("Unsupported Hash Algorithm.", e); } finally { if(inputStream != null) { inputStream.close(); } } }

以下是使用NIO进行文件哈希的示例

路径
FileChanngel
MappedByteBuffer

并避免使用byte []。所以我认为这应该是上面的改进版本。第二个nio示例，其中散列值存储在用户属性中。这可以用于HTML etag生成，其他样本文件不会更改。

  public static final byte[] getFileHash(final File src, final String hashAlgo) throws IOException, NoSuchAlgorithmException { final int BUFFER = 32 * 1024; final Path file = src.toPath(); try(final FileChannel fc = FileChannel.open(file)) { final long size = fc.size(); final MessageDigest hash = MessageDigest.getInstance(hashAlgo); long position = 0; while(position < size) { final MappedByteBuffer data = fc.map(FileChannel.MapMode.READ_ONLY, 0, Math.min(size, BUFFER)); if(!data.isLoaded()) data.load(); System.out.println("POS:"+position); hash.update(data); position += data.limit(); if(position >= size) break; } return hash.digest(); } } public static final byte[] getCachedFileHash(final File src, final String hashAlgo) throws NoSuchAlgorithmException, FileNotFoundException, IOException{ final Path path = src.toPath(); if(!Files.isReadable(path)) return null; final UserDefinedFileAttributeView view = Files.getFileAttributeView(path, UserDefinedFileAttributeView.class); final String name = "user.hash."+hashAlgo; final ByteBuffer bb = ByteBuffer.allocate(64); try { view.read(name, bb); return ((ByteBuffer)bb.flip()).array(); } catch(final NoSuchFileException t) { // Not yet calculated } catch(final Throwable t) { t.printStackTrace(); } System.out.println("Hash not found calculation"); final byte[] hash = getFileHash(src, hashAlgo); view.write(name, ByteBuffer.wrap(hash)); return hash; }

FileChannel ByteBuffer和Hashing Files

MessageDigest的更新方法做什么以及BASE64Encoder的用途是什么？

如何正确使用自定义渲染器绘制JTable中的特定单元格？

Java日期时间格式转换

在servlet的init（）中加载属性文件，而不使用web.xml中的context-param标记

将String转换为Joda LocalTime格式（HH：mm：ss）并删除毫秒

JAXB以不同的方式将XML封送到OutputStream和StringWriter

MVP和GWT小部件之间的通信

如何在java中的http post中发送json对象

WebSphere尝试从Internet加载Spring相关的模式

Java：创建GZIPInputStream时出错：不是GZIP格式