需要高性能的文本文件读取和解析(split() – like)

目前我有:

  • 1个文件,包含900万行
  • BufferedReader.readLine()读取每一行
  • String.split()来解析每一行(由管道分隔的列)
  • 使用了大量的RAM(因为String interning?)

问题是:正如您可能已经猜到的那样,我想要更好地阅读和解析此文件…

问题:

  • 如何使用最少的资源读取这个相对较大的文件(知道每一行都需要在管道上进行某种“拆分”)?
  • 我可以用其他东西替换String.split(比方说,StringBuilder,CharBuffer,……)?
  • 在我将字符串拆分为最终字符序列之前,避免使用字符串读取文件的最佳方法是什么?
  • 我不介意在我的POJO中使用其他字符串,如果你有更好的东西吗?
  • 该文件将每隔几个小时重新加载一次,如果这有助于您给我一个解决方案?

谢谢 :)

一个900万行文件应该不到几秒钟。 大部分时间都花在将数据读入内存中。 如何分割数据不太可能产生重大影响。

BufferedReader和String.split对我来说听起来不错。 除非你确定这会有所帮助,否则我不会使用实习。 (它不会为你实习生()

最新版本的Java 6在处理字符串方面有一些性能改进。 我会尝试Java 6更新25,看看它是否更快。


编辑:做一些测试发现分裂速度非常慢,你可以改进它。

public static void main(String... args) throws IOException { long start1 = System.nanoTime(); PrintWriter pw = new PrintWriter("deleteme.txt"); StringBuilder sb = new StringBuilder(); for (int j = 1000; j < 1040; j++) sb.append(j).append(' '); String outLine = sb.toString(); for (int i = 0; i < 1000 * 1000; i++) pw.println(outLine); pw.close(); long time1 = System.nanoTime() - start1; System.out.printf("Took %f seconds to write%n", time1 / 1e9); { long start = System.nanoTime(); FileReader fr = new FileReader("deleteme.txt"); char[] buffer = new char[1024 * 1024]; while (fr.read(buffer) > 0) ; fr.close(); long time = System.nanoTime() - start; System.out.printf("Took %f seconds to read text as fast as possible%n", time / 1e9); } { long start = System.nanoTime(); BufferedReader br = new BufferedReader(new FileReader("deleteme.txt")); String line; while ((line = br.readLine()) != null) { String[] words = line.split(" "); } br.close(); long time = System.nanoTime() - start; System.out.printf("Took %f seconds to read lines and split%n", time / 1e9); } { long start = System.nanoTime(); BufferedReader br = new BufferedReader(new FileReader("deleteme.txt")); String line; Pattern splitSpace = Pattern.compile(" "); while ((line = br.readLine()) != null) { String[] words = splitSpace.split(line, 0); } br.close(); long time = System.nanoTime() - start; System.out.printf("Took %f seconds to read lines and split (precompiled)%n", time / 1e9); } { long start = System.nanoTime(); BufferedReader br = new BufferedReader(new FileReader("deleteme.txt")); String line; List words = new ArrayList(); while ((line = br.readLine()) != null) { words.clear(); int pos = 0, end; while ((end = line.indexOf(' ', pos)) >= 0) { words.add(line.substring(pos, end)); pos = end + 1; } // words. //System.out.println(words); } br.close(); long time = System.nanoTime() - start; System.out.printf("Took %f seconds to read lines and break using indexOf%n", time / 1e9); } } 

版画

 Took 1.757984 seconds to write Took 1.158652 seconds to read text as fast as possible Took 6.671587 seconds to read lines and split Took 4.210100 seconds to read lines and split (precompiled) Took 1.642296 seconds to read lines and break using indexOf 

因此,看起来自己拆分字符串是一种改进,让您尽可能快地接近踩踏文本。 更快地读取它的唯一方法是将文件视为二进制/ ASCII-7。 ;)