从Java中的文件中读取大量数据

我的文本文件包含以下forms的1 000 002数字：

 123 456 1 2 3 4 5 6 .... 999999 100000

现在我需要读取该数据并将其分配给int变量（前两个数字）和其余所有（ 1 000 000个数字）到数组int[] 。

这不是一项艰巨的任务，但是 – 这很糟糕。

我的第一次尝试是`java.util.Scanner` ：

  Scanner stdin = new Scanner(new File("./path")); int n = stdin.nextInt(); int t = stdin.nextInt(); int array[] = new array[n]; for (int i = 0; i < n; i++) { array[i] = stdin.nextInt(); }

它作为例外工作，但执行大约需要7500毫秒 。我需要在几百毫秒内获取该数据。

然后我尝试了`java.io.BufferedReader` ：

使用BufferedReader.readLine()和String.split()我在大约1700毫秒内得到了相同的结果，但它仍然太多了。

如何在不到1秒的时间内读取该数据量？最终结果应该等于：

 int n = 123; int t = 456; int array[] = { 1, 2, 3, 4, ..., 999999, 100000 };

根据trashgod回答：

StreamTokenizer解决方案很快（大约需要1400毫秒），但它仍然太慢：

 StreamTokenizer st = new StreamTokenizer(new FileReader("./test_grz")); st.nextToken(); int n = (int) st.nval; st.nextToken(); int t = (int) st.nval; int array[] = new int[n]; for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) { array[i] = (int) st.nval; }

PS。无需validation。我100％确定./test_grz文件中的数据是正确的。

感谢您的回答，但我已经找到了符合我标准的方法：

 BufferedInputStream bis = new BufferedInputStream(new FileInputStream("./path")); int n = readInt(bis); int t = readInt(bis); int array[] = new int[n]; for (int i = 0; i < n; i++) { array[i] = readInt(bis); } private static int readInt(InputStream in) throws IOException { int ret = 0; boolean dig = false; for (int c = 0; (c = in.read()) != -1; ) { if (c >= '0' && c <= '9') { dig = true; ret = ret * 10 + c - '0'; } else if (dig) break; } return ret; }

读取1毫升整数只需要大约300毫秒 ！

StreamTokenizer可能更快。

您可以使用BufferedReader减少StreamTokenizer结果的时间：

 Reader r = null; try { r = new BufferedReader(new FileReader(file)); final StreamTokenizer st = new StreamTokenizer(r); ... } finally { if (r != null) r.close(); }

此外，请不要忘记关闭您的文件，正如我在此处所示。

您还可以通过使用自定义标记器来为您的目的节省更多时间：

 public class CustomTokenizer { private final Reader r; public CustomTokenizer(final Reader r) { this.r = r; } public int nextInt() throws IOException { int i = r.read(); if (i == -1) throw new EOFException(); char c = (char) i; // Skip any whitespace while (c == ' ' || c == '\n' || c == '\r') { i = r.read(); if (i == -1) throw new EOFException(); c = (char) i; } int result = (c - '0'); while ((i = r.read()) >= 0) { c = (char) i; if (c == ' ' || c == '\n' || c == '\r') break; result = result * 10 + (c - '0'); } return result; } }

记得为此使用BufferedReader 。此自定义标记生成器假定输入数据始终完全有效，并且仅包含空格，新行和数字。

如果您经常阅读这些结果并且这些结果没有太大变化，您应该保存数组并跟踪上次修改文件的时间。然后，如果文件没有更改，只需使用数组的缓存副本，这将显着加快结果。例如：

 public class ArrayRetriever { private File inputFile; private long lastModified; private int[] lastResult; public ArrayRetriever(File file) { this.inputFile = file; } public int[] getResult() { if (lastResult != null && inputFile.lastModified() == lastModified) return lastResult; lastModified = inputFile.lastModified(); // do logic to actually read the file here lastResult = array; // the array variable from your examples return lastResult; } }

你在电脑里有多少记忆？你可能会遇到GC问题。

最好的办法是尽可能一次处理一行数据。不要将其加载到数组中。加载您需要的内容，处理，写出并继续。

这将减少您的内存占用，仍然使用相同数量的文件IO

它可以重新格式化输入，使每个整数位于一个单独的行上（而不是一个带有一百万个整数的长行），由于更智能的缓冲，你应该看到使用Integer.parseInt(BufferedReader.readLine())大大提高了性能。按行而不必将长字符串拆分为单独的字符串数组。

编辑：我测试了这个并设法将seq 1 1000000产生的输出读入一个int的数组，不到半秒，但当然这取决于机器。

我会扩展FilterReader并解析在read（）方法中读取的字符串。有一个getNextNumber方法返回数字。代码留给读者练习。

在BufferedReader上使用StreamTokenizer将为您提供相当好的性能。您不应该编写自己的readInt（）函数。

这是我用来做一些本地性能测试的代码：

 /** * Created by zhenhua.xu on 11/27/16. */ public class MyReader { private static final String FILE_NAME = "./1m_numbers.txt"; private static final int n = 1000000; public static void main(String[] args) { try { readByScanner(); readByStreamTokenizer(); readByStreamTokenizerOnBufferedReader(); readByBufferedInputStream(); } catch (Exception e) { e.printStackTrace(); } } public static void readByScanner() throws Exception { long startTime = System.currentTimeMillis(); Scanner stdin = new Scanner(new File(FILE_NAME)); int array[] = new int[n]; for (int i = 0; i < n; i++) { array[i] = stdin.nextInt(); } long endTime = System.currentTimeMillis(); System.out.println(String.format("Total time by Scanner: %d ms", endTime - startTime)); } public static void readByStreamTokenizer() throws Exception { long startTime = System.currentTimeMillis(); StreamTokenizer st = new StreamTokenizer(new FileReader(FILE_NAME)); int array[] = new int[n]; for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) { array[i] = (int) st.nval; } long endTime = System.currentTimeMillis(); System.out.println(String.format("Total time by StreamTokenizer: %d ms", endTime - startTime)); } public static void readByStreamTokenizerOnBufferedReader() throws Exception { long startTime = System.currentTimeMillis(); StreamTokenizer st = new StreamTokenizer(new BufferedReader(new FileReader(FILE_NAME))); int array[] = new int[n]; for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) { array[i] = (int) st.nval; } long endTime = System.currentTimeMillis(); System.out.println(String.format("Total time by StreamTokenizer with BufferedReader: %d ms", endTime - startTime)); } public static void readByBufferedInputStream() throws Exception { long startTime = System.currentTimeMillis(); BufferedInputStream bis = new BufferedInputStream(new FileInputStream(FILE_NAME)); int array[] = new int[n]; for (int i = 0; i < n; i++) { array[i] = readInt(bis); } long endTime = System.currentTimeMillis(); System.out.println(String.format("Total time with BufferedInputStream: %d ms", endTime - startTime)); } private static int readInt(InputStream in) throws IOException { int ret = 0; boolean dig = false; for (int c = 0; (c = in.read()) != -1; ) { if (c >= '0' && c <= '9') { dig = true; ret = ret * 10 + c - '0'; } else if (dig) break; } return ret; }

结果我得到：

扫描仪总时间：789毫秒
StreamTokenizer的总时间：226毫秒
StreamTokenizer与BufferedReader的总时间：80毫秒
BufferedInputStream的总时间：95毫秒

从Java中的文件中读取大量数据

我的第一次尝试是`java.util.Scanner` ：

然后我尝试了`java.io.BufferedReader` ：

根据trashgod回答：

AbstractApplicationContext vs ApplicationContext

EOFException – 如何处理？

使用Stanford Parser（CoreNLP）查找短语头

使用自定义收集器进行Java 8分组？

喜欢在Elasticsearch中搜索

是否有必要关闭FileWriter，前提是它是通过BufferedWriter编写的？

Spring MVC多文件上传具有HTML5多文件格式function

Maven私有依赖

oracle jdbc驱动程序版疯狂

spEL（Spring Expression Language）的一些有效用途是什么？

从Java中的文件中读取大量数据

我的第一次尝试是java.util.Scanner ：

然后我尝试了java.io.BufferedReader ：

根据trashgod回答：

AbstractApplicationContext vs ApplicationContext

EOFException – 如何处理？

使用Stanford Parser（CoreNLP）查找短语头

使用自定义收集器进行Java 8分组？

喜欢在Elasticsearch中搜索

是否有必要关闭FileWriter，前提是它是通过BufferedWriter编写的？

Spring MVC多文件上传具有HTML5多文件格式function

Maven私有依赖

oracle jdbc驱动程序版疯狂

spEL（Spring Expression Language）的一些有效用途是什么？

我的第一次尝试是`java.util.Scanner` ：

然后我尝试了`java.io.BufferedReader` ：