multithreading读取大量文件

我仍在围绕Java中的并发性工作。我理解（如果您订阅了OO Java 5并发模型），您可以使用run()或call()方法（分别）实现Task或Callable ，并且您应该将尽可能多的实现方法并行化。可能。

但我仍然不理解Java中并发编程的固有内容：

如何为Task的run()方法分配适当数量的并发工作？

作为一个具体的例子，如果我有一个I / O绑定的readMobyDick()方法，它将Herman Melville的Moby Dick的全部内容从本地系统上的文件读入内存。我只想说我希望这个readMobyDick()方法是并发的并由3个线程处理，其中：

线程＃1将书籍的前1/3读入内存
线程＃2将书籍的第二个1/3读入内存
线程＃3将书的最后1/3读入内存

我是否需要将Moby Dick分成三个文件并将它们分别传递给自己的任务，或者我只是从实现的run()方法中调用readMobyDick()并且（不知何故） Executor知道如何打破其中的工作线程。

我是一个非常直观的学习者，因此非常感谢任何正确方法的代码示例！谢谢！

你可能偶然选择了并行活动的绝对最糟糕的例子！

从单个机械磁盘并行读取实际上比使用单个线程读取要慢，因为实际上，当每个线程轮到运行时，您将机械磁头弹回到磁盘的不同部分。最好留作单线程活动。

让我们再举一个例子，它类似于你的但实际上可以带来一些好处：假设我想在一个巨大的单词列表中搜索某个单词的出现（这个列表甚至可能来自一个磁盘文件，但是就像我一样说，由单个线程读取）。假设我可以像你的例子一样使用3个线程，每个线程搜索巨大单词列表的1/3，并保留一个本地计数器，显示搜索单词出现的次数。

在这种情况下，您需要将列表分为3个部分，将每个部分传递给其类型实现Runnable的不同对象，并在run方法中实现搜索。

运行时本身不知道如何进行分区或类似的东西，你必须自己指定它。还有许多其他分区策略，每个策略都有自己的优点和缺点，但我们现在可以坚持使用静态分区。

我们来看一些代码：

 class SearchTask implements Runnable { private int localCounter = 0; private int start; // start index of search private int end; private List words; private String token; public SearchTask(int start, int end, List words, String token) { this.start = start; this.end = end; this.words = words; this.token = token; } public void run() { for(int i = start; i < end; i++) { if(words.get(i).equals(token)) localCounter++; } } public int getCounter() { return localCounter; } } // meanwhile in main :) List words = new ArrayList(); // populate words // let's assume you have 30000 words // create tasks SearchTask task1 = new SearchTask(0, 10000, words, "John"); SearchTask task2 = new SearchTask(10000, 20000, words, "John"); SearchTask task3 = new SearchTask(20000, 30000, words, "John"); // create threads for each task Thread t1 = new Thread(task1); Thread t2 = new Thread(task2); Thread t3 = new Thread(task3); // start threads t1.start(); t2.start(); t3.start(); // wait for threads to finish t1.join(); t2.join(); t3.join(); // collect results int counter = 0; counter += task1.getCounter(); counter += task2.getCounter(); counter += task3.getCounter();

这应该很好用。请注意，在实际情况下，您将构建更通用的分区方案。如果要返回结果，也可以使用ExecutorService并实现Callable而不是Runnable 。

所以使用更高级结构的替代示例：

 class SearchTask implements Callable { private int localCounter = 0; private int start; // start index of search private int end; private List words; private String token; public SearchTask(int start, int end, List words, String token) { this.start = start; this.end = end; this.words = words; this.token = token; } public Integer call() { for(int i = start; i < end; i++) { if(words.get(i).equals(token)) localCounter++; } return localCounter; } } // meanwhile in main :) List words = new ArrayList(); // populate words // let's assume you have 30000 words // create tasks List tasks = new ArrayList(); tasks.add(new SearchTask(0, 10000, words, "John")); tasks.add(new SearchTask(10000, 20000, words, "John")); tasks.add(new SearchTask(20000, 30000, words, "John")); // create thread pool and start tasks ExecutorService exec = Executors.newFixedThreadPool(3); List results = exec.invokeAll(tasks); // wait for tasks to finish and collect results int counter = 0; for(Future f: results) { counter += f.get(); }

你选择了一个不好的例子，因为都铎是如此友善地指出。旋转磁盘硬件受移动盘片和磁头的物理限制，最有效的读取实现是按顺序读取每个块，这减少了移动磁头或等待磁盘对齐的需要。

也就是说，某些操作系统并不总是将内容连续存储在磁盘上，对于那些记住，如果操作系统/文件系统没有为您完成工作，碎片整理可以提高磁盘性能。

正如你提到的想要一个有益的程序，让我建议一个简单的矩阵加法。

假设您为每个核心创建了一个线程，您可以将任意两个矩阵划分为N（每个线程一个）行。添加矩阵（如果您还记得）可以这样工作：

 A + B = C

要么

 [ a11, a12, a13 ] [ b11, b12, b13] = [ (a11+b11), (a12+b12), (a13+c13) ] [ a21, a22, a23 ] + [ b21, b22, b23] = [ (a21+b21), (a22+b22), (a23+c23) ] [ a31, a32, a33 ] [ b31, b32, b33] = [ (a31+b31), (a32+b32), (a33+c33) ]

因此，为了在N个线程中分配这个，我们只需要将行数和模数除以线程数来获得它将添加的“线程ID”。

 matrix with 20 rows across 3 threads row % 3 == 0 (for rows 0, 3, 6, 9, 12, 15, and 18) row % 3 == 1 (for rows 1, 4, 7, 10, 13, 16, and 19) row % 3 == 2 (for rows 2, 5, 8, 11, 14, and 17) // row 20 doesn't exist, because we number rows from 0

现在每个线程“知道”它应该处理哪些行，并且可以简单地计算“每行”结果，因为结果不会进入其他线程的计算域 。

现在需要的只是一个“结果”数据结构，它跟踪何时计算了值，并且当设置了最后一个值时，计算就完成了。在这个带有两个线程的矩阵加法结果的“假”示例中，用两个线程计算答案大约需要一半的时间。

 // the following assumes that threads don't get rescheduled to different cores for // illustrative purposes only. Real Threads are scheduled across cores due to // availability and attempts to prevent unnecessary core migration of a running thread. [ done, done, done ] // filled in at about the same time as row 2 (runs on core 3) [ done, done, done ] // filled in at about the same time as row 1 (runs on core 1) [ done, done, .... ] // filled in at about the same time as row 4 (runs on core 3) [ done, ...., .... ] // filled in at about the same time as row 3 (runs on core 1)

multithreading可以解决更复杂的问题，并且可以使用不同的技术解决不同的问题。我有目的地选择了一个最简单的例子。

您使用run（）或call（）方法（分别）实现Task或Callable，并且您应该尽可能多地并行化该实现方法。

Task代表一个独立的工作单元
将文件加载到内存中是一个独立的工作单元，因此可以将此活动委派给后台线程。即后台线程运行此加载文件的任务。
它是一个独立的工作单元，因为它没有其他依赖关系来完成它的工作（加载文件）并具有离散的边界。
你要问的是进一步将其划分为任务。即一个线程加载文件的1/3而另一个线程加载2/3等。
如果您能够将任务划分为更多的子任务，那么根据定义，它根本不是一项任务。因此，加载文件本身就是一项任务。

举个例子：
假设您有一个GUI，您需要向用户显示来自5个不同文件的数据。要呈现它们，您还需要准备一些数据结构来处理实际数据。
所有这些都是单独的任务。
例如，加载文件是5个不同的任务，因此可以由5个不同的线程完成。
数据结构的准备可以在不同的线程中完成。
GUI当然在另一个线程中运行。
所有这些都可以同时发生

如果系统支持高吞吐量I / O，请按以下步骤操作：

当高吞吐量（3GB / s）文件系统可用时，如何使用Java中的多个线程读取文件

以下是使用多个线程读取单个文件的解决方案。

将文件分成N个块，读取线程中的每个块，然后按顺序合并它们。 注意跨越块边界的线。 这是用户slaks建议的基本思想

在单个20 GB文件的multithreading实现下面的基准标记：

1线程：50秒：400 MB / s

2个线程：30秒：666 MB / s

4个线程：20秒：1GB / s

8个线程：60秒：333 MB / s

等效Java7 readAllLines（）：400秒：50 MB / s

注意：这可能仅适用于旨在支持高吞吐量I / O的系统，而不适用于通常的个人计算机

以下是代码的基本命中，有关完整的详细信息，请点击链接

 public class FileRead implements Runnable { private FileChannel _channel; private long _startLocation; private int _size; int _sequence_number; public FileRead(long loc, int size, FileChannel chnl, int sequence) { _startLocation = loc; _size = size; _channel = chnl; _sequence_number = sequence; } @Override public void run() { System.out.println("Reading the channel: " + _startLocation + ":" + _size); //allocate memory ByteBuffer buff = ByteBuffer.allocate(_size); //Read file chunk to RAM _channel.read(buff, _startLocation); //chunk to String String string_chunk = new String(buff.array(), Charset.forName("UTF-8")); System.out.println("Done Reading the channel: " + _startLocation + ":" + _size); } //args[0] is path to read file //args[1] is the size of thread pool; Need to try different values to fing sweet spot public static void main(String[] args) throws Exception { FileInputStream fileInputStream = new FileInputStream(args[0]); FileChannel channel = fileInputStream.getChannel(); long remaining_size = channel.size(); //get the total number of bytes in the file long chunk_size = remaining_size / Integer.parseInt(args[1]); //file_size/threads //thread pool ExecutorService executor = Executors.newFixedThreadPool(Integer.parseInt(args[1])); long start_loc = 0;//file pointer int i = 0; //loop counter while (remaining_size >= chunk_size) { //launches a new thread executor.execute(new FileRead(start_loc, toIntExact(chunk_size), channel, i)); remaining_size = remaining_size - chunk_size; start_loc = start_loc + chunk_size; i++; } //load the last remaining piece executor.execute(new FileRead(start_loc, toIntExact(remaining_size), channel, i)); //Tear Down } }

multithreading读取大量文件

将JSON反序列化为未知类型的集合

JSP以编程方式呈现

spring数据jpa和hibernate分离的实体传递给ManyToMany关系持久化

Java 8中静态方法引用的限制

摇摆中的球动画

侦听组件层次结构的关键事件

无法获得org.hibernate.persister.entity.SingleTableEntityPersister的构造函数

Java JProgressBar使用Image

apache poi：将jtable保存到文件中

从jsf重定向？