如果我在Spark中缓存两次相同的RDD会发生什么

我正在构建一个接收RDD的generics函数，并对其进行一些计算。由于我在输入RDD上运行多个计算，我想缓存它。例如：

public JavaRDD foo(JavaRDD r) { r.cache(); JavaRDD t1 = r... //Some calculations JavaRDD t2 = r... //Other calculations return t1.union(t2); }

我的问题是，因为r是给我的，它可能已经或可能没有被缓存。如果它被缓存并且我再次调用缓存，那么spark会创建一个新的缓存层，这意味着在计算t1和t2 ，我将在缓存中有两个r实例吗？或者火花是否意识到r被缓存并将忽略它？

没什么 。如果在缓存的RDD上调用cache ，则没有任何反应，RDD将被缓存（一次）。像许多其他转换一样，缓存是懒惰的：

调用cache ，RDD的MEMORY_ONLY设置为MEMORY_ONLY
再次调用cache时，它设置为相同的值（无更改）
在评估时，当底层RDD实现时，Spark将检查RDD的storageLevel ，如果它需要缓存，它将缓存它。

所以你很安全。

只是测试我的集群，Zohar是对的，没有任何反应，它只会缓存RDD一次。我认为，原因是每个RDD内部都有一个id ，spark会使用id来标记RDD是否已被缓存。因此，多次缓存一个RDD将无能为力。

下面是我的代码和截图：

在此处输入图像描述

更新[根据需要添加代码]

 ### cache and count, then will show the storage info on WEB UI raw_file = sc.wholeTextFiles('hdfs://10.21.208.21:8020/user/mercury/names', minPartitions=40)\ .setName("raw_file")\ .cache() raw_file.count() ### try to cache and count again, then take a look at the WEB UI, nothing changes raw_file.cache() raw_file.count() ### try to change rdd's name and cache and count again, to see will it cache a new rdd as the new name again, still ### nothing changes, so I think maybe it is using the RDD id as a mark, for more we need to take a detailed read on ### the document even then source code raw_file.setName("raw_file_2") raw_file.cache().count()

如果我在Spark中缓存两次相同的RDD会发生什么

关于hadoop 2.2.0 maven依赖性的火花0.9.1

Spark – foreach Vs foreachPartitions何时使用什么？

使用saveAsTextFile的Spark NullPointerException

如何从spark设置和获取静态变量？

序列化RDD

在Spark MLlib上使用Java中的Breeze

在PySpark中运行自定义Java类

httpclient版本与Apache Spark之间的冲突

如何在YARN Spark作业中设置环境变量？

如何为每个记录生成唯一ID