PySpark：java.lang.OutofMemoryError：Java堆空间

我最近在我的服务器上使用PySpark与Ipython一起使用24个CPU和32GB RAM。它只能在一台机器上运行。在我的过程中，我想收集大量数据，如下面的代码所示：

train_dataRDD = (train.map(lambda x:getTagsAndText(x)) .filter(lambda x:x[-1]!=[]) .flatMap(lambda (x,text,tags): [(tag,(x,text)) for tag in tags]) .groupByKey() .mapValues(list))

当我做

 training_data = train_dataRDD.collectAsMap()

它给了我outOfMemory错误。 Java heap Space 。此外，我在此错误后无法对Spark执行任何操作，因为它失去了与Java的连接。它给出了Py4JNetworkError: Cannot connect to the java server 。

看起来堆空间很小。如何将其设置为更大的限制？

编辑：

我在运行之前尝试过的事情： sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')

我按照此处的文档更改了spark选项（如果你执行ctrl-f并搜索spark.executor.extraJavaOptions）： http ：//spark.apache.org/docs/1.2.1/configuration.html

它说我可以通过设置spark.executor.memory选项来避免OOM。我做了同样的事情，但似乎没有工作。

在尝试了大量配置参数之后，我发现只需要更改一个以启用更多堆空间，即spark.driver.memory 。

 sudo vim $SPARK_HOME/conf/spark-defaults.conf #uncomment the spark.driver.memory and change it according to your use. I changed it to below spark.driver.memory 15g # press : and then wq! to exit vim editor

关闭现有的火花应用程序并重新运行它。您不会再遇到此错误。 🙂

PySpark：java.lang.OutofMemoryError：Java堆空间

数据集-API模拟JavaSparkContext.wholeTextFiles

如何更新火花流中的广播变量？

Spark DataFrame并重命名多个列（Java）

使用Apache Spark和Java将CSV解析为DataFrame / DataSet

如果我在Spark中缓存两次相同的RDD会发生什么

如何使Spark Streaming计算unit testing中文件中的单词？

如何在Java / Kotlin中创建一个返回复杂类型的Spark UDF？

如何使用Java中的Structured Streaming从Kafka反序列化记录？

Apache Spark需要5到6分钟才能从Cassandra中简单计算1亿行

无法找到Web UI的资源路径：org / apache / spark / ui / static创建Spark应用程序时