加入一个dataframespark java

首先，感谢您抽出时间阅读我的问题。

我的问题如下：在Spark with Java中，我在两个dataframe中加载了两个csv文件的数据。

这些数据框将具有以下信息。

Dataframe机场

Id | Name | City ----------------------- 1 | Barajas | Madrid

Dataframe airport_city_state

 City | state ---------------- Madrid | España

我想加入这两个dataframe，使它看起来像这样：

dataframe结果

 Id | Name | City | state -------------------------- 1 | Barajas | Madrid | España

其中dfairport.city = dfaiport_city_state.city

但是我不能用语法来澄清所以我可以正确地进行连接。我如何创建变量的一些代码：

  // Load the csv, you have to specify that you have header and what delimiter you have Dataset  dfairport = Load.Csv (sqlContext, data_airport); Dataset  dfairport_city_state = Load.Csv (sqlContext, data_airport_city_state); // Change the name of the columns in the csv dataframe to match the columns in the database // Once they match the name we can insert them Dfairport .withColumnRenamed ("leg_key", "id") .withColumnRenamed ("leg_name", "name") .withColumnRenamed ("leg_city", "city") dfairport_city_state .withColumnRenamed("city", "ciudad") .withColumnRenamed("state", "estado");

您可以使用带有列名的join方法来连接两个dataframe，例如：

 Dataset  dfairport = Load.Csv (sqlContext, data_airport); Dataset  dfairport_city_state = Load.Csv (sqlContext, data_airport_city_state); Dataset  joined = dfairport.join(dfairport_city_state, dfairport_city_state("City"));

还有一个重载版本，允许您将join类型指定为第三个参数，例如：

Dataset joined = dfairport.join(dfairport_city_state, dfairport_city_state("City"), "left_outer");

关于联接的更多信息。

首先，非常感谢您的回复。

我已经尝试了我的两个解决方案，但没有一个工作，我得到以下错误：方法dfairport_city_state（String）未定义类型ETL_Airport

我无法访问数据框的特定列以进行连接。

编辑：已经做了加入，我把这个解决方案放在这里以防其他人帮忙;）

感谢您的一切和最好的问候

 //Join de tablas en las que comparten ciudad Dataset  joined = dfairport.join(dfairport_city_state, dfairport.col("leg_city").equalTo(dfairport_city_state.col("city")));

加入一个dataframespark java

与csv文件相比，将mysql表转换为spark数据集的速度非常慢

如何在GroupBy操作后从spark DataFrame列中收集字符串列表？

使用已安装的spark和maven将Spark Scala Program编译为jar文件

如何在spark数据框中展平结构？

本地类不兼容exception：从IDE运行spark standalone时

在google dataproc集群实例中的spark-submit上运行app jar文件

从Apache Spark SQL中的用户定义聚合函数（UDAF）返回多个数组

如何强制Spark执行代码？

不断增加YARN中Spark应用程序的物理内存

使用Java的Spark作业服务器