多节点hadoop集群中的Apache Spark Sql问题
嗨,我使用Spark java apis从hive获取数据。 此代码适用于hadoop单节点集群。 但是当我尝试在hadoop多节点集群中使用它时,它会抛出错误
org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.
注意:我已将master作为本地用于单节点,而yarn-cluster用于多节点。
这是我的java代码
SparkConf sparkConf = new SparkConf().setAppName("Hive").setMaster("yarn-cluster"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); HiveContext sqlContext = new HiveContext(ctx.sc()); org.apache.spark.sql.Row[] result = sqlContext.sql("Select * from Tablename").collect();
此外,我试图将master更改为本地,现在它抛出未知的主机名exception。
任何人都可以帮助我吗?
更新
错误日志
15/08/05 11:30:25 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 15/08/05 11:30:25 INFO ObjectStore: Initialized ObjectStore 15/08/05 11:30:25 INFO HiveMetaStore: Added admin role in metastore 15/08/05 11:30:25 INFO HiveMetaStore: Added public role in metastore 15/08/05 11:30:25 INFO HiveMetaStore: No user is added in admin role, since config is empty 15/08/05 11:30:25 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/08/05 11:30:25 INFO HiveMetaStore: 0: get_table : db=default tbl=activity 15/08/05 11:30:25 INFO audit: ugi=labuser ip=unknown-ip-addr cmd=get_table : db=default tbl=activity 15/08/05 11:30:25 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead 15/08/05 11:30:25 INFO deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 15/08/05 11:30:26 INFO MemoryStore: ensureFreeSpace(399000) called with curMem=0, maxMem=1030823608 15/08/05 11:30:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 389.6 KB, free 982.7 MB) 15/08/05 11:30:26 INFO MemoryStore: ensureFreeSpace(34309) called with curMem=399000, maxMem=1030823608 15/08/05 11:30:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 33.5 KB, free 982.7 MB) 15/08/05 11:30:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.100.7:61775 (size: 33.5 KB, free: 983.0 MB) 15/08/05 11:30:26 INFO SparkContext: Created broadcast 0 from collect at Hive.java:29 Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: hadoopcluster at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:258) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:602) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:547) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:139) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1783) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:885) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.collect(RDD.scala:884) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:105) at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1255) at com.Hive.main(Hive.java:29) Caused by: java.net.UnknownHostException: hadoopcluster ... 44 more
如exception所示,不能直接从SparkContext
使用yarn-cluster模式。 但您可以使用SparkContext
在独立的多节点集群上运行它。 首先,你必须启动你的独立火花星团然后你设置sparkConf.setMaster("spark://HOST:PORT")
,其中HOST:PORT
是你的火花星团的URL。 我希望这能解决你的问题。
- 如何在YARN Spark作业中设置环境变量?
- apache spark MLLib:如何为字符串function构建标记点?
- 将分析数据从Spark插入Postgres
- 此语言级别不支持Lambda表达式
- java + spark:org.apache.spark.SparkException:作业已中止:任务不可序列化:java.io.NotSerializableException
- Spark DataFrame并重命名多个列(Java)
- Spark – foreach Vs foreachPartitions何时使用什么?
- 在PySpark中运行自定义Java类
- 使用Java将spark RDD保存到本地文件系统