Tag: apache spark ml

Spark ML Pipeline api保存不起作用

在版本1.6中,管道api获得了一组新的function来保存和加载管道阶段。 在我训练分类器并稍后再次加载以重新使用它并节省计算再次建模的努力之后,我尝试将一个阶段保存到磁盘。 出于某种原因,当我保存模型时,该目录仅包含元数据目录。 当我尝试再次加载它时,我得到以下exception: 线程“main”中的exceptionjava.lang.UnsupportedOperationException:org.apache.spark.rdd.RDDOperationScope $中org.apache.spark.rdd.RDD $$ anonfun $ first $ 1.apply(RDD.scala:1330)的空集合.withScope(RDDOperationScope.scala:150)atg.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:111)atg.apache.spark.rdd.RDD.withScope(RDD.scala:316)at org .apache.spark.rdd.RDD.first(RDD.scala:1327)位于org.apache.spark.ml.tuning的org.apache.spark.ml.util.DefaultParamsReader $ .loadMetadata(ReadWrite.scala:284)。 CrossValidator $ SharedReadWrite $ .load(CrossValidator.scala:287)org.apache.spark.ml.tuning.CrossValidatorModel $ CrossValidatorModelReader.load(CrossValidator.scala:393)at org.apache.spark.ml.tuning.CrossValidatorModel $ CrossValidatorModelReader .load(CrossValidator.scala:384)org.apache.spark.ml.util.MLReadable $ class.load(ReadWrite.scala:176)at org.apache.spark.ml.tuning.CrossValidatorModel $ .load(CrossValidator。 scala:368)在org.apache.spark.ml.tuning.CrossVal org.test.categoryminer.spark.SparkTextClassifierModelCache.get中的idatorModel.load(CrossValidator.scala)(SparkTextClassifierModelCache.java:34) 保存我使用的模型: crossValidatorModel.save(“/tmp/my.model”) 并加载它我使用: CrossValidatorModel.load(“/tmp/my.model”) 我调用了在CrossValidator对象上调用fit(dataframe)时得到的CrossValidatorModel对象的save。 任何指针为什么它只保存元数据目录?

带有DataFrame API的Apache Spark MLlib在createDataFrame()或read()时会产生java.net.URISyntaxException .csv(…)

在一个独立的应用程序(运行在java8,Windows 10上,使用spark-xxx_2.11:2.0.0作为jar依赖项)下一代码会出错: /* this: */ Dataset logData = spark_session.createDataFrame(Arrays.asList( new LabeledPoint(1.0, Vectors.dense(4.9,3,1.4,0.2)), new LabeledPoint(1.0, Vectors.dense(4.7,3.2,1.3,0.2)) ), LabeledPoint.class); /* or this: */ /* logFile: “C:\files\project\file.csv”, “C:\\files\\project\\file.csv”, “C:/files/project/file.csv”, “file:/C:/files/project/file.csv”, “file:///C:/files/project/file.csv”, “/file.csv” */ Dataset logData = spark_session.read().csv(logFile); 例外: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/files/project/spark-warehouse at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.(Path.java:172) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:89) at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95) […]

如何将模型从ML Pipeline保存到S3或HDFS?

我正在努力保存ML Pipeline生产的数千种型号。 如此答案所示,模型可以保存如下: import java.io._ def saveModel(name: String, model: PipelineModel) = { val oos = new ObjectOutputStream(new FileOutputStream(s”/some/path/$name”)) oos.writeObject(model) oos.close } schools.zip(bySchoolArrayModels).foreach{ case (name, model) => saveModel(name, Model) } 我已经尝试使用s3://some/path/$name和/user/hadoop/some/path/$name因为我希望最终将模型保存到amazon s3,但它们都会失败,并显示路径不能是找到。 如何将模型保存到Amazon S3?