1.导入yarn和hdfs配置文件

因为spark on yarn
是依赖于yarn和hdfs的,所以获取yarn和hdfs配置文件是首要条件,将core-site.xml、hdfs-site.xml
、yarn-site.xml 这三个文本考入到你IDEA项目里面的resource目录下,如下图所示:



2.添加项目依赖

除了pom里面你要添加的:
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</
artifactId> <!--<scope>test</scope>--> <version>2.7.3</version> </dependency> <
dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</
artifactId> <!--<scope>test</scope>--> <version>2.7.3</version> </dependency> <
dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</
artifactId> <!--<scope>test</scope>--> <version>2.2.1</version> </dependency> <
dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</
artifactId> <!--<scope>test</scope>--> <version>2.2.1</version> </dependency> <
dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</
artifactId> <version>5.1.34</version> </dependency> <dependency> <groupId>
org.apache.spark</groupId> <artifactId>spark-hive_2.11</artifactId> <version>
2.2.1</version> </dependency>
而且还要添加spark-yarn的依赖包到你的dependencies 方法如下所示:







这里是抛砖引玉,只是说我缺少spark-yarn_2.11-2.2.1.jar,具体缺什么你们自己加就好了,对了,spark依赖的包都在${SPARK_HOME}/jars目录下,缺什么自己去找即可,实在不行,你就来个*全部加载进去,很暴力,很好用。

如果你不加,你会报出如下错误:
Caused by: org.apache.spark.SparkException: Unable to load YARN support at org
.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:413)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil
.scala:408) at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil
.scala:408) at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil
.scala:433) at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:
2381) at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:156)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:351) at org.apache.spark
.SparkEnv$.createDriverEnv(SparkEnv.scala:175) at org.apache.spark.SparkContext
.createSparkEnv(SparkContext.scala:257) at org.apache.spark.SparkContext
.<init>(SparkContext.scala:432) at com.timanetworks.spark.faw.CommonStaticConst$
.loadHdfsConfig(CommonStaticConst.scala:37) at com.timanetworks.spark.faw
.CommonStaticConst$.<init>(CommonStaticConst.scala:23) at com.timanetworks.spark
.faw.CommonStaticConst$.<clinit>(CommonStaticConst.scala) ... 3 more Caused by:
java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn
.YarnSparkHadoopUtil at java.net.URLClassLoader.findClass(URLClassLoader.java:
381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc
.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader
.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$
.classForName(Utils.scala:230) at org.apache.spark.deploy.SparkHadoopUtil$
.liftedTree1$1(SparkHadoopUtil.scala:409)
3。修改core-site的如下配置:

在core-site.xml中注释掉如下配置:


说白了就是注释掉这个:
<property> <name>net.topology.script.file.name</name>
<value>/etc/hadoop/conf/topology_script.py</value> </property>
你如果想试着把这个py脚本从linux下拷贝下来,然后改成windows路径,很负责的告诉你,我试过。。。不得行!
<property> <name>net.topology.script.file.name</name> <value>D:\spark\spark-2.2
.1-bin-hadoop2.7\topology_script.py</value> </property>
这样是不行的,一定要注释掉!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

不然你会发现,这个错误:
1. java.io.IOException: Cannot run program
"/etc/hadoop/conf/topology_script.py" (in directory
"D:\workspace\fawmc-new44\operation-report-calc"): CreateProcess error=2,
系统找不到指定的文件。 at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org
.apache.hadoop.util.Shell.runCommand(Shell.java:520) at org.apache.hadoop.util
.Shell.run(Shell.java:479) at org.apache.hadoop.util.Shell$ShellCommandExecutor
.execute(Shell.java:773) at org.apache.hadoop.net.ScriptBasedMapping
$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251) at org
.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve
(ScriptBasedMapping.java:188) at org.apache.hadoop.net.CachedDNSToSwitchMapping
.resolve(CachedDNSToSwitchMapping.java:119) at org.apache.hadoop.yarn.util
.RackResolver.coreResolve(RackResolver.java:101) at org.apache.hadoop.yarn.util
.RackResolver.resolve(RackResolver.java:81) at org.apache.spark.scheduler
.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37) at org.apache
.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager
.scala:225) at org.apache.spark.scheduler.TaskSetManager
$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:206) at scala.collection
.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala
.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache
.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:206) at org
.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager
.scala:178) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166)
at org.apache.spark.scheduler.TaskSetManager.<init>(TaskSetManager.scala:177)
at org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager
(TaskSchedulerImpl.scala:229) at org.apache.spark.scheduler.TaskSchedulerImpl
.submitTasks(TaskSchedulerImpl.scala:193) at org.apache.spark.scheduler
.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1055) at org.apache.spark
.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930) at org
.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive
(DAGScheduler.scala:1695) at org.apache.spark.scheduler
.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687) at org.apache
.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676
) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Caused by:
java.io.IOException: CreateProcess error=2, 系统找不到指定的文件。 at java.lang.ProcessImpl
.create(Native Method) at java.lang.ProcessImpl.<init>(ProcessImpl.java:386) at
java.lang.ProcessImpl.start(ProcessImpl.java:137) at java.lang.ProcessBuilder
.start(ProcessBuilder.java:1029) ... 26 more
或者这个错误:
"D:\spark\spark-2.2.1-bin-hadoop2.7\topology_script.py" (in directory "D:
\workspace\fawmc-new44\operation-report-calc"): CreateProcess error=193, %1
不是有效的 Win32 应用程序。
好了,完成了上述步骤,就可以用IDEA调试spark on yarn了,美滋滋,对了,一般都是用yarn-client做调试哦~

成果图!



好啦,就是这样,如果你们还遇到了hdfs什么权限问题,不能创建,读写啊什么之类的,网上有很多资料,总结下来无非就是hadoop fs
-chmod/chown这俩命令,然后用法和linux类似。
如果你们遇到了windows下的某些权限问题,可以看我的另一片博文:

windows下搭建hadoop/spark环境常见问题
https://blog.csdn.net/qq_31806205/article/details/79819724
<https://blog.csdn.net/qq_31806205/article/details/79819724>

友情链接
KaDraw流程图
API参考文档
OK工具箱
云服务器优惠
阿里云优惠券
腾讯云优惠券
华为云优惠券
站点信息
问题反馈
邮箱:ixiaoyang8@qq.com
QQ群:637538335
关注微信