centos7安装hadoop3.0.3和jdk1.8的伪分布式模式

添加普通用户hadoop

useradd hadoop
passwd hadoop
1

给hadoop用户sudo权限

chmod u+w /etc/sudoers
vi /etc/sudoers
添加
hadoop ALL=(ALL) ALL
或者
hadoop ALL=(root) NOPASSWD:ALL

切换到hadoop用户

su - hadoop

安装到/home/hadoop/hadoop3.03目录

sudo mkidr /home/hadoop/hadoop3.03
tar -zxvf hadoop-3.0.3.tar.gz
mv hadoop-3.0.3 hadoop3.03

安装到/home/hadoop/java/jdk1.8

tar -zxvf jdk-8u172-linux-x64.gz
mv jdk_1.8.0.172 jdk1.8

配置环境变量

vi /etc/profile
##java export JAVA_HOME=/home/hadoop/java/jdk1.8 export PATH=$PATH:$JAVA_HOME
/bin##hadoop export HADOOP_HOME=/home/hadoop/hadoop3.03 export PATH=$HADOOP_HOME
/bin:$HADOOP_HOME/sbin:$PATH
验证
echo $JAVA_HOME echo $HADOOP_HOME
配置 hadoop-env.sh、mapred-env.sh、yarn-env.sh文件的JAVA_HOME参数
export JAVA_HOME=/home/hadoop/java/jdk1.8
配置core-site.xml

hadoop-localhost为主机名称,/opt/data/tmp要先创建好目录
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl"
href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or
agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
or implied. See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file. --> <!-- Put
site-specific property overrides in this file. --> <configuration> <property> <
name>fs.defaultFS</name> <value>hdfs://hadoop-localhost:8020</value> <
description>HDFS的URI,文件系统://namenode标识:端口号</description> </property> <property>
<name>hadoop.tmp.dir</name> <value>/opt/data/tmp</value> <description>
namenode上本地的hadoop临时文件夹</description> </property> </configuration>

hadoop.tmp.dir配置的是Hadoop临时目录,比如HDFS的NameNode数据默认都存放这个目录下,查看*-default.xml等默认配置文件,就可以看到很多依赖${hadoop.tmp.dir}的配置。


默认的hadoop.tmp.dir是/tmp/hadoop-${user.name},此时有个问题就是NameNode会将HDFS的元数据存储在这个/tmp目录下,如果操作系统重启了,系统会清空/tmp目录下的东西,导致NameNode元数据丢失,是个非常严重的问题,所有我们应该修改这个路径。

sudo mkdir -p /opt/data/tmp

将临时目录的所有者修改为hadoop
sudo chown –R hadoop:hadoop /opt/data/tmp

配置hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl"
href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or
agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
or implied. See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file. --> <!-- Put
site-specific property overrides in this file. --> <configuration> <property> <
name>dfs.name.dir</name> <value>/opt/data/tmp/dfs/name</value> <description>
namenode上存储hdfs名字空间元数据</description> </property> <property> <name>dfs.data.dir</
name> <value>/opt/data/tmp/dfs/data</value> <description>datanode上数据块的物理存储位置</
description> </property> <!--设置hdfs副本数量--> <property> <name>dfs.replication</
name> <value>1</value> </property> </configuration>
格式化HDFS

sudo chown -R hadoop:hadoop /opt/data
hdfs namenode –format

查看NameNode格式化后的目录
$ ll /opt/data/tmp/dfs/name/current

启动NameNode
sbin/hadoop-daemon.sh start namenode

启动DataNode
sbin/hadoop-daemon.sh start datanode

启动SecondaryNameNode
sbin/hadoop-daemon.sh start secondarynamenode

JPS命令查看是否已经启动成功,有结果就是启动成功了
$ jps

HDFS上测试创建目录、上传、下载文件

[hadoop@hadoop-localhost hadoop3.03]#
创建目录
bin/hdfs dfs -mkdir /demo1

上传
bin/hdfs dfs -put etc/hadoop/core-site.xml /demo1

读取HDFS上的文件内容
bin/hdfs dfs -cat /demo1/core-site.xml

从HDFS上下载文件到本地
bin/hdfs dfs -get /demo1/core-site.xml

查看hdfs的web页面

hdfs 2.X版本的web页面端口号为50070
http://192.168.145.129:50070 <http://192.168.145.129:50070>

hdfs 3.X版本的web页面端口号为9870
http://192.168.145.129:9870/dfshealth.html#tab-overview
<http://192.168.145.129:9870/dfshealth.html#tab-overview>

配置、启动YARN

配置mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl"
href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or
agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
or implied. See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file. --> <!-- Put
site-specific property overrides in this file. --> <!-- 指定mr运行在yarn上 --> <!--
${full path of your hadoop distribution directory} --> <configuration>   <
property>    <name>mapreduce.framework.name</name>    <value>yarn</value>   </
property> <property> <name>yarn.app.mapreduce.am.env</name> <value>
HADOOP_MAPRED_HOME=/home/hadoop/hadoop3.03</value> </property> <property> <name>
mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=/home/hadoop/hadoop3.03</
value> </property> <property> <name>mapreduce.reduce.env</name> <value>
HADOOP_MAPRED_HOME=/home/hadoop/hadoop3.03</value> </property> </configuration>
配置yarn-site.xml

arn.nodemanager.aux-services配置了yarn的默认混洗方式,选择为mapreduce的默认混洗算法。

yarn.resourcemanager.hostname指定了Resourcemanager运行在哪个节点上。
<?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or
agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
or implied. See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file. --> <!--
指定YARN的老大(ResourceManager)的地址 --> <configuration> <!-- Site specific YARN
configuration properties --> <property> <name>yarn.nodemanager.aux-services</
name> <value>mapreduce_shuffle</value> </property>   <!-- reducer获取数据的方式 --> <
property>   <name>yarn.resourcemanager.hostname</name>   <value>hadoop-localhost
</value> </property> </configuration>
启动Resourcemanager

sbin/yarn-daemon.sh start resourcemanager

启动nodemanager

sbin/yarn-daemon.sh start nodemanager

也可执行批处理文件启动服务
启动hdfs 和yarn
sbin/start-dfs.sh
sbin/start-yarn.sh

sbin/start-all.sh

YARN的Web页面

YARN的Web客户端端口号是8088,通过http://192.168.145.129:8088/
<http://192.168.145.129:8088/>可以查看。

运行MapReduce Job

创建测试用的Input文件
bin/hdfs dfs -mkdir -p /wordcountdemo/input

wc.input文件内容为:
hadoop mapreduce hive hbase spark storm sqoop hadoop hive spark hadoop
将wc.input文件上传到HDFS的/wordcountdemo/input目录中:
bin/hdfs dfs -put /opt/data/wc.input /wordcountdemo/input

运行WordCount MapReduce Job

[hadoop@hadoop-localhost hadoop3.03]$ bin/yarn jar
share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar wordcount
/wordcountdemo/input /wordcountdemo/output
2018-07-03 19:38:23,956 INFO client.RMProxy: Connecting to ResourceManager at
hadoop-localhost/192.168.145.129:8032 2018-07-03 19:38:24,565 INFO
mapreduce.JobResourceUploader: Disabling Erasure Codingfor path:
/tmp/hadoop-yarn/staging/hadoop/.staging/job_1530615244194_00022018-07-03 19:38:
24,879 INFO input.FileInputFormat: Total input files to process : 1 2018-07-03
19:38:25,784 INFO mapreduce.JobSubmitter: number of splits:1 2018-07-03 19:38:25
,841 INFO Configuration.deprecation:
yarn.resourcemanager.system-metrics-publisher.enabledis deprecated. Instead, use
yarn.system-metrics-publisher.enabled2018-07-03 19:38:26,314 INFO
mapreduce.JobSubmitter: Submitting tokensfor job: job_1530615244194_0002 2018-07
-03 19:38:26,315 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2018-07-
03 19:38:26,466 INFO conf.Configuration: resource-types.xml not found 2018-07-03
19:38:26,466 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2018-07-03 19:38:26,547 INFO impl.YarnClientImpl: Submitted application
application_1530615244194_00022018-07-03 19:38:26,590 INFO mapreduce.Job: The
urlto track the job: http://hadoop-localhost:8088
/proxy/application_1530615244194_0002/2018-07-03 19:38:26,590 INFO
mapreduce.Job: Running job: job_1530615244194_00022018-07-03 19:38:35,985 INFO
mapreduce.Job: Job job_1530615244194_0002 runningin uber mode : false 2018-07-03
19:38:35,988 INFO mapreduce.Job: map 0% reduce 0% 2018-07-03 19:38:42,310 INFO
mapreduce.Job:map 100% reduce 0% 2018-07-03 19:38:47,402 INFO mapreduce.Job: map
100% reduce 100% 2018-07-03 19:38:49,469 INFO mapreduce.Job: Job
job_1530615244194_0002 completed successfully2018-07-03 19:38:49,579 INFO
mapreduce.Job: Counters:53 File System Counters FILE: Number of bytes read=94
FILE: Number of bytes written=403931 FILE: Number of read operations=0 FILE:
Numberof large read operations=0 FILE: Number of write operations=0 HDFS: Number
of bytes read=195 HDFS: Number of bytes written=60 HDFS: Number of read
operations=8 HDFS: Number of large read operations=0 HDFS: Number of write
operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1
Data-localmap tasks=1 Total time spent by all maps in occupied slots (ms)=4573
Totaltime spent by all reduces in occupied slots (ms)=2981 Total time spent by
all map tasks (ms)=4573 Total time spent by all reduce tasks (ms)=2981 Total
vcore-milliseconds taken byall map tasks=4573 Total vcore-milliseconds taken by
all reduce tasks=2981 Total megabyte-milliseconds taken by all map tasks=4682752
Total megabyte-milliseconds taken byall reduce tasks=3052544 Map-Reduce
FrameworkMap input records=4 Map output records=11 Map output bytes=115 Map
output materialized bytes=94 Input split bytes=122 Combine input records=11
Combine output records=7 Reduce input groups=7 Reduce shuffle bytes=94 Reduce
input records=7 Reduce output records=7 Spilled Records=14 Shuffled Maps =1
Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=171 CPU time spent
(ms)=1630 Physical memory (bytes) snapshot=332750848 Virtual memory (bytes)
snapshot=5473169408 Total committed heap usage (bytes)=165810176 Peak Map
Physical memory (bytes)=214093824 Peak Map Virtual memory (bytes)=2733207552
Peak Reduce Physical memory (bytes)=118657024 Peak Reduce Virtual memory
(bytes)=2739961856 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=
0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=73 File
Output Format Counters Bytes Written=60 [hadoop@hadoop-localhost hadoop3.03]$
输出统计结果为:
[hadoop@hadoop-localhost hadoop3.03]$ bin/hdfs dfs -cat
/wordcountdemo/output/part-r-00000 hadoop 3 hbase 1 hive 2 mapreduce 1 spark 2
sqoop1 storm 1 [hadoop@hadoop-localhost hadoop3.03]$
结果是按照键值排好序的

停止Hadoop

sbin/hadoop-daemon.sh stop namenode
sbin/hadoop-daemon.sh stop datanode
sbin/yarn-daemon.sh stop resourcemanager
sbin/yarn-daemon.sh stop nodemanager

全部停止批处理文件
sbin/stop_yarn.sh
sbin/stop_dfs.sh

sbin/stop_all.sh

HDFS模块简介


HDFS负责大数据的存储,通过将大文件分块后进行分布式存储方式,突破了服务器硬盘大小的限制,解决了单台机器无法存储大文件的问题,HDFS是个相对独立的模块,可以为YARN提供服务,也可以为HBase等其他模块提供服务。

YARN模块简介

YARN是一个通用的资源协同和任务调度框架,是为了解决Hadoop1.x中MapReduce里NameNode负载太大和其他问题而创建的一个框架。

YARN是个通用框架,不止可以运行MapReduce,还可以运行Spark、Storm等其他计算框架。

MapReduce模块简介


MapReduce是一个计算框架,它给出了一种数据处理的方式,即通过Map阶段、Reduce阶段来分布式地流式处理数据。它只适用于大数据的离线处理,对实时性要求很高的应用不适用。

—-the—–end—-

友情链接
KaDraw流程图
API参考文档
OK工具箱
云服务器优惠
阿里云优惠券
腾讯云优惠券
华为云优惠券
站点信息
问题反馈
邮箱:ixiaoyang8@qq.com
QQ群:637538335
关注微信