One , Platform environment
virtual machine :VMware Workstation Pro 64 position
operating system :Ubuntu16.04 64 position

Two , software package
Jdk-8u171-linux-x64.tar(java version 1.8.0_171)
Hadoop 2.9.1.tar
Scala-2.11.6
Spark-2.3.1-bin-hadoop2.7


Three ,Spark Construction process of distributed cluster environment
0. preparation
first , We just set up a virtual machine . After the basic general configuration information is configured , We'll clone two more , It would be a lot easier .
The system uses Ubuntu16.04, Virtual machine network adapter selection NET pattern , Manual configuration static IP address , Turn off the firewall .

next , Download and build the website Spark All kinds of software for cluster environment . All downloaded software is stored in ubuntu16.04 Of ~/spark Working directory , The absolute path is :/home/hadoop/spark.
1.     establish hadoop user
Because it was installed on my virtual machine
Ubuntu16.04, But for convenience , We need to create a new system user name , such as , I've added a new one called “hadoop” Users of . The specific process is as follows :
establish hadoop user , Set password , by hadoop User added administrator rights , shutdown -l ( Click on the gear in the upper right corner of the screen , Select logout ), Use the newly created hadoop
User login .
Press ctrl+alt+t Open terminal window , Enter the following command , Complete creation :

    

    
 
2.     Modify computer environment
1)     Cluster environment :1 platform master,2 platform slave, Namely slave1,slave2
2)     Modify host name hostname, Change the host name to master, And view the modified host name , After that, the computer must be restarted !! Otherwise, an error will be reported when executing other commands .
     

     

     


3)     Modify your own IP address , Can be passed through ifconfig View the modified IP address .

    

    


4) modify hosts file
be careful :hosts There will normally be one in the file localhost
And its corresponding IP address , Never delete it , Just add the above three lines of information to the bottom , Deleted localhost,Spark There will be no entrance !
     

     
 
3.     to update apt( It's not necessary )
Using commands sudo apt-get update Update , The goal is to update all dependent software , In case some software installation is not successful . If all hosts Ubuntu
The software is up to date , There is no need for this step .
4.     install vim
     


5.     install JDK
1) stay ~/spark Under the table of contents , Using commands tar -zvxf jdk-8u171-linux-x64.tar.gz Unzip the package .
2) Modify environment variables , Using commands sudo vi /etc/profile Modify configuration file , Then use the command source /etc/profile, Make the document modification effective .
3) Using commands java –version inspect jdk Is the installation successful .
     

     
 
6.     install SCALA
1)     Using commands sudo apt install scala Online installation scala
2)     Using commands scala –version inspect scala Is the installation successful
3)     Using commands which scala see scala Installation location path for
4)     And JDK equally , Using commands sudo vi /etc/profile Modify the configuration file of environment variables , Then use the command source
/etc/profile, Make the document modification effective .
     

     

     


7.     install SSH service
Using commands sudo apt-get install openssh-server Install ,SSH The specific configuration is to clone two slave After configuration .
     


8.     Clone host
1)  
  After configuring the above contents , The general configuration of nodes in the cluster has been basically completed . next , We clone two slave come out , formation 1 platform master, Two slave(slave1,slave1) Cluster environment of .
2)  
  Modify cloned hosts hostname Documents and themselves IP address ( Because it's cloned master, Then here's the hosts There is no need to modify the document ), Must and hosts The host name and IP Address should correspond , Can be passed through ifconfig see IP address ( Refer to para 2 step ).
3)     Restart the computer , use ping command , Check whether the modified computer can be connected .

    

    
 
9.     to configure master And slave
1)     Configure the SSH Password free login environment
2)     Public and private key pairs are generated on each host
     

     

     
 
3)     take slave1 And slave2 On the id_rsa.pub Send to master
     

     
 
4)     stay master upper , Loads all public keys into the public key file used for authentication authorized_key in , And view the generated file .
     


5)     take master Public key file on authorized_key Distribute to slave1,slave2.
     


6)     Last use SSH command , Verify password free login .
     

    

    
 
10.     install hadoop
1)     stay ~/spark Under the table of contents , Unzip downloaded hadoop Compressed package .
2)     Using commands cd
spark/hadoop/hadoop-2.9.1/etc/hadoop, get into hadoop configure directory , In this directory , The configuration file needs to be modified , namely :hadoop-env.sh
, yarn-env.sh , slaves , core-site.xml , hdfs-site.xml , mapred-site.xml ,
yarn-site.xml . Use the following command to modify the above files in turn :
     

     
 
modify hadoop-env.sh, add to JAVA_HOME
     


modify yarn-env.sh , add to JAVA_HOME
     


modify slaves, to configure slave Node's ip or hostname

    


modify core-site.xml
     


modify hdfs-site.xml
     


modify mapred-site.xml( Copy the template of the file first , Then rename it to mapred-site.xml, Finally, it is revised )
     
     

     

 

modify yarn-site.xml
     


3)     Will be configured hadoop-2.9.1 Documents distributed to all slave.
     
     


4)     last , format namenode.
     

     


5)     start-up hadoop colony
     

    
 
6)     verification hadoop Is the installation successful
i)     use jps Command view hadoop process .

    

ii)     Enter in browser http://master:8088 Access to hadoop Management interface .
     


11.     install Spark
1)     stay ~/spark Under the table of contents , Unzip downloaded spark Compressed package .
2)     Using commands cd
spark/spark-2.3.1/conf, In this directory , See a lot of documents are based on template Ending , that is because spark We are provided with a template configuration file , We can make a copy first , Then the .template Get rid of it , Make it a real configuration file and edit it .

     


3)     to configure spark-env.sh, This file contains spark Various operating environments of .
     

     
 
4)     to configure slaves file
     

     


5)     Will be configured spark-2.3.1 Documents distributed to all slave.
     

     
 
6)     start-up spark colony
i)     Using commands cd spark/hadoop/hadoop-2.9.1 get into hadoop catalog , In this directory , start-up hadoop Document management system
HDFS And start up hadoop Task Manager YARN.
ii)     start-up spark
     

     
 
7)     see Spark Cluster information
i)     use jps Command view spark process .
     

     
     


ii)     see spark Management interface , Enter in browser :http://master:8080

    
 
8)     function spark-shell, You can enter Spark Of shell Console .

     

9)     The cluster stops running
    When the cluster is stopped , function sbin/stop-all.sh stop it Spark colony , function sbin/stop-dfs.sh To close hadoop Document management system
HDFS, Last run sbin/stop-yarn.sh To close hadoop Task Manager YARN.

 

Four ,Spark Running examples

use Spark shell Interactive analysis of data

1. Load local file to HDFS in

1) Using commands bin/hdfs dfs -mkdir -p /data/input , That is to create a test directory on the virtual distributed file system /data/input

2) Using commands hdfs dfs -put README.txt /data/input
, soon /hadoop-2.9.1 Directory README.txt File copy to virtual distributed file system

3) Using commands bin/hdfs dfs -ls /data/input , View HDFS Is there a file we copied in the file system

    

 

2.  stay spark-shell window , to write scala sentence , From HDFS Loading in README.txt file , The paper also makes a brief analysis of the document

1) first , take README.txt File open for , You can see what's inside , As shown below :

    

2) stay spark-shell On the window , through the use of count(),first(),collect() Etc , Right README.txt Analysis

    i)  from HDFS Loading in README.txt file

    

    ii) Count() Meaning of operation :RDD Medium item number , For text files , That's the total number of rows .First() The meaning of :RDD First in
item, For text files , It means the first line , And 1 The first line in the step is the same , There was no error in the description .

    

    iii) use collect(), Count the number of words in the file

    

    

 

Test complete , Build successfully !!!