Because in the virtual machine linux It's already deployed on spark, But every time I write it spark Of python Scripts have to be tested on virtual machines , so much trouble , So it's local win7 Under the system , combination pycharm development tool , Build a local test run environment .

Local operation spark Of python Script program , Of course spark Relevant environment of , So the premise is local win7 Set it up spark Environmental Science

【 The steps are as follows 】
1. Build local test spark Environmental Science 2. stay pycharm Configuration in development tools spark Environmental orientation , When this script runs , Can be called successfully spark Related resource packs for . 3.
Program well , function
【1 Build local test spark Environmental Science 】
To avoid problems caused by version inconsistency , So local win Lower mounted spark Version and spark Consistent version .
spark rely on scala scala rely on jdk , therefore jdk and scala They have to be installed .
On virtual machine
jdk1.7 scala 2.11.8 spark-2.0.1


Local window It is better to install the same version on , As shown below .



Download the corresponding window System version (jdk1.7 Of linux and window It's different ,
scala and spark stay inux and window It's the same , because jdk To achieve platform independence , So it's about the system ,scala,spark Is running on jdk above , It's not about the platform ). to configure jdk,scala,spark Related environment variables of .

【2 install python Environmental Science 】


different spark support python Different versions , Through spark catalog /bin/pyspark.sh Document viewing version requirements .
if hash python2.7 2>/dev/null; then # Attempt to use Python 2.7, if installed:
DEFAULT_PYTHON="python2.7" else DEFAULT_PYTHON="python" fi
Latest version 3.6 It doesn't seem so good either ,python3.6 The following errors will be reported in the environment
TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose',
'rename', and 'module'.

So install 2.7 Or above , I chose python 3.5 edition
Well configured python3.5 After environment variable , Check whether it can operate normally .


【3 Installation configuration pycharm】
Download and install pychram( Cracking method Baidu , Students' words , If there's a school mailbox , Free registration )
newly build testspark Of python project , to write test.py The script is as follows
import os import sys import re try: from pyspark import SparkContext from
pysparkimport SparkConf print("Successfully imported Spark Modules") except
ImportErroras e: print("Can not import Spark Modules", e)

Select tesy.py Right click to select execute 【run ‘test.py’】 operation , Will output

Can not import Spark Modules No module named 'pyspark'

This is the code try Operation not successful . load spark Library file failed for
from pyspark import SparkContext from pyspark import SparkConf
Run attempt needs to be configured ——spark Library file location , The configuration is as follows :Run menu bar ——>Edit
Configurations——> Select script file tesy.py->Environment variables ——> add to spark environment variable ——>apply
Then save and run again






// local spark Under directory python Location and py4j Location of PYTHONPATH=C:\Java\spark-2.0.1-bin-hadoop2.7
\python;C:\Java\spark-2.0.1-bin-hadoop2.7\python\lib\py4j-0.10.3-src.zip
//spark Directory location SPARK_HOME=C:\Java\spark-2.0.1-bin-hadoop2.7
Execute again test.py



Successfully loaded . You can test it locally spark Code .

be careful : Because the configured environment object is current py script , So new py After document , Need to be added again PYTHONPATH and SPARK_HOME environment variable .