Because in the virtual machinelinux It's already deployed onspark, But every time I write itspark Ofpython Scripts have to be tested on virtual machines, So much trouble, So it's localwin7 Under the system, Combinationpycharm development tool, Build a local test run environment.

Local operationspark Ofpython Script program, Of course.spark Relevant environment of, So the premise is localwin7 Build up nextspark Environmental Science

【 The steps are as follows】
1. Build local testspark Environmental Science 2. staypycharm Configuration in development toolsspark Environmental direction, When this script runs, Can be called successfullyspark Related resource packs for. 3.
Program well, Function
【1 Build local testspark Environmental Science】
To avoid problems caused by version inconsistency, So localwin Installed belowspark Version andspark Version consistency.
spark rely onscala scala rely onjdk , therefore jdk andscala They have to be installed.
On virtual machine
jdk1.7 scala 2.11.8 spark-2.0.1


Locallywindow It is better to install the same version on, Following chart.



Download the correspondingwindow System version(jdk1.7 Oflinux andwindow It's different,
scala andspark stayinux andwindow It's the same. becausejdk To achieve platform independence, So it's about the system,scala,spark Is running injdk Above, It's not about the platform). To configurejdk,scala,spark Related environment variables of.

【2 installpython Environmental Science】


Differentspark Supportpython Different versions, Can pass spark Catalog/bin/pyspark.sh Document viewing version requirements .
if hash python2.7 2>/dev/null; then # Attempt to use Python 2.7, if installed:
DEFAULT_PYTHON="python2.7" else DEFAULT_PYTHON="python" fi
Latest edition3.6 It doesn't seem so good either,python3.6 The following errors will be reported in the environment
TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose',
'rename', and 'module'.

So install2.7 Or above, My choicepython 3.5 Edition
Well configuredpython3.5 After environment variable, Check whether it can operate normally.


【3 Installation configurationpycharm】
Download and installpychram( Cracking method Baidu, Students' words, If there's a school mailbox, Free registration)
Newly buildtestspark Ofpython project, To writetest.py The script is as follows
import os import sys import re try: from pyspark import SparkContext from
pysparkimport SparkConf print("Successfully imported Spark Modules") except
ImportErroras e: print("Can not import Spark Modules", e)

Selectiontesy.py Right click to select execute【run ‘test.py’】 operation, Output

Can not import Spark Modules No module named 'pyspark'

This is the codetry Operation not successful. Loadspark Library file failed for
from pyspark import SparkContext from pyspark import SparkConf
Run attempt needs to be configured——spark Library file location, The configuration is as follows:Run menu bar——>Edit
Configurations——> Select script filetesy.py->Environment variables ——> Add tospark environment variable——>apply
Then save and run again






// localspark Directorypython Location andpy4j Location PYTHONPATH=C:\Java\spark-2.0.1-bin-hadoop2.7
\python;C:\Java\spark-2.0.1-bin-hadoop2.7\python\lib\py4j-0.10.3-src.zip
//spark Directory location SPARK_HOME=C:\Java\spark-2.0.1-bin-hadoop2.7
Re executiontest.py



Successful loading. You can test it locallyspark Code..

Be careful: Because the configured environment object is currentpy Script, So newpy After file, Need to be added againPYTHONPATH andSPARK_HOME environment variable.