. Advertisement .
..3..
. Advertisement .
..4..
Today we will learn about “environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON”. As you can see, in a world where data is generated at such an alarming rate, it is very helpful to analyze that data exactly at the right time. One of the most amazing frameworks for real-time big data processing and performance analysis is Apache Spark, and if we talk about programming languages that are being used today to handle analysis tasks. data analysis and complex data analysis, I’m sure Python will be at the top of the graph. Thus, if you understand the above problem well, you will probably find it easier to work with Python.
When did the problem occur?
Some information about the applications users are using: pyspark, python 3 is fully set up.
Below is a snippet of the program the coder uses in his project. Looks like it went wrong:
>>from pyspark import SparkContext
>>sc = SparkContext()
>>data = range(1,1000)
>>rdd = sc.parallelize(data)
>>rdd.collect()
And here is the display of the error after the coder runs.
[Stage 0:> (0 + 0) / 4]18/01/15 14:36:32 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 123, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
Coders used the following elements in their programs:
export SPARK_HOME=/opt/spark
export PYTHONPATH=$SPARK_HOME/python3
Solution for the environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
Option 1
To handle the above situation, you can consider installing additional Pycharm. Once the setup is complete, you can use PYSPARK_PYTHON, PYSPARK_DRIVER_PYTHON. It is a commonly used computational framework for processing, querying, and analyzing big data. Based on in-memory computation, it has advantages over some other big data frames.
With the above two you can apply them to handle your errors, refer to the image below:
......... ADVERTISEMENT .........
..8..
Option 2
Set the following environment variables in $SPARK_HOME/conf/spark-env.sh
export PYSPARK_PYTHON=/usr/bin/python
export PYSPARK_DRIVER_PYTHON=/usr/bin/python
Rename spark-env.sh.template
If the spark-env.sh
does not exit
Option 3
If you are using Jupyter Notebook to learn PySpark, look for where python3
is set up doing in a terminal:
which python3
The pointing to /usr/bin/python3
in the notebook follow this command here:
/* import os
# Set spark environments
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/usr/bin/python3'*/
Conclusion
Thus, Through the above article, we have given a way to handle the above error. Hope you will better understand the problem of “environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON“. If you have any difficulties or questions, please contact or leave us a message.
In addition, we have other articles related to the above issue. If you want to learn more scroll down and look in the tags. Hope you will enjoy it. Thank you for reading and hope to see you soon!
I think we should set both variable in
.bash_profile
Then I realized that the reason for the issue is that my default Python version is Python 2.7 by entering
the command python -version