. Advertisement .
..3..
. Advertisement .
..4..
You may see the Python: No module named ‘pyspark’ error message when running a PySpark script for the first time. Read on to learn why it occurs and how to get rid of this issue.
When Does The Error “Python: No module named ‘pyspark'” Happen?
This error occurs when Python can’t detect the pyspark module. The most likely scenario is that you haven’t installed Spark on your system. It comes with high-level APIs in different programming languages, including Python.
Apache Spark is originally written in Scala – a programming language that can compile into Java bytecode. It requires a Java Virtual Machine (JVM), while PySpark provides a more friendly interface to all the functionalities under the hood.
How to fix the issue
To fix the issue, you will need to:
- Install Java Development Kit (JDK) – we recommend Oracle’s implementation or OpenJDK.
- Install Spark.
- Configure your system’s environment variables.
Install JDK
Windows
- Go to Oracle’s official Java download website.
- Pick one from three formats of installers: ZIP, EXE, and MSI. It is recommended to choose the EXE or MSI file if you want to have an easy installation process.
- Wait for the download to finish. You can verify the integrity of this file by comparing its size and SHA256 hash to information on the download website.
- Make sure you have a 64-bit Windows system, and your account has administrative privileges.
- Double click the installer in your download folder. Follow instructions on screen – there is no need to change any default setting.
Your JDK installation should be located at “C:\Program Files\Common Files\Oracle\Java” by default. Open File Explorer to verify this.
Windows users also have the option of installing Oracle JDK silently by using the command:
<installer_file>.exe /s
Replace <installer_file> with the actual name of the installer you have just downloaded.
macOS
As with Windows, you will need administrative privileges to install Oracle JDK on your Mac machine.
Note that it is compatible with both Intel and Arm versions of macOS at the moment. However, there is no option for single-user JDK installation on this operating system. You can only have a system-wide installation.
- Go to Oracle’s official Java download website.
- Choose the correct installer for your system (the recommended file format is DMG). If you have an Intel-based system, pick x64. For newer Arm versions, choose Arm.
- Wait for the download to complete and double-click the DMG file to begin the installation. Follow instructions on the screen. Enter your password when required.
- The installation should complete shortly, after which you can delete the DMG installer file to save space.
Linux
Most Linux distributions have OpenJDK as the official implementation of Java in their software repositories. This is the recommended method as it makes sure you can have the latest JDK version when updating your Linux system.
Use these commands to install OpenJDK, depending on your distribution.
Ubuntu, Debian, etc.
sudo apt-get install openjdk-8-jre
Fedora
List available OpenJDK versions:
dnf search openjdk
Install OpenJDK (use the package name of the version you want to install):
sudo dnf install <openjdk-package-name>
Archlinux
sudo pacman -S jdk-openjdk
Install Spark
- Go to Apache Spark download website.
- Pick a different Spark version or package type if you want to. Click the download link and wait for the download to complete.
- Extract the TGZ file somewhere (you can use free software like 7zip or WinZip).
Configure Environment Variable
Windows
- Right-click the Windows icon on the taskbar > System > Advanced system settings.
- Switch to the Advanced tab > Environment Variables.
- Under System variables, click New.
- Enter SPARK_HOME into Variable name and the path to your Spark installation (where you extract the TGZ file) into Variable value.
- Click OK > OK.
- Log out and log in again.
macOS & Linux
- Open file ~/.bash-profile with a text editor.
- Insert this line into the file (<path> is the absolute path to the Spark folder) and save it:
export $SPARK_HOME=<path>
- Log out and log in again.
Using findspark
The findspark module is designed to find your Spark installation and add it to your Python’s environment variable.
Install the module with pip:
pip install findspark
Insert this line on the top of your script or run it first before importing pyspark (path is the full path to your Spark folder):
import findspark
findspark.init("path")
Conclusion
You can fix the Python: No module named ‘pyspark’ Error by installing the necessary software into your system. To make it work, you will need JDK, and Spark, as well as have to set the correct environment variables. After installing PySpark, read this guide to get the hang of it.
Leave a comment