. Advertisement .
..3..
. Advertisement .
..4..
You may be already familiar with the SELECT statement in SQL. Learn about PySpark select columns from DataFrame with this tutorial when you switch to this framework.
PySpark Select Columns From DataFrame
You will need to use the DataFrame.select() method from the pyspark.sql module to make column selections in PySpark. Its syntax is quite simple:
select(exp)
This method has exp as the sole parameter. It consists of one or multiple expressions or columns that indicate your intended selection. The method select() will interpret the expression(s) and return a new DataFrame containing your desired selection.
There are many tasks you can accomplish with this method. To demonstrate them one by one, we are going to create a DataFrame first:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField("Site", StringType(), True),
StructField("Ranking", IntegerType(), True),
StructField("Site Type", StringType(), True),
StructField("Foundation", IntegerType(), True),
StructField("Top Language", StringType(), True)
])
data = [
("ITTutoria", 1, "Tutorials", 2022, "Python"),
("Stack Overflow", 2, "Q&A", 2008, "JavaScript"),
("Quora", 3, "Q&A", 2009, None),
("Reddit", 4, "Forum", 2005, None)
]
df = spark.createDataFrame(data=data, schema=schema)
df.show()
Ouput:
......... ADVERTISEMENT .........
..8..
This is a simple example database of many knowledge-sharing websites. It has five columns, which you can select in many ways from the method select(). You can learn more about creating DataFrames in PySpark with this guide.
Select One Column
This can be done by passing the name of the column you want to pick from the DataFrame.
df.select("Site").show()
......... ADVERTISEMENT .........
..8..
As you can see, the method reads the parameter directly and prints out the column. Instead of a string, you can also use the name of the column as a property of the DataFrame. This expression will give you the same result:
df.select(df.Site).show()
In a similar manner, select the column by using the square brackets:
df.select(["Site"]).show()
df.select(df["Site"]).show()
You can also use the method col() from the pyspark.sql.functions module to select a column:
from pyspark.sql.functions import col
df.select(col("Site")).show()
Thanks to the property columns, you can select a column by using its index as well. Here we need to use the index “1” because “Site” is the first column:
df.select(df.columns[:1]).show()
You can even use a regular expression to get the job done (even though this method is more catered to multiple selections). Make sure that only a column matches the expression you give the function colRegex():
df.select(df.colRegex("`^.*Site*`")).show()
In this case, only the first column matches the ^.*Site* expression.
Select Multiple Columns
All solutions above can be tweaked slightly to pick up several columns at the same time. All these statements should give you the same result: the first and the third columns.
......... ADVERTISEMENT .........
..8..
df.select("Site", "Site Type").show()
df.select(["Site", "Site Type"]).show()
df.select(df["Site"], df["Site Type"]).show()
from pyspark.sql.functions import col
df.select(col("Site"), col("Site Type")).show()
df.select(df.colRegex("`^.*Site.*`")).show()
The slicing operation can be used to select sequential columns:
df.select(df.columns[1:4]).show()
......... ADVERTISEMENT .........
..8..
Select All Columns
To select all columns in a DataFrame with PySpark, pass an asterisk character to the method select():
df.select("*").show()
Conclusion
In PySpark select columns from DataFrame can be done with the method select(). It makes use of the API’s labeling and indexing system to make the selection happen.
Leave a comment