This tutorial will explain how you can sort multiple columns in Pandas DataFrame. This is a basic task that you may need to use frequently when analyzing data.
Sort Multiple Columns In Pandas DataFrame
For this specific task, you will need to use the method
DataFrame.sort_values() with this syntax:
DataFrame.sort_values(by, axis, ascending, inplace, kind, na_position, ignore_index, key)
- axis: the axis the method will sort. Its default value is 0 (or index). This is also the direction you want to use when sorting multiple columns in Pandas DataFrame.
- by: a string (or a list of strings) that determines the label(s) of the columns along which you want to sort the DataFrames.
- ascending: a boolean value (or a list of boolean values) specifying the way each column should be sorted.
- inplace: this parameter controls whether the method makes changes on the current DataFrame itself or leaves it intact and only sorts a copy of it. The default value is True, meaning in-place operation.
- kind: the sorting algorithm you want to use. PySpark uses quicksort by default, but you can change it to stable, heapsort, or mergesort. Only stable and mergesort are stable sorting algorithms.
When sorting DataFrames, this option only applies when used on a single label or column.
- na_position: this parameter determines where to put NaNs: at the beginning if it is first, and at the end when it is set to last.
- ignore_index: available since Pandas version 1.0.0, this boolean parameter controls whether the returned DataFrame is labeled from 0 to n-1.
- key: apply a function to values of the DataFrame before applying the sorting algorithm. This key function has to be vectorized (meaning it should return a Series when receiving a Series of the same shape).
To show you how the
DataFrame.sort_values() works, we will create DataFrame in Pandas by importing data from a CSV file. It has information about some home sales, such as their selling price, listing price, number of rooms, and taxes.
import pandas as pd df = pd.read_csv('homes.csv') df
We can sort two columns by providing the method
sort_values() with their labels in a list.
For instance, this statement sorts houses in the DataFrame by the number of rooms first, then their listing price.
You can continue to sort this further by another column, such as the selling price. You can see that some entries have their places changed.
df.sort_values(['Rooms', 'List', 'Sell'])
By default, the method
sort_values() keeps the indexes of rows, meaning they will move together with the rows as well. You can add the option
ignore_index = True to leave them where they used to be. The entries in columns don’t change, but the indexes will run from 0 to n-1.
df.sort_values(['Rooms', 'List'], ignore_index = True)
The default options of
sort_values() sort entries in ascending order. You can change this sorting direction for every column by defining a boolean list.
This statement, for instance, still displays houses with a lower number of rooms first. However, if they are similar, houses with a higher listing and selling price will be shown first.
df.sort_values(['Rooms', 'List', 'Sell'], ascending = [True, False, False])
All examples above leave the original intact. If you want to directly sort on it, add the argument inplace = True:
df.sort_values(['Rooms', 'List'], inplace = True)
DataFrame.sort_values() can sort multiple columns in Pandas DataFrame. You choose a sorting algorithm, specify different sorting directions for columns, and even invoke a function before sorting.