DS - Pandas - DataFrame Indexing

This is part of a series of articles dedicated to study Data Science using Python. This article aims to review some indexing DataFrame theory going through an example.

The Data

We have a list of students attending different universities. At the beginning we are indexing all students by the university they enrolled in.

initial data
import pandas as pd

peterj = pd.Series({"name": "Peter Johnson", "city": "NYC", "age": 22})
marcus = pd.Series({"name": "Marcus Ron", "city": "NYC", "age": 23})
annagu = pd.Series({"name": "Anna Guta", "city": "PAR", "age": 24})
johnro = pd.Series({"name": "John Roberts", "city": "NYC", "age": 23})
valeri = pd.Series({"name": "Valerie Son", "city": "PAR", "age": 22})

people = pd.DataFrame([peterj, marcus, annagu, johnro, valeri], index=["UAM", "UAM", "UCM", "UCM", "UAM"])
people
Overview

That allows us to query the DataFrame easily to answer questions like the names of the students at UAM:

getting UAM students DataFrame
uam_students_df = people.iloc["UAM", ["name"]]
uam_students_df

But what if we’d liked to know which UAM students are from NYC:

UAM students from NYC
uam_students_df = people.loc["UAM"]
uam_older_than_22 = uam_students_df[uam_students_df["city"] == "NYC"]
uam_from_nyc

We got it, but it would be nice to be able to filter by both conditions at once. As far as I’m aware there’re at least two ways to do it: Boolean masking and Multilevel Indexing. Both approaches require to dig a little bit more about DataFrame indexes.

Boolean Masking

To create a boolean mask involving the university and the city, I need to move the index values to a column and then concatenate the boolean mask with the university and the boolean mask regarding the city:

Overview
use boolean mask to query DataFrame
import pandas as pd

peterj = pd.Series({"name": "Peter Johnson", "city": "NYC", "age": 22})
marcus = pd.Series({"name": "Marcus Ron", "city": "NYC", "age": 23})
annagu = pd.Series({"name": "Anna Guta", "city": "PAR", "age": 24})
johnro = pd.Series({"name": "John Roberts", "city": "NYC", "age": 23})
valeri = pd.Series({"name": "Valerie Son", "city": "PAR", "age": 22})

people = pd.DataFrame([peterj, marcus, annagu, johnro, valeri], index=["UAM", "UAM", "UCM", "UCM", "UAM"])

# creating a new column
people["UNI"] = people.index

# removing current index and leaving the default numeric index
people = people.reset_index(drop=True)

# concatenating two boolean masks
uam_from_nyc = people[(people["UNI"] == "UAM") & (people["city"] == "NYC")]
uam_from_nyc
Overview

Multilevel Indexing

This approach can be seen as the opposite as the boolean masking strategy. Here instead of moving the index to a column, a column will become part of the index so that we can use the loc function to get the information I want.

Overview
use multilevel index to query DataFrame
import pandas as pd

peterj = pd.Series({"name": "Peter Johnson", "city": "NYC", "age": 22})
marcus = pd.Series({"name": "Marcus Ron", "city": "NYC", "age": 23})
annagu = pd.Series({"name": "Anna Guta", "city": "PAR", "age": 24})
johnro = pd.Series({"name": "John Roberts", "city": "NYC", "age": 23})
valeri = pd.Series({"name": "Valerie Son", "city": "PAR", "age": 22})

people = pd.DataFrame([peterj, marcus, annagu, johnro, valeri], index=["UAM", "UAM", "UCM", "UCM", "UAM"])

# creating a multilevel index
people_by_uni_and_city = people.set_index([people.index, "city"])
people_by_uni_and_city.index.names = ["UNI", "CITY"]
people_by_uni_and_city

# using loc function to get the information just passing index values
uam_from_nyc = people_by_uni_and_city.loc["UAM", "NYC"]
Overview