2020-09-03

DS - Pandas - DataFrame Indexing

This is part of a series of articles dedicated to study Data Science using Python. This article aims to review some indexing DataFrame theory going through an example.

The Data

We have a list of students attending different universities. At the beginning we are indexing all students by the university they enrolled in.

initial data

import pandas as pd

peterj = pd.Series({"name": "Peter Johnson", "city": "NYC", "age": 22})
marcus = pd.Series({"name": "Marcus Ron", "city": "NYC", "age": 23})
annagu = pd.Series({"name": "Anna Guta", "city": "PAR", "age": 24})
johnro = pd.Series({"name": "John Roberts", "city": "NYC", "age": 23})
valeri = pd.Series({"name": "Valerie Son", "city": "PAR", "age": 22})

people = pd.DataFrame([peterj, marcus, annagu, johnro, valeri], index=["UAM", "UAM", "UCM", "UCM", "UAM"])
people

That allows us to query the DataFrame easily to answer questions like the names of the students at UAM:

getting UAM students DataFrame

uam_students_df = people.iloc["UAM", ["name"]]
uam_students_df

But what if we’d liked to know which UAM students are from NYC:

UAM students from NYC

uam_students_df = people.loc["UAM"]
uam_older_than_22 = uam_students_df[uam_students_df["city"] == "NYC"]
uam_from_nyc

We got it, but it would be nice to be able to filter by both conditions at once. As far as I’m aware there’re at least two ways to do it: Boolean masking and Multilevel Indexing. Both approaches require to dig a little bit more about DataFrame indexes.

Boolean Masking

To create a boolean mask involving the university and the city, I need to move the index values to a column and then concatenate the boolean mask with the university and the boolean mask regarding the city:

use boolean mask to query DataFrame

import pandas as pd

peterj = pd.Series({"name": "Peter Johnson", "city": "NYC", "age": 22})
marcus = pd.Series({"name": "Marcus Ron", "city": "NYC", "age": 23})
annagu = pd.Series({"name": "Anna Guta", "city": "PAR", "age": 24})
johnro = pd.Series({"name": "John Roberts", "city": "NYC", "age": 23})
valeri = pd.Series({"name": "Valerie Son", "city": "PAR", "age": 22})

people = pd.DataFrame([peterj, marcus, annagu, johnro, valeri], index=["UAM", "UAM", "UCM", "UCM", "UAM"])

# creating a new column
people["UNI"] = people.index

# removing current index and leaving the default numeric index
people = people.reset_index(drop=True)

# concatenating two boolean masks
uam_from_nyc = people[(people["UNI"] == "UAM") & (people["city"] == "NYC")]
uam_from_nyc

Multilevel Indexing

This approach can be seen as the opposite as the boolean masking strategy. Here instead of moving the index to a column, a column will become part of the index so that we can use the loc function to get the information I want.

use multilevel index to query DataFrame

import pandas as pd

peterj = pd.Series({"name": "Peter Johnson", "city": "NYC", "age": 22})
marcus = pd.Series({"name": "Marcus Ron", "city": "NYC", "age": 23})
annagu = pd.Series({"name": "Anna Guta", "city": "PAR", "age": 24})
johnro = pd.Series({"name": "John Roberts", "city": "NYC", "age": 23})
valeri = pd.Series({"name": "Valerie Son", "city": "PAR", "age": 22})

people = pd.DataFrame([peterj, marcus, annagu, johnro, valeri], index=["UAM", "UAM", "UCM", "UCM", "UAM"])

# creating a multilevel index
people_by_uni_and_city = people.set_index([people.index, "city"])
people_by_uni_and_city.index.names = ["UNI", "CITY"]
people_by_uni_and_city

# using loc function to get the information just passing index values
uam_from_nyc = people_by_uni_and_city.loc["UAM", "NYC"]

The Data

Boolean Masking

Multilevel Indexing

References