Dealing with Messy Dates in Python
Data Transformation and Analysis

Dealing with Messy Dates in Python

Overview

Dates can be one of the more frustrating data types to deal with when you have a large dataset. There are just so many ways to write the same date, depending on where you are and what different people prefer. Consider the ways you might write July 1, 2020:

  • 7/1/2020
  • 07/01/2020
  • 2020/07/01
  • 2020-07-01
  • Jul-1-2020
  • and the list could keep going!

In order to align your date data so that each one can be aggregated, summarized, and used in calculations, you need it to be in the same format. When you have dates in string format (like the examples above), it can be difficult to, for example, programmatically add a certain number of days to each one.

Python solves this issue with the datetime module. Pandas uses datetime to parse strings of dates and convert them to special datetime objects, which breaks units of time down into smaller units. For instance, our July 1, 2020 date would look like this in datetime: datetime.datetime(2020, 7, 1). This would allow us to easily manipulate the date.

Prerequisites

This guide assumes that you have already:

Getting Started

For the purposes of this walkthrough, I’m going to use a small CSV file called data_cleaning_registrations.csv, which can be downloaded here.

As always, the first thing we want to do is create a Python text file, save it into a folder, and if we’re working with a CSV or Excel file, save that in the same folder. In our script, let’s import Pandas and alias it:

import pandas as pd

Next, we’ll import the CSV file:

data = pd.read_csv('data_cleaning_registrations.csv')

Let’s take a look a look at the summary of the data:

import pandas as pd

data = pd.read_csv('data_cleaning_registrations.csv')

print(data.info())

This returns:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 6 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   first name                  17 non-null     object
 1   last name                   17 non-null     object
 2   age                         17 non-null     int64 
 3   date of registration        17 non-null     object
 4   number of family members    17 non-null     int64 
 5   family has disabled member  17 non-null     object
dtypes: int64(2), object(4)

Converting the Date Column to DateTime

Hm, the date of registration was interpreted as an object (or a string of text). That isn’t quite right. Fortunately, Pandas helps us specify that this is actually date data and convert it to datetime with the aptly-named to_datetime() method. Let’s add the following line, then walkthrough what it does:

data['date of registration'] = pd.to_datetime(data['date of registration'])
  1. data['date of registration']: In Python, you specify a subset of data with brackets. This applies to data types outside of the scope of what we’re covering here, but as a simple example, if you had a list of integers like my_list = [1, 2, 3], calling my_list[1] returns the second item in the list, 2 in this case (remember, Python starts at zero, so the second element is actually 1). In the case of this CSV, when we put the name of the column inside quotes like this, we’re telling Python that we want that particular column. Since we’re starting the line like this, we’re actually manipulating the existing column and saving over it. We’ll see what that means after running this.
  2. pd.: We’re referencing Pandas here using the alias we established at the top of the script when we imported it.
  3. .to_datetime(): This is a method included in the DataFrame class. We’re letting Pandas do the messy work of looping over each item in this column to perform a function that looks at the string and converts it to a datetime object.
  4. data['date of registration']: Isn’t this the same thing that is already written to the left of the equals sign? Yup! We’re essentially saying convert this column to datetime, and save it in its place.

Now let’s add a print statement to the bottom of the script to check that we successfully changed the data type:

import pandas as pd

data = pd.read_csv('data_cleaning_registrations.csv')

data['date of registration'] = pd.to_datetime(data['date of registration'])

print(data.info())

And see what that does:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 6 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   first name                  17 non-null     object        
 1   last name                   17 non-null     object        
 2   age                         17 non-null     int64         
 3   date of registration        17 non-null     datetime64[ns]
 4   number of family members    17 non-null     int64         
 5   family has disabled member  17 non-null     object        
dtypes: datetime64[ns](1), int64(2), object(3)

Success! We’ve now changed the data type to something a bit more useful. Let’s see what we can do with this new format.

Checking for Out-of-Range Dates

If we have a series of dates, sometimes there are records that fall outside of a given range. For instance, volunteers may have registered people after a deadline, making those potential beneficiaries ineligible for our distribution. Typos might also mean we have dates set in the future, which for the date of registration would not be possible. Let’s check on this type of error—dates incorrectly set in the future—and walk through how to fix it.

First, let’s take a look at what the max date is by printing out the max value from the date column:

import pandas as pd
import datetime as dt

data = pd.read_csv('data_cleaning_registrations.csv')

data['date of registration'] = pd.to_datetime(data['date of registration']).dt.date

print(data['date of registration'].max())

today = dt.date.today()

data.loc[data['date of registration'] > today, 'date of registration'] = today

print(data['date of registration'].max())

Simple Arithmetic on Dates

Let’s say that the relief lead wants to do a follow up with folks that had been registered 90 days after they were registered. When the dates were in string format, we couldn’t do that, but now we can.

In the previous step, we overwrote one column with a new one, but now we want to actually add a column that we’ll call ‘registration follow up date’. Remember how you specify a subset of a dataset with brackets and the column name? Well if you declare a new variable that has a column name that doesn’t exist, Python will interpret that as you wanting to create a new one. (This works with other data types too, like dictionaries, which we’ll cover in another article).

Below what we’ve already written, let’s add a line:

data['90 day follow up'] = data['date of registration'] + pd.DateOffset(days=90)
  1. data['90 day follow up']: This is declaring a new column name where we’ll store the data.
  2. data['date of registration']: References the column whose data type we converted in the previous step, which is now datetime formatted.
  3. + pd.DateOffset(days=90): This is a new method for us! The fact that it has pd prepended means that we’re accessing a Pandas method called DateOffset(). As with all things related to libraries like Pandas, it’s always a good idea to check the documentation available to help us understand what methods we can access, and what arguments those methods accept. Let’s take a look at the documentation for this method here. The parameters section helps us understand what we can pass it as keyword arguments (named parameters that we pass the method). In our case, we want to add 90 days so we specify days=90. If we wanted to add three years, we would instead use years=3.

Let’s run this again with a print statement to see if this new column was added:

import pandas as pd

data = pd.read_csv('data_cleaning_registrations.csv')

data['date of registration'] = pd.to_datetime(data['date of registration'])

data['90 day follow up'] = data['date of registration'] + pd.DateOffset(days=90)

print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 7 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   first name                  17 non-null     object        
 1   last name                   17 non-null     object        
 2   age                         17 non-null     int64         
 3   date of registration        17 non-null     datetime64[ns]
 4   number of family members    17 non-null     int64         
 5   family has disabled member  17 non-null     object        
 6   90 day follow up            17 non-null     datetime64[ns]
dtypes: datetime64[ns](2), int64(2), object(3)

We see the new column, and the data type was correctly listed as datetime! Let’s also just spot check the data there to make sure it did what we wanted by swapping out print(data.info()) with print(data[['date of registration', '90 day follow up']])

import pandas as pd

data = pd.read_csv('data_cleaning_registrations.csv')

data['date of registration'] = pd.to_datetime(data['date of registration'])

data['90 day follow up'] = data['date of registration'] + pd.DateOffset(days=90)

print(data[['date of registration', '90 day follow up']])
   date of registration 90 day follow up
0            2022-01-04       2022-04-04
1            2022-01-05       2022-04-05
2            2022-01-04       2022-04-04
3            2022-01-05       2022-04-05
4            2022-01-06       2022-04-06
5            2022-01-02       2022-04-02
6            2022-01-06       2022-04-06
7            2022-01-13       2022-04-13
8            2022-01-11       2022-04-11
9            2022-01-14       2022-04-14
10           2022-01-04       2022-04-04
11           2022-01-04       2022-04-04
12           2022-01-04       2022-04-04
13           2022-01-05       2022-04-05
14           2022-01-13       2022-04-13
15           2022-01-06       2022-04-06
16           2022-01-02       2022-04-02

Looks good! If you’re interested, you can pass a list of multiple column names with double brackets as I’ve done here. The outer brackets are interpreted as specifying a subset of the data, and the inner brackets are interpreted as a comma separated list, which in this case, is just two elements.

Leave a Reply

Your email address will not be published. Required fields are marked *