Introduction to Data Cleaning with Pandas

27 Sep 2019

Through this workshop, you will learn how to use Pandas to explore and “wrangle” datasets. Topics will include an introduction to Jupyter Notebooks/Colab, data cleaning with pandas, feature engineering with pandas, basic visualization and more. This workshop will focus on actual coding.

This article provides a summary of the main workshop, which you can watch here. Here is a colab link to run all the code.

import pandas as pd
import numpy as np

%matplotlib inline

Jupyter Tips

Before starting with pandas, let’s look at some useful features Jupyter has that will help us along the way.

Typing a function then pressing tab gives you a list of arguments you can enter. Pressing shift-tab gives you the function signature. Also:

?pd.Series # using one question mark gives you the function/class signature with the description
??pd.Series # two question marks gives you the actual code for that function

Timing your pandas code is a very helpful learning tool, so you can figure out the most efficient way to do things. You can time code as follows:

%timeit [i for i in range(500)] # in line mode

100000 loops, best of 3: 14 µs per loop

%%timeit # time an entire cell
for i in range(10):
    None;

The slowest run took 5.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 300 ns per loop

Commands prefaced by “%” or “%%” are called magic commands. You can read about more here.

What is Pandas?

Pandas is a Python library for manipulating data and performing analysis. It has too many fefatures to cover in one introductory workshop, but you will find the documentation complete and clear: https://pandas.pydata.org/pandas-docs/stable/. For many tasks, there is likely a Pandas function to make your life easier, so Google away!

The most basic unit in Pandas is called a Series:

s = pd.Series(['a', 'b', 'c', 'd', 'e', 'f', 'g'])
s

0    a
1    b
2    c
3    d
4    e
5    f
6    g
dtype: object

A series is simply a 1D numpy array with some more functionality built on top. Above, on the left you see an index and on the right are the actual values. The “dtype” is the datatype, which can be anything from objects (usually strings), integers, floats, categorical variables, datetimes, etc. Series are much faster than built in python lists because the numpy backend is written in C.

You can index into a series exactly the same as you would a numpy array:

s[1] # returns the 2nd element (0 indexed)

'b'

s[1:3] # returns a series from indices 1 to 3 (exclusive)

1    b
2    c
dtype: object

s[1::2] # returns series from indices 1 to the end, counting by 2s (i.e. 1, 3, 5)

1    b
3    d
5    f
dtype: object

You also retain the same broadcasting numpy arrays do. For example

s2 = pd.Series([i for i in range(50)])
s2 = s2/50 + 1

You can also sample a random element from a series:

s2.sample()

2    1.04
dtype: float64

Next, let’s import some data and jump into Dataframes. Dataframes are tables of data, where each column has a name and is a series of some type. Each column can have a different type.

df = pd.read_csv('https://raw.githubusercontent.com/n2cholas/pokemon-analysis/master/pokemon-data.csv', delimiter=';')
mdf = pd.read_csv('https://raw.githubusercontent.com/n2cholas/pokemon-analysis/master/move-data.csv', delimiter=';')

print('Number of pokemon: ', len(df))
df.sample()

Number of pokemon:  918

	Name	Types	Abilities	Tier	HP	Attack	Defense	Special Attack	Special Defense	Speed	Next Evolution(s)	Moves
552	Octillery	['Water']	['Moody', 'Sniper', 'Suction Cups']	PU	75	105	75	105	75	45	[]	['Gunk Shot', 'Rock Blast', 'Water Gun', 'Cons...

We can also take samples of different sizes, or look at the top of the dataset, or the bottom:

mdf.head(3)

	Index	Name	Type	Category	Contest	PP	Power	Accuracy	Generation
0	1	Pound	Normal	Physical	Tough	35	40	100	1
1	2	Karate Chop	Fighting	Physical	Tough	25	50	100	1
2	3	Double Slap	Normal	Physical	Cute	10	15	85	1

mdf.sample(2)

	Index	Name	Type	Category	Contest	PP	Power	Accuracy	Generation
551	552	Fiery Dance	Fire	Special	Beautiful	10	80	100	5
84	85	Thunderbolt	Electric	Special	Cool	15	90	100	1

mdf.tail()

	Index	Name	Type	Category	Contest	PP	Power	Accuracy	Generation
723	724	Searing Sunraze Smash	Steel	Special	???	1	200	None	7
724	725	Menacing Moonraze Maelstrom	Ghost	Special	???	1	200	None	7
725	726	Let's Snuggle Forever	Fairy	Physical	???	1	190	None	7
726	727	Splintered Stormshards	Rock	Physical	???	1	190	None	7
727	728	Clangorous Soulblaze	Dragon	Special	???	1	185	None	7

Initial Processing

We don’t need the index column because Pandas gives us a default index, so let’s drop that column.

mdf.drop('Index', inplace=True, axis=1)
# mdf = mdf.drop(columns='Index') # alternative

Many pandas functions return a changed version of the dataframe instead of modifying the dataframe itself. We can use inplace=True to do it inplace (which is more efficient). Sometimes, when using multiple commands consecutively, it’ easier to chain the commands instead of doing it inplace (as you’ll see).

mdf.columns = ['name', 'type', 'category', 'contest', 'pp', 'power', 'accuracy', 'generation'] #set column names

mdf.dtypes

name          object
type          object
category      object
contest       object
pp             int64
power         object
accuracy      object
generation     int64
dtype: object

Pandas usually does a good job of detecting the datatypes of various columns. We know that power and accuracy should be numbers, but pandas is making them objects (strings). This usually indicates null values. Let’s check.

mdf['accuracy'].value_counts()

100     320
None    280
90       46
95       29
85       26
75       10
80        7
70        4
55        3
50        3
Name: accuracy, dtype: int64

Just as we suspected, there is the string “None” for non-numeric values. Let’s fix this.

mdf['accuracy'].replace('None', 0, inplace=True)
# notice mdf.accuracy.replace(..., inplace=True) wouldn't work
mdf['accuracy'] = pd.to_numeric(mdf['accuracy'])

Below, we get a boolean series indicating whether the column is ‘None’ or not. We can use this boolean series to index into the dataframe.

mdf.power == 'None'

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11      True
12     False
13      True
14     False
15     False
16     False
17      True
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27      True
28     False
29     False
       ...  
698    False
699    False
700    False
701     True
702    False
703    False
704    False
705    False
706    False
707    False
708    False
709    False
710    False
711    False
712    False
713    False
714     True
715    False
716     True
717    False
718    False
719    False
720    False
721    False
722    False
723    False
724    False
725    False
726    False
727    False
Name: power, Length: 728, dtype: bool

mdf[mdf.power == 'None'].head()

	name	type	category	contest	pp	power	accuracy	generation
11	Guillotine	Normal	Physical	Cool	5	None	0	1
13	Swords Dance	Normal	Status	Beautiful	20	None	0	1
17	Whirlwind	Normal	Status	Clever	20	None	0	1
27	Sand Attack	Ground	Status	Cute	15	None	100	1
31	Horn Drill	Normal	Physical	Cool	5	None	0	1

mdf.loc[mdf.power == 'None', 'power'].head()

11    None
13    None
17    None
27    None
31    None
Name: power, dtype: object

.loc is a common way to index into a Dataframe. The first argument is the index (or list of indices), or a boolean array that acts as a mask. iloc can be used similarly, except the first number is the actual numeric index (notice that a Dataframe index can be non-numeric).

mdf.loc[mdf.power == 'None', 'power'] = 0
mdf['power'] = pd.to_numeric(mdf['power'])

mdf.dtypes

name          object
type          object
category      object
contest       object
pp             int64
power          int64
accuracy       int64
generation     int64
dtype: object

We were able to convert them with no issues. Notice the two ways to access columns. The only difference between the two is that the dictionary-style access allows you to modify the column, and allows you to create new columns. You can only use the .column method for existing columns, and it returns a copy (so the modifications won’t affect the original Dataframe). Also, notice you can’t access columns with spaces in their names with the .column notation.

Although the dictionary-style access is more consistent, I like to use the .column access whenever I can because it is faster to type.

df.columns = ['name', 'types', 'abilities', 'tier', 'hp', 'atk', 'def', 'spa', 'spd', 'spe', 'next_evos','moves']
df.dtypes

name         object
types        object
abilities    object
tier         object
hp            int64
atk           int64
def           int64
spa           int64
spd           int64
spe           int64
next_evos    object
moves        object
dtype: object

We saw above that the next_evos, moves, abilities, and types columns should be lists, so we can do that.

temp_df = df.copy()

%%timeit
for ind, row in temp_df.iterrows():
    df.at[ind, 'next_evos'] = eval(row['next_evos'])

10 loops, best of 3: 108 ms per loop

A few notes. This seems like the most obvious way to achieve what we want. Look through the rows using iterrows, use python’s “eval” to turn a string-list into an actual list, then assign it to the dataframe at that index. Notice that we use “at”, which is the same as “loc” except it can only access one value at a time.

This turns out to be the worst way to do this. In pandas, we can almost always avoid explicitly looping through our data.

%%timeit
df['types'] = temp_df.apply(lambda x: eval(x.types), axis=1)

10 loops, best of 3: 22.4 ms per loop

This is much better. The apply function applies a function you give it to all the rows or columns in the dataframe. The axis argument specifies whether it’s rows or columns. We can make this a bit cleaner.

%%timeit
df['abilities'] = temp_df.abilities.map(eval)

100 loops, best of 3: 6.12 ms per loop

This is very clean. While apply works on a dataframe, map works on a single series. Also, since the value is always just applied to the one column, we can just pass the eval function instead of using a lambda. Our next improvement won’t be faster, but it’ll be nicer

from tqdm import tqdm
tqdm.pandas()

df['moves'] = temp_df.moves.progress_map(eval)

100%|██████████| 918/918 [00:00<00:00, 8454.77it/s]

tqdm is a library that provides progress bars for loops, but it can be easily used with pandas to provide a progress bar for your maps and applies. Very useful for doing complex processing on large datasets.

Next, notice that our dataframe has one row per pokemon. It would be nice to index into by the pokemon name rather than a number. If we are going to access rows by pokemon name often, this will give us a speed advantage, since the items in the index are supported in the backend by a hashtable.

df.set_index('name', inplace=True)

df.loc['Pikachu']

types                                               [Electric]
abilities                              [Lightning Rod, Static]
tier                                                       NaN
hp                                                          35
atk                                                         55
def                                                         40
spa                                                         50
spd                                                         50
spe                                                         90
next_evos                               [Raichu, Raichu-Alola]
moves        [Tail Whip, Thunder Shock, Growl, Play Nice, T...
Name: Pikachu, dtype: object

We can also reset_index, which can be useful sometimes. Now that we’ve done some processing, we can produce a summary of the numeric columns:

df.describe()

	hp	atk	def	spa	spd	spe
count	918.000000	918.000000	918.000000	918.000000	918.000000	918.000000
mean	69.558824	80.143791	74.535948	73.297386	72.384532	68.544662
std	26.066527	32.697233	31.225467	33.298652	27.889548	29.472307
min	1.000000	5.000000	5.000000	10.000000	20.000000	5.000000
25%	50.000000	55.000000	50.000000	50.000000	50.000000	45.000000
50%	66.500000	75.000000	70.000000	65.000000	70.000000	65.000000
75%	80.000000	100.000000	90.000000	95.000000	90.000000	90.000000
max	255.000000	190.000000	230.000000	194.000000	230.000000	180.000000

Data Correction

Typically, you will find oddities in your data during analysis. Perhaps you visualize a column, and the numbers look off, so you look into the actual data and notice some issues. For the purpose of this workshop, we’ll skip the visualization and just correct the data

First, some pokemon have moves duplicated. Let’s fix this by making the move-lists into movesets

df['moves'] = df.moves.progress_map(set)

100%|██████████| 918/918 [00:00<00:00, 68711.23it/s]

Next, I noticed a weird quirk with the strings for the moves. This will cause some trouble if we want to relate the mdf and df tables, so let’s fix it.

moves = {move for move_set in df.moves for move in move_set}

weird_moves = {m for m in moves if "'" in m}
weird_moves

{"Baby'Doll Eyes",
 "Double'Edge",
 "Forest's Curse",
 "Freeze'Dry",
 "King's Shield",
 "Land's Wrath",
 "Lock'On",
 "Mud'Slap",
 "Multi'Attack",
 "Nature's Madness",
 "Power'Up Punch",
 "Self'Destruct",
 "Soft'Boiled",
 "Topsy'Turvy",
 "Trick'or'Treat",
 "U'turn",
 "Wake'Up Slap",
 "Will'O'Wisp",
 "X'Scissor"}

Many of these moves, such as U-turn, should have a dash instead of an apostrophe (according to the moves dataset). Upon closer inspection, it’s clear that the only moves that should have an apostrophe are those whose words end with an apostrophe-s. Let’s make this correction.

weird_moves.remove("King's Shield")
weird_moves.remove("Forest's Curse")
weird_moves.remove("Land's Wrath")
weird_moves.remove("Nature's Madness")

def clean_moves(x):
  return  {move if move not in weird_moves else 
           move.replace("'", "-")
           for move in x}

df['moves'] = df.moves.progress_map(clean_moves)

100%|██████████| 918/918 [00:00<00:00, 43018.50it/s]

removal_check = {move for move_set in df.moves 
                      for move in move_set
                      if "'" in move}
removal_check

{"Forest's Curse", "King's Shield", "Land's Wrath",
 "Nature's Madness"}

The moves dataframe contains moves that are unlearnable by pokemon. These include moves like Struggle (which is a move pokemon use when they have no more pp in their normal moveset) and Z-moves (moves that are activated by a Z-crystal). These moves are characterized by having only 1 PP (which denotes the number of times a pokemon can use the move). Let’s remove these.

mdf = mdf[(mdf.pp != 1) | (mdf.name == 'Struggle')]

Due to the nature of the site we scraped, some pokemon are missing moves :(. Let’s fix part of the problem by adding back some important special moves:

df.loc['Victini', 'moves'].add('V-create')
df.loc['Rayquaza', 'moves'].add('V-create')
df.loc['Celebi', 'moves'].add('Hold Back')

for pok in ['Zygarde', 'Zygarde-10%', 'Zygarde-Complete']:
    df.loc[pok, 'moves'].add('Thousand Arrows')
    df.loc[pok, 'moves'].add('Thousand Waves')
    df.loc[pok, 'moves'].add('Core Enforcer')

Let’s say for our analysis, we only care about certain tiers. Furthermore, we want to consolidate tiers. Let’s do it:

df.loc[df.tier == 'OUBL','tier'] = 'Uber'
df.loc[df.tier == 'UUBL','tier'] = 'OU'
df.loc[df.tier == 'RUBL','tier'] = 'UU'
df.loc[df.tier == 'NUBL','tier'] = 'RU'
df.loc[df.tier == 'PUBL','tier'] = 'NU'
df = df[df['tier'].isin(['Uber', 'OU', 'UU', 'NU', 'RU', 'PU'])]

The last line eliminates all pokemon that do not belong to one of those tiers (i.e. LC).

Since the tiers are a categorical variable, let’s covert it to the categorical dtype in pandas. This will come in handy if we decide to use this dataset in a machine learning model, as categorical variables will have a string label but have a corresponding integer code.

df['tier'] = df['tier'].astype('category')
df['tier'].dtype

But wait, our tiers do have an order! Let’s actually turn them into an ordered categorical variable. This will ensure the codes are in order.

from pandas.api.types import CategoricalDtype

order = ['Uber', 'OU', 'UU', 'NU', 'RU', 'PU']
df['tier'] = df['tier'].astype(CategoricalDtype(categories=order, 
                                                ordered=True))
df['tier'].dtype

CategoricalDtype(categories=['Uber', 'OU', 'UU', 'NU', 'RU', 'PU'], ordered=True)

We can take a look at the actual codes for the categories:

df['tier'].cat.codes.head(10)

name
Abomasnow          5
Abomasnow-Mega     4
Absol              5
Absol-Mega         2
Accelgor           3
Aegislash          0
Aegislash-Blade    0
Aerodactyl         4
Aerodactyl-Mega    2
Aggron             5
dtype: int8

list(zip(df['tier'].head(10), df['tier'].cat.codes.head(10)))

[('PU', 5),
 ('RU', 4),
 ('PU', 5),
 ('UU', 2),
 ('NU', 3),
 ('Uber', 0),
 ('Uber', 0),
 ('RU', 4),
 ('UU', 2),
 ('PU', 5)]

(very light) Feature Engineering

Let’s make a feature counting the number of moves a pokemon can learn.

df['num_moves'] = df.moves.map(len)

The base stat total is a common metric players use to assess a Pokemon’s overall strength, so let’s create a column for this.

df['bst'] = (df['hp'] + df['atk'] + df['def'] + df['spa'] + df['spd']
             + df['spe'])

Anomaly Analysis

This workshop is about data cleaning, but a useful way to look for data issues, gain ideas for feature engineering, and understand your data is to look at anomalies. Plus, we can look at some new pandas techniques.

Let’s look at information about the BST by tier:

bstdf = df[['tier', 'bst']].groupby('tier').agg([np.mean, np.std])
bstdf

	bst
	mean	std
tier
NU	495.132353	36.655681
OU	565.896104	68.916155
PU	464.184685	59.964976
RU	524.486111	48.101124
UU	538.181818	50.624685
Uber	657.042553	67.435946

First, we get a dataframe containing each pokemon’s tier and base stat total. We want the mean and standard deviation of the BST’s by tier. So, we group by the tier. In pandas, we can group by multiple columns if you want. Then, we apply aggregate function mean and std. This will calculate mean and std within each tier.

You’ll notice that we now have a multiindex for the columns. We will not cover this in this workshop, so we will just simplify the multiindex.

bstdf.columns = ['bst_mean', 'bst_std']
bstdf

	bst_mean	bst_std
tier
NU	495.132353	36.655681
OU	565.896104	68.916155
PU	464.184685	59.964976
RU	524.486111	48.101124
UU	538.181818	50.624685
Uber	657.042553	67.435946

The main ways to join tables in pandas are join and merge. Join is typically used to join on an index. For example, if you had two tables with the pokemon name as the index, you can do df1.join(df2), and this will horizontally concatenate the tables based on index.

I will show you how to use merge, which is the most general and easiest to understand joining method (though not always the fastest).

df2 = df.reset_index().merge(bstdf, left_on='tier', right_on='tier', 
                             how='left')
# equivalent to bstdf.merge(df, ..., how='right')
df2.sample()

	name	types	abilities	tier	hp	atk	def	spa	spd	spe	next_evos	moves	num_moves	bst	bst_mean	bst_std
91	Crabominable	[Fighting, Ice]	[Anger Point, Hyper Cutter, Iron Fist]	PU	97	132	77	62	67	43	[]	{Fling, Bubble Beam, Iron Defense, Hidden Powe...	54	478	464.184685	59.964976

Basically, pandas looked for where the tier in df equaled tier in bstdf and concatenated those rows. left_on is the column for df, right_on is the column for bstdf (in this case they’re the same). You can learn more about how joins work in this article: https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/. The concepts carry over to pandas.

We want to look at anomalous pokemon who’s stats seem too low for their tiers. Let’s accomplish this:

under = df2[(df2['bst'] < df2['bst_mean'] - 2*df2['bst_std']) 
            & (df2['tier'] != 'PU')]
under

	name	types	abilities	tier	hp	atk	def	spa	spd	spe	next_evos	moves	num_moves	bst	bst_mean	bst_std
5	Aegislash	[Steel, Ghost]	[Stance Change]	Uber	60	50	150	50	150	60	[]	{Hidden Power, Iron Defense, Hyper Beam, Pursu...	55	520	657.042553	67.435946
6	Aegislash-Blade	[Steel, Ghost]	[Stance Change]	Uber	60	150	50	150	50	60	[]	{Hidden Power, Iron Defense, Hyper Beam, Pursu...	55	520	657.042553	67.435946
34	Azumarill	[Water, Fairy]	[Huge Power, Sap Sipper, Thick Fat]	OU	100	50	80	60	80	50	[]	{Muddy Water, Swagger, Water Pulse, Ice Beam, ...	96	420	565.896104	68.916155
115	Diggersby	[Normal, Ground]	[Cheek Pouch, Huge Power, Pickup]	OU	85	56	77	50	77	78	[]	{Rollout, Sandstorm, Fling, Earthquake, Hidden...	81	423	565.896104	68.916155
267	Linoone	[Normal]	[Gluttony, Pickup, Quick Feet]	RU	78	70	61	50	61	100	[]	{Thunder Wave, Super Fang, Swagger, Water Puls...	89	420	524.486111	48.101124
298	Marowak-Alola	[Fire, Ghost]	[Cursed Body, Lightning Rod, Rock Head]	UU	60	80	110	50	80	45	[]	{Tail Whip, Sandstorm, Fling, Hidden Power, Hy...	74	425	538.181818	50.624685
303	Medicham	[Fighting, Psychic]	[Pure Power, Telepathy]	NU	60	60	75	60	75	80	[]	{Rock Slide, Swagger, Meditate, Confusion, Gra...	96	410	495.132353	36.655681
521	Vivillon	[Bug, Flying]	[Compound Eyes, Friend Guard, Shield Dust]	NU	80	52	50	90	50	89	[]	{Hidden Power, Iron Defense, Hyper Beam, Rest,...	59	411	495.132353	36.655681

Misc.

Pandas also has built in graphing functionalities which behave identically to matplotlib. For example:

df.bst.hist()

png

df.plot.scatter('bst', 'atk')

png

Finally, we can “pivot” tables as you would in excel. This provides a summary of the data.

df['type_1'] = df['types'].map(lambda x: x[0])

pd.pivot_table(df, index='tier', columns='type_1', values='bst', 
               aggfunc='mean')

type_1	Bug	Dark	Dragon	Electric	Fairy	Fighting	Fire	Flying	Ghost	Grass	Ground	Ice	Normal	Poison	Psychic	Rock	Steel	Water
tier
NU	476.500000	494.000000	487.500000	460.500000	473.500000	469.625000	534.400000	479.000000	483.750000	506.250000	486.250	525.000000	495.400000	457.0	520.000000	519.50	520.000000	520.750000
OU	567.500000	520.000000	644.444444	562.142857	483.000000	524.250000	607.600000	518.333333	476.000000	542.166667	519.000	505.000000	497.000000	495.0	598.250000	700.00	550.000000	576.428571
PU	426.521739	448.300000	NaN	473.800000	392.666667	461.333333	485.454545	447.090909	479.400000	478.476190	457.875	511.727273	457.342857	472.6	465.266667	494.00	380.000000	459.000000
RU	490.166667	510.000000	536.500000	543.750000	516.000000	527.000000	573.333333	495.000000	518.333333	546.500000	480.000	552.500000	523.571429	487.0	545.428571	505.75	546.000000	538.500000
UU	485.800000	531.714286	598.000000	540.000000	525.000000	531.500000	517.000000	536.250000	500.000000	586.000000	512.250	NaN	559.500000	507.5	547.500000	585.00	543.333333	544.166667
Uber	585.000000	640.000000	686.800000	NaN	680.000000	612.500000	613.333333	626.666667	600.000000	NaN	720.000	NaN	655.000000	540.0	682.153846	NaN	580.000000	720.000000

Conclusion

Through this workshop, we’ve seen an overview of pandas and how it can be useful for data preprocessing. Next, we can use these skills to analyze and model our data using random forests in scikit-learn.

Nicholas Vadivelu

Introduction to Data Cleaning with Pandas

Jupyter Tips

What is Pandas?

Initial Processing

Data Correction

(very light) Feature Engineering

Anomaly Analysis

Misc.

Conclusion

Related posts

Nicholas Vadivelu

Introduction to Data Cleaning with Pandas

Jupyter Tips

What is Pandas?

Initial Processing

Data Correction

(very light) Feature Engineering

Anomaly Analysis

Misc.

Conclusion

Related posts

Optimizing k-Means in NumPy & SciPy 10 May 2021

Rejection & Importance Sampling Explained in Code 09 Mar 2021

Groupby-by From Scratch "Part 2" 16 Feb 2021