Work With Duplicate Values

Data Processing for beginners: How to deal with duplicates

Let’s create a dataframe that has multiple columns and rows has repeated values:

import pandas as pd
df = pd.DataFrame({'a': [1,2,2,3,4,4,5],
                   'b': ['a','b','b','v','c','c','c']})
print(df)

   a  b
1  a
2  b
2  b
3  v
4  c
4  c
5  c We can see that both column a and b has rows that have duplicate values. This DF is very small so we can see the values but what if we have millions of rows. It is hard to know if any rows have duplicate values, so what we can do.

We can inspect each column using ‘duplicated’ method of pandas. duplicated method will return a series object where each row is marked with either False or True ( I will explain in a minute what are our options of marking a row duplicate)

Let’s run duplicated on our dataframe’s ‘a’ column with keep assigne to False

df.duplicated('a',keep=False)

  False
   True
   True
  False
   True
   True
  False
dtype: bool

Sometime we need to know how many duplicates we are dealing with.

The easiest way to deal with it to add a ‘is_duplicate’ column to the df and then sum the value over ‘is_duplicated’ column. This works because True = 1 and False = 0, so when we sum, it will print 4

df['is_duplicate'] = df.duplicated('a', keep=False)
print(df)

   a  b  is_duplicate
1  a         False
2  b          True
2  b          True
3  v         False
4  c          True
4  c          True
5  c         False

df.is_duplicate.sum()

Now ‘keep’ can take False, ‘first’ or ‘last’ argument. These argument tells pandas how to mark duplicates when they were encountered.

‘first’ will mark the first encounter of duplicates with False but second onwards true(Mark duplicates as True except for the first occurrence.) . Like this:

df['is_duplicate'] = df.duplicated('a', keep='first')
print(df)

   a  b  is_duplicate
1  a         False
2  b         False
2  b          True
3  v         False
4  c         False
4  c          True
5  c         False

‘last’ will make all duplicates true except the last one (Mark duplicates as True except for the last occurrence.). Like this:

#List all the duplicate items
df[df['is_duplicate']]

	a	b	is_duplicate
2	2	b	True
5	4	c	True

df['is_duplicate'] = df.duplicated('a', keep='last')
print(df)

   a  b  is_duplicate
1  a         False
2  b          True
2  b         False
3  v         False
4  c          True
4  c         False
5  c         False

‘False’ will make all duplicates as true(Mark all duplicates as True.). We used False above so we can know how many elements are duplicates.

Now how to deal with duplicates. Most of the time we just want to drop the duplicate rows to clean the data

This can be done by using drop_duplicates method.

Here you need to be careful about choosing the ‘keep’ option. If you want to keep at least one row of the duplicate item, you should use keep=’first’:

df.drop_duplicates('a', keep='first')

	a	b	is_duplicate
0	1	a	False
1	2	b	True
3	3	v	False
4	4	c	True
6	5	c	False

If you want to drop the duplicates all togheter, than use ‘keep=false’:

df.drop_duplicates('a', keep=False)

	a	b	is_duplicate
0	1	a	False
3	3	v	False
6	5	c	False

df.drop_duplicates('a', keep='last')

	a	b	is_duplicate
0	1	a	False
2	2	b	True
3	3	v	False
5	4	c	True
6	5	c	False

That’s all for dealing with duplicates in pandas.

Data Processing for beginners: How to deal with duplicates

Comments

Related Posts

This is my first post 27 Sep 2018

Second post 26 Sep 2018