Connect with us

Hi, what are you looking for?

Latest

How do I drop the duplicate rows in Pandas?

One of the most common tasks in data cleansing is deciding how to work with double strings in a data frame. If the whole series is reproduced exactly, the solution is simple. We can leave out the double line for further analysis. Sometimes you have to make a decision if only part of a line is duplicated.

In this article we will see how we can use pandas to run double data strands in Python.

Let’s load the pandas.

Let’s use the Carpenter’s Gapminder dataset. We see it has 1704 rows and 6 columns.

How do I submit complete double lines?

This variance management record has been well checked, so there is no rule that has been completely duplicated. To illustrate how you can execute lines that are completely duplicated, let’s merge the Gapminder data frame with our own. After pairing, each line is fully duplicated twice.

We can combine two data frames with the Pandas concat function. Here we specify an axis = 0 so that the concatenation connects two data frames with strings.

1

2

3

gapminder_duplicated = pd.concat([gapminder,gapminder],as=0)

form gamepermeter_duplicated

(3408, 6)

We see that our new Pandas double-line data frame has twice as many lines as the original Gapminder data frame. Basically, each line is duplicated in the original data frame.

The function drop_duplicates() is used to remove double strings. By default, the function drop_duplicates() completely removes double rows, i.e. each column element is identical to

1 gapminder_duplicate.drop_duplicates()

We can check for double lines by checking the data frame form.

1

2

gapminder_duplicate.drop_duplicate().form

(1704, 6)

How can I partially remove duplicate rows based on the selected columns?

By default, the drop_duplicates function uses all columns to determine whether a row is a duplicate or not. Often you have to delete rows based on double values from another ore column. The panda drop_duplicates function has an argument to indicate which columns to use to identify duplicates.

For example, to remove double rows using a continental column, we can use a subset argument and specify the name of the column we want to identify as a duplicate.

Let’s duplicate lines using the original Gapminder data frame and use a subset argument with the continent.

1 gapminder.drop_duplicates(subet=continent)

We expect only one line with the continental value, and by default the drop_duplicates() function keeps the first line with the continental value and omits all other lines as duplicates.

We can easily see that in our results. Note that all country values start with the letter A.

1

2

3

4

5

6

        Year Country Pop Continental LifeEhr gpPerkap

0 Afghanistan 1952 8425333,0 Asia 28 801 779 445314

12 Albania 1952 1282697.0 Europe 55,230 1601,056136

24 Algeria 1952 9279525,0 Africa 43 077 2449,008185

48 Argentina 1952 17876956.0 America 62 485 5911 315053

60 Australia 1952 8691212,0 Oceania 69 120 10039 595640

We can also save the last occurrence of the column value with the keep=last argument.

1 gapminder.drop_duplicates(subet=continent, keep=last)

Here we see a line for each continent, but we reset everything except the last one.

1

2

3

4

5

6

             Year Country Pop Continental LifeEhr gpPerkap

1103 New Zealand 2007 4115771,0 Oceania 80 204 25185 009110

1607 United Kingdom 2007 60776238,0 Europe 79 425 33203,261280

1643 Venezuela 2007 26084662,0 North and South America 73 747 11415,805690

1679 Republic of Yemen 2007 22211743,0 Asia 62 698 2280 769906

1703 Zimbabwe 2007 12311143,0 Africa 43 487 469 709298

Note that all country values start with the letters at the end of the alphabet.

We can use a subset argument with names from more than one column. In this case, the function drop_duplicates treats the row as a duplicate if the names of the specified columns are identical.

For example, to delete rows with the same continental and annual values, we can use a subset argument with column names as a list.

1 gapminder.drop_duplicates(subet=[continent, year])

Here we lowered the lines with the same continental and annual values.

1

2

3

4

       Year Country Pop Continental LifeEhr gpPerkap

0 Afghanistan 1952 8425333,0 Asia 28 801 779 445314

1 Afghanistan, 1957. 9240934,0 Asia 30 332 820 853030

2 Afghanistan 1962 10267083,0 Asia 31,997 853 100710

How do I find one or more double columns in Pandas?

Another common task when storing data is determining whether a certain column value is a duplicate or not. In this case, the purpose is not to delete duplicate rows, but to determine which rows have duplicate values for a particular column in the data frame.

Pandas has another useful function called duplicated to tell you if the column values are duplicated or not. We can apply this duplication function to indexes, series and data frames.

For example, to find out whether the values in the continental columns are duplicated or not, we can do the following

1 gapminder.continent.duplicated()

This results in a logical order

1

2

3

4

5

6

7

8

9

10

11

12

0 Wrong answer

1 Okay.

2 That’s right.

Three to the right.

4 That’s right.

1699. That’s right.

1700 That’s right.

1701 Correct.

1702 Correct.

1703. Right.

Name: Continent, length: 1704, man: Wool.

We can also apply the duplicate function directly to the data frame and, as before, use a subset of arguments to specify the column we want to check for duplicates. To find out, for example, which chains have the same continental and annual values, we can

1

2

3

4

5

6

7

8

9

10

11

12

13

gapminder.duplicated(subet=[continent, year])

0 Wrong answer

1 Wrong

2 Wrong

3 Wrong

4 False

1699. That’s right.

1700 That’s right.

1701 Correct.

1702 Correct.

1703. Right.

Length: 1704, man: Laine

 

 

 pandas drop duplicates multiple columns,pandas duplicated

You May Also Like