Finding duplicated data across one or more columns in a database table

A few months ago I posted a little query about [finding duplicate rows in a database table]({{<relref “/blog/2008-10-17-finding-duplicate-rows-in-the-database”>}} “Finding duplicate rows in a database table”).

I'm revisiting this because I helped out Doogie with a similar query last night but with some complications.

Let's start with the original simple scenario of checking duplicates in a single column.

Some example data, a Users table:

Id	Email
`1`	`joe@bloggs.com`
`2`	`joe@bloggs.com`
`3`	`joe@bloggs.com`
`4`	`jane@doe.com`
`5`	`jane@doe.com`
`6`	`john@doe.com`

You can see that joe@bloggs.com and jane@doe.com have been duplicated. This could have been prevented by putting a unique index on the Email column.

So to find what emails have duplicates in our table:

SELECT Email, COUNT(Email) AS Duplicates FROM `Users` GROUP BY Email HAVING ( Duplicates > 1 )

Results:

Email	Duplicates
`jane@doe.com`	2
`joe@bloggs.com`	3

So, to help us manually correct our data, what are the Ids of the duplicates?

In MySQL (4.1+), we can use GROUP_CONCAT (after casting the numerical Id to a character string):

SELECT Email, COUNT(Email) AS Duplicates, GROUP_CONCAT( CAST(Id AS CHAR) ) AS Culprits 
FROM `Users` GROUP BY Email HAVING ( Duplicates > 1 )

Our results:

Email	Duplicates	Culprits
`jane@doe.com`	2	4,5
`joe@bloggs.com`	3	1,2,3

That's quite handy, but what about just a list of the duplicates we can go through, instead of these rows of comma-separated Ids? This fugly query will do that for us: (I'm sure I could do this a better way but I'm tired and this works!)

SELECT Id, Email FROM `Users` WHERE Email IN 
    (SELECT Email FROM `Users` GROUP BY Email HAVING ( COUNT(Email) > 1 ))
ORDER BY Email

Id	Email
4	`jane@doe.com`
5	`jane@doe.com`
1	`joe@bloggs.com`
2	`joe@bloggs.com`
3	`joe@bloggs.com`

Now you can edit / delete the rows you want to get rid of if you ran the query in something like phpMyAdmin.

And don't forget, after the clean-up job, add that index to prevent duplicates re-appearing:

ALTER TABLE `Users` ADD UNIQUE (`Email`)

Now, the new scenario. What about duplicates across multiple columns? For example, our Locations table:

Id	CountryCode	AreaCode	Prefix
1	64	9	489
2	64	9	489
3	64	9	489
4	64	3	942
5	64	3	942
6	64	9	536

Here, we want to find duplicates that have the same values in the 3 columns.

For example, you can see that 64-9-489 is duplicated three times, and 64-3-942 two times.

We can do this without much alteration to our original queries:

SELECT CountryCode, AreaCode, Prefix, COUNT(*)
AS Duplicates FROM `Locations` 
GROUP BY CountryCode, AreaCode, Prefix
HAVING ( Duplicates > 1 )

CountryCode	AreaCode	Prefix	Duplicates
64	3	942	2
64	9	489	3

Then to get the Ids:

SELECT CountryCode, AreaCode, Prefix, COUNT(*) AS Duplicates, 
GROUP_CONCAT( CAST(Id AS CHAR) ) AS Culprits 
FROM `Locations` GROUP BY CountryCode, AreaCode, Prefix HAVING ( Duplicates > 1 )

CountryCode	AreaCode	Prefix	Duplicates	Culprits
64	3	942	2	4,5
64	9	489	3	1,2,3

I think you're getting the point.

Here's to get the rows for the culprits:

SELECT Id, CountryCode, AreaCode, Prefix 
FROM `Locations` 
WHERE Id NOT IN 
  (SELECT Id FROM `Locations`
    GROUP BY CountryCode, AreaCode, Prefix 
    HAVING ( COUNT(CountryCode) = 1 ))
ORDER BY CountryCode, AreaCode, Prefix

Id	CountryCode	AreaCode	Prefix
4	64	3	942
5	64	3	942
1	64	9	489
2	64	9	489
3	64	9	489

Again I'm sure there'd be an easier way to do that, but hey, it works, and for something that should be a one-off.

So how to prevent the duplicated data in our second scenario? Add a composite unique key on those columns:

ALTER TABLE `locations` ADD UNIQUE (`CountryCode`, `AreaCode`, `Prefix`)