2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / Elein’s PlanetPostgreSQL3 / Swoop de Dupe
Elein

Swoop de Dupe

May 29, 2019/11 Comments/in Elein’s PlanetPostgreSQL /by Elein

The problem: duplicate rows

Once upon a time, there was a database person who knows about referential integrity and its importance for creating database tables.  The interesting part  is that referential integrity, including primary keys, enable keeping with the standard of not having duplicate rows.

However, they just wanted a private little table on their private little database to hold their contacts. No primary keys were needed because none of the names were going to be the same. The only thing needed was to avoid recreation of the table. No formatting was needed with the phone or email data. It was just a simple lookup table.

create table if not exists contacts (
fname text,
lname text,
phone numeric(11),
email text
);

In the initialization script, the \copy command was added in order to copy the tsv file with the contacts into the new table. It failed a number of times due to typos, bad filenames, wrong version, new keyboard, drinking while typing, being interrupted, bad data format, the dog ate my homework, issues with copy vs. \copy…  And while wrangling these errors, some of the copies failed, and some succeeded. There are many ways to get unwanted duplicate rows in your table.

? ? ? ? ? ? ?

Knowing they had 26 friends, one for each line in the file, they checked out the data.

=# select count(*) from contacts;
count
-------
104
(1 row)

OMG. What a mess. The easy solution to this problem is to drop the table and successfully copy in the data once and only once. But sometimes duplicates happen inadvertently in live and important tables where drop/recreate/reload is not an option. It is due to not creating a proper primary key. (See the latest rant on primary keys and serials in your favorite mailing list.)

The Solution

Let us see how to de dupe this little table and you can see how to apply it to more crucial tables.

  • 1. Add a unique identifying column to your table. Make it a SERIAL PRIMARY KEY.
  • 2. Update the identifying column to its default. This will add a unique value in the column for each record.
alter table contacts add column uid serial primary key;
update contacts set uid = default;
select uid, fname, lname, phone, email from contacts limit 5;

 

uid fname lname phone email
79 abba anderson 15105551234 [email protected]
80 bobby bronson 15105551234 [email protected]
81 carla corbani 15105551234 [email protected]
82 daryll dunkirk 15105551234 [email protected]
83 emilia every 15105551234 [email protected]

(5 rows)

  • 3. Here are some things to notice.

In this little query, data in one set is distinct by fname, lname. We can partition by those columns to find duplicates. Failing a couple of columns that are distinct in the set, we could use the entire row to partition.

The row_number() function returns the number of the current row within its partition, counting from 1. It requires an over (partition by col[, col]) statement.

  • 4. Number the sets of duplicates. Using the row_number and over (partition) construct, we can identify all rows uniquely and number the rows in a set of duplicates.
select uid, fname, lname, 
       row_number() over 
       (partition by fname, lname order by fname, lname) 
from contacts limit 9;
uid fname lname row_number
105 abba anderson 1
79 abba anderson 2
131 abba anderson 3
80 bobby bronson 1
132 bobby bronson 2
106 bobby bronson 3
133 carla corbani 1
81 carla corbani 2
107 carla corbani 3

(9 rows)

  • 5. Delete duplicate rows
    With the information in the above query, we can delete the rows by uid where the row numbers in a set are greater than one. This leaves us with only row numbers 1 for each set.
 delete from contacts
 where uid in (select uid
 from (select uid, fname, lname,
              row_number() over 
       (partition BY fname, lname order by fname, lname) as rnum
       from contacts ) t
 where t.rnum > 1;
  • 6. Check your data…just in case
=# select count(*) from contacts;
 count
 -------
 26
 (1 row)
  • 7. Clean up

Always clean up any extra columns, tables, temporary functions, etc when completing your tasks.  This cleanup will only require the dropping of the newly added column.

alter table contacts drop column uid;

Script template: de Dupe

This is a template of the entire script.  Just substitute the bracketed items with your table and column names.

alter table <table> add column uid serial primary key;
 update <table> set uid = default;

 delete from <table>
 where uid in (select uid
 from (select uid, <cola>, <colb>,
              row_number() over (partition by <cola>, <colb>
              order by <cola>, <colb>) as rnum
       from <table> ) t
       where t.rnum > 1;

 alter table <table> drop column uid;

Share and enjoy good referential integrity.

[email protected] — GeneralBits

Tags: data deduplication, GeneralBits, PostgreSQL
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
11 replies
  1. Patrick TJ McPhee
    Patrick TJ McPhee says:
    May 29, 2019 at 11:40 pm

    An alternative to adding a serial key column just for deleting rows is to use ctid. You have to be certain that nobody will update the table while you’re operating, but apart from that, you can just replace “uid” with “ctid” throughout your delete statement and get rid of the duplicates with no other changes required.

    Reply
    • Elein
      Elein says:
      May 30, 2019 at 3:42 am

      Generally, I try to stay away from system columns so that the problem/solution can work on different major versions and so that I have complete control over the data I am using. No worries about vacuum full or data shuffling. Don’t forget that we also added set numbering in addition to the uid. Thanks for your feedback. ~e

      Reply
  2. Steven Winfield
    Steven Winfield says:
    May 30, 2019 at 3:16 pm

    Shouldn’t we all be using “integer GENERATED ALWAYS AS IDENTITY” (or similar) columns now instead of “serial”, due to the lower burden of manually managing the sequence’s permissions?

    Reply
    • Elein
      Elein says:
      May 31, 2019 at 9:41 pm

      Mea culpa as I responded to Evan. You are right. I will keep that in mind for future items. Thanks for your good point. ~e

      Reply
  3. Evan Carroll
    Evan Carroll says:
    May 30, 2019 at 8:06 pm

    A couple of points. Don’t use serial like that. It’s the old way to do it. You should be using identity columns. With that, you also don’t have to update to set the default..

    ALTER TABLE contacts ADD contact_id int PRIMARY KEY GENERATED BY DEFAULT AS identity;

    You also seem to be using a UID as a means to an end of deduplication. Strictly speaking, if that’s the premise it’s simply not needed, you can either use ctid (a system column) or you can just rewrite the table.

    DELETE FROM contacts WHERE ctid NOT IN ( SELECT DISTINCT ON (fname,lname,phone,email) ctid FROM contacts )

    Reply
    • Elein
      Elein says:
      May 31, 2019 at 9:40 pm

      Well, I admit to being old fashioned. Identity would have been the coolest way to go. It is in my queue for “next time”. As for using uid instead of ctid or any other system column, I prefer to have a fully controlled data column and not have to worry about activity, vacuuming, etc. This was suggested previously. Thank you very much for your comments. I have always depended on the QA of strangers. -e

      Reply
  4. Myles F
    Myles F says:
    May 31, 2019 at 11:57 am

    I take it that setting uid as the primary key will unset the existing primary key. If it does that, is it likely to affect any foreign key relationships already existing on that table, and would you also need to re-set the previous primary key field after dropping uid?

    Reply
    • Elein
      Elein says:
      May 31, 2019 at 9:35 pm

      If you had a primary key in the table in the first place, you would not have had duplicates. No other relationships should be affected. Thanks for chiming in. -e

      Reply
  5. Roman
    Roman says:
    June 5, 2019 at 1:28 pm

    closing bracket is missed in delete query

    Reply
    • Elein
      Elein says:
      June 6, 2019 at 12:43 am

      Good catch. That was me arguing with the formatting functions. Thank you. ~e

      Reply
  6. Olivia Ava
    Olivia Ava says:
    January 16, 2021 at 11:02 pm

    Setting the user id as the primary key could unset the existing primary key. How well is that post written in the story format, well-done!

    Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
Postgres is the coolest database – Reason #4: It is extendable Beautiful things, strings.
Scroll to top
×