Re: Merge rows based on Levenshtein distance - Mailing list pgsql-general

From David G Johnston
Subject Re: Merge rows based on Levenshtein distance
Date
Msg-id CAKFQuwbhyimtQR9NY28RtfLvpOhgsSRe0H_mT=MqTZtz3k9bnA@mail.gmail.com
Whole thread Raw
In response to Re: Merge rows based on Levenshtein distance  (mongoose <akarargyris@gmail.com>)
Responses Re: Merge rows based on Levenshtein distance
List pgsql-general
On Tuesday, December 2, 2014, mongoose [via PostgreSQL] <[hidden email]> wrote:
David,

Thank you for your prompt reply. I believe your answer helped a lot but it seems I was not clear enough on my description. Basically I want a counter (id) to show if two or more names are similar (i.e. levenshtein distance less than 3) So in the previous example:

From this table:

Name, City
"Booob", "NYC"
"Alex", "Washington"
"Alexj2", "Washington"
"Bob", "NYC"
"Aleex1", "Washington"

to get this table:

id, Name, City
1,"Alex", "Washington"
1,"Aleex1", "Washington"
1,"Alexj2", "Washington"
2,"Bob", "NYC"
2,"Booob", "NYC"

So basically the id is a counter that starts from "1" and increments only when there is a different name. Please notice that the table has its names in a completely random order.


Write and combine a few subqueries that use window functions (namely lag and row_number) to identify groups, label them, and assign rows to each group (using a between condition on a join)

Pondering some (not tested) if you identify the boundary records in a subquery you can assign them a value of 1 while all others take on null.  In the outer query you should be able to assign groups by simply applying the sum function over the entire result such that at each boundary value the presence of the 1 will increment the sum while the null rows will use the sum value from the prior row.

David J.



View this message in context: Re: Merge rows based on Levenshtein distance
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

pgsql-general by date:

Previous
From: David G Johnston
Date:
Subject: Re: Is "WITH () UPDATE" Thread Safe ?
Next
From: Tim Schäfer
Date:
Subject: Re: Auto vacuum not running -- Could not bind socket for statistics collector