OraFAQ Forum: SQL & PL/SQL » Grouping similar strings together using UTL

Home » SQL & PL/SQL » SQL & PL/SQL » Grouping similar strings together using UTL_MATCH Jaro-Winkler (12.1.0.1 )

Show: Today's Messages :: Polls :: Message Navigator
E-mail to friend

Wed, 27 September 2023 23:39

OraFerro
Messages: 433
Registered: July 2011

Senior Member

Hi,

I have a table of strings (basically partner names) with a lot of duplication due to double entries of the same name with little variation. I need to clean the data by providing the user with groups of strings (partner names) that are similar.


create table test_jaro
  (
    id number(4) primary key,
    string varchar2(30)
  );

insert all 
  into test_jaro (id, string) values (1, 'My name is Fe"rro')
  into test_jaro (id, string) values (2, 'My nameis Ferro')
  into test_jaro (id, string) values (3, 'My name is Ferro')
  into test_jaro (id, string) values (4, 'My n ame is Ferro')
  into test_jaro (id, string) values (5, 'My n/ame is Ferro')
  into test_jaro (id, string) values (6, 'My name is Ferro')

  into test_jaro (id, string) values (7, 'Or1a FAQ')
  into test_jaro (id, string) values (8, 'Ora FAQ1')
  into test_jaro (id, string) values (9, 'Ora FAQ')
  into test_jaro (id, string) values (10, 'Ora FAQ')
  into test_jaro (id, string) values (11, 'Ora/ FAQ')
  into test_jaro (id, string) values (12, 'Ora/ FAQ')

  into test_jaro (id, string) values (13, 'a string')
select * from dual;


strings with 90% or above similarity are grouped together and a group id (string_group) is shows together with all mathing string ids (matching_string_ids) so that the final output would be:

string_group matching_string_ids 
1 1
1 2
1 3
1 4
1 5
1 6
2 7
2 8
2 9
2 10
2 11
2 12
3 13

Thanks,
Ferro

Report message to a moderator

Thu, 28 September 2023 00:02

OraFerro
Messages: 433
Registered: July 2011

Senior Member

Please see below my trivial trial and of course the performance is disastrous with larger samples:


select * from 
(
  select * FROM
    (
    SELECT
      t1.id AS string_id,
      t1.string AS base_string,
      t2.id AS matching_id,
      t2.string AS matching_string,
      UTL_MATCH.JARO_WINKLER_SIMILARITY(t1.string, t2.string) jaro_simi
    FROM
      test_jaro t1
    JOIN
      test_jaro t2
    ON
      t1.id < t2.id
    )
    WHERE
      jaro_simi >= 90
)
where string_id not in 
(
 select matching_id from 
  (
    select * FROM
    (
    SELECT
      t1.id AS string_id,
      t1.string AS base_string,
      t2.id AS matching_id,
      t2.string AS matching_string,
      UTL_MATCH.JARO_WINKLER_SIMILARITY(t1.string, t2.string) jaro_simi
    FROM
      test_jaro t1
    JOIN
      test_jaro t2
    ON
      t1.id < t2.id
    )
    WHERE
      jaro_simi >= 90
  )
);

[Updated on: Thu, 28 September 2023 00:22]

Report message to a moderator

Thu, 28 September 2023 00:59

Michel Cadot
Messages: 68647
Registered: March 2007
Location: Nanterre, France, http://...

Senior Member
Account Moderator

A simple solution could be (from your examples) to remove all non alpha-characters, using TRANSLATE for instance.

Report message to a moderator

Fri, 29 September 2023 12:14

mathguy
Messages: 107
Registered: January 2023

Senior Member

The relation "Jaro-Winkler score at 90% or above" is not transitive. (Nor is a similar definition for distance between points in the plane, for example - keep that in mind as an intuitive image.)

Therefore the "grouping" you are trying to output is nonsensical. Imagine you have many points in the plane and you want to create "groups" from points that are less than 1mm apart. How would that work? Points A and B are 0.8mm apart, B and C are also 0.8mm apart, but A and C are 1.3mm apart. So do A and C belong to the same group, or different groups?

There is a game, I think it's called "word ladder" or something like that, that illustrates the point exactly. You can go from HEAD to TAIL through one-letter changes; if you consider those to be "close", then HEAD and TAIL would be in the same group, even though they aren't similar at all.

Report message to a moderator

Sat, 30 September 2023 05:53

OraFerro
Messages: 433
Registered: July 2011

Senior Member

@Michel Cadot: what about switching between characters by mistake creating a new string (i.e. string ids: 7, Cool

. In the real case letters from French and German exist in addition to Arabic and English and punctuations are needed.

@Mathguy: I totally agree with you. In fact I should've ended by problem definition by asking for suggestions to solve my problem even if irrelevant to the technique I am suggesting. I need to help users in their exercise to filter the data and point out duplicates. My first version was just the first statement:

  select * FROM
    (
    SELECT
      t1.id AS string_id,
      t1.string AS base_string,
      t2.id AS matching_id,
      t2.string AS matching_string,
      UTL_MATCH.JARO_WINKLER_SIMILARITY(t1.string, t2.string) jaro_simi
    FROM
      test_jaro t1
    JOIN
      test_jaro t2
    ON
      t1.id < t2.id
    )
    WHERE
      jaro_simi >= 90

However the list is large and providing a string with all possible combinations with other similar string (with a non-technology oriented user) makes this trial lose its point.

What do you both think?
Thanks for both of you.
Ferro

Report message to a moderator

Sat, 30 September 2023 09:31

Michel Cadot
Messages: 68647
Registered: March 2007
Location: Nanterre, France, http://...

Senior Member
Account Moderator

Quote:

@Michel Cadot: what about switching between characters by mistake creating a new string (i.e. string ids: 7

In this particular case, TRANSLATE remove all non-alphabetical characters so no problem.

Quote:

In the real case letters from French and German exist in addition to Arabic and English and punctuations are needed

Nothing prevent you from converting with TRANSLATE any character to any other one.
In one of my application I use the following to convert, among others, French characters to US7ASCII one:

translate (string,
            'ƒŠŒŽšœžŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïøðñòóôõöùúûüýÿÞþ ',
            'fSEZsezYAAAAAAECEEEEIIIIDNOOOOOOUUUUYBaaaaaaeceeeeiiiioonooooouuuuyy')

You can then remove, in an outer TRANSLATE, the unwanted characters as I showed there.

The point is that this does not handle swapped characters, then a similarity algorithm will handle this more accurately than on the original value.

Report message to a moderator

Sun, 01 October 2023 00:53

OraFerro
Messages: 433
Registered: July 2011

Senior Member

@Michel
Quote:

The point is that this does not handle swapped characters

Exactly, which is what I meant by "switching two letters (or more) by mistake". In this case, if Quote:

similarity algorithm will handle this more accurately

, then what is the best technique/approach? the one I suggested has two options:
1- the inner query:

  select * FROM
    (
    SELECT
      t1.id AS string_id,
      t1.string AS base_string,
      t2.id AS matching_id,
      t2.string AS matching_string,
      UTL_MATCH.JARO_WINKLER_SIMILARITY(t1.string, t2.string) jaro_simi
    FROM
      test_jaro t1
    JOIN
      test_jaro t2
    ON
      t1.id < t2.id
    )
    WHERE
      jaro_simi >= 90

which produces logic results but the scope is very lengthy and repeating the string more than once confuses the average user.

2- remove repetition and group similar strings by the first occurrence but this will add an assumption that can be wrong and is very expensive in terms of execution time especially with large data.

Please tell me what you would recommend.

Regards,
Ferro

Report message to a moderator

Sun, 01 October 2023 03:57

Barbara Boehmer
Messages: 9090
Registered: November 2002
Location: California, USA

Senior Member

The following produces the same results as your query, but in a slightly different format. It eliminates the possibility of having things that are not closely related in the same group, due to a chain of relations, by comparing collections and only eliminating those that are a subset of another. Due to the sample data, the results happen to be the same here.

SCOTT@orcl_12.1.0.2.0> CREATE OR REPLACE TYPE ids AS TABLE OF NUMBER(4);
  2  /

Type created.

SCOTT@orcl_12.1.0.2.0> CREATE OR REPLACE TYPE strings AS TABLE OF VARCHAR2(30);
  2  /

Type created.

SCOTT@orcl_12.1.0.2.0> WITH
  2    q AS
  3  	 (SELECT t1.id AS string_id,
  4  		 t1.string AS base_string,
  5  		 CAST (COLLECT (t2.id ORDER BY t2.id) AS ids) AS matching_ids,
  6  		 CAST (COLLECT (t2.string ORDER BY t2.id) AS strings)  AS matching_strings
  7  	  FROM	 test_jaro t1, test_jaro t2
  8  	  WHERE  t1.id < t2.id
  9  	  AND	 UTL_MATCH.JARO_WINKLER_SIMILARITY(t1.string, t2.string) > 90
 10  	  GROUP  BY t1.id, t1.string)
 11  SELECT q1.*
 12  FROM   q q1
 13  WHERE  NOT EXISTS
 14  	      (SELECT *
 15  	       FROM   q q2
 16  	       WHERE  q1.string_id != q2.string_id
 17  	       AND    q1.matching_strings SUBMULTISET OF q2.matching_strings)
 18  /

 STRING_ID BASE_STRING                    MATCHING_IDS
---------- ------------------------------ ------------------------------
MATCHING_STRINGS
----------------------------------------------------------------------------------------------------------------------------------
         1 My name is Fe"rro              IDS(2, 3, 4, 5, 6)
STRINGS('My nameis Ferro', 'My name is Ferro', 'My n ame is Ferro', 'My n/ame is Ferro', 'My name is Ferro')

         7 Or1a FAQ                       IDS(8, 9, 10, 11, 12)
STRINGS('Ora FAQ1', 'Ora FAQ', 'Ora FAQ', 'Ora/ FAQ', 'Ora/ FAQ')


2 rows selected.

Report message to a moderator

Previous Topic:	Inserted JSON into Column is being Truncated
Next Topic:	Improve select with inner JOIN

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Sat Apr 27 17:29:01 CDT 2024