Matchmaker

SQL is an easy language to learn.
A few commands to learn (SELECT/FROM/WHERE/GROUP BY) and you are good to go.
One downside of the simple syntax and small vocabulary is that it can lead many SQL users to the assumption that these building blocks are all that can be expressed with SQL.

This is an example to show that this notion is not true at all.

It is a question from StackOverflow:

QUESTION:
How can I find materials within document numbers which match at least by X-Percentage (e.g. >=50%), in other document numbers?


https://stackoverflow.com/questions/60409464/sap-hana-sql-to-compare-line-items

To illustrate the problem, let’s have a table that contains documents (we can think of it as customer orders, but the exact business document type does not matter here) that lists line-items of materials. Those could just be the items you ordered off Amazon in a single order.

The table could look like this (test data from the SO question):

DOCUMENTPOSNRMATERIAL
10010R3
100207000000
100307000010
100407000011
100507000160
20010SW
200207000000
200307000010
200407000011
200507000160
200607000036
200707000040
200807000066
200907000068
300017000160
300117000011
400107000033
400207000034
400507000068
400607000079

Some background – what could that be good for?

With this kind of data – but likely many more documents with many more materials – we are looking for which documents match other documents by a certain percentage.

Answering this question could also answer questions like “If my order is not available now, which other order contains P% of the items I ordered?“.

Screenshot of "Frequently bought together" section for SAP HANA Administration book on Amazon.
“Frequently bought together” on Amazon – thanks for the good rating of my book!

A related question could be “Given the current order, what are similar orders and what else did those orders include?“.
Which brings us to the well-known “Market Basket Analysis”/”Affinity Analysis” (that greets everyone in online shops in the “Frequently bought together” section.

However, typically MBA does not look for how well different orders match, so let’s leave this approach aside for this post.

How to start thinking about this

A first naive approach to manually compute an answer would probably look like this:

  1. start by taking one document
  2. comparing with all other documents
  3. note down the number of matching materials for each compared document
  4. compare with the number of materials in the first document
  5. rinse and repeat for all documents of interest.

Once we’re looking at more than a few documents, this approach gets tiresome.
One up from this we could make up a matrix like this:

document/
material
SWR37000000700001070000117000160
100XXXXX
200XXXXX
300XX
400

With such a matrix, comparing two documents is a lot easier.
We could represent the X as bits and use an AND-operation to find the materials both documents have in common.
Comparing the number of common materials with the number of materials that the first document includes gives the “matching percentage“.

Once the columns for all the materials are determined, the task of checking individual document could even be parallelized easily.

So, how do we do this in the database?

Such an approach is difficult in SQL because building up a sparse matrix takes a lot of time and memory and that SQL DBMS usually don’t provide efficient commands for this.

Here’s what we can do

Let’s start with the obvious:
we need to get the unique materials for all documents and we also need to know how many unique materials per document there are.

That’s easy:

select distinct
          document
        , material
        , count(*) OVER
            (PARTITION BY document) as MATERIAL_CNT
from
    mytable;

The result of this look like this:

DOCUMENTMATERIALMATERIAL_CNT
30070001602
30070000112
40070000334
40070000344
40070000684
40070000794
100R35
10070000005
10070000105
10070000115

Let’s call this set doc_elements and put it into a common table expression (WITH clause). You could also save this query in a view or a table variable if you’re working in SQLScript.

Next, we want to compare every document with every other document and find matching materials.
To do that, SQL comes with a handy language-construct: JOINs.

We’re going to join the doc_elements set with itself, but want to make sure that the documents are not the same.
To make it easier to think about it, one of the doc_elements-references is called side_a the other side_b.

Because it is possible that there are no matches, we need to use an OUTER JOIN here.
Otherwise, no matches would lead to a final result, that does not contain the matching percentage for a document_a against another document_b at all.
Instead of that, we’d rather see a 0% match.

select
     ... 
from 
                doc_elements side_a
left outer join doc_elements side_b
        on   side_a.material = side_b.material
        and side_a.document != side_b.document

Now we need to think about what information we want from this join.

I chose document and material from side_a and document and material_cnt from side_b.

Comparing documents

In addition, there’s a window aggregate function to count the number of matching materials.

Finally, we filter out those matches where the materials match, but the documents do not NOT MATCH (yes, double negative!) – i.e. the materials that belong to document_a:

select
      side_a.document as document_a
    , side_a.material
    , side_b.material_cnt as material_b_cnt
    , side_b.document doc_b
    , count(*) OVER
            (PARTITION BY side_a.document, side_b.document) as match_cnt
from 
                doc_elements side_a
left outer join doc_elements side_b
        on   side_a.material = side_b.material
        and side_a.document != side_b.document
where 
    side_b.document IS NOT NULL

The result of this looks like this:

DOCUMENT_AMATERIAL_AMATERIAL_B_CNTDOCUMENT_BMATCH_CNT
300700016092002
300700001192002
300700016051002
300700001151002
400700006892001
100700000092004
100700001092004
100700001192004
100700016092004
100700001123002
100700016023002

We can read this easily from left to right:
Document 300 has materials 7000160 and 7000011.
Matched against document 200, with a total of 9 materials of which 2 are matching with document 300.
Let’s call this set matched_materials.

That’s pretty good progress.

All that is missing now is to calculate the matching percentage.
This is straight-forward to do:

select distinct
      document_a
    , document_b
    , material_b_cnt
    , match_cnt
    , round( (100/material_b_cnt) * match_cnt, 2) as match_pct
from 
    matched_materials
order by
    document_a
  , document_b;
DOCUMENT_ADOCUMENT_BMATERIAL_B_CNTMATCH_CNTMATCH_PCT
1002009444.44
10030022100
2001005480
20030022100
2004004125
3001005240
3002009222.22
4002009111.11

So, here we see that the MATCH_PCT shows the percentage of how many of the materials from document_b match those of document_a.

For example, document 300 has only two materials and all of them are found in document 200 (with 9 materials in total).
Therefore, the MATCH_PCT for 200 -> 300 is 100% (all materials in document 300 are found in 200) but for 300 -> 200 its only 22.22% since only 2 of 9 materials are matched.

The final solution (for now)

The complete SQL looks like this:

with doc_elements 
(document, material, material_cnt)  
as  (select distinct
        document
        , material
        , count(*) OVER
            (PARTITION BY document) as MATERIAL_CNT
    from
        mytable
    )  
, matched_materials 
(document_a, material_a, material_b_cnt, document_b, match_cnt)  
as  (select
        side_a.document as document_a
    , side_a.material as material_a
    , side_b.material_cnt as material_b_cnt
    , side_b.document as document_b
    , count(*) OVER
            (PARTITION BY side_a.document, side_b.document) as match_cnt
    from 
                        doc_elements side_a
        left outer join doc_elements side_b
                on   side_a.material = side_b.material
                and side_a.document != side_b.document
    where 
            side_b.document IS NOT NULL
    )      
select distinct
    document_a
   , document_b
   , material_b_cnt
   , match_cnt
   , round((100/material_b_cnt)*match_cnt, 2) as match_pct
from 
    matched_materials
order by
    document_a
  , document_b;

Mission accomplished.

This is the answer I posted to the SO’s question and it addresses the OP’s main concern to achieve the computation without using CURSORs.

Does this mean, this is the end of this development?
Not at all.

One thing we haven’t looked at is:
How long does this query take to finish when there is more than just the example data in the database?

I’ll look into this in the next blog post.

1 Trackback or Pingback

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.