SQL is an easy language to learn.
A few commands to learn (SELECT/FROM/WHERE/GROUP BY
) and you are good to go.
One downside of the simple syntax and small vocabulary is that it can lead many SQL users to the assumption that these building blocks are all that can be expressed with SQL.
This is an example to show that this notion is not true at all.
It is a question from StackOverflow:
QUESTION:
How can I find materials within document numbers which match at least by X-Percentage (e.g. >=50%), in other document numbers?
https://stackoverflow.com/questions/60409464/sap-hana-sql-to-compare-line-items
To illustrate the problem, let’s have a table that contains documents (we can think of it as customer orders, but the exact business document type does not matter here) that lists line-items of materials. Those could just be the items you ordered off Amazon in a single order.
The table could look like this (test data from the SO question):
DOCUMENT | POSNR | MATERIAL |
---|---|---|
100 | 10 | R3 |
100 | 20 | 7000000 |
100 | 30 | 7000010 |
100 | 40 | 7000011 |
100 | 50 | 7000160 |
200 | 10 | SW |
200 | 20 | 7000000 |
200 | 30 | 7000010 |
200 | 40 | 7000011 |
200 | 50 | 7000160 |
200 | 60 | 7000036 |
200 | 70 | 7000040 |
200 | 80 | 7000066 |
200 | 90 | 7000068 |
300 | 01 | 7000160 |
300 | 11 | 7000011 |
400 | 10 | 7000033 |
400 | 20 | 7000034 |
400 | 50 | 7000068 |
400 | 60 | 7000079 |
Some background – what could that be good for?
With this kind of data – but likely many more documents with many more materials – we are looking for which documents match other documents by a certain percentage.
Answering this question could also answer questions like “If my order is not available now, which other order contains P% of the items I ordered?“.
A related question could be “Given the current order, what are similar orders and what else did those orders include?“.
Which brings us to the well-known “Market Basket Analysis”/”Affinity Analysis” (that greets everyone in online shops in the “Frequently bought together” section.
However, typically MBA does not look for how well different orders match, so let’s leave this approach aside for this post.
How to start thinking about this
A first naive approach to manually compute an answer would probably look like this:
- start by taking one document
- comparing with all other documents
- note down the number of matching materials for each compared document
- compare with the number of materials in the first document
- rinse and repeat for all documents of interest.
Once we’re looking at more than a few documents, this approach gets tiresome.
One up from this we could make up a matrix like this:
document/ material | SW | R3 | 7000000 | 7000010 | 7000011 | 7000160 | … |
---|---|---|---|---|---|---|---|
100 | X | X | X | X | X | … | |
200 | X | X | X | X | X | … | |
300 | X | X | … | ||||
400 | … | … | … | … | … | … | … |
… | … | … | … | … | … | … | … |
With such a matrix, comparing two documents is a lot easier.
We could represent the X as bits and use an AND-operation to find the materials both documents have in common.
Comparing the number of common materials with the number of materials that the first document includes gives the “matching percentage“.
Once the columns for all the materials are determined, the task of checking individual document could even be parallelized easily.
So, how do we do this in the database?
Such an approach is difficult in SQL because building up a sparse matrix takes a lot of time and memory and that SQL DBMS usually don’t provide efficient commands for this.
Here’s what we can do
Let’s start with the obvious:
we need to get the unique materials for all documents and we also need to know how many unique materials per document there are.
That’s easy:
select distinct
document
, material
, count(*) OVER
(PARTITION BY document) as MATERIAL_CNT
from
mytable;
The result of this look like this:
DOCUMENT | MATERIAL | MATERIAL_CNT |
---|---|---|
300 | 7000160 | 2 |
300 | 7000011 | 2 |
400 | 7000033 | 4 |
400 | 7000034 | 4 |
400 | 7000068 | 4 |
400 | 7000079 | 4 |
100 | R3 | 5 |
100 | 7000000 | 5 |
100 | 7000010 | 5 |
100 | 7000011 | 5 |
… | … | … |
Let’s call this set doc_elements
and put it into a common table expression (WITH clause). You could also save this query in a view or a table variable if you’re working in SQLScript.
Next, we want to compare every document with every other document and find matching materials.
To do that, SQL comes with a handy language-construct: JOINs.
We’re going to join the doc_elements
set with itself, but want to make sure that the documents are not the same.
To make it easier to think about it, one of the doc_elements-references is called side_a the other side_b.
Because it is possible that there are no matches, we need to use an OUTER JOIN here.
Otherwise, no matches would lead to a final result, that does not contain the matching percentage for a document_a against another document_b at all.
Instead of that, we’d rather see a 0% match.
select
...
from
doc_elements side_a
left outer join doc_elements side_b
on side_a.material = side_b.material
and side_a.document != side_b.document
Now we need to think about what information we want from this join.
I chose document
and material
from side_a and document
and material_cnt
from side_b.
Comparing documents
In addition, there’s a window aggregate function to count the number of matching materials.
Finally, we filter out those matches where the materials match, but the documents do not NOT MATCH (yes, double negative!) – i.e. the materials that belong to document_a:
select
side_a.document as document_a
, side_a.material
, side_b.material_cnt as material_b_cnt
, side_b.document doc_b
, count(*) OVER
(PARTITION BY side_a.document, side_b.document) as match_cnt
from
doc_elements side_a
left outer join doc_elements side_b
on side_a.material = side_b.material
and side_a.document != side_b.document
where
side_b.document IS NOT NULL
The result of this looks like this:
DOCUMENT_A | MATERIAL_A | MATERIAL_B_CNT | DOCUMENT_B | MATCH_CNT |
---|---|---|---|---|
300 | 7000160 | 9 | 200 | 2 |
300 | 7000011 | 9 | 200 | 2 |
300 | 7000160 | 5 | 100 | 2 |
300 | 7000011 | 5 | 100 | 2 |
400 | 7000068 | 9 | 200 | 1 |
100 | 7000000 | 9 | 200 | 4 |
100 | 7000010 | 9 | 200 | 4 |
100 | 7000011 | 9 | 200 | 4 |
100 | 7000160 | 9 | 200 | 4 |
100 | 7000011 | 2 | 300 | 2 |
100 | 7000160 | 2 | 300 | 2 |
… | … | … | … | … |
We can read this easily from left to right:
Document 300 has materials 7000160 and 7000011.
Matched against document 200, with a total of 9 materials of which 2 are matching with document 300.
Let’s call this set matched_materials
.
That’s pretty good progress.
All that is missing now is to calculate the matching percentage.
This is straight-forward to do:
select distinct
document_a
, document_b
, material_b_cnt
, match_cnt
, round( (100/material_b_cnt) * match_cnt, 2) as match_pct
from
matched_materials
order by
document_a
, document_b;
DOCUMENT_A | DOCUMENT_B | MATERIAL_B_CNT | MATCH_CNT | MATCH_PCT |
---|---|---|---|---|
100 | 200 | 9 | 4 | 44.44 |
100 | 300 | 2 | 2 | 100 |
200 | 100 | 5 | 4 | 80 |
200 | 300 | 2 | 2 | 100 |
200 | 400 | 4 | 1 | 25 |
300 | 100 | 5 | 2 | 40 |
300 | 200 | 9 | 2 | 22.22 |
400 | 200 | 9 | 1 | 11.11 |
So, here we see that the MATCH_PCT
shows the percentage of how many of the materials from document_b match those of document_a.
For example, document 300 has only two materials and all of them are found in document 200 (with 9 materials in total).
Therefore, the MATCH_PCT
for 200 -> 300 is 100% (all materials in document 300 are found in 200) but for 300 -> 200 its only 22.22% since only 2 of 9 materials are matched.
The final solution (for now)
The complete SQL looks like this:
with doc_elements
(document, material, material_cnt)
as (select distinct
document
, material
, count(*) OVER
(PARTITION BY document) as MATERIAL_CNT
from
mytable
)
, matched_materials
(document_a, material_a, material_b_cnt, document_b, match_cnt)
as (select
side_a.document as document_a
, side_a.material as material_a
, side_b.material_cnt as material_b_cnt
, side_b.document as document_b
, count(*) OVER
(PARTITION BY side_a.document, side_b.document) as match_cnt
from
doc_elements side_a
left outer join doc_elements side_b
on side_a.material = side_b.material
and side_a.document != side_b.document
where
side_b.document IS NOT NULL
)
select distinct
document_a
, document_b
, material_b_cnt
, match_cnt
, round((100/material_b_cnt)*match_cnt, 2) as match_pct
from
matched_materials
order by
document_a
, document_b;
Mission accomplished.
This is the answer I posted to the SO’s question and it addresses the OP’s main concern to achieve the computation without using CURSORs.
Does this mean, this is the end of this development?
Not at all.
One thing we haven’t looked at is:
How long does this query take to finish when there is more than just the example data in the database?
Pingback: Matchmaker - quick quick? - The Lars Breddemann Blog