SQL is an easy language to learn.
A few commands to learn (
SELECT/FROM/WHERE/GROUP BY) and you are good to go.
One downside of the simple syntax and small vocabulary is that it can lead many SQL users to the assumption that these building blocks are all that can be expressed with SQL.
This is an example to show that this notion is not true at all.
It is a question from StackOverflow:
To illustrate the problem, let’s have a table that contains documents (we can think of it as customer orders, but the exact business document type does not matter here) that lists line-items of materials. Those could just be the items you ordered off Amazon in a single order.
The table could look like this (test data from the SO question):
Some background – what could that be good for?
With this kind of data – but likely many more documents with many more materials – we are looking for which documents match other documents by a certain percentage.
Answering this question could also answer questions like “If my order is not available now, which other order contains P% of the items I ordered?“.
A related question could be “Given the current order, what are similar orders and what else did those orders include?“.
Which brings us to the well-known “Market Basket Analysis”/”Affinity Analysis” (that greets everyone in online shops in the “Frequently bought together” section.
However, typically MBA does not look for how well different orders match, so let’s leave this approach aside for this post.
How to start thinking about this
A first naive approach to manually compute an answer would probably look like this:
- start by taking one document
- comparing with all other documents
- note down the number of matching materials for each compared document
- compare with the number of materials in the first document
- rinse and repeat for all documents of interest.
Once we’re looking at more than a few documents, this approach gets tiresome.
One up from this we could make up a matrix like this:
With such a matrix, comparing two documents is a lot easier.
We could represent the X as bits and use an AND-operation to find the materials both documents have in common.
Comparing the number of common materials with the number of materials that the first document includes gives the “matching percentage“.
Once the columns for all the materials are determined, the task of checking individual document could even be parallelized easily.
So, how do we do this in the database?
Such an approach is difficult in SQL because building up a sparse matrix takes a lot of time and memory and that SQL DBMS usually don’t provide efficient commands for this.
Here’s what we can do
Let’s start with the obvious:
we need to get the unique materials for all documents and we also need to know how many unique materials per document there are.
select distinct document , material , count(*) OVER (PARTITION BY document) as MATERIAL_CNT from mytable;
The result of this look like this:
Let’s call this set
doc_elements and put it into a common table expression (WITH clause). You could also save this query in a view or a table variable if you’re working in SQLScript.
Next, we want to compare every document with every other document and find matching materials.
To do that, SQL comes with a handy language-construct: JOINs.
We’re going to join the
doc_elements set with itself, but want to make sure that the documents are not the same.
To make it easier to think about it, one of the doc_elements-references is called side_a the other side_b.
Because it is possible that there are no matches, we need to use an OUTER JOIN here.
Otherwise, no matches would lead to a final result, that does not contain the matching percentage for a document_a against another document_b at all.
Instead of that, we’d rather see a 0% match.
select ... from doc_elements side_a left outer join doc_elements side_b on side_a.material = side_b.material and side_a.document != side_b.document
Now we need to think about what information we want from this join.
material from side_a and
material_cnt from side_b.
In addition, there’s a window aggregate function to count the number of matching materials.
Finally, we filter out those matches where the materials match, but the documents do not NOT MATCH (yes, double negative!) – i.e. the materials that belong to document_a:
select side_a.document as document_a , side_a.material , side_b.material_cnt as material_b_cnt , side_b.document doc_b , count(*) OVER (PARTITION BY side_a.document, side_b.document) as match_cnt from doc_elements side_a left outer join doc_elements side_b on side_a.material = side_b.material and side_a.document != side_b.document where side_b.document IS NOT NULL
The result of this looks like this:
We can read this easily from left to right:
Document 300 has materials 7000160 and 7000011.
Matched against document 200, with a total of 9 materials of which 2 are matching with document 300.
Let’s call this set
That’s pretty good progress.
All that is missing now is to calculate the matching percentage.
This is straight-forward to do:
select distinct document_a , document_b , material_b_cnt , match_cnt , round( (100/material_b_cnt) * match_cnt, 2) as match_pct from matched_materials order by document_a , document_b;
So, here we see that the
MATCH_PCT shows the percentage of how many of the materials from document_b match those of document_a.
For example, document 300 has only two materials and all of them are found in document 200 (with 9 materials in total).
MATCH_PCT for 200 -> 300 is 100% (all materials in document 300 are found in 200) but for 300 -> 200 its only 22.22% since only 2 of 9 materials are matched.
The final solution (for now)
The complete SQL looks like this:
with doc_elements (document, material, material_cnt) as (select distinct document , material , count(*) OVER (PARTITION BY document) as MATERIAL_CNT from mytable ) , matched_materials (document_a, material_a, material_b_cnt, document_b, match_cnt) as (select side_a.document as document_a , side_a.material as material_a , side_b.material_cnt as material_b_cnt , side_b.document as document_b , count(*) OVER (PARTITION BY side_a.document, side_b.document) as match_cnt from doc_elements side_a left outer join doc_elements side_b on side_a.material = side_b.material and side_a.document != side_b.document where side_b.document IS NOT NULL ) select distinct document_a , document_b , material_b_cnt , match_cnt , round((100/material_b_cnt)*match_cnt, 2) as match_pct from matched_materials order by document_a , document_b;
This is the answer I posted to the SO’s question and it addresses the OP’s main concern to achieve the computation without using CURSORs.
Does this mean, this is the end of this development?
Not at all.
One thing we haven’t looked at is:
How long does this query take to finish when there is more than just the example data in the database?