In Matchmaker, I developed a solution to a not-so-obvious problem.
The solution worked and delivered the right answer – finding documents where the line items match to a certain degree the line items of a reference document.
Now, getting something like this to work on a test data set is important to see that the query logic is doing the expected thing. A carefully constructed set of test data can ensure the solution covers all requirements.
What about performance?
The performance of a given solution, however, cannot be checked with just a few records.
For that, we need data.
Lots of data, just like Keanu needs guns.
Generating volume test data
The structure of the data we want to artificially create is very simple. It’s a single table with just three columns:
CREATE COLUMN TABLE "MYTABLE" (
"DOCUMENT" NVARCHAR(10) DEFAULT '' NOT NULL
, "POSNR" NVARCHAR(6) DEFAULT '000000' NOT NULL
, "MATERIAL" NVARCHAR(40) DEFAULT '' NOT NULL
, PRIMARY KEY ("DOCUMENT", "POSNR")
)
“DOCUMENT” is just an arbitrary string, “POSNR” is a number stored as a string and unique for each “DOCUMENT”, and finally “MATERIAL” is an arbitrary string out of a set of possible “materials”.
Dummy data to go
To have documents with different numbers of line items and different materials, we can simply make use of the RAND()
function. This will of course not realistically emulate real-life data distributions, but is good enough to get an idea of what data volume does to our solution in terms of runtime and memory requirements.
select
TO_NVARCHAR(doc.generated_period_start) as document
, TO_NVARCHAR(pos.generated_period_start) as posnr
, TO_NVARCHAR(7000000 + ceil(rand()* 9999)) as material
from
series_generate_integer (10, 100, 10000000) as doc
cross join
series_generate_integer (10, 10, 1000) as pos
where
rand() < 0.75;
The statement above generates 999.999 document numbers, and 99 line-item numbers, creates a random material number between ‘7000000’ and ‘7009999’ and then filters 25% ( rand() < 0.75
) of all generated records out again.
Thanks to the series_generate_integer
functions, this statement only takes about 2:42 minutes on my test machine to fill the table with ~49.5 million rows, i.e. ~305,555 records per second.
That's pretty quick, I'd say.
Checking what we have now
A quick check of the data in our table now shows: there's plenty.
select
count(distinct document) as doc_cnt
, count(distinct posnr) as posnr_cnt
, count(distinct material) as mat_cnt
, count(*)
from
mytable;
DOC_CNT POSNR_CNT MAT_CNT COUNT(*)
999,990 66 999,999 49,504,784
It also gives us an idea of how many documents we now have to compare with each other.
Apple's market cap. size
Comparing all 999.990 documents with all the other 999.989 documents makes 999.979.000.110 comparisons or nearly one trillion as WolframAlpha explains quite nicely:

Let the testing begin
While this is not a realistic data set in terms of data distribution, it looks like we now have something to keep the HANA busy with.
Step 1: run the original solution
First off, we just run the SQL I posted as the solution in my previous post:
with doc_elements
(document, material, material_cnt)
as (select distinct
document
, material
, count(*) OVER
(PARTITION BY document) as MATERIAL_CNT
from
mytable
)
, matched_materials
(document_a, material_a, material_b_cnt, document_b, match_cnt)
as (select
side_a.document as document_a
, side_a.material as material_a
, side_b.material_cnt as material_b_cnt
, side_b.document as document_b
, count(*) OVER
(PARTITION BY side_a.document, side_b.document) as match_cnt
from
doc_elements side_a
left outer join doc_elements side_b
on side_a.material = side_b.material
and side_a.document != side_b.document
where
side_b.document IS NOT NULL
)
select distinct
document_a
, document_b
, material_b_cnt
, match_cnt
, round((100/material_b_cnt)*match_cnt, 2) as match_pct
from
matched_materials
order by
document_a
, document_b;
If I had thought a bit about the data volume before it should have been obvious that this might not work out very well.
So after a couple of minutes, I got the following error:
Could not execute 'with doc_elements (document, material, material_cnt) as (select distinct document , material , ...' in 6:26.459 minutes .
SAP DBTech JDBC: [4]: cannot allocate enough memory: please check traces for further information
Step 2: Change to a more sensible query
Ok, so we cannot compute all document comparisons in one go and we probably don't want to anyhow.
How about we ask for documents that match one reference document instead? That reduces the size of the problem to 999.989 documents that need to be compared (plus, of course, their respective line items).
In addition to that, we limit the returned records to those, where we have a minimum of 3% line-item match. Otherwise, we would just get a lot of uninteresting barely matching documents.
with doc_elements
(document, material, material_cnt)
as (select distinct
document
, material
, count(*) OVER
(PARTITION BY document) as MATERIAL_CNT
from
mytable
)
, matched_materials
(document_a, material, material_b_cnt, document_b, match_cnt)
as (select
side_a.document as document_a
, side_a.material
, side_b.material_cnt as material_a_cnt
, side_b.document doc_b
, count(*) OVER
(PARTITION BY side_a.document, side_b.document) as match_cnt
from
doc_elements side_a
left outer join doc_elements side_b
on side_a.material = side_b.material
and side_a.document != side_b.document
where
side_b.document IS NOT NULL
)
select distinct
document_a
, document_b
, material_b_cnt
, match_cnt
, round((100/material_b_cnt)*match_cnt, 2) as match_pct
from
matched_materials
where
round((100/material_b_cnt)*match_cnt, 2) >= 3
and document_a = '1100'
order by
document_a
, document_b;
----
Statement 'with doc_elements (document, material, material_cnt) as (select distinct document , material , ...'
successfully executed in 1:32.720 minutes
(server processing time: 1:32.721 minutes)
----
DOCUMENT_A DOCUMENT_B MATERIAL_B_CNT MATCH_CNT MATCH_PCT
1100 3020570 51 2 3.92
1100 4802470 51 2 3.92
1100 5223960 32 1 3.13
1100 5470720 52 2 3.85
1100 6030740 47 2 4.26
A minute and a half is not a bad response time for such a query, but light-years away from anything one would want to put into an "interactive" usage scenario.
Step 3: look for options to improve performance
The first thing to look for when trying to improve query performance is finding out, where the time is spent during the processing.
With SAP HANA the PlanViz trace is nearly always the way to go to learn about statement execution. While the XSA WebIDE provides some features to work with PlanViz traces, I have not yet seen a version that is half as usable as the Eclipse-based tool.
So for now, I stick with the "classic" PlanViz tool.
This is what we get from the SQL from Step 2:

The graph above shows how the data "flows" starting on the bottom left side up to the red "1" and continuing at the red "2" on the bottom right.
What we see is that the one filter we used (document_a = '1100'
) is not used directly on the table. Instead, all the table rows are processed upstream and the filter is only applied on top of an internal table (ITab) much later.
As mentioned before, this version of the statement took about 90 secs (1:30 min.) and the PlanViz Overview-tab tells us that it allocated 22.3 GB during the execution.
No filter push-down? Why?
To understand why the filter was not "pushed down", we need to go back to the SQL:
with doc_elements
as (select distinct
document
, material
, count(*) OVER
(PARTITION BY document) as MATERIAL_CNT
from
mytable )
, matched_materials
as (select
[...]
from
doc_elements side_a
left outer join doc_elements side_b
on side_a.material = side_b.material
and side_a.document != side_b.document
[...] )
select
[...]
from
matched_materials
where
[...]
document_a = '1100'
order by ...
The filter is defined on matched_materials.document_a
which is where we made the distinction of side_a and side_b of the comparison. But both sides are projections of the very same doc_elements
derived table.
This one, in turn, calculates the number of materials per document in the window aggregate function. The result of this computation needs to be stored somewhere: this is what the internal table is for.
Now, let's fix this
Can we improve this? Well, if we could at least filter document_a
early, then the filter could work on the table structure instead of the internal table and the statement would not have to push the 49.5 Mio record IDs upstream twice.
An easy option, thanks to the way the SQL is structured in the first place, is to make up a new derived table that allows HANA to push the filter down to the table without affecting the data for side_b.
with doc_elements
(document, material, material_cnt)
as (select distinct
document
, material
, count(*) OVER
(PARTITION BY document) as MATERIAL_CNT
from
mytable
)
-- separate derived table for side_a
-- to allow filter push-down
, doc_elements_a
(document, material, material_cnt)
as (select distinct
document
, material
, count(*) OVER
(PARTITION BY document) as MATERIAL_CNT
from
mytable
)
--
, matched_materials
(document_a, material, material_b_cnt, document_b, match_cnt)
as (select
side_a.document as document_a
, side_a.material
, side_b.material_cnt as material_a_cnt
, side_b.document doc_b
, count(*) OVER
(PARTITION BY side_a.document, side_b.document) as match_cnt
from
doc_elements_a side_a
left outer join doc_elements side_b
on side_a.material = side_b.material
and side_a.document != side_b.document
where
side_b.document IS NOT NULL
)
select distinct
document_a
, document_b
, material_b_cnt
, match_cnt
, round((100/material_b_cnt)*match_cnt, 2) as match_pct
from
matched_materials
where
round((100/material_b_cnt)*match_cnt, 2) >= 3
and document_a = '1100'
order by
document_a
, document_b;
All we needed to do is to copy doc_elements
table expression and reference the new doc_elements
in the join condition.
Time for a test run
The PlanViz for this version of the statement only took 44.1 seconds - which is just half of the time we needed before. The allocated memory this time around was only 9 GB, which is less than half of the first version.
Not sure about you, but I think this is a great improvement.

The PlanViz graph above tells the story clearly: now the filter is applied for side_a and even the inverted filter is used for the access on side_b. HANA can figure this out, because of the join condition and the NOT NULL constraints on the columns, so no weird three-value logic is required here.
Note, how little data is now transferred upstream in the processing and how the HASH JOIN does not require a filter step on top of an internal table anymore.
Step 4: Further options?
We've already cut the execution time and memory requirement in half. What else could we do?
Usual suspects for ideas would probably include things like "Creating an Index" or "Setting up partitioning".
But just trying out things usually is not a good way forward.
In this case, the lion's share of the runtime (in the revised statement) is not spent accessing the data in the table. And that is what indexing or partitioning would have addressed (maybe).
Instead, this time is spent grouping and calculating the number of materials and matching materials on top of the HASH JOIN result.
On my 4-core test system, this step is already running with 4 parallel threads, which means that there is no easy way left to speed this up.
Unless we find an alternative algorithm/SQL statement to reduce the amount of data that needs to be compared further, that's about as fast as this computation will go on this machine, off of this table structure.
And this is where it ends
For an interactive application, one may consider transforming the data or precomputing some of the results but as the OP's original question only asked for something better than an SQLScript solution based on a cursor, I think we are done here.
And this is how you improve the performance of SQL by copy & paste 😉
There you go, now you know.
Pingback: Matchmaker - The Lars Breddemann Blog