Matchmaker – quick quick?

In Matchmaker, I developed a solution to a not so obvious problem.

The solution worked and delivered the right answer – finding documents where the line-items match to a certain degree the line-items of a reference document.

Now, getting something like this to work on a test data set is important to see that the query logic is doing the expected thing. A carefully constructed set of test data can ensure the solution covers all requirements.

What about performance?

The performance of a given solution, however, cannot be checked with just a few records.
For that, we need data.
Lots of data, just like Keanu needs guns.

Generating volume test data

The structure of the data we want to artificially create is very simple. It’s a single table with just three columns:

CREATE COLUMN TABLE "MYTABLE" (
       "DOCUMENT" NVARCHAR(10) DEFAULT '' NOT NULL 
     , "POSNR" NVARCHAR(6) DEFAULT '000000' NOT NULL 
     , "MATERIAL" NVARCHAR(40) DEFAULT '' NOT NULL
     , PRIMARY KEY ("DOCUMENT", "POSNR")
     ) 

“DOCUMENT” is just an arbitrary string, “POSNR” a number stored as a string and unique for each “DOCUMENT”, and finally “MATERIAL” is an arbitrary string out of a set of possible “materials”.

Dummy data to go

In order to have documents with different numbers of line-items and different materials, we can simply make use of the RAND() function. This will of course not realistically emulate real-life data distributions, but is good enough to get an idea of what data volume does to our solution in terms of runtime and memory requirements.

select
     TO_NVARCHAR(doc.generated_period_start) as document
   , TO_NVARCHAR(pos.generated_period_start) as posnr
   , TO_NVARCHAR(7000000 + ceil(rand()* 9999)) as material
from
    series_generate_integer (10, 100, 10000000) as doc
    cross join 
    series_generate_integer (10, 10, 1000) as pos
where
    rand() < 0.75;

The statement above generates 999.999 document numbers, 99 line-item numbers, creates a random material number between ‘7000000’ and ‘7009999’ and then filters 25% (

rand() < 0.75
) of all generated records out again.

Thanks to the series_generate_integer functions, this statement only takes about 2:42 minutes on my test machine to fill the table with ~49.5 million rows, i.e. ~305,555 records per second.
That’s pretty quick, I’d say.

Checking what we have now

A quick check of the data in our table now shows: there’s plenty.

select 
     count(distinct document) as doc_cnt
   , count(distinct posnr) as posnr_cnt
   , count(distinct material) as mat_cnt
   , count(*)
 from 
    mytable;

DOC_CNT	POSNR_CNT	MAT_CNT   COUNT(*)  
999,990	66       	999,999	  49,504,784

It also gives us an idea of how many documents we now have to compare with each other.

Apple’s market cap. size

Comparing all 999.990 documents with all the other 999.989 documents makes 999.979.000.110 comparisons or nearly one trillion as WolframAlpha explains quite nicely:

WolframAlpha showing the result of multiplying 999.990 by 999.989 = 999.979.000.110.
Neat Number Line feature! Nearly a trillion.

Let the testing begin

While this is not a realistic data set in terms of data distribution, it looks like we now have something to keep the HANA busy with.

Step 1: run the original solution

First off, we just run the SQL I posted as the solution in my previous post:

with doc_elements 
(document, material, material_cnt)  
as  (select distinct
          document
        , material
        , count(*) OVER
            (PARTITION BY document) as MATERIAL_CNT
    from
        mytable
    )  
, matched_materials 
(document_a, material_a, material_b_cnt, document_b, match_cnt)  
as  (select
         side_a.document as document_a
       , side_a.material as material_a
       , side_b.material_cnt as material_b_cnt
       , side_b.document as document_b
       , count(*) OVER
            (PARTITION BY side_a.document, side_b.document) as match_cnt
    from 
                        doc_elements side_a
        left outer join doc_elements side_b
                on   side_a.material = side_b.material
                and side_a.document != side_b.document
     where 
            side_b.document IS NOT NULL
    )      
select distinct
    document_a
  , document_b
  , material_b_cnt
  , match_cnt
  , round((100/material_b_cnt)*match_cnt, 2) as match_pct
from 
    matched_materials
order by
    document_a
  , document_b;

If I had thought a bit about the data volume before it should have been obvious that this might not work out very well.
So after a couple of minutes, I got the following error:

Could not execute 'with doc_elements (document, material, material_cnt) as (select distinct document , material , ...' in 6:26.459 minutes . 
SAP DBTech JDBC: [4]: cannot allocate enough memory: please check traces for further information 

Step 2: Change to a more sensible query

Ok, so we cannot compute all document comparisons in one go and we probably don’t want to anyhow.

How about we ask for documents that match one reference document instead? That reduces the size of the problem to 999.989 documents that need to be compared (plus, of course, their respective line-items).

In addition to that, we limit the returned records to those, where we have a minimum of 3% line-item match. Otherwise, we would just get a lot of uninteresting barely matching documents.

with doc_elements 
(document, material, material_cnt)  
as  (select distinct
          document
        , material
        , count(*) OVER
            (PARTITION BY document) as MATERIAL_CNT
    from
        mytable
    )  
, matched_materials 
(document_a, material, material_b_cnt, document_b, match_cnt)  
as  (select
         side_a.document as document_a
       , side_a.material
       , side_b.material_cnt as material_a_cnt
       , side_b.document doc_b
       , count(*) OVER
            (PARTITION BY side_a.document, side_b.document) as match_cnt
    from 
                        doc_elements side_a
        left outer join doc_elements side_b
                on   side_a.material = side_b.material
                and side_a.document != side_b.document
     where 
            side_b.document IS NOT NULL
    )      
select distinct
    document_a
  , document_b
  , material_b_cnt
  , match_cnt
  , round((100/material_b_cnt)*match_cnt, 2) as match_pct
from 
    matched_materials 
where 
  round((100/material_b_cnt)*match_cnt, 2) >= 3
and document_a = '1100'
order by
    document_a
  , document_b;

----
Statement 'with doc_elements (document, material, material_cnt) as (select distinct document , material , ...' 
successfully executed in 1:32.720 minutes
(server processing time: 1:32.721 minutes)
----

DOCUMENT_A	DOCUMENT_B	MATERIAL_B_CNT	MATCH_CNT	MATCH_PCT
1100      	3020570   	51            	2        	3.92     
1100      	4802470   	51            	2        	3.92     
1100      	5223960   	32            	1        	3.13     
1100      	5470720   	52            	2        	3.85     
1100      	6030740   	47            	2        	4.26     

A minute and a half is not a bad response time for such a query, but light-years away from anything one would want to put into an “interactive” usage scenario.

Step 3: look for options to improve performance

The first thing to look for when trying to improve query performance is finding out, where the time is spent during the processing.

With SAP HANA the PlanViz trace is nearly always the way to go to learn about statement execution. While the XSA WebIDE provides some features to work with PlanViz traces, I have not yet seen a version that is half as usable as the Eclipse-based tool.
So for now, I stick with the “classic” PlanViz tool.

This is what we get from the SQL from Step 2:

SAP HANA PlanViz graph from the first query. Table access without filter is shown.
PlanViz showing that the one filter condition we have is applied late.

The graph above shows how the data “flows” starting on the bottom left side up to the red “1” and continuing at the red “2” on the bottom right.

What we see is that the one filter we used (document_a = '1100') is not used directly on the table. Instead, all the table rows are processed upstream and the filter is only applied on top of an internal table (ITab) much later.
As mentioned before, this version of the statement took about 90 secs (1:30 min.) and the PlanViz Overview-tab tells us that it allocated 22.3 GB during the execution.

No filter push-down? Why?

To understand why the filter was not “pushed down”, we need to go back to the SQL:

with doc_elements 
as  (select distinct
          document
        , material
        , count(*) OVER
            (PARTITION BY document) as MATERIAL_CNT
    from
        mytable  )  
, matched_materials
as  (select
    [...]
    from 
                        doc_elements side_a
        left outer join doc_elements side_b
                on   side_a.material = side_b.material
                and side_a.document != side_b.document
     [...]  )      

select  
 [...]
from 
    matched_materials 
where 
   [...]
    document_a = '1100'
order by ...

The filter is defined on matched_materials.document_a which is where we made the distinction of side_a and side_b of the comparison. But both sides are projections of the very same doc_elements derived table.
This one, in turn, calculates a number of materials per document in the window aggregate function. The result of this computation needs to be stored somewhere: this is what the internal table is for.

Now, let’s fix this

Can we improve this? Well, if we could at least filter document_a early, then the filter could work on the table structure instead of the internal table and the statement would not have to push the 49.5 Mio record IDs upstream twice.

An easy option, thanks to the way the SQL is structured in the first place, is to make up a new derived table that allows HANA to push the filter down to the table without affecting the data for side_b.

with doc_elements 
(document, material, material_cnt)  
as  (select distinct
          document
        , material
        , count(*) OVER
            (PARTITION BY document) as MATERIAL_CNT
    from
        mytable
    )
-- separate derived table for side_a 
-- to allow filter push-down  
, doc_elements_a 
(document, material, material_cnt)  
as  (select distinct
          document
        , material
        , count(*) OVER
            (PARTITION BY document) as MATERIAL_CNT
    from
        mytable
    ) 
-- 
, matched_materials 
(document_a, material, material_b_cnt, document_b, match_cnt)  
as  (select
         side_a.document as document_a
       , side_a.material
       , side_b.material_cnt as material_a_cnt
       , side_b.document doc_b
       , count(*) OVER
            (PARTITION BY side_a.document, side_b.document) as match_cnt
    from 
                       doc_elements_a side_a
        left outer join doc_elements side_b
                on   side_a.material = side_b.material
                and side_a.document != side_b.document
     where 
            side_b.document IS NOT NULL
    )      
select distinct
    document_a
  , document_b
  , material_b_cnt
  , match_cnt
  , round((100/material_b_cnt)*match_cnt, 2) as match_pct
from 
    matched_materials   

where 
  round((100/material_b_cnt)*match_cnt, 2) >= 3
and document_a = '1100'
order by
    document_a
  , document_b;

All we needed to do is to copy doc_elements table expression and reference the new doc_elements in the join condition.

Time for a test run

The PlanViz for this version of the statement only took 44.1 seconds – which is just half of the time we needed before. The allocated memory this time around was only 9 GB, which is less than half of the first version.

Not sure about you, but I think this is a great improvement.

SAP HANA PlanViz graph from the first query. Table access with filter and reduced data volume upstream is shown.
HANA was even able to push-down the inverse filter, due to the join conditions and the NOT NULL constraints on the table.

The PlanViz graph above tells the story clearly: now the filter is applied for side_a and even the inverted filter is used for the access on side_b. HANA can figure this out, because of the join condition and the NOT NULL constraints on the columns, so no weird three-value-logic is required here.

Note, how little data is now transferred upstream in the processing and how the HASH JOIN does not require a filter step on top of an internal table anymore.

Step 4: Further options?

We’ve already cut the execution time and memory requirement in half. What else could we do?
Usual suspects for ideas would probably include things like “Creating an Index” or “Setting up partitioning”.

But just trying out things usually is not a good way forward.

In this case, the lion’s share of the runtime (in the revised statement) is not spent with accessing the data in the table. And that is what indexing or partitioning would have addressed (maybe).
Instead, this time is spent with grouping and calculating the number of materials and matching materials on top of the HASH JOIN result.

On my 4-core test system, this step is already running with 4 parallel threads, which means that there is no easy way left to speed this up.
Unless we find an alternative algorithm/SQL statement to reduce the amount of data that needs to be compared further, that’s about as fast as this computation will go on this machine, off of this table structure.

And this is where it ends

For an interactive application, one may consider transforming the data or precomputing some of the results but as the OP’s original question only was to get something better than an SQLScript solution based on a cursor, I think we are done here.

And this is how you improve the performance of SQL by copy&paste 😉

There you go, now you know.

1 Trackback or Pingback

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.