Skip to content

SAP HANA middle of the road for Big Data or just average?

When SAP tries to explain HANA as a technology the buzz words “Big Data” have always been part of the story.
Commonly “Big Data with HANA” then gets specified to mean one or another thing – often to caveat what the platform is supposed to be able to do and which usage scenarios require what sort of special handling.

While HANA can deal with a lot of data SAP customers often have even more data in their systems – there always seems to be more data than what can be neatly dealt with. This is of course the reason for things like “archiving” and “multi-temperature driven” data management.

I guess there are many approaches that might help with understanding why HANA in some situations cannot function the way the user expects; as a user of the platform, I am more interested in how I could still work with my data.

Attention everyone, someone wants to process a lot of data!

A recent SCN community question provided a common example of this. The OP wanted to calculate the median of a very large data set of more than 2 billion records.

The median() function in HANA appears to implement the straight-forward approach compute the value:

  1. sort all input values
  2. find the value that lies at exactly the middle of the sorted data set (or calculate an average of the two values surrounding the middle)
  3. return result
  4. profit! (ok, maybe not just yet).

This approach requires sorting the data set (it says so in step 1) and that is what HANA does. But considering the large number of records, it really cannot do that. Instead, the OP received an error message:

column store error: search table error: [2617] executor:
plan operation execution failed with an exception … exceeded limit 2167466761.

or (the error message I got)

Could not execute 'select median (val) from bn2' in 69 ms 870 µs .
[129]: transaction rolled back by an internal error:
Search result size limit exceeded: 2250000000

The cause for the error is that the internal memory structures (itabs) that hold intermediate results in HANA cannot keep more than ca. 2 billion entries. This is similar to the general restriction of column store tables, but with those, one can use partitioning to get around it. The itabs cannot be partitioned, so the limit stands.

But what if I really, really want the result?

Now, one can go and accept that HANA is just not the right tool for this kind of data processing and move to some other technology. Alternatively, one can go back to the time when median() was a fancy feature that not all SQL databases provided.
In his very good book “SQL for SmartiesJoe Celko has a whole chapter on how to compute the median of a data set, without using the median() function.
I very much recommend both the book as well as the chapter (it is available online here).

One of the techniques Celko describes is called “Median with Characteristic Function” and works by cross joining the table to itself.

By pairwise checking and counting up when the left side of the join is smaller than the right side and comparing that sum to half of the total number of records this approach filters the data down to the “middle value” without the need for sorting or large intermediate tables.

The statement for this looks like this (val is the column we want the median for):

SELECT
    AVG(DISTINCT  val) DIY_MEDIAN
FROM
    (SELECT P1.val
     FROM
                    bn2 AS P1
         CROSS JOIN bn2 AS P2
      GROUP BY P1.val
  HAVING (SUM(CASE
                      WHEN P2.val <= P1.val
                      THEN to_bigint(1)
                      ELSE to_bigint(0)
                  END))
            >= (COUNT(*) / 2.0)
           AND  (SUM(CASE
                        WHEN P2.val >= P1.val
                        THEN to_bigint(1)
                        ELSE to_bigint(0)
                    END))
            >= (COUNT(*) / 2.0)
       );

The data type conversions (TO_BIGINT) are there to ensure that the SUMs and the COUNT(*)s can grow large enough.

Running this statement on a partitioned table with 2,250,000,000 records takes a few seconds, but at least provides the correct result.

For fellow SQL DYIers

In case you want to try this yourself but do not happen to have such a huge table ready and/or only a small test HANA instance (like yours truly – I still use my trusty NUC… ) here are the commands that I used:

select * from m_database;
/*
SYSTEM_ID	DATABASE_NAME	HOST   	START_TIME                  	VERSION               	USAGE
HXE      	HXE          	hxehost	14 Aug 2022, 10:24:44.019 am	2.00.057.00.1629894416	DEVELOPMENT
*/
/* smallest possible example - just a single int column and 15 different values */
drop table bn2;
create column table bn2
	(val int not null);
alter table bn2 partition by  range (val)
((partition value = 1
, partition value = 2
, partition value = 3
, partition value = 4
, partition value = 5
, partition value = 6
, partition value = 7
, partition value = 8
, partition value = 9
, partition value = 10
, partition value = 11
, partition value = 12
, partition value = 13
, partition value = 14
, partition value = 15));
/* we want to load a LOT of data - let's trigger the merging when it suits us */
alter table bn2 disable automerge;
/* insert priming data - this is to allow for compression optimisation
   and to have the dictionary value for each value defined when we insert the bulk of the data */
insert into bn2 partition (1) 
(select  1 from series_generate_integer (1, 0, 10000000) );
insert into bn2 partition (2) 
(select  2 from series_generate_integer (1, 0, 10000000) );
insert into bn2 partition (3) 
(select  3 from series_generate_integer (1, 0, 10000000) );
insert into bn2 partition (4) 
(select  4 from series_generate_integer (1, 0, 10000000));
insert into bn2 partition (5) 
(select  5 from series_generate_integer (1, 0, 10000000) );
insert into bn2 partition (6) 
(select  6 from series_generate_integer (1, 0, 10000000) );
insert into bn2 partition (7) 
(select  7 from series_generate_integer (1, 0, 10000000) );
insert into bn2 partition (8) 
(select  8 from series_generate_integer (1, 0, 10000000) );
insert into bn2 partition (9) 
(select  9 from series_generate_integer (1, 0, 10000000) );
insert into bn2 partition (10) 
(select 10  from series_generate_integer (1, 0, 10000000) );
insert into bn2 partition (11) 
(select 11 from series_generate_integer (1, 0, 10000000) );
insert into bn2 partition (12) 
(select  12 from series_generate_integer (1, 0, 10000000) );
insert into bn2 partition (13) 
(select  13 from series_generate_integer (1, 0, 10000000) );
insert into bn2 partition (14) 
(select  14 from series_generate_integer (1, 0, 10000000) );
insert into bn2 partition (15) 
(select  15 from series_generate_integer (1, 0, 10000000));
/* do delta merge and compression optimisation to have optimal compression
 prefixed -> only single value stored per partition
 */
merge delta of bn2;
update bn2 with parameters ('optimize_compression'='force');
/* 
Checking the storage parameters for the table gives us this:
Total Memory Consumption (KB): 28,134
Number of Entries: 150,000,000
Size on Disk (KB): 9,480
Memory Consumption in Main Storage (KB): 27,659
Memory Consumption in Delta Storage (KB): 272
Estimated Maximum Memory Consumption (KB): 29,301
Details for Table
  Column Name | Part ID | Record Count | Distinct Records | Loaded | Compression Type
  VAL         | 1       | 10,000,000   | 1                | TRUE   | PREFIXED
  VAL         | 2       | 10,000,000   | 1                | TRUE   | PREFIXED
  VAL         | 3       | 10,000,000   | 1                | TRUE   | PREFIXED
  VAL         | 4       | 10,000,000   | 1                | TRUE   | PREFIXED
  VAL         | 5       | 10,000,000   | 1                | TRUE   | PREFIXED
  VAL         | 6       | 10,000,000   | 1                | TRUE   | PREFIXED
  VAL         | 7       | 10,000,000   | 1                | TRUE   | PREFIXED
  VAL         | 8       | 10,000,000   | 1                | TRUE   | PREFIXED
  VAL         | 9       | 10,000,000   | 1                | TRUE   | PREFIXED
  VAL         | 10      | 10,000,000   | 1                | TRUE   | PREFIXED
  VAL         | 11      | 10,000,000   | 1                | TRUE   | PREFIXED
  VAL         | 12      | 10,000,000   | 1                | TRUE   | PREFIXED
  VAL         | 13      | 10,000,000   | 1                | TRUE   | PREFIXED
  VAL         | 14      | 10,000,000   | 1                | TRUE   | PREFIXED
  VAL         | 15      | 10,000,000   | 1                | TRUE   | PREFIXED
*/

At this point MEDIAN still works

select median(val) from bn2;
/*
MEDIAN(VAL)
8
*/
select
	(SUM(CASE
			WHEN val <= 8 THEN 1
      		ELSE 0
      	 END)) as "smaller than 8"
, 	(SUM(CASE
			WHEN val > 8 THEN 1
      		ELSE 0
      	 END)) as "greater than 8"
, median (val) median
, count(*)
from bn2;

/*
smaller than 8	greater than 8	COUNT(*)
70,000,000      80,000,000      150,000,000
*/

Now blow up the data set to > 2 bn

How often do we need to run those INSERTS statements?

2,000,000,000 / 150,000,000 => 13.333…

Let’s do 14 rounds of inserts and merges!

As we scale the data up by the same factor for all values, the median stays the same!

do
begin
	declare I integer;
	for I in 1..14 do
		insert into bn2 partition (1) 
             (select  1 from series_generate_integer (1, 0, 10000000) );
		insert into bn2 partition (2) 
             (select  2 from series_generate_integer (1, 0, 10000000) );
		insert into bn2 partition (3)
             (select  3 from series_generate_integer (1, 0, 10000000) );
		insert into bn2 partition (4)
             (select  4 from series_generate_integer (1, 0, 10000000) );
		insert into bn2 partition (5) 
             (select  5 from series_generate_integer (1, 0, 10000000) );
		insert into bn2 partition (6) 
             (select  6 from series_generate_integer (1, 0, 10000000) );
		insert into bn2 partition (7) 
             (select  7 from series_generate_integer (1, 0, 10000000) );
		insert into bn2 partition (8) 
             (select  8 from series_generate_integer (1, 0, 10000000) );
		insert into bn2 partition (9) 
             (select  9 from series_generate_integer (1, 0, 10000000) );
		insert into bn2 partition (10) 
             (select 10 from series_generate_integer (1, 0, 10000000) );
		insert into bn2 partition (11) 
             (select 11 from series_generate_integer (1, 0, 10000000) );
		insert into bn2 partition (12) 
             (select 12 from series_generate_integer (1, 0, 10000000) );
		insert into bn2 partition (13) 
             (select 13 from series_generate_integer (1, 0, 10000000) );
		insert into bn2 partition (14) 
             (select 14 from series_generate_integer (1, 0, 10000000) );
		insert into bn2 partition (15) 
             (select 15 from series_generate_integer (1, 0, 10000000) );
		
        commit;
		merge delta of bn2;
	end for;
end;

select
	(SUM(CASE
			WHEN val <= 8 THEN 1
      		ELSE 0
      	 END)) as "smaller than 8"
, 	(SUM(CASE
			WHEN val > 8 THEN 1
      		ELSE 0
      	 END)) as "greater than 8"
-- , median (val) median
, count(*)
from bn2;
/*
smaller than 8	greater than 8	COUNT(*)
1,200,000,000 	1,050,000,000 	2,250,000,000
*/

The data is IN and the median does NOT work anymore:

select median (val) from bn2;
/*
Could not execute 'select median (val) from bn2' in 69 ms 870 µs .
[129]: transaction rolled back by an internal error: Search result size limit exceeded: 2250000000
OPERATOR_NAME     	OPERATOR_DETAILS             EXECUTION_ENGINE	TABLE_NAME TABLE_SIZE   	OUTPUT_SIZE
ROW SEARCH        	MEDIAN(BN2.VAL)              ROW             	?          ?            	1
  AGGREGATION     	AGGREGATION: MEDIAN(BN2.VAL) ROW             	?          ?            	1
    COLUMN SEARCH 	BN2.VAL                      COLUMN          	?          2,250,000,000	2,250,000,000
      COLUMN TABLE	                             COLUMN          	BN2        2,250,000,000	2,250,000,000
-> here 2,250,000,000 records should be fed into MEDIAN()
*/

Using the DIY median

SELECT
	AVG(DISTINCT  val) DIY_MEDIAN
FROM
	(SELECT P1.val
     FROM
     			   bn2 AS P1
     	CROSS JOIN bn2 AS P2
      GROUP BY P1.val
  HAVING (SUM(CASE
      				WHEN P2.val <= P1.val
      				THEN to_bigint(1)
           			ELSE to_bigint(0)
      			END))
        	>= (COUNT(*) / 2.0)
      	 AND  (SUM(CASE
	      	 			WHEN P2.val >= P1.val
	           			THEN to_bigint(1)
	           			ELSE to_bigint(0)
	           		END))
        	>= (COUNT(*) / 2.0)
       );
/*
DIY_MEDIAN
8
OPERATOR_NAME             	OPERATOR_DETAILS                                                 OPERATOR_PROPERTIES                 TABLE_NAME TABLE_SIZE     OUTPUT_SIZE
ESX SEARCH                	DIY_MEDIAN                                                                                           ?          ?              1
  AGGREGATION             	AGGREGATION: AVG(P1.VAL)                                                                             ?          ?              1
    HAVING                	TO_DECIMAL(SUM(CASE WHEN P2.VAL <= P1.VAL THEN 1 ELSE 0 END
                                        * COUNT(*) * COUNT(*)), 26, 6)
                            >= TO_DECIMAL(SUM(COUNT(*) * COUNT(*))) / 2
                            AND TO_DECIMAL(SUM(CASE WHEN P2.VAL >= P1.VAL THEN 1 ELSE 0 END
                                        * COUNT(*) * COUNT(*)), 26, 6)
                            >= TO_DECIMAL(SUM(COUNT(*) * COUNT(*))) / 2                                                          ?          ?              2.40000000000
      AGGREGATION         	GROUPING: P1.VAL,
                            AGGREGATION: SUM(CASE WHEN P2.VAL <= P1.VAL THEN 1 ELSE 0 END
                                         * COUNT(*) * COUNT(*))
                                        , SUM(COUNT(*) * COUNT(*))
                                        , SUM(CASE WHEN P2.VAL >= P1.VAL THEN 1 ELSE 0 END
                                         * COUNT(*) * COUNT(*))                              ENUM_BY: DOUBLE_PREAGGR_BEFORE_JOIN ?          ?              15
        NESTED LOOP JOIN  	                                                                 ENUM_BY: DOUBLE_PREAGGR_BEFORE_JOIN ?          ?              225
          COLUMN SEARCH   	P1.VAL, COUNT(*)                                                 PARALLELIZED, ENUM_BY: CS_TABLE     ?          2,250,000,000  15
            AGGREGATION   	GROUPING: P1.VAL, AGGREGATION: COUNT(*)                          ENUM_BY: PREAGGR_BEFORE_JOIN        ?          ?              15
              COLUMN TABLE	                                                                                                     BN2        2,250,000,000  2,250,000,000
          COLUMN SEARCH   	P2.VAL, COUNT(*)                                                 PARALLELIZED, ENUM_BY: CS_TABLE     ?          2,250,000,000  15
            AGGREGATION   	GROUPING: P2.VAL, AGGREGATION: COUNT(*)                          ENUM_BY: PREAGGR_BEFORE_JOIN        ?          ?              15
              COLUMN TABLE	                                                                                                     BN2        2,250,000,000  2,250,000,000
*/

Parting thoughts

Would it be great if HANA could handle this kind of BIG DATA automagically? I think so. However, looking at how HANA gets used in the wild, I doubt that there is any appetite to make this happen. It turned out to be much more a database for SAP products than for general (and sometimes big) data processing after all.

There you go; now you know!

Lars

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: