What is this about?
In his blog post Why are scalar UDFs so slow? Jörg Brandeis wonders why user defined scalar functions (scalar UDFs) are slow compared to plain SQL. This is an important question since UDFs would make for very convenient development objects. SQL could be a lot easier to read and maintain if one could put complex or often reused computations into an UDF and just use this in queries, procedures and BW transformations.
Jörg’s scenario was simple enough: take the conversion of an ABAP DATS field (on DB-level that is a 8-character string in the format YYYYMMDD) to the BW CALMONTH fomat (a 6-character string in the format YYYYMM on DB-level) and create a scalar UDF that performs this conversion.
One possible conversion logic looks like this:
calmonth := TO_NVARCHAR(TO_DATE(:dats_date), 'YYYYMM');
First, the DATS-date information is converted to a SQL date type. Then, the SQL date is converted to a string again, only considering the year (YYYY) and month (MM) information.
This expression can be put directly into a SELECT statement or it can be implemented as a scalar UDF – which are exactly the two cases, Jörg compared with one another.
To his disappointment, the performance of the UDF version was a lot worse than that of the plain SQL expression. Which led to the titular question of his post: why are scalar UDFs so slow?
Answering the question
In the post, Jörg even showed the relevant PlanViz sections for both scenarios, but couldn’t find what the difference between both execution was – besides the much worse runtime for the UDF version.
I recreated a similar scenario on a HANA 2.00.045 (HANA Express Edition) and reproduced the behavior.
In my scenario, a table contains 10 million records with dates in both SQL date- and DATS-format. The dates range from 2019-01-01 to 2020-12-31, covering two years (730 days). Compared to the data volume, there are only few distinct date values.
I use a UDF similar to the one from Jörg’s post:
create or replace function dats_to_calmonth_deterministic (IN DATS_date nvarchar(8)) returns calmonth nvarchar(6) deterministic as begin calmonth := TO_NVARCHAR(TO_DATE(:dats_date), 'YYYYMM'); -- an alternative would be to just keep the 6 left-most characters -- left (:dats_date, 6) -- at the expense that this does not necessarily creates a valid SQL date end;
Then I run two equivalent SELECTs to compare the runtime:
-- plain SQL select top 500 guid , tx_date , TO_NVARCHAR(TO_DATE(tx_dats), 'YYYYMM') as calmonth from data_10m where TO_NVARCHAR(TO_DATE(tx_dats), 'YYYYMM') = '202008' order by guid ; -- planviz reported resource usage -- runtime / memory -- 3.5s / 1.4GB -- 3.4s / 1.4GB -- 3.4s / 1.4GB -- 3.2s / 1.4GB -- 3.2s / 1.4GB
-- deterministic UDF select top 500 guid , tx_date , dats_to_calmonth_deterministic(tx_dats) as calmonth from data_10m where dats_to_calmonth_deterministic(tx_dats) = '202008' order by guid ; -- planviz reported resource usage -- runtime / memory -- 17.9s / 2.5GB -- 17.2s / 2.5GB -- 17.0s / 2.5GB -- 17.3s / 2.5GB -- 17.4S / 2.5GB
Looking at the runtime/memory numbers we see how the UDF version completely stinks compared to the plain SQL.
The plain SQL statement uses more than 1 GB less memory and is about 5. 2 times faster; to produce the very same result.
What determines this performance?
Looking at the PlanViz we do find differences. First the plain SQL version:
And the UDF version:
Notable here is what the JECalculate plan operator (POP) is doing.
In the plain SQL version, we see the whole expression as the calculation to be performed.
In the UDF version, we see an evali-call referencing the UDF.
HANA clearly did not unroll the expression from the function in this case. The query plan handles the UDF as a black box – but why?
Let’s look at yet another version of the function:
create or replace function dats_to_calmonth (IN DATS_date nvarchar(8)) returns calmonth nvarchar(6) as begin calmonth := TO_NVARCHAR(TO_DATE(:dats_date), 'YYYYMM'); -- an alternative would be to just keep the 6 left-most characters -- left (:dats_date, 6) -- at the expense that this does not necessarily creates a valid SQL date end; -- standard UDF select top 500 guid , tx_date , dats_to_calmonth(tx_dats) as calmonth from data_10m where dats_to_calmonth(tx_dats) = N'202008' order by guid ; -- planviz reported resource usage -- runtime / memory -- 3.3s / 637MB -- 3.1s / 637MB -- 3.0s / 637MB -- 3.0s / 637MB -- 3.0S / 637MB
Now, how is this? This UDF version is slightly faster and uses a lot less memory than the plain SQL version.
What does the PlanViz for this version look like?
While this plan looks very different than the other two, we see that the expression from the UDF now is part of the POP (filter-pop this time). I’ll explain later why the POPs in this plan have different colours and names. For now, let’s focus on this “expression unfolding”.
Did you spot the important difference?
The two UDF versions are identical except for the DETERMINISTIC keyword. The HANA documentation explains that marking functions as DETERMINISTIC allows for additional result caching. And this sounds a lot like a potential turbo button (if you remember those, you’ve been computering a long time!)
Unfortunately, this feature does come with severe implications when it comes to query optimisation. To make the whole result-cache-feature working, DETERMINISTIC UDFs cannot be executed in the main SQL processing POPs. Instead, they need to run in their own little environment where they can fill the cache with new results and look up already existing results. Turns out, operating a cache comes with considerable efforts – even if the cache is only valid for the duration of the query as it is the case with DETERMINISTIC UDFs.
Looking at the filtering part of the deterministic UDF we find this oddly partitioned approach of Calculation and Dynamic Search. And those partitions are executed in sequence which leads to the horrific runtime.
What to do now?
I would say: don’t use the DETERMINISTIC keyword.
Truth be told, I don’t know any documentation about this feature beyond the short paragraph in the HANA documentation. So, I don’t know what scenarios would really benefit from it.
Even worse: I have not been able to construct a DETERMINISTIC UDF that was quicker or more efficient than the “standard” version.
I’m pretty sure that there would be scenarios, where this feature works a treat; but so far SAP has managed to keep the specific conditions for that to themselves.
And what was that about the colours in the PlanViz?
The different colours and names for the POPs in the “standard” UDF are the indicator for a ROW engine execution.
Looking at the EXPLAIN PLAN confirms this:
OPERATOR_NAME OPERATOR_DETAILS EXECUTION_ENGINE ROW SEARCH DATA_10M.GUID, DATA_10M.TX_DATE, __validation__(TO_NVARCHAR(TO_DATE(__validation__ (TO_NVARCHAR(DATA_10M.TX_DATS), 8)), 'YYYYMM'), 6) ROW LIMIT NUM RECORDS: 500 ROW ORDER BY DATA_10M.GUID ASC ROW FILTER __validation__ (TO_NVARCHAR(TO_DATE(__validation__ (TO_NVARCHAR(DATA_10M.TX_DATS), 8)), 'YYYYMM'), 6) = '202008' ROW COLUMN SEARCH DATA_10M.GUID, DATA_10M.TX_DATE, DATA_10M.TX_DATS COLUMN COLUMN TABLE COLUMN
The EXECUTION_ENGINE column shows ROW for all operations after COLUMN SEARCH (read from the bottom up).
As far as I can tell, this is due to the __validation__ function that appeared with HANA 2 SPS 03. There are some SAP notes about performance issues with table UDFs but none about scalar UDFs.
Also, the __validation__ function does not seem to be injected into the execution plan, when the input or output data types of the function is not a character string type. That is to say: when I changed the function to return INTEGER instead of VARCHAR, the __validation__ function did not occur and the query was executed in the COLUMN ENGINE.
And that’s all for today.
There you go, now you know!