Write better SQL!

When reading Some opinionated thoughts on SQL databases recently, I felt reminded of an observation that I have made in the past: the way SQL is commonly used and presented is way too low-level.

What I mean by that, is that the SQL code one can find in examples almost always shows some arbitrary and meaningless piece of SQL syntax. This is a bit like the mathematics formulas that turned away generations of students from engaging with maths.

To some, a statement like:

x * (1 + y) = z

might be all they need to implement the logic and be happy.

Personally, I prefer to understand the formula in the context it is used in.
This could be a VAT calculation:

net_price * (1 + vat_rate) = gross_price

Or it could be a bonus calculation:

base_salary * (1 + bonus_percentage) = total_salary

The structure of the formula stays the same, but its meaning is completely different.

Unfortunate habits

Some developers have a talent for taking business processes and requirements and turning those into executable code. Unfortunately, translating in the reverse direction – taking code and putting it into a context where the code is used – does seem to be a skill that is even rarer.

This becomes especially problematic if the software is not some purpose-built solution that addresses a specific need. In that case, the purpose and the need can provide a lot of the context by default. But for platform software, like databases or application servers this benefit does not exist.
If you think “He is probably referring to SAP HANA or the SAP ABAP application server“, you are spot on.

I have long lamented the poor quality of SAP’s product documentation and examples and even though I know that there are many talented and well-meaning individuals behind what help.sap.com is today, in my opinion, it is still a far cry from good documentation.

An example

In SAP Tech Bytes: SAP HANA Mass Input with Duplicates Thomas Jung, a longtime HANA developer advocate, demonstrates how a certain common difficulty with loading data can be solved by writing JavaScript, HANA CDS, and SQL. The code used for this exercise is publicly available on GitHub, which is really a good thing to do for example code.

Now, the overall solution works; what else is to be expected from Thomas? But there a few things in it, that I believe could be improved to make the example much better.

Let’s see how.

Thomas’ database logic is covered in three statements:

1. INSERT data into a temporary table

INSERT INTO #TEMP_STATS (ORG, REPOSITORY, DATE, VIEWS) VALUES (?, ?, TO_DATE(?, 'MM/DD/YYYY'), ?)

2. DELETE duplicates that are in the temporary table

DELETE FROM #TEMP_STATS WHERE "$rowid$" IN
       ( SELECT ROW_ID FROM (SELECT ROW_NUMBER() OVER (PARTITION BY ORG, REPOSITORY, "DATE") as RN,    
         "$rowid$" as ROW_ID, ORG, REPOSITORY, "DATE" FROM #TEMP_STATS ORDER BY 3, 2, 1)    
        WHERE RN>1)

3. MERGE the remaining data with the target table contents

MERGE INTO STATS AS T1 USING #TEMP_STATS AS T2
        ON T1.ORG = T2.ORG and T1.REPOSITORY = T2.REPOSITORY and T1.DATE = T2.DATE
        WHEN MATCHED THEN UPDATE SET T1.VIEWS = (T1.VIEWS + T2.VIEWS)
        WHEN NOT MATCHED THEN INSERT (ORG, REPOSITORY, DATE, VIEWS) VALUES(T2.ORG, T2.REPOSITORY, T2.DATE, T2.VIEWS)

Rough edges

As mentioned before, the code does what it is supposed to do. No complaint there. But looking at the three SQL commands, do you understand them? Are you clear about what the intention behind each command is and how it achieves this intent?

At the very least, one should get curious about the funny “$rowid$” column that seemingly can be pulled out of thin air in the DELETE statement. Longtime HANA users will be familiar with this internal column and be aware that its use is not supported by SAP and no guarantees are made about what data it contains and how to use it. Nevertheless, the column is often used when duplicates in a table should be “cleaned up”, i.e. one of the otherwise identical records should be deleted.

But how can the de-duplication be done without the “$rowid$” column?” – SQL has a special feature for this: it’s called SELECT DISTINCT.

Instead of changing the contents of the temporary table twice, we can simply tell HANA to only return DISTINCT records in the MERGE statement. No need for some “internal-column-black-magic” here.

Not so fast!“, I hear you say, “the SELECT DISTINCT only works if complete records are identical, but in Thomas’ example only the key-columns have to be considered“.
Right on!
This complication is, in fact, what makes the DELETE statement so hard to read.

Let’s look at the SELECTion part of the DELETE statement again.
But this time with some formatting:

( SELECT ROW_ID FROM 
   (SELECT ROW_NUMBER() 
           OVER (PARTITION BY ORG, REPOSITORY, "DATE") as RN,    
           "$rowid$" as ROW_ID
         , ORG
         , REPOSITORY
         , "DATE" 
    FROM 
          #TEMP_STATS 
    ORDER BY 3, 2, 1)    
WHERE RN > 1)

The outer SELECT simply selects ROW_IDs (the values of the “$rowid$” column) for which the ROW_NUMBER() function (column RN) returned a value larger than 1. These are the records that should be removed because they appear more than once; makes sense.

The inner SELECT is not as obvious, though.
First off, the ORDER BY at the end is superfluous and will be ignored by the outer statement. It has no effect on the execution whatsoever. Sub-selects, like this one, do not have any order, just like tables do not have any order.
In addition, the ill-practice of using column indexes (3, 2, 1) in the ORDER BY clause, is just a bad example. There is no situation when one should do that in a piece of code that gets saved into a file.
If you disagree, quick: tell me what the inner SELECT will be sorted by…

Still going back and forth between the ORDER BY clause and the selected columns? Thanks for proving my point.

Underspecified window clause

Using the ROW_NUMBER() window function is the smart thing to do when one wants to find duplicates on arbitrary candidate keys. But for that to work, we should be explicit about which records we want to call “duplicates”. Without an ORDER BY statement in the ROW_NUMBER() function declaration, we simply cannot tell which record of any duplicate will be assigned number 1, 2, etc.

In order to make the logic clear and unambiguous, this is something we want to specify:

SELECT 
     ROW_NUMBER() OVER 
        (PARTITION BY ORG, REPOSITORY, DDATE
         ORDER BY ORG, REPOSITORY, "DDATE", VIEWS DESC ) as RN
         , ORG
         , REPOSITORY
         , DDATE 
         , VIEWS 
 FROM #TEMP_STATS

The statement above delivers exactly the information we need to pick only the first occurrences of every entry, while being specific about that the records with the smallest VIEWS values will be picked for every key.
And, as we see, there is no “$rowid$” column required for this bit either.

So, how do you use this to DELETE the entries from the temp table?” – really good question. I don’t!

The very next step in the overall process is to MERGE the unique records from the #TEMP_STATS table with the STATS table. There is really no need to delete entries in the temporary table first and then plug the whole table into the MERGE statement.

The big MERGE

Instead, we can tell the database what we want to do in a single statement:

MERGE INTO STATS AS trgt
USING(
        SELECT ORG, REPOSITORY, DDATE, VIEWS 
        FROM (
            SELECT 
                ROW_NUMBER() OVER 
                    (PARTITION BY ORG, REPOSITORY, DDATE
                     ORDER BY ORG, REPOSITORY, "DDATE", VIEWS DESC ) as RN
                     , ORG
                     , REPOSITORY
                     , DDATE 
                     , VIEWS 
            FROM #TEMP_STATS)    
        WHERE RN = 1
        ) AS src
        ON (trgt.ORG, trgt.REPOSITORY, trgt.DDATE) 
         = (src.ORG , src.REPOSITORY , src.DDATE)

WHEN MATCHED THEN 
    UPDATE SET trgt.VIEWS = (trgt.VIEWS + src.VIEWS)

WHEN NOT MATCHED THEN 
    INSERT (ORG    , REPOSITORY    , DDATE    , VIEWS) 
    VALUES (src.ORG, src.REPOSITORY, src.DDATE, src.VIEWS);  

By inserting the de-duplication logic and rewriting the MERGE statement with meaningful table aliases and meaningful formatting, the whole process becomes a lot clearer:

  • table STATS is the target (trgt) table for the MERGE, while the de-duplication sub-query is the source (src).
MERGE INTO STATS AS trgt
USING(
     [...]
        ) AS src
  • the join-/matching condition is written in tuple-style and in subsequent lines on top of each other, making it visually clear which columns are matched:
ON (trgt.ORG, trgt.REPOSITORY, trgt.DDATE) 
 = (src.ORG , src.REPOSITORY , src.DDATE)
  • likewise the MATCHED and NOT MATCHED branches of the MERGE logic are clearly visually separated:
WHEN MATCHED THEN 
    UPDATE SET trgt.VIEWS = (trgt.VIEWS + src.VIEWS)

WHEN NOT MATCHED THEN 
    INSERT (ORG    , REPOSITORY    , DDATE    , VIEWS) 
    VALUES (src.ORG, src.REPOSITORY, src.DDATE, src.VIEWS); 

Looking back

Let’s have look at the original MERGE statement again for comparison:

MERGE INTO STATS AS T1 USING #TEMP_STATS AS T2
        ON T1.ORG = T2.ORG and T1.REPOSITORY = T2.REPOSITORY and T1.DATE = T2.DATE
        WHEN MATCHED THEN UPDATE SET T1.VIEWS = (T1.VIEWS + T2.VIEWS)
        WHEN NOT MATCHED THEN INSERT (ORG, REPOSITORY, DATE, VIEWS) VALUES(T2.ORG, T2.REPOSITORY, T2.DATE, T2.VIEWS)

I’d wager that the re-written statement is a lot easier to make sense of than these unformatted references to T1 and T2.

What gives

What is the point of pointing out potential corrections in this demo code?

The point is showing that SQL itself is not as bad and low-level as it is made out to be. There are many options to make SQL code very contextual and be literal about what the intention for it is. Others have written about this (see e.g. Markus Winand’s Literate SQL) or my session “Humane DB design and programming” from #sitMEL-2020 (also covered here: SAP Inside Track #sitMEL – Full Day Event in Melbourne).

Software vendors are in a rather unique position to be examples for what good code can look like and how their platform products should be used “in the real world”. Virtually every developer will try to write code that fits in with the existing style of the codebase; but if the platform is relatively new, there is no dominant code base to follow. So, the documentation and tutorials are used as examples of how to write code on this platform.
Ever participated in a SAP HANA, BW, or ABAP on HANA project and found yourself looking for example structures and naming conventions?
There is a need for guardrails like this. Showing such good examples in the documentation and tutorials, especially from central community figures, can be a long lever for moving towards better real-life code.

There you go…

… now you know, how I feel about this.

It should have become clear, that the purpose of this post is not to pick on Thomas’ code, but to showcase the opportunity that good examples provide.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.