In a recent blog post SAP presented several ways to load data into a HANA instance. The main approach shown was to use a combination of Python, SQLAlchemy, and Pandas (a Python data analysis library) as well as the Python-specific version of the SAP HANA client driver.
Nine reading minutes and some reasonably straight-forward Python code later, the demo data file chosen (from Kaggle, a website that hosts free data sets for data science and machine learning) could be loaded into HANA.
The author listed the time for the data load as around 25 minutes, which seemed like a long time for me.
Mind you, the file in question contains 22.489.348 records, which translates into a data loading speed of around 15.000 records/second.
Isn’t that a bit complicated?
Now, I thought this can be improved upon, but at the same time, I prefer easy and convenient over setting up a whole Python environment with dependencies, etc.
Like other bloggers in the SAP HANA space (e.g. here and here) I have grown fond of DBeaver, a great free database tool, which happens to have a very convenient data import tool.
Another big plus of this tool is that it is Eclipse/JAVA based, which means it works the same on macOS (which is what I use), Windows, and Linux.
Let’s walk through this, shall we?
Find the “Import Data” in the context menu
Choose what to import
There’s not too much to do for us here, just continue by clicking on “Next >”.
Pick the source data file
We select the source file via a standard file dialogue box and get the following Tables Mapping window:
This window is rather central to the whole data import process. Here we can specify which fields from the source file are mapped to what columns of the target table.
Speaking of which, this window also allows us to define the target table (or select an existing one). DBeaver conveniently generates “safe” default target column types (really, those are just huge character columns that we should adjust if we know more about the data).
Unlike the tools that SAP provides for dealing with HANA databases (HANA Studio & WEB IDE) DBeaver’s functionality was designed by somebody that actually wants to work with it. So, it’s not surprising that sensible features like checking the generated DDL for the target table are just a click away (button “Schema …” in the tables mapping window). Here we could make further adjustments if we wanted to.
Summary of our setup
Once we finished the mapping definitions, we get an overview window. Here we can use “Save task” to save this definition for later reuse.
Ever wanted to reuse the same mapping rules on multiple files and got surprised that HANA Studio just “forgot” how you did it the last time? Then this feature is for you.
“Make it so!” – start the data import
Now, about 5 dialog windows into it, we can finally kick off the actual data import. The final settings allow to truncate the target table before loading the data (or not) and to set after how many records a COMMIT will be sent to the database.
I chose to truncate before loading and committing after 500.000 rows.
And that’s it
After around 12 mins. the import was finished and the data loaded into my HANA Express VM instance. With no coding, minimal typing, and no setting up a new environment.
I’ve got no idea whether the improved speed (2x as fast as in the SAP blog post) was due to my setup of using a local VM vs. the instance in the SAP blog post was on HANA service on SAP Cloud Platform or due to something else. But unless I need to absolutely automate the data loading I prefer the easy UI option.
There you go, now you know!
thank you for the well described article. Next time I need to import a few million records I will compare universal DBeaver tools with HANA Studio and HANA Cockpit.
I noticed in the past the “do commit” value matters in SAP HANA database most.
Did you import column store or row store tables?
Best Regards from Switzerland
glad you liked the article.
The frequency of COMMITs does have a big influence on the loading performance, your observation is correct here.
But it’s not the full story.
You may want to check https://www.lbreddemann.org/merging-right-some-tests-with-data-loading-in-hana-from-down-under/ for a more complete picture.
And in my example I used the HANA 2 SP04 default table type (COLUMN store). Practically speaking, whenever you’re dealing with a lot of data that you want to analyse, you don’t want to use ROWSTORE.
Cheers from Melbourne,