Metadata-generated ETL

It is every ETL application developer’s dream to have a look at an input data extract, analyse its data, confirm primary keys, set up data rules, and then take the Generator, click the great green button, and all the required ETL mappings and processes would come into the digital world. All the ETL would get generated automatically out of metadata.

Common ETL development procedure in data warehouses has three main runners who pass on the relay. First, there is the business client; his requirements must be then processed by an analyst who prepares specifcations for the software developer. Then, the developer does the remaining dirty but mostly routine work. Ideally, the relay would only be passed on in one direction; but the reality mostly resembles a yo-yo moving there and back. Nevertheless, this ancient scheme will soon sound the death knell because new tools are appearing – generators which can take up analyst’s work and do what until recently only a capable developer could have done, i.e. create relevant ETL. And they can do it automatically, quickly, without mistakes and for just a fraction of the price and, therefore, to everyone’s happiness.

These tools take off the burden from developer’s shoulders and put it into analyst’s hands – and one crucial job is required from analysts – high-quality metadata creation/documenting, or saying what to do with the data.

The thing is that ETL generators are not human beings who can drop in at their colleague’s desk and if there is something unclear ask the analyst what exactly he meant by saying “entries are growing somewhere, so probably DISTINCT”. They need input source metadata to be correct and punctilious. And out of the three roles, the analyst is the only one who can provide them.

It is well known that nobody is heavily into drawing up the documentation but if it turns out that it can finally serve something greater that just for boring lying upon hard discs but for real ETL generation, analysts will be left with no space for excuses and will start producing documentation which will not only serve as source material for generators but will also guarantee that what is in the end running in data warehouses corresponds to what is kept in the documentation. Impact analyses will then shift from “probably nothing is going to happen” to “I am sure this attribute must come here”.

ETL process generation through automats based on analyst’s output (i.e. metadata) brings many advantages, such as e.g.:

Low-cost development – People with short-sighted vision may object that costs of analyses will rise because analyses will have to be more accurate and satisfy higher quality requirements. Similarly, to buy an ETL generator means extra initial costs. Nevertheless, there will be prominent money savings in a long-term perspective because the software developer’s role will no longer be needed or, more precisely, the developer will not need to be concerned with tedious development of ordinary jobs but will be able to go into specialities fully.

Bulletproof documentation – The solution suggested by the analyst will be implemented precisely. In the future, this will simplify impact analyses and data lineage creation to a great extent. Not to mention the fact that it will be easier to continue work after someone when it is be clear what he has actually created.

Mistake elimination – To err is human. ETL generators may give a faint smile to this and create an endless number of well-working mappings out of one tuned-in template.

Faster development – It is much faster to click a button than write a third email to an developer to hurry up.

Compliance with enterprise standards – generators are based on templates; this assures that new ETLs will correspond to corporate requirements and conventions rather than developer’s momentous mood.

Data structure descriptions are the simplest type of metadata and, although they may be created by the dullest modelling tool, it is surprising how much you can generate out of such information:

Data scrubbing and checking – This includes checking whether data types are correct as well as an opportunity to generate checking of values easily according to regular terms, or perhaps validity of personal numbers or phone numbers, etc.

SCD2 historization – It is interesting that such a complex system as correctly working procedure for SCD2 historization will make do for its running with just knowing the target table structure plus column flags to record data validity.

Recording extracts to stage – Again, these processes look as if copied through carbon paper and therefore generators are an ideal choice.

Data masking, exporting tables to a file system, transferring data to other databases, backup and many other functions.

The target goal and the Holy Grail in one lie in automatic generation of transformation mappings based exclusively on analyst-created metadata. However, first it needs to know the syntax to record rules for the specific ETL generator but once it has swallowed this bitter beginning, it will be rewarded not only by generated ETL process but also faultless documentation and perhaps also scripts required to create ETL dependences.

And, therefore, once all the universe of jobs, structures and processes running in data warehouse has been correctly reposed together in a metadata repository, it will be just up to analysts’ fantasies and ETL generators’ ingenuity what they can do about all that.

Vlastimil Menšík

Senior Consultant, GLOBTECH spol. s r.o.