Friday, April 3, 2015

Safari Haathi ke Sang..

In recent times, I have been questioned on continuity of Data Integration platforms in the world of BigData.
Many developers from traditional ETL tools such as Informatica feel greatly concerned about BigData
related tools and its fast upcoming. To add to it, many consultants from BigData have started stampeding strength of ETL tools as things of past. I have been pondering about this for sometime now and analyzing complex projects I have been interacting and consulted in enterprises. In this post i am sharing why all need to co exist in a larger eco system.

I have been in battles of hand coding vs ETL tool at multiple places. Absolutely there are key evidences of one technique vs other coming to rescue at enterprises. Data Integration tool leaders such as Informatica attacks the problem at hand with multiple dimensions:
1. Graphical development environments
2. Ease of debugging
3. clustering(Grid) and recovering capability
4. Operational statistic logging and monitoring
5. Data Quality management
6. Metadata management
7. Universal Connectivity

Remember every aspect here can be hand coded by expert Java(or any other) programmers and optimize to peak performance. Then why Informatica (or any such tool)?!

I consider Data integration is not same as Data Processing. Data Integration involves Extraction, Transformation and Loading aspects. Where as Data Processing only focuses on extreme programming to optimize recursive looping, working with large data collections and some transformation. This type of processing is seldom viable using Informatica. Just to give a sample example, I wanted to process around 1000 files (in CSV format) constituting Solar weather predictions and actual observations on Propagation conditions amounting uto 3GB for my personal research work on a weekend. Due to its nature of onetime use, recursive in parsing data, I chose 200 lines of Java code over Informatica Express edition. It took 4 hours to write the code and then around  hours to download the files from internet sites of research organizations and process them (Check my Tableau public visualiation of this work Geo Wise - Hrly Prop Analysis ). Yes this could have also been done using Informatica with wrapper scripts. However, it was unnecessary for my purpose. As Architects, we need to make such smart choices all the time. Typical scenarios where Data Processing takes an edge are pattern detection based on historical values, recommendations, statistical model processing etc.

Problem is many consider BigData technologies as simple replacement for DI Tools such as Informatica and then start considering other tools for extraction of data and transformation of data . Tools such Informatica have been working on pushdown optimization for several years (yes even before BigData came to main stream). Informatica provided developers ability to perform data intensive transformations (ofcourse mostly not recursive in nature) in database platforms. In last couple of years Pushdown to Hadoop for transformation has taken a lot of traction. Informatica does this cleverly by turning transformations in to series of Hive Queries on Hadooop.  It will not cut the chase, lot of limitations etc. are some noise I hear time to time on this technique. Question I would like to ask is whether you have used it for the purpose pushdown is meant for? With advent of fast streaming technologies from Informatica such as Vibe Data Stream (VDS) and Hparser, variety of complex high volume data can now be processed at extremely fast rates. Every thing has its place and you got to know its positioning. Question to ask is how smartly you take advantage of them to get to future!

For me Elephant is a friend which i can use for a smooth ride in the data jungle. Now my motivational theme for sometime is going to be "Safari Haathi ke Sang"

Have a wonderful ride!

Wednesday, August 20, 2014

Informatica Grid options

One of the most frequent query as Informatica Architect I receive is when to go for grid?. Will it not be sufficient if I go for more CPU and RAM on same server instead? High Availability by automatic fail-over is inherent property of grid isn't?

In this blog I would like to throw light on simplified understanding of what is it we are going to achieve and what option of Informatica is required for certain specific needs.

Before we get into to the grid, lets go back to one of the early basics on session/CPU core computation. As per many Informatica articles and actual usage, I have observed that a session dedicatedly required around 1.2 CPU core units to run. i.e. if one has 4 CPU Cores in theory it can support up to 3 sessions in parallel. What happens when more sessions are invoked to run on same CPU configuration? Time based CPU  splicing starts to happen either mandating some sessions to go on wait mode or slow down. As most of the ETL sessions are memory intensive than CPU intensive by the nature of its data movement requirements, mostly we the session/CPU is the requirement from the multithreading architecture of the PowerCenter.

So if one goes by this computation to run more number of sessions in parallel more CPU cores are essential. Now in a large enterprise definitely one may opt for high end multi core CPU monsters. However think of scaling it with cost consideration, one will hit the barrier soon. What if we can add commodity servers on demand and expand with growing demand becomes a smarter choice. Informatica Grid is such option where you start with minimum 2 servers as individual nodes and grow the farm as demand increases. 
One will realize  some benefits as soon as you get onto this configuration:
1. smaller servers 2. elasticity 3. Colocation for data 4. specialized zones 5. almost zero down time during patch upgrades.6. Workflow level distributed computing
Following diagram shows bare minimum architecture of a grid:

Gateway node
Worker node (Backup node)
Shared storage

Although grid provides above benefits, it does not automatically perform failover recovery for sessions. It will just enable node level failover. This is one of the misconception most people have. Grid option just do not provide this High Availability feature. To make sure one has HA on Grid, HA option needs to be procured and also all related components wiz repository and application Databases, storage systems and networks needs to be HA compliant. 

Finally another option which can enable session level computing on the grid is "Session on Grid" option. With this advanced option Informatica distributes individual transformation level tasks on the grid which is useful in CPU process intensive sessions.



Back to blogging

Its been long since I blogged here. I think it right time to get back. Lot has been happening in Data Integration world while away from writing here. Informatica introduced Virtual Data Machine - Vibe, Data integration is getting moved towards the sources with streams, Data Security is taken seriously than ever, vendors are opening up their platforms to cope with exponentially growing type of data sources etc.

Its going to be exciting path ahead. Keep reading...

Thursday, April 12, 2012

Dynamic Expression evaluation in Informatica

Informatica is one of the leading Data Integration - ETL tool in the market for several years now.
One of the objective all the big ETL tool companies strive to accomplish is the need for data transfer speed with complex business rule transformations. Informatica provides various transformations in a typical source DB to target DB mapping. One of the recent advances included making expression evaluation dynamic. i.e expression string itself can now be placed as a parameter in the parameter file.

This feature provides avenues for materializing several ideas for dynamic rule changing. Let us take an example.

In this simple example Name is output port with following expression  : Name = First Name || Last Name

In normal situation, If the expression string needs to be changed so that target Name field to contain only initial character of Last Name, then mapping needs to be modified, tested and moved to production. Now if there was a mapping parameter created like $$Name and isExpression property turned to TRUE, then you could just assign required expression to the parameter in parameter file and use the parameter in the Name port instead of hardcoded concatenation string.

This feature become more practical in situations like bonus calculations, price formulas, scoring etc. where the change of expressions are more frequent.

Kiran Padiyar

Thursday, May 19, 2011

Shopper step analysis

With every square feet of a retail space becoming expensive these days, retailer is continuously on lookout of new ways to optimize floorspace and enhance shopper experience.
How many steps a shopper has to take prior to finding the item?, how many times he goes back n forth? which is the most foot printed place?, are some of the key questions analysts need to answer to yield to optimized placements of items and promotions.
Large retailers already do market basket analysis and placement predictions using data collected in POS. Now using the cart movement on retail floor provides another dimension. Following is a grid showing sample shopper cart movement data.

On the left side are Cart ID and shopper's moving steps (Stops a shopper takes. Every stop is recorded by isle RFID sensors assuming carts are fixed with RFID tags). Columns in the grid show sections in retail shop, these are marked with point numbers. i.e in a ideal case, shopper navigates like point 1 --> point 2 --> .. point 10. Grid cells show sum of sales made(recorded at POS).
Following is a visual showing the heatmap of sales with customer's navigation behaviour.
Blank cells with grey color are sections where shoppers visited and did not make any purchase. Same visualization can be embedded over a shopfloor graphic to show navigation behaviors. 

Hope this post helped in tinkering some new ideas on retail analysis using your BI solution.