Parsing delimited string in Redshift

SQL server developers are very much familiar with split string function. These types of functions generally parse delimited string and return single column table type. Recently SQL Server 2016 has given native function “STRING_SPLIT” for parsing.

Redshift has provided “split_part” function for parsing string which returns a part of delimiter string. But developers are always like to convert delimited string to table rows so that they can join the result. Yes this can be also possible in Redshift by utilizing “split_part”.

First, we will have to create a number series. PostgreSQL have nice function called “generate_series” to generate series of integer values. Through Redshift was build from PostgreSQL 8.1.x, “generate_series” function is not fully supported in Redshift. Below code will generate series of integer values between 0 to 255 (collected from here).

 + p1.n*2 
 + p2.n * POWER(2,2) 
 + p3.n * POWER(2,3)
 + p4.n * POWER(2,4)
 + p5.n * POWER(2,5)
 + p6.n * POWER(2,6)
 + p7.n * POWER(2,7) 
 as num
 (SELECT 0 as n UNION SELECT 1) p0,
 (SELECT 0 as n UNION SELECT 1) p1,
 (SELECT 0 as n UNION SELECT 1) p2,
 (SELECT 0 as n UNION SELECT 1) p3,
 (SELECT 0 as n UNION SELECT 1) p4,
 (SELECT 0 as n UNION SELECT 1) p5,
 (SELECT 0 as n UNION SELECT 1) p6,
 (SELECT 0 as n UNION SELECT 1) p7

Suppose, we have a delimited string and then create a single row temp table to use in join easily.

 SELECT '32,64,256' as tagidtext

Now below code will convert delimited text to table rows:

    ) as tagid
 FROM delimatedtagid dti
 JOIN numbers 
 ON numbers.num <= regexp_count(dti.tagidtext, ',')


Complete script:

 + p1.n*2 
 + p2.n * POWER(2,2) 
 + p3.n * POWER(2,3)
 + p4.n * POWER(2,4)
 + p5.n * POWER(2,5)
 + p6.n * POWER(2,6)
 + p7.n * POWER(2,7) 
 as num
 (SELECT 0 as n UNION SELECT 1) p0,
 (SELECT 0 as n UNION SELECT 1) p1,
 (SELECT 0 as n UNION SELECT 1) p2,
 (SELECT 0 as n UNION SELECT 1) p3,
 (SELECT 0 as n UNION SELECT 1) p4,
 (SELECT 0 as n UNION SELECT 1) p5,
 (SELECT 0 as n UNION SELECT 1) p6,
 (SELECT 0 as n UNION SELECT 1) p7

 SELECT '32,64,256' as tagidtext

    ) as tagid
 FROM delimatedtagid dti
 JOIN numbers 
 ON numbers.num <= regexp_count(dti.tagidtext, ',')

select * from tagidlist;

Output is like below:


Install Hadoop in single node Vagrant

Vagrant is a nice tool for developers specially who are love to play with new technologies. Recently i have successfully installed Hadoop in local vagrant. The work was not smooth, i stuck several times. So the objective of this post to help people who wants to explore Hadoop using Vagrant. Specially for Windows users, Vagrant can be a magnificent choice for Hadoop learning.

1. Prerequisite:

  • Latest version of Vagrant
  • Git
  • Putty

2. Prepare single node Vagrant:
Create a directory in Windows machine and then clone below github repository using GIT bash in this directory,

cmd > git clone

I have created a Vagrant configuration file with neccessary setting to install Hadoop in Guest Machine. Hadoop will be installed single node(I will write another post for multi-node). Below is the vagrant node configuration:

  • Ubuntu version : Server 14.04
  • Vagrant box name: ubuntu/trusty64
  • Java: Sun Java java 8
  • Cpu: 2
  • Memory: 1024
  • Private IP:

Now it is time to up the vagrant machine. Go to project directory, open Windows command prompt there, write below command,

cmd> vagrant up hnname

As we are using Vagrant in Widnows machine, we will have to connect vagrant guest using SSL client. I prefer to use PuTTy. Below Vagrant command display SSH connection information,

cmd> vagrant ssh hnname

3. Configure Ubuntu SSH server:
SSH server is already pre-installed in vagrant host Ubuntu. We will have to configure SSH server for Hadoop becuase Hadoop manage distribute notes over SSH. Here we have created SSH key. As we are preparing local development environment, so we can left SSH password blank.

$ ssh-keygen
cat ~/.ssh/ >> ~/.ssh/authorized_keys

4. Downlaoding Hadoop:
I have used hadoop-2.7.2.tar.gz for this article. Hadoop will be install in /usr/local/hadoop folder. But this is optional, hadoop is fine to install other location. Hadoop will be installed under default “Vagrant” user. But dedicated user for Hadoop is recommended.

$ wget
$ tar -zxvf hadoop-2.7.2.tar.gz
$ sudo cp -r hadoop-2.7.2.tar.gz /usr/local/hadoop

5. Configure Ubuntu bash:
Java home directory and Hadoop base path will have to set in Ubuntu bash file. Below are the steps to modify bash file,

# open bash in vi editor
$ sudo vi $HOME/.bashrc

# append below line in bash file
export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvm/java-8-oracle

# save and exit from VI(Esc + :wq!)
# compile bash
$ excec bash

6. Disable IPV6:
Hadoop does not support IPV6. In Vagrant Ubuntu IPV6 is enable by default. So IPV6 support will have to disabled following below instruction.

6.1 Modify Hadoop Env setting:

# Edit hadoop env file
$ sudo vi /usr/local/hadoop/etc/hadoop/

# modify HADOOP_OPTS value

# Save and exist vi


6.2 Modify Ubuntu network setting to disable IPV6

# Modify sysctl.conf
$ sudo vi /etc/sysctl.conf

# Add below IPv6 configuration
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

# Save and exit vi

# Reload sysctl.conf configuration
$ sudo sysctl -p

# check IPV6 status. value =1 means IPV6 disable
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

7. Ubuntu Hosts(/etc/hosts) file entry:
Hadoop does not require any special entry in Hosts file. But for better understanding below i have added the working copy of my hosts file,	hnname	hnname localhost
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

8. Configure Hadoop site setting:
This part i got difficulty due to vagrant. I have tried configured using host name “hnname” or “localhost”, unfortunately i am not succeed to up hadoop component properly using those host name. So i have configured sites using default ip “”.

Hadoop stores configuration files under /usr/local/hadoop/etc/hadoop directory. Configuration file will have to open using vi, add corresponding setting, then save the file.

8.1 core-site.xml file:


8.2 mapred-site.xml file:

# Create template mapred-site.xml file from defaul one
$ sudo cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

# Edit mapred-site.xml file
$ sudo vi /usr/local/hadoop/etc/hadoop/mapred-site.xml

# add below setting

8.3 hdfs-site.xml file:




9. Create a temporary directory for hadoop:

$ sudo mkdir /home/hadoop/tmp
$ sudo chown Vagrant /home/hadoop/tmp

# Set folder permissions
$ sudo chmod 750 /home/hadoop

10. Create data folder for data node:

$ sudo mkdir /home/hadoop/hdfs
$ sudo chown vagrant /home/hadoop/hdfs
$ sudo chmod 750 /home/hadoop/hdfs

Configuration is done here. Now it time to test hadoop. Every process should start smoothly. We will have to verify log(/user/local/hadoop/logs) if any unexpected issue raised.

11. Starting services:

# Formate data node. This require one time. This will cleanup hdfs data folder.
$ hdfs namenode -format

#Start hdfs and yarn services.
$ /usr/local/hadoop/sbin/
$ /usr/local/hadoop/sbin/

# If every thing ok, then below java process will be run
$ jps

# dfs process
2034 DataNode
2263 SecondaryNameNode
1887 NameNode

# yarn process
2441 ResourceManager
2586 NodeManager

12. Test from Browser:
Hadoop have basic web UI to view and track activities. After starting all Hadoop process, Hadoop sites can be view from window machine browser,

If site is not display from browser, try to test site using telnet from Ubuntu. If telnet connect successfully then Window machine browser should connect to the Hadoop web successfully.

$ telnet 50070

Please note that is private IP for vagrant machine which is configure in vagrant config file.

Below site for Hadoop data node:

13. Running a job:

I like to create word count job in Hadoop to count word from source text file. A sample source file will be download from Job will execute MapReduce example Jar which already in installation path.

$ cd /usr/local/hadoop

# make directory for sample data and download test data from
$ mkdir sampledata/science
$cd sampledata/science

# Create a directory in dfs and put sample data
$ hdfs dfs -mkdir /project01
$ hdfs dfs -put /usr/local/hadoop/sampledata /project01

# view dfs data list. you can also view data from data node web site (
$ hdfs dfs -ls

# execute example job for word count.
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /project01/sampledata/science /project01/sampledata/science/output

# Below show the output.
$ hdfs dfs -cat /project01/sampledata/science/output/part-r-00000

# stopping services
$ /usr/local/hadoop/sbin/
$ /usr/local/hadoop/sbin/

The working copy of Hadoop and Ubuntu configuration file has been added in git “config-files” folder. Hadoop is highly configurable distributed system. Above describe setting is minimum level to run Hadoop in single node cluster. In production server, Hadoop is generally installed in multi node cluster.


Execute Hive Script in AWS Elastic MapReduce (EMR)

Three ways we can execute Hive script in EMR,

  • EMR Cluster Console
  • PuTTy or some other SSL connector
  • Using own code (Python, Java, Ruby and .NET)


Below I have written a Hive script which will export data from DynamoDB to S3. So before run this script, you will have to create a DyanmoDB table and S3 bucket for export file.

CREATE EXTERNAL TABLE ddbmember (id bigint,name string,city string,state string,age bigint)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
TBLPROPERTIES ("" = "memberinfo",
"dynamodb.column.mapping" = "id:id,name:name,city:city,state:state,age:age"); 
CREATE EXTERNAL TABLE s3member (id int,name string,city string,state string,age int)
LOCATION 's3://test.emr/export/'; 
FROM ddbmember;

drop table ddbmember;
drop table s3member; 

First, we have created an external table for DynamoDB table.  “Id” field data must have same data type with DynamoDB table hash key(numeric type). Then we have created an external table of export S3 bucket.  Finally initiate “INSERT OVERWRITE” instruction to export full DynamoDB table in S3 bucket.


The hive script file will have to upload in S3 bucket to continue next section instruction. Below I have described three way of Hive script implantations,

EMR Console

Follow below steps

  • Navigate to EMR console>Cluster List>Waiting EMR cluster
  • Create new Step.
  • Write S3 the script location
  • Create

The AWS will execute the script automatically and will notify progress in Cluster console.



Using PuTTy client we can connect to EMR instance directly and execute Hive script same as traditional database.   Below article describe how to configure putty,

Below article describe how to connect putty with EMR cluster Hive.


If I summarized the AWS developer notes, steps are below,

  • Using puttygen.exe, create private key from Key Pair .pem file. Please make sure that your EMR cluster has been created same Kay Pair file. Putty gen will create a .PPK file. Follow this Post for details.
  • Under Session tab, write EMR cluster master node URL. Add “hadoop@” prefix of the URL.
  • Under connection>SSH>Auth, load the .PPK file
  • Under Tunnels add below info,
    • Destination: Master node URL:8888
    • Port: 8157
  • After add tunnel, click Load
  • After connect with EMR, write Hive.
  • Execute hive script using Hive console.



We can create an EMR Step by attaching script file using AWS .NET SDK. Below items are prerequisites,

  • AWS .NET SDK – Core and EMR
  • EMR cluster instance
  • S3 bucket for Script

Below are implementation steps,

  • Create a new EMR cluster only to execute this script or assign existing running EMR instance. You will have to collect Job Flow Id of running instance from EMR console>EMR Cluster List
  • Create a step
  • Attached script with Step
  • Wait for EMR Step execution completion
  • Terminate EMR cluster if require.

Below is .NET implementation,

public void RunHiveScriptStep(string activeWaitingJobFlowId, string scriptS3Location, bool isTerminateCluster)

 if (!string.IsNullOrEmpty(activeWaitingJobFlowId))
 StepFactory stepFactory = new StepFactory(RegionEndpoint.EUWest1);
 StepConfig runHiveScript = new StepConfig()
 Name = "Run Hive script",
 HadoopJarStep = stepFactory.NewRunHiveScriptStep(scriptS3Location),
 ActionOnFailure = "TERMINATE_JOB_FLOW"
 AddJobFlowStepsRequest addHiveRequest = new AddJobFlowStepsRequest(activeWaitingJobFlowId, new List<StepConfig>() { runHiveScript });
 AddJobFlowStepsResponse addHiveResponse = EmrClient.AddJobFlowSteps(addHiveRequest);
 List<string> stepIds = addHiveResponse.StepIds;
 String hiveStepId = stepIds[0];

 DescribeStepRequest describeHiveStepRequest = new DescribeStepRequest() { ClusterId = activeWaitingJobFlowId, StepId = hiveStepId };
 DescribeStepResponse describeHiveStepResult = EmrClient.DescribeStep(describeHiveStepRequest);
 Step hiveStep = describeHiveStepResult.Step;
 StepStatus hiveStepStatus = hiveStep.Status;
 string hiveStepState = hiveStepStatus.State.Value.ToLower();
 bool failedState = false;
 StepTimeline finalTimeline = null;
 while (hiveStepState != "completed")
 describeHiveStepRequest = new DescribeStepRequest() { ClusterId = activeWaitingJobFlowId, StepId = hiveStepId };
 describeHiveStepResult = EmrClient.DescribeStep(describeHiveStepRequest);
 hiveStep = describeHiveStepResult.Step;
 hiveStepStatus = hiveStep.Status;
 hiveStepState = hiveStepStatus.State.Value.ToLower();
 finalTimeline = hiveStepStatus.Timeline;
 Console.WriteLine(string.Format("Current state of Hive script execution: {0}", hiveStepState));
 switch (hiveStepState)
 case "pending":
 case "running":
 case "cancelled":
 case "failed":
 case "interrupted":
 failedState = true;
 if (failedState)
 if (finalTimeline != null)
 Console.WriteLine(string.Format("Hive script step {0} created at {1}, started at {2}, finished at {3}"
 , hiveStepId, finalTimeline.CreationDateTime, finalTimeline.StartDateTime, finalTimeline.EndDateTime));

 if (isTerminateCluster)
 TerminateJobFlowsRequest terminateRequest =
 new TerminateJobFlowsRequest(new List<string> {activeWaitingJobFlowId});
 TerminateJobFlowsResponse terminateResponse = EmrClient.TerminateJobFlows(terminateRequest);
 Console.WriteLine("No valid job flow could be created.");
 catch (AmazonElasticMapReduceException emrException)
 Console.WriteLine("Hive script execution step has failed.");
 Console.WriteLine("Amazon error code: {0}",
 string.IsNullOrEmpty(emrException.ErrorCode) ? "None" : emrException.ErrorCode);
 Console.WriteLine("Exception message: {0}", emrException.Message);

Method “RunHiveScript” expect three parameters,

  • activeWaitingJobFlowId : Running instance Job flow id. You can collect this ID from EMR console
  • scriptS3Location: script file S3 location
  • isTerminateCluster: Terminate cluster after execution or not.

AWS has provided SDK for some other languages like Phython, Java, Ruby. You can impalement same thing with other programming languages.


Reference: Using Amazon Elastic MapReduce with the AWS.NET API Part 4: Hive basics with Hadoop



Create and Configure AWS Elastic MapReduce (EMR) Cluster

AWS EMR developer guide has nicely described how to setup and configure a new EMR cluster. Please click here to get the AWS manual. In this writing I will emphasize on two setting of EMR cluster that can confuse beginner. Actually one of big reason to select a tropic in my blog is that something I have tried but did not work first time.

Key Pair

This setting is optional but very important for EMR developer. Key Pair is an encrypted key file which is required to connect the EMR from SSL client like PuTTy.  Key Pair file can be creating from AWS EC2 console. Please follow below steps to create Key Pair file,

  • Navigate to EC2 Console>Kay Pairs>Create Key Pairs
  • Put a name of the file.
  • Then a .pem extension file will auto downloaded for you
  • Store this file for future use.

Now you will get created “Key Pair” name in New EMR creation dropdown list under “Key Pair” section. For more information on Key Pair file click here.


EMR IAM Service and Job Flow Role

AWS has provided SDK for EMR. Using SDK a new EMR cluster can be created and manage. We require this two IAM rule to create EMR cluster from code using AWS SDK. Below I have noted steps to create these two roles,

IAM Service Rule

  • Navigate to IAM console>Rules>New Role
  • Write a name for rule
  • Select “Amazon Elastic MapReduce” role type
  • Then attached this policy

IAM Job Flow Role

  • Navigate to IAM console>Rules>New Role
  • Write a name for rule
  • Select “Amazon Elastic MapReduce for EC2” role type
  • Then attached this policy

Below steps is optional but you can follow if your stack with AWS security exception during EMR cluster creation from code.


Create EMR Cluster using .NET SDK

Below are prerequisites,

  • AWS .NET SDK for Core and EMR
  • EMR service and Job flow role
  • S3 bucket
public string CreateEMRCluster()
 var stepFactory = new StepFactory();

 var enabledebugging = new StepConfig
 Name = "Enable debugging",
 ActionOnFailure = "TERMINATE_JOB_FLOW",
 HadoopJarStep = stepFactory.NewEnableDebuggingStep()

 var installHive = new StepConfig
 Name = "Install Hive",
 ActionOnFailure = "TERMINATE_JOB_FLOW",
 HadoopJarStep = stepFactory.NewInstallHiveStep()

 var instanceConfig = new JobFlowInstancesConfig
 Ec2KeyName = "testemr",
 InstanceCount = 2,
 KeepJobFlowAliveWhenNoSteps = true,
 MasterInstanceType = "m3.xlarge",
 SlaveInstanceType = "m3.xlarge"

 var request = new RunJobFlowRequest
 Name = "Hive Interactive",
 Steps = { enabledebugging, installHive },
 AmiVersion = "3.8.0",

 LogUri = "s3://test.emr/",
 Instances = instanceConfig,
 ServiceRole = "emrServiceRule",
 JobFlowRole = "EMR_EC2_DefaultRole"

 var result = EmrClient.RunJobFlow(request);
 return result.JobFlowId;

The method “CreateEMRCluster” will create a EMR cluster name “testemr”. This method will return “Flow Job Id” which has further use if you want to create EMR Step from code.

Related Post: Execute Hive Script in AWS EMR



SQL Server to Redshift Data Migration

Data migration may not always smooth like copy past. The complexity goes peak when data volume is very larger and error tolerance level is minimum. This article, i am sharing some my experience of data migration to redshift. Hope this will help some people.

Basic working steps are belows,

  1. Export delimited flat file from SQL server
  2. Upload source file in S3
  3. Execute copy command

1. Export CSV from SQL server
A lots of way this can be done. But you must be carefully and will have to know some restrictions of redshift before going to prepare source file. Below i have listed some common consideration,

Please recheck you data, if there any possibility to contain junk character. Redshift copy command will fail if there any unsupported character in source file. Data cleanup is recommended for succesfull copy command.

  • Redshift does not support datetime similarly SQL server. Redshift “date” type only store date part. If time part is important to you, datetime data will have to transform to timestamp.

Here, i prefer two methods to export flat file from Sql Server.

1.1 SQL Server Export Wizard

Will have to login using SQL Server management stadio, then you find Export data under Task menu by right clicking on database. Please do not forget to select utf-8 for destination file encoding.

1.2 BCC

Excellent technique to use BCC to export flat. Below i give a sample BCC command which will have to run in command promt. Data will be exported in Tab delimated file.

cmd> bcp "SELECT * FROM [database].dbo.[tablename]" queryout D:\data.txt -S[HostIP] -U[DbUser -PDbPassword] -c

2. Upload CSV in S3
Make Gzip and then upload the file into s3 using AWS console.

3. Execute Copy command
Now build your copy command and execute using Redshift client. Below i give basic copy command. You will get a lots of writing in Redshift documentation to build your custom copy command.

COPY tablename
FROM 's3://data.txt' 
CREDENTIALS 'aws_access_key_id=[key]

You may not lucky enough to succeed Copy command first attempt -): .  If you get error, see redshift error log using below query. After fixing, then again execute copy command,

select *
from stl_load_errors
order by starttime desc limit 100

Copy command will be terminated in analyzing part, if there any error in source data. This means, no data will not imported for single error in source data. But some case, you may need to import data ignoring invalid rows. Then add “MAXERROR 1000” at the end of copy command. Actually MAXERROR can helpfull to identified invalid row because Redshift will log all “ignore rows” in stl_load_errors. So after identified and fixing errors, copy command should execute without “MAXERROR 1000”.

COPY tablename
FROM 's3://testdata/data.csv'
CREDENTIALS 'aws_access_key_id=[key]

Best wishes !!
Khayer, Bangladesh

Email Open Tracking using Google Analytics

It is really amazing feature if we can track email opening by using Google Analytics. Previously, there is no suggested way to submit HTTP request to Google Analytics. But now Google has introduced new feature Universal Analyticss . The main focus for introducing this feature is to track mobile app uses. Interestingly we can take benefit of this new feature for tracking Open email and other area.

We should follow below basic step for open mail tacking,

  1. Sign up Google analytics and create a GA tracking account
  2. Create HTML encoding email.
  3. Append “<img>” tag at top/bottom of the mail body content. “<img>” tag src will be point to Google Analytics url with required parameter.
  4. Monitoring tacking from Google Analytics dashboard.

First, we can create request URL for tacking data submission. Below is a sample URL with required parameter.

If we hit above URL directly by web browser, GA will receive data under provided GA account ID(tid=your GA ID). All useful Parameters documentation can be found from Measurement Protocol. Below I only described some required parameter for email open tracking task.

v= GA version number

tid=tracking ID / web property ID (Google analytics account ID)

cid= Customer ID(Any unique number)

t= Hit type.

ec= Event Category.

ea= Event Action

el= Event Label

cs= Campaign Source

cm= The campaign medium.

cn= Campaign name

“tid” is Google Analytics account tracking id. It need to be updated by your GA account ID.

cid is generally application user ID or any other unique id. Here is a nice video to understand the reason of this parameter.

Below is GA “Real Time” dashboard live response after hitting above URL directly in browser.

GA Live Tracking
GA Live Tracking

Figure 1: GA Real Time dashboard

Sending HTML Email:

I previously noted that we can track only HTML emaill. So to test our created tracking URL, We have to send a  HTML email. We can use GMAIL to send a test HTML. Outlook is not recommended for HTML email sending. Below tracking “<img>” tag will have to append at top/bottom of the HTML mail content.

<img  src=”

”  />

Here is a helpful video to learn how to send HTML mail using Gmail.

Now GA “Real time” dashboard will show response as soon as email open (figure 1). One special note is that by default Gmail and others mail client trim email body “<img>” tag. So email receiver must have to click “display image” link to show images in mail consent. If user does not click “display image” link, tracking request will NOT send to GA. This is a classic limitation of “one pixel” image tracking technique.

View Analytics tracking report:

By “Real Time” dashboard we can ensure that the tracking is working during development time. GA can be taking some hours to prepare actual GA report. We will get our email tracking report under Traffic “Source>Overview> Campains>open_email_tracking”

GA tracking dashboad

                       Figure 2: Open Email tracking report

Please don’t forget to post comment if you need further assistance 🙂