It's Khayer, Bangladesh

Execute Hive Script in AWS Elastic MapReduce (EMR)

Three ways we can execute Hive script in EMR,

  • EMR Cluster Console
  • PuTTy or some other SSL connector
  • Using own code (Python, Java, Ruby and .NET)


Below I have written a Hive script which will export data from DynamoDB to S3. So before run this script, you will have to create a DyanmoDB table and S3 bucket for export file.

CREATE EXTERNAL TABLE ddbmember (id bigint,name string,city string,state string,age bigint)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
TBLPROPERTIES ("" = "memberinfo",
"dynamodb.column.mapping" = "id:id,name:name,city:city,state:state,age:age"); 
CREATE EXTERNAL TABLE s3member (id int,name string,city string,state string,age int)
LOCATION 's3://test.emr/export/'; 
FROM ddbmember;

drop table ddbmember;
drop table s3member; 

First, we have created an external table for DynamoDB table.  “Id” field data must have same data type with DynamoDB table hash key(numeric type). Then we have created an external table of export S3 bucket.  Finally initiate “INSERT OVERWRITE” instruction to export full DynamoDB table in S3 bucket.


The hive script file will have to upload in S3 bucket to continue next section instruction. Below I have described three way of Hive script implantations,

EMR Console

Follow below steps

  • Navigate to EMR console>Cluster List>Waiting EMR cluster
  • Create new Step.
  • Write S3 the script location
  • Create

The AWS will execute the script automatically and will notify progress in Cluster console.



Using PuTTy client we can connect to EMR instance directly and execute Hive script same as traditional database.   Below article describe how to configure putty,

Below article describe how to connect putty with EMR cluster Hive.


If I summarized the AWS developer notes, steps are below,

  • Using puttygen.exe, create private key from Key Pair .pem file. Please make sure that your EMR cluster has been created same Kay Pair file. Putty gen will create a .PPK file. Follow this Post for details.
  • Under Session tab, write EMR cluster master node URL. Add “hadoop@” prefix of the URL.
  • Under connection>SSH>Auth, load the .PPK file
  • Under Tunnels add below info,
    • Destination: Master node URL:8888
    • Port: 8157
  • After add tunnel, click Load
  • After connect with EMR, write Hive.
  • Execute hive script using Hive console.



We can create an EMR Step by attaching script file using AWS .NET SDK. Below items are prerequisites,

  • AWS .NET SDK – Core and EMR
  • EMR cluster instance
  • S3 bucket for Script

Below are implementation steps,

  • Create a new EMR cluster only to execute this script or assign existing running EMR instance. You will have to collect Job Flow Id of running instance from EMR console>EMR Cluster List
  • Create a step
  • Attached script with Step
  • Wait for EMR Step execution completion
  • Terminate EMR cluster if require.

Below is .NET implementation,

public void RunHiveScriptStep(string activeWaitingJobFlowId, string scriptS3Location, bool isTerminateCluster)

 if (!string.IsNullOrEmpty(activeWaitingJobFlowId))
 StepFactory stepFactory = new StepFactory(RegionEndpoint.EUWest1);
 StepConfig runHiveScript = new StepConfig()
 Name = "Run Hive script",
 HadoopJarStep = stepFactory.NewRunHiveScriptStep(scriptS3Location),
 ActionOnFailure = "TERMINATE_JOB_FLOW"
 AddJobFlowStepsRequest addHiveRequest = new AddJobFlowStepsRequest(activeWaitingJobFlowId, new List<StepConfig>() { runHiveScript });
 AddJobFlowStepsResponse addHiveResponse = EmrClient.AddJobFlowSteps(addHiveRequest);
 List<string> stepIds = addHiveResponse.StepIds;
 String hiveStepId = stepIds[0];

 DescribeStepRequest describeHiveStepRequest = new DescribeStepRequest() { ClusterId = activeWaitingJobFlowId, StepId = hiveStepId };
 DescribeStepResponse describeHiveStepResult = EmrClient.DescribeStep(describeHiveStepRequest);
 Step hiveStep = describeHiveStepResult.Step;
 StepStatus hiveStepStatus = hiveStep.Status;
 string hiveStepState = hiveStepStatus.State.Value.ToLower();
 bool failedState = false;
 StepTimeline finalTimeline = null;
 while (hiveStepState != "completed")
 describeHiveStepRequest = new DescribeStepRequest() { ClusterId = activeWaitingJobFlowId, StepId = hiveStepId };
 describeHiveStepResult = EmrClient.DescribeStep(describeHiveStepRequest);
 hiveStep = describeHiveStepResult.Step;
 hiveStepStatus = hiveStep.Status;
 hiveStepState = hiveStepStatus.State.Value.ToLower();
 finalTimeline = hiveStepStatus.Timeline;
 Console.WriteLine(string.Format("Current state of Hive script execution: {0}", hiveStepState));
 switch (hiveStepState)
 case "pending":
 case "running":
 case "cancelled":
 case "failed":
 case "interrupted":
 failedState = true;
 if (failedState)
 if (finalTimeline != null)
 Console.WriteLine(string.Format("Hive script step {0} created at {1}, started at {2}, finished at {3}"
 , hiveStepId, finalTimeline.CreationDateTime, finalTimeline.StartDateTime, finalTimeline.EndDateTime));

 if (isTerminateCluster)
 TerminateJobFlowsRequest terminateRequest =
 new TerminateJobFlowsRequest(new List<string> {activeWaitingJobFlowId});
 TerminateJobFlowsResponse terminateResponse = EmrClient.TerminateJobFlows(terminateRequest);
 Console.WriteLine("No valid job flow could be created.");
 catch (AmazonElasticMapReduceException emrException)
 Console.WriteLine("Hive script execution step has failed.");
 Console.WriteLine("Amazon error code: {0}",
 string.IsNullOrEmpty(emrException.ErrorCode) ? "None" : emrException.ErrorCode);
 Console.WriteLine("Exception message: {0}", emrException.Message);

Method “RunHiveScript” expect three parameters,

  • activeWaitingJobFlowId : Running instance Job flow id. You can collect this ID from EMR console
  • scriptS3Location: script file S3 location
  • isTerminateCluster: Terminate cluster after execution or not.

AWS has provided SDK for some other languages like Phython, Java, Ruby. You can impalement same thing with other programming languages.


Reference: Using Amazon Elastic MapReduce with the AWS.NET API Part 4: Hive basics with Hadoop



Create and Configure AWS Elastic MapReduce (EMR) Cluster

AWS EMR developer guide has nicely described how to setup and configure a new EMR cluster. Please click here to get the AWS manual. In this writing I will emphasize on two setting of EMR cluster that can confuse beginner. Actually one of big reason to select a tropic in my blog is that something I have tried but did not work first time.

Key Pair

This setting is optional but very important for EMR developer. Key Pair is an encrypted key file which is required to connect the EMR from SSL client like PuTTy.  Key Pair file can be creating from AWS EC2 console. Please follow below steps to create Key Pair file,

  • Navigate to EC2 Console>Kay Pairs>Create Key Pairs
  • Put a name of the file.
  • Then a .pem extension file will auto downloaded for you
  • Store this file for future use.

Now you will get created “Key Pair” name in New EMR creation dropdown list under “Key Pair” section. For more information on Key Pair file click here.


EMR IAM Service and Job Flow Role

AWS has provided SDK for EMR. Using SDK a new EMR cluster can be created and manage. We require this two IAM rule to create EMR cluster from code using AWS SDK. Below I have noted steps to create these two roles,

IAM Service Rule

  • Navigate to IAM console>Rules>New Role
  • Write a name for rule
  • Select “Amazon Elastic MapReduce” role type
  • Then attached this policy

IAM Job Flow Role

  • Navigate to IAM console>Rules>New Role
  • Write a name for rule
  • Select “Amazon Elastic MapReduce for EC2” role type
  • Then attached this policy

Below steps is optional but you can follow if your stack with AWS security exception during EMR cluster creation from code.


Create EMR Cluster using .NET SDK

Below are prerequisites,

  • AWS .NET SDK for Core and EMR
  • EMR service and Job flow role
  • S3 bucket
public string CreateEMRCluster()
 var stepFactory = new StepFactory();

 var enabledebugging = new StepConfig
 Name = "Enable debugging",
 ActionOnFailure = "TERMINATE_JOB_FLOW",
 HadoopJarStep = stepFactory.NewEnableDebuggingStep()

 var installHive = new StepConfig
 Name = "Install Hive",
 ActionOnFailure = "TERMINATE_JOB_FLOW",
 HadoopJarStep = stepFactory.NewInstallHiveStep()

 var instanceConfig = new JobFlowInstancesConfig
 Ec2KeyName = "testemr",
 InstanceCount = 2,
 KeepJobFlowAliveWhenNoSteps = true,
 MasterInstanceType = "m3.xlarge",
 SlaveInstanceType = "m3.xlarge"

 var request = new RunJobFlowRequest
 Name = "Hive Interactive",
 Steps = { enabledebugging, installHive },
 AmiVersion = "3.8.0",

 LogUri = "s3://test.emr/",
 Instances = instanceConfig,
 ServiceRole = "emrServiceRule",
 JobFlowRole = "EMR_EC2_DefaultRole"

 var result = EmrClient.RunJobFlow(request);
 return result.JobFlowId;

The method “CreateEMRCluster” will create a EMR cluster name “testemr”. This method will return “Flow Job Id” which has further use if you want to create EMR Step from code.

Related Post: Execute Hive Script in AWS EMR



SQL Server to Redshift Data Migration

Data migration may not always smooth like copy past. The complexity goes peak when data volume is very larger and error tolerance level is minimum. This article, i am sharing some my experience of data migration to redshift. Hope this will help some people.

Basic working steps are belows,

  1. Export delimited flat file from SQL server
  2. Upload source file in S3
  3. Execute copy command

1. Export CSV from SQL server
A lots of way this can be done. But you must be carefully and will have to know some restrictions of redshift before going to prepare source file. Below i have listed some common consideration,

Please recheck you data, if there any possibility to contain junk character. Redshift copy command will fail if there any unsupported character in source file. Data cleanup is recommended for succesfull copy command.

  • Redshift does not support datetime similarly SQL server. Redshift “date” type only store date part. If time part is important to you, datetime data will have to transform to timestamp.

Here, i prefer two methods to export flat file from Sql Server.

1.1 SQL Server Export Wizard

Will have to login using SQL Server management stadio, then you find Export data under Task menu by right clicking on database. Please do not forget to select utf-8 for destination file encoding.

1.2 BCC

Excellent technique to use BCC to export flat. Below i give a sample BCC command which will have to run in command promt. Data will be exported in Tab delimated file.

cmd> bcp "SELECT * FROM [database].dbo.[tablename]" queryout D:\data.txt -S[HostIP] -U[DbUser -PDbPassword] -c

2. Upload CSV in S3
Make Gzip and then upload the file into s3 using AWS console.

3. Execute Copy command
Now build your copy command and execute using Redshift client. Below i give basic copy command. You will get a lots of writing in Redshift documentation to build your custom copy command.

COPY tablename
FROM 's3://data.txt' 
CREDENTIALS 'aws_access_key_id=[key]

You may not lucky enough to succeed Copy command first attempt -): .  If you get error, see redshift error log using below query. After fixing, then again execute copy command,

select *
from stl_load_errors
order by starttime desc limit 100

Copy command will be terminated in analyzing part, if there any error in source data. This means, no data will not imported for single error in source data. But some case, you may need to import data ignoring invalid rows. Then add “MAXERROR 1000” at the end of copy command. Actually MAXERROR can helpfull to identified invalid row because Redshift will log all “ignore rows” in stl_load_errors. So after identified and fixing errors, copy command should execute without “MAXERROR 1000”.

COPY tablename
FROM 's3://testdata/data.csv'
CREDENTIALS 'aws_access_key_id=[key]

Best wishes !!
Khayer, Bangladesh

Email Open Tracking using Google Analytics

It is really amazing feature if we can track email opening by using Google Analytics. Previously, there is no suggested way to submit HTTP request to Google Analytics. But now Google has introduced new feature Universal Analyticss . The main focus for introducing this feature is to track mobile app uses. Interestingly we can take benefit of this new feature for tracking Open email and other area.

We should follow below basic step for open mail tacking,

  1. Sign up Google analytics and create a GA tracking account
  2. Create HTML encoding email.
  3. Append “<img>” tag at top/bottom of the mail body content. “<img>” tag src will be point to Google Analytics url with required parameter.
  4. Monitoring tacking from Google Analytics dashboard.

First, we can create request URL for tacking data submission. Below is a sample URL with required parameter.

If we hit above URL directly by web browser, GA will receive data under provided GA account ID(tid=your GA ID). All useful Parameters documentation can be found from Measurement Protocol. Below I only described some required parameter for email open tracking task.

v= GA version number

tid=tracking ID / web property ID (Google analytics account ID)

cid= Customer ID(Any unique number)

t= Hit type.

ec= Event Category.

ea= Event Action

el= Event Label

cs= Campaign Source

cm= The campaign medium.

cn= Campaign name

“tid” is Google Analytics account tracking id. It need to be updated by your GA account ID.

cid is generally application user ID or any other unique id. Here is a nice video to understand the reason of this parameter.

Below is GA “Real Time” dashboard live response after hitting above URL directly in browser.

GA Live Tracking

GA Live Tracking

Figure 1: GA Real Time dashboard

Sending HTML Email:

I previously noted that we can track only HTML emaill. So to test our created tracking URL, We have to send a  HTML email. We can use GMAIL to send a test HTML. Outlook is not recommended for HTML email sending. Below tracking “<img>” tag will have to append at top/bottom of the HTML mail content.

<img  src=”

”  />

Here is a helpful video to learn how to send HTML mail using Gmail.

Now GA “Real time” dashboard will show response as soon as email open (figure 1). One special note is that by default Gmail and others mail client trim email body “<img>” tag. So email receiver must have to click “display image” link to show images in mail consent. If user does not click “display image” link, tracking request will NOT send to GA. This is a classic limitation of “one pixel” image tracking technique.

View Analytics tracking report:

By “Real Time” dashboard we can ensure that the tracking is working during development time. GA can be taking some hours to prepare actual GA report. We will get our email tracking report under Traffic “Source>Overview> Campains>open_email_tracking”

GA tracking dashboad

                       Figure 2: Open Email tracking report

Please don’t forget to post comment if you need further assistance:)

Display Numeric Keypad for HTML input box

Tablets and mobile user always expect numeric keypad by default for number text box entry. HTML5 have introduced two text type named “tel” and “number”. All modern browsers supports this new type. Both “tel” and “number” will bring default numeric keypad for tablels like ipad/andoid tab.

<input type=”tel” class=”someclass”/ >
<input type=”number” class=”someclass”/ >

The basic difference between “tel” and “number” is that “tel” type does not allow decimal number, is useful for only telephone number entry.

One problem I have found when I have used this new type in jqery modal. Modal by default setfocus first text and if we use first textbox “tel” or “number” type, ipad does not show numeric keypad.
I do not find any good solution for this. Alternatively we can change default foucs of jqueryui modal to button instead of text box.

open: function(event, ui) {
        $(":button:contains('Submit')").focus(); // 'Submit' is the button text

Javascript string to integer conversion hell
February 9, 2012, 8:57 pm
Filed under: Javascript | Tags: , , ,

Javascript is great. But some over intelligency cause also great problem to the developer.

Here I give some string to integer conversion result in JavaScript,

1.  parseInt(“01”)=>1

2.  parseInt(“05”)=>5

3.  parseInt(“08”)=>0

4.  parseInt(“09”)=>0


Here javascript is giving wrong result for 3 and 4.


When javascript get Zero leading string as parseInt parameter, it takes 8 base system for conversion automatically instead of 10 base.


Simply set the base explicitly like below,

3. parseInt(“08”,10)=>8

4. parseInt(“09”,10)=>9

It is best practice in javascript to set the conversion “base” during any type number parsing.



Get every new post delivered to your Inbox.