Minikube in Windows machine

As a windows user sometimes we feel helpless when trying to use community products. Kubernetes is great but it’s local test environment “minikube” is not so great. Occasionally it is very annoying especially for window user. Here I like to share my experience and quick solution that I have learned. Like my other post, this writing is not for expert but will save valuable hours of beginners like me. 

 

Why we need minikube

Minikube is local development environment for kubernetes. Except some advance features like  “Load Balancing” it is possible to test kubernetes in local pc.

 

 What is kubectl 

Kubectl is a kubernetes command tools that will connect kubernetes cluster(minikube), deploy app and manage cluster resources.

 

Install kubectl(windows)

  • Find the latest version number from below

       https://storage.googleapis.com/kubernetes-release/release/stable.txt

  • Download latest exe from below url, replace {version number} with actual version number,

      https://storage.googleapis.com/kubernetes-release/release/{version number}/bin/windows/amd64/kubectl.exe

     Example for version v1.15.1 :

     https://storage.googleapis.com/kubernetes-release/release/v1.15.1/bin/windows/amd64/kubectl.exe

  • (Optional) Create a folder in the C drive and move the exe to that folder.

     For more details: https://kubernetes.io/docs/tasks/tools/install-kubectl

 

Install minikube(windows)

  1. Make sure you have install hyper-v. 
  2. Make sure Docker is running.
  3. Download minikube-installer.exe from. Always download the latest version.

https://github.com/kubernetes/minikube/releases

  1. Add install folder at windows env path variable
  2. Add hyper-v network switch. You can skip this step if you run cluster in default switch.

https://medium.com/@JockDaRock/minikube-on-windows-10-with-hyper-v-6ef0f4dc158c

Check current active ethernet information using Powershell command Get-NetAdapter to configure switch.

  1. Execute below command in PowerShell admin mode. 
 
minikube start --vm-driver hyperv --hyperv-virtual-switch "Minikube Virtual Switch"

 

For default switch you can remove virtual switch parameter.

 
minikube start --vm-driver hyperv

 

For details log of minikube activity, add below log parameter,

 
minikube start --vm-driver hyperv --alsologtostderr

Any issue?

If you are not super lucky, it is a high chance that you will get error during minikube start. So stay cool, read the error message carefully. Below I have listed some checked list to overcome errors,

 

  • If minikube hyperv images has been created, unchecked dynamic memory in Hyper-v image from window Hyper-v manager.
  • Make sure docker is running. Also make sure you are using Powershell in admin mode.
  • Always use latest version minikube and kubectl tools. 
  • If system waiting for SSH access show long time, delete cluster, delete private switch and create again. Or you can change to default switch from VM, then initiate minikube start command again.

Cluster delete command,

 
minikube delete -p minikube
minikube start --vm-driver hyperv

Worked before but not today !!

It’s very common issue i have seen. Most of the times minikube getting problem when i try to start minikube after starting my laptop. Below is my todo list, 

 

  • Make sure docker is running. Also make sure you are using Powershell in admin mode.
  • Restart hyper-v minikube images from hyper-v manager, then execute minikube start command again. 
 
minikube start
  • Delete minikube cluster( hyper-v image) and start minikube again.
 
minikube delete -p minikube

Importance links:

https://kubernetes.io/docs/tasks/tools/install-kubectl/

https://medium.com/@JockDaRock/minikube-on-windows-10-with-hyper-v-6ef0f4dc158c

https://medium.com/@mudrii/kubernetes-local-development-with-minikube-on-hyper-v-windows-10-75f52ad1ed42

https://kubernetes.io/docs/reference/kubectl/cheatsheet

Angular 5 and Asp.net Core File Upload

There are lots of help for ASP.NET mvc file upload in the web. But i got difficulty when need to implement Angular 5 using ASP.NET core. Moreover i have used AspNetBoilerplate framework 🙂 So my file uploading technologies stack is like below

Back-end implementation:

Boilerplate App service does not work for file upload. So i have created standard controller for file upload end point.

[DisableValidation]
public class TestAppFileUploadController : TestControllerBase
{
[HttpPost]
public async Task<IActionResult> UploadFile(IFormFile file)
{
  if (file == null || file.Length == 0)
    return Content("file not selected");
 var filePath = Path.Combine(
 Directory.GetCurrentDirectory(), "TempUpload");

 if (!Directory.Exists(filePath))
 {
    Directory.CreateDirectory(filePath);
 }

 var fileUniqueId = Guid.NewGuid().ToString().ToLower().Replace("-", string.Empty);
 var uniqueFileName = $"{fileUniqueId}_{file.FileName}";

 using (var fileStream = new FileStream(Path.Combine(filePath, uniqueFileName), FileMode.Create))
 {
     await file.CopyToAsync(fileStream);
 }

 var result = new
 {
    UploadFileName = uniqueFileName
 };

 return new JsonResult(result);
 }
}

This file upload end point accept single file. If need to accept multiple, the parameter should be like List<file>. The parameter name “file” is important. This name must have to match the angular form data name value. Another important point is that we have applied [DisableValidation] attribute for controller(also work in action level) to avoid form data validation checking of Boilerplate Framework middle-ware(Ref 2).

DotNet core has removed “Request.Files” object form “HttpContext” directly. So collecting the file object from “Request” not possible in Dotnet core. Alternatively DotNet core suggest to add “IFormFile” in action parameter. I think this is absolutely right move of the core 🙂

Angular Implementation:

Component HTML template:

<p-fileUpload

name="myFile[]"

maxFileSize="1000000000"

customUpload="true"

auto="auto"

(uploadHandler)="myUploader($event)"

(onUpload)="onUpload($event)"

(onBeforeSend)="onBeforeSend($event)">

<ng-template pTemplate="content">

   <ul *ngIf="uploadedFiles.length">

     <li *ngFor="let file of uploadedFiles">{{file.name}} - {{file.size}} bytes</li>

   </ul>

   </ng-template>

</p-fileUpload>

Angular Component:


export class FileUploadComponent extends AppComponentBase {

uploadUrl: string;

uploadedFiles: any[] = [];

constructor(

injector: Injector,

private http: Http

) {

super(injector);

  this.uploadUrl = 'http://localhost:22742/TestAppFileUpload/UploadFile';

}

myUploader(event):void{

  console.log('My File upload',event);

   if(event.files.length == 0){

      console.log('No file selected.');

     return;

    }

  var fileToUpload = event.files[0];

  let input = new FormData();

  input.append("file", fileToUpload);

  this.http

    .post(this.uploadUrl, input)

    .subscribe(res => {

    console.log(res);

  });
}

// upload completed event

onUpload(event): void {

  for (const file of event.files) {

     this.uploadedFiles.push(file);

  }

}

onBeforeSend(event): void {

  event.xhr.setRequestHeader('Authorization', 'Bearer ' + abp.auth.getToken());

}

}

The method myUploader is important here. We have used customUpload=”true” for PrimeNg upload, using “myUploader” actually we are posting uploading request manually. The form data “file” text must have to match with backend controller action method variable name.

Reference:

  1. https://devblog.dymel.pl/2016/09/02/upload-file-image-angular2-aspnetcore
  2. forum.aspnetboilerplate.com
  3. https://www.primefaces.org/primeng/#/fileupload

Parsing delimited string in Redshift

SQL server developers are very much familiar with split string function. These types of functions generally parse delimited string and return single column table type. Recently SQL Server 2016 has given native function “STRING_SPLIT” for parsing.

Redshift has provided “split_part” function for parsing string which returns a part of delimiter string. But developers are always like to convert delimited string to table rows so that they can join the result. Yes this can be also possible in Redshift by utilizing “split_part”.

First, we will have to create a number series. PostgreSQL have nice function called “generate_series” to generate series of integer values. Through Redshift was build from PostgreSQL 8.1.x, “generate_series” function is not fully supported in Redshift. Below code will generate series of integer values between 0 to 255 (collected from here).

CREATE TEMPORARY TABLE numbers AS (
 SELECT 
 p0.n 
 + p1.n*2 
 + p2.n * POWER(2,2) 
 + p3.n * POWER(2,3)
 + p4.n * POWER(2,4)
 + p5.n * POWER(2,5)
 + p6.n * POWER(2,6)
 + p7.n * POWER(2,7) 
 as num
 FROM 
 (SELECT 0 as n UNION SELECT 1) p0,
 (SELECT 0 as n UNION SELECT 1) p1,
 (SELECT 0 as n UNION SELECT 1) p2,
 (SELECT 0 as n UNION SELECT 1) p3,
 (SELECT 0 as n UNION SELECT 1) p4,
 (SELECT 0 as n UNION SELECT 1) p5,
 (SELECT 0 as n UNION SELECT 1) p6,
 (SELECT 0 as n UNION SELECT 1) p7
);

Suppose, we have a delimited string and then create a single row temp table to use in join easily.

CREATE TEMPORARY TABLE delimatedtagid AS(
 SELECT '32,64,256' as tagidtext
);

Now below code will convert delimited text to table rows:

 SELECT 
 TRIM(
    split_part(dti.tagidtext, 
                ',',
               (numbers.num+1)::int)
    ) as tagid
 FROM delimatedtagid dti
 JOIN numbers 
 ON numbers.num <= regexp_count(dti.tagidtext, ',')

 

Complete script:

CREATE TEMPORARY TABLE numbers AS (
 SELECT 
 p0.n 
 + p1.n*2 
 + p2.n * POWER(2,2) 
 + p3.n * POWER(2,3)
 + p4.n * POWER(2,4)
 + p5.n * POWER(2,5)
 + p6.n * POWER(2,6)
 + p7.n * POWER(2,7) 
 as num
 FROM 
 (SELECT 0 as n UNION SELECT 1) p0,
 (SELECT 0 as n UNION SELECT 1) p1,
 (SELECT 0 as n UNION SELECT 1) p2,
 (SELECT 0 as n UNION SELECT 1) p3,
 (SELECT 0 as n UNION SELECT 1) p4,
 (SELECT 0 as n UNION SELECT 1) p5,
 (SELECT 0 as n UNION SELECT 1) p6,
 (SELECT 0 as n UNION SELECT 1) p7
);

CREATE TEMPORARY TABLE delimatedtagid AS(
 SELECT '32,64,256' as tagidtext
);

CREATE TEMPORARY TABLE tagidlist AS( 
 SELECT 
 TRIM(
      split_part(dti.tagidtext, 
                 ',',
                 (numbers.num+1)::int)
    ) as tagid
 FROM delimatedtagid dti
 JOIN numbers 
 ON numbers.num <= regexp_count(dti.tagidtext, ',')
);

select * from tagidlist;

Output is like below:

[tagid]
..................
32
64
256

Install Hadoop in single node Vagrant

Vagrant is a nice tool for developers specially who are love to play with new technologies. Recently i have successfully installed Hadoop in local vagrant. The work was not smooth, i stuck several times. So the objective of this post to help people who wants to explore Hadoop using Vagrant. Specially for Windows users, Vagrant can be a magnificent choice for Hadoop learning.

1. Prerequisite:

  • Latest version of Vagrant
  • Git
  • Putty

2. Prepare single node Vagrant:
Create a directory in Windows machine and then clone below github repository using GIT bash in this directory,

cmd > git clone https://github.com/khayer117/hadoop-in-vagrant.git

I have created a Vagrant configuration file with neccessary setting to install Hadoop in Guest Machine. Hadoop will be installed single node(I will write another post for multi-node). Below is the vagrant node configuration:

  • Ubuntu version : Server 14.04
  • Vagrant box name: ubuntu/trusty64
  • Java: Sun Java java 8
  • Cpu: 2
  • Memory: 1024
  • Private IP: 192.168.33.50

Now it is time to up the vagrant machine. Go to project directory, open Windows command prompt there, write below command,

cmd> vagrant up hnname

As we are using Vagrant in Widnows machine, we will have to connect vagrant guest using SSL client. I prefer to use PuTTy. Below Vagrant command display SSH connection information,

cmd> vagrant ssh hnname

3. Configure Ubuntu SSH server:
SSH server is already pre-installed in vagrant host Ubuntu. We will have to configure SSH server for Hadoop becuase Hadoop manage distribute notes over SSH. Here we have created SSH key. As we are preparing local development environment, so we can left SSH password blank.

$ ssh-keygen
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

4. Downlaoding Hadoop:
I have used hadoop-2.7.2.tar.gz for this article. Hadoop will be install in /usr/local/hadoop folder. But this is optional, hadoop is fine to install other location. Hadoop will be installed under default “Vagrant” user. But dedicated user for Hadoop is recommended.

$ wget http://apache.mirrors.pair.com/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
$ tar -zxvf hadoop-2.7.2.tar.gz
$ sudo cp -r hadoop-2.7.2.tar.gz /usr/local/hadoop

5. Configure Ubuntu bash:
Java home directory and Hadoop base path will have to set in Ubuntu bash file. Below are the steps to modify bash file,

# open bash in vi editor
$ sudo vi $HOME/.bashrc

# append below line in bash file
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-8-oracle

# save and exit from VI(Esc + :wq!)
# compile bash
$ excec bash

6. Disable IPV6:
Hadoop does not support IPV6. In Vagrant Ubuntu IPV6 is enable by default. So IPV6 support will have to disabled following below instruction.

6.1 Modify Hadoop Env setting:

# Edit hadoop env file
$ sudo vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh

# modify HADOOP_OPTS value
HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

# Save and exist vi

 

6.2 Modify Ubuntu network setting to disable IPV6

# Modify sysctl.conf
$ sudo vi /etc/sysctl.conf

# Add below IPv6 configuration
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

# Save and exit vi

# Reload sysctl.conf configuration
$ sudo sysctl -p

# check IPV6 status. value =1 means IPV6 disable
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

7. Ubuntu Hosts(/etc/hosts) file entry:
Hadoop does not require any special entry in Hosts file. But for better understanding below i have added the working copy of my hosts file,

127.0.0.1	hnname	hnname
127.0.0.1 localhost
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

8. Configure Hadoop site setting:
This part i got difficulty due to vagrant. I have tried configured using host name “hnname” or “localhost”, unfortunately i am not succeed to up hadoop component properly using those host name. So i have configured sites using default ip “0.0.0.0”.

Hadoop stores configuration files under /usr/local/hadoop/etc/hadoop directory. Configuration file will have to open using vi, add corresponding setting, then save the file.

8.1 core-site.xml file:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:10001</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
</configuration>

8.2 mapred-site.xml file:

# Create template mapred-site.xml file from defaul one
$ sudo cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

# Edit mapred-site.xml file
$ sudo vi /usr/local/hadoop/etc/hadoop/mapred-site.xml

# add below setting
<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>0.0.0.0:10002</value>
</property>
</configuration>

8.3 hdfs-site.xml file:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/hdfs</value>
</property>

</configuration>

9. Create a temporary directory for hadoop:

$ sudo mkdir /home/hadoop/tmp
$ sudo chown Vagrant /home/hadoop/tmp

# Set folder permissions
$ sudo chmod 750 /home/hadoop

10. Create data folder for data node:

$ sudo mkdir /home/hadoop/hdfs
$ sudo chown vagrant /home/hadoop/hdfs
$ sudo chmod 750 /home/hadoop/hdfs

Configuration is done here. Now it time to test hadoop. Every process should start smoothly. We will have to verify log(/user/local/hadoop/logs) if any unexpected issue raised.

11. Starting services:

# Formate data node. This require one time. This will cleanup hdfs data folder.
$ hdfs namenode -format

#Start hdfs and yarn services.
$ /usr/local/hadoop/sbin/start-dfs.sh
$ /usr/local/hadoop/sbin/start-yarn.sh

# If every thing ok, then below java process will be run
$ jps

# dfs process
2034 DataNode
2263 SecondaryNameNode
1887 NameNode

# yarn process
2441 ResourceManager
2586 NodeManager

12. Test from Browser:
Hadoop have basic web UI to view and track activities. After starting all Hadoop process, Hadoop sites can be view from window machine browser,
http://192.168.33.50:50070

If site is not display from browser, try to test site using telnet from Ubuntu. If telnet connect successfully then Window machine browser should connect to the Hadoop web successfully.

$ telnet 192.168.33.50 50070

Please note that 192.168.33.50 is private IP for vagrant machine which is configure in vagrant config file.

Below site for Hadoop data node:
http://192.168.33.50:50070

Cluster Yarn(Job tracker) site:

http://192.168.33.50:8088

Important ports:

http://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/

13. Running a job:

I like to create word count job in Hadoop to count word from source text file. A sample source file will be download from textfiles.com. Job will execute MapReduce example Jar which already in installation path.

$ cd /usr/local/hadoop

# make directory for sample data and download test data from texfiles.com
$ mkdir sampledata/science
$cd sampledata/science
$wget http://www.textfiles.com/science/ast-list.txt

# Create a directory in dfs and put sample data
$ hdfs dfs -mkdir /project01
$ hdfs dfs -put /usr/local/hadoop/sampledata /project01

# view dfs data list. you can also view data from data node web site (http://192.168.33.50:50070/explorer.html).
$ hdfs dfs -ls /

# execute example job for word count.
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /project01/sampledata/science /project01/sampledata/science/output

# Below show the output.
$ hdfs dfs -cat /project01/sampledata/science/output/part-r-00000

# stopping services
$ /usr/local/hadoop/sbin/stop-dfs.sh
$ /usr/local/hadoop/sbin/stop-yarn.sh

The working copy of Hadoop and Ubuntu configuration file has been added in git “config-files” folder. Hadoop is highly configurable distributed system. Above describe setting is minimum level to run Hadoop in single node cluster. In production server, Hadoop is generally installed in multi node cluster.

References:

Execute Hive Script in AWS Elastic MapReduce (EMR)

Three ways we can execute Hive script in EMR,

  • EMR Cluster Console
  • PuTTy or some other SSL connector
  • Using own code (Python, Java, Ruby and .NET)

 

Below I have written a Hive script which will export data from DynamoDB to S3. So before run this script, you will have to create a DyanmoDB table and S3 bucket for export file.

CREATE EXTERNAL TABLE ddbmember (id bigint,name string,city string,state string,age bigint)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
TBLPROPERTIES ("dynamodb.table.name" = "memberinfo",
"dynamodb.column.mapping" = "id:id,name:name,city:city,state:state,age:age"); 
 
CREATE EXTERNAL TABLE s3member (id int,name string,city string,state string,age int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION 's3://test.emr/export/'; 
 
INSERT OVERWRITE TABLE s3member SELECT * 
FROM ddbmember;

drop table ddbmember;
drop table s3member; 

First, we have created an external table for DynamoDB table.  “Id” field data must have same data type with DynamoDB table hash key(numeric type). Then we have created an external table of export S3 bucket.  Finally initiate “INSERT OVERWRITE” instruction to export full DynamoDB table in S3 bucket.

 

The hive script file will have to upload in S3 bucket to continue next section instruction. Below I have described three way of Hive script implantations,

EMR Console

Follow below steps

  • Navigate to EMR console>Cluster List>Waiting EMR cluster
  • Create new Step.
  • Write S3 the script location
  • Create

The AWS will execute the script automatically and will notify progress in Cluster console.

 

PuTTY

Using PuTTy client we can connect to EMR instance directly and execute Hive script same as traditional database.   Below article describe how to configure putty,

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html

Below article describe how to connect putty with EMR cluster Hive.

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-ssh.html

 

If I summarized the AWS developer notes, steps are below,

  • Using puttygen.exe, create private key from Key Pair .pem file. Please make sure that your EMR cluster has been created same Kay Pair file. Putty gen will create a .PPK file. Follow this Post for details.
  • Under Session tab, write EMR cluster master node URL. Add “hadoop@” prefix of the URL.
  • Under connection>SSH>Auth, load the .PPK file
  • Under Tunnels add below info,
    • Destination: Master node URL:8888
    • Port: 8157
  • After add tunnel, click Load
  • After connect with EMR, write Hive.
  • Execute hive script using Hive console.

 

.NET SDK

We can create an EMR Step by attaching script file using AWS .NET SDK. Below items are prerequisites,

  • AWS .NET SDK – Core and EMR
  • EMR cluster instance
  • S3 bucket for Script

Below are implementation steps,

  • Create a new EMR cluster only to execute this script or assign existing running EMR instance. You will have to collect Job Flow Id of running instance from EMR console>EMR Cluster List
  • Create a step
  • Attached script with Step
  • Wait for EMR Step execution completion
  • Terminate EMR cluster if require.

Below is .NET implementation,

public void RunHiveScriptStep(string activeWaitingJobFlowId, string scriptS3Location, bool isTerminateCluster)
{
 try
 {

 if (!string.IsNullOrEmpty(activeWaitingJobFlowId))
 {
 StepFactory stepFactory = new StepFactory(RegionEndpoint.EUWest1);
 StepConfig runHiveScript = new StepConfig()
 {
 Name = "Run Hive script",
 HadoopJarStep = stepFactory.NewRunHiveScriptStep(scriptS3Location),
 ActionOnFailure = "TERMINATE_JOB_FLOW"
 };
 AddJobFlowStepsRequest addHiveRequest = new AddJobFlowStepsRequest(activeWaitingJobFlowId, new List<StepConfig>() { runHiveScript });
 AddJobFlowStepsResponse addHiveResponse = EmrClient.AddJobFlowSteps(addHiveRequest);
 List<string> stepIds = addHiveResponse.StepIds;
 String hiveStepId = stepIds[0];

 DescribeStepRequest describeHiveStepRequest = new DescribeStepRequest() { ClusterId = activeWaitingJobFlowId, StepId = hiveStepId };
 DescribeStepResponse describeHiveStepResult = EmrClient.DescribeStep(describeHiveStepRequest);
 Step hiveStep = describeHiveStepResult.Step;
 StepStatus hiveStepStatus = hiveStep.Status;
 string hiveStepState = hiveStepStatus.State.Value.ToLower();
 bool failedState = false;
 StepTimeline finalTimeline = null;
 while (hiveStepState != "completed")
 {
 describeHiveStepRequest = new DescribeStepRequest() { ClusterId = activeWaitingJobFlowId, StepId = hiveStepId };
 describeHiveStepResult = EmrClient.DescribeStep(describeHiveStepRequest);
 hiveStep = describeHiveStepResult.Step;
 hiveStepStatus = hiveStep.Status;
 hiveStepState = hiveStepStatus.State.Value.ToLower();
 finalTimeline = hiveStepStatus.Timeline;
 Console.WriteLine(string.Format("Current state of Hive script execution: {0}", hiveStepState));
 switch (hiveStepState)
 {
 case "pending":
 case "running":
 Thread.Sleep(10000);
 break;
 case "cancelled":
 case "failed":
 case "interrupted":
 failedState = true;
 break;
 }
 if (failedState)
 {
 break;
 }
 }
 if (finalTimeline != null)
 {
 Console.WriteLine(string.Format("Hive script step {0} created at {1}, started at {2}, finished at {3}"
 , hiveStepId, finalTimeline.CreationDateTime, finalTimeline.StartDateTime, finalTimeline.EndDateTime));
 }

 if (isTerminateCluster)
 {
 TerminateJobFlowsRequest terminateRequest =
 new TerminateJobFlowsRequest(new List<string> {activeWaitingJobFlowId});
 TerminateJobFlowsResponse terminateResponse = EmrClient.TerminateJobFlows(terminateRequest);
 }
 }
 else
 {
 Console.WriteLine("No valid job flow could be created.");
 }
 }
 catch (AmazonElasticMapReduceException emrException)
 {
 Console.WriteLine("Hive script execution step has failed.");
 Console.WriteLine("Amazon error code: {0}",
 string.IsNullOrEmpty(emrException.ErrorCode) ? "None" : emrException.ErrorCode);
 Console.WriteLine("Exception message: {0}", emrException.Message);
 }
}

Method “RunHiveScript” expect three parameters,

  • activeWaitingJobFlowId : Running instance Job flow id. You can collect this ID from EMR console
  • scriptS3Location: script file S3 location
  • isTerminateCluster: Terminate cluster after execution or not.

AWS has provided SDK for some other languages like Phython, Java, Ruby. You can impalement same thing with other programming languages.

Related Post: CREATE AND CONFIGURE AWS ELASTIC MAPREDUCE (EMR) CLUSTER

Reference: Using Amazon Elastic MapReduce with the AWS.NET API Part 4: Hive basics with Hadoop

…………………….

Khayer

Create and Configure AWS Elastic MapReduce (EMR) Cluster

AWS EMR developer guide has nicely described how to setup and configure a new EMR cluster. Please click here to get the AWS manual. In this writing I will emphasize on two setting of EMR cluster that can confuse beginner. Actually one of big reason to select a tropic in my blog is that something I have tried but did not work first time.

Key Pair

This setting is optional but very important for EMR developer. Key Pair is an encrypted key file which is required to connect the EMR from SSL client like PuTTy.  Key Pair file can be creating from AWS EC2 console. Please follow below steps to create Key Pair file,

  • Navigate to EC2 Console>Kay Pairs>Create Key Pairs
  • Put a name of the file.
  • Then a .pem extension file will auto downloaded for you
  • Store this file for future use.

Now you will get created “Key Pair” name in New EMR creation dropdown list under “Key Pair” section. For more information on Key Pair file click here.

 

EMR IAM Service and Job Flow Role

AWS has provided SDK for EMR. Using SDK a new EMR cluster can be created and manage. We require this two IAM rule to create EMR cluster from code using AWS SDK. Below I have noted steps to create these two roles,

IAM Service Rule

  • Navigate to IAM console>Rules>New Role
  • Write a name for rule
  • Select “Amazon Elastic MapReduce” role type
  • Then attached this policy

IAM Job Flow Role

  • Navigate to IAM console>Rules>New Role
  • Write a name for rule
  • Select “Amazon Elastic MapReduce for EC2” role type
  • Then attached this policy

Below steps is optional but you can follow if your stack with AWS security exception during EMR cluster creation from code.

 

Create EMR Cluster using .NET SDK

Below are prerequisites,

  • AWS .NET SDK for Core and EMR
  • EMR service and Job flow role
  • S3 bucket
public string CreateEMRCluster()
{
 var stepFactory = new StepFactory();

 var enabledebugging = new StepConfig
 {
 Name = "Enable debugging",
 ActionOnFailure = "TERMINATE_JOB_FLOW",
 HadoopJarStep = stepFactory.NewEnableDebuggingStep()
 };

 var installHive = new StepConfig
 {
 Name = "Install Hive",
 ActionOnFailure = "TERMINATE_JOB_FLOW",
 HadoopJarStep = stepFactory.NewInstallHiveStep()
 };

 var instanceConfig = new JobFlowInstancesConfig
 {
 Ec2KeyName = "testemr",
 InstanceCount = 2,
 KeepJobFlowAliveWhenNoSteps = true,
 MasterInstanceType = "m3.xlarge",
 SlaveInstanceType = "m3.xlarge"
 };

 var request = new RunJobFlowRequest
 {
 Name = "Hive Interactive",
 Steps = { enabledebugging, installHive },
 AmiVersion = "3.8.0",

 LogUri = "s3://test.emr/",
 Instances = instanceConfig,
 ServiceRole = "emrServiceRule",
 JobFlowRole = "EMR_EC2_DefaultRole"
 
 };

 var result = EmrClient.RunJobFlow(request);
 return result.JobFlowId;
}

The method “CreateEMRCluster” will create a EMR cluster name “testemr”. This method will return “Flow Job Id” which has further use if you want to create EMR Step from code.

Related Post: Execute Hive Script in AWS EMR

………………………………

Khayer

SQL Server to Redshift Data Migration

Data migration may not always smooth like copy past. The complexity goes peak when data volume is very larger and error tolerance level is minimum. This article, i am sharing some my experience of data migration to redshift. Hope this will help some people.

Basic working steps are belows,

  1. Export delimited flat file from SQL server
  2. Upload source file in S3
  3. Execute copy command

1. Export CSV from SQL server
A lots of way this can be done. But you must be carefully and will have to know some restrictions of redshift before going to prepare source file. Below i have listed some common consideration,

Please recheck you data, if there any possibility to contain junk character. Redshift copy command will fail if there any unsupported character in source file. Data cleanup is recommended for succesfull copy command.

  • Redshift does not support datetime similarly SQL server. Redshift “date” type only store date part. If time part is important to you, datetime data will have to transform to timestamp.

Here, i prefer two methods to export flat file from Sql Server.

1.1 SQL Server Export Wizard

Will have to login using SQL Server management stadio, then you find Export data under Task menu by right clicking on database. Please do not forget to select utf-8 for destination file encoding.

1.2 BCC

Excellent technique to use BCC to export flat. Below i give a sample BCC command which will have to run in command promt. Data will be exported in Tab delimated file.

cmd> bcp "SELECT * FROM [database].dbo.[tablename]" queryout D:\data.txt -S[HostIP] -U[DbUser -PDbPassword] -c

2. Upload CSV in S3
Make Gzip and then upload the file into s3 using AWS console.

3. Execute Copy command
Now build your copy command and execute using Redshift client. Below i give basic copy command. You will get a lots of writing in Redshift documentation to build your custom copy command.

COPY tablename
FROM 's3://data.txt' 
CREDENTIALS 'aws_access_key_id=[key]
aws_secret_access_key=[key]'
DELIMITER '\t' GZIP IGNOREHEADER 1

You may not lucky enough to succeed Copy command first attempt -): .  If you get error, see redshift error log using below query. After fixing, then again execute copy command,

select *
from stl_load_errors
order by starttime desc limit 100

Copy command will be terminated in analyzing part, if there any error in source data. This means, no data will not imported for single error in source data. But some case, you may need to import data ignoring invalid rows. Then add “MAXERROR 1000” at the end of copy command. Actually MAXERROR can helpfull to identified invalid row because Redshift will log all “ignore rows” in stl_load_errors. So after identified and fixing errors, copy command should execute without “MAXERROR 1000”.

COPY tablename
FROM 's3://testdata/data.csv'
CREDENTIALS 'aws_access_key_id=[key]
aws_secret_access_key=[key]'
DELIMITER ','
GZIP IGNOREHEADER 1 MAXERROR 1000

Best wishes !!
Khayer, Bangladesh