Create and Configure AWS Elastic MapReduce (EMR) Cluster


AWS EMR developer guide has nicely described how to setup and configure a new EMR cluster. Please click here to get the AWS manual. In this writing I will emphasize on two setting of EMR cluster that can confuse beginner. Actually one of big reason to select a tropic in my blog is that something I have tried but did not work first time.

Key Pair

This setting is optional but very important for EMR developer. Key Pair is an encrypted key file which is required to connect the EMR from SSL client like PuTTy.  Key Pair file can be creating from AWS EC2 console. Please follow below steps to create Key Pair file,

  • Navigate to EC2 Console>Kay Pairs>Create Key Pairs
  • Put a name of the file.
  • Then a .pem extension file will auto downloaded for you
  • Store this file for future use.

Now you will get created “Key Pair” name in New EMR creation dropdown list under “Key Pair” section. For more information on Key Pair file click here.

 

EMR IAM Service and Job Flow Role

AWS has provided SDK for EMR. Using SDK a new EMR cluster can be created and manage. We require this two IAM rule to create EMR cluster from code using AWS SDK. Below I have noted steps to create these two roles,

IAM Service Rule

  • Navigate to IAM console>Rules>New Role
  • Write a name for rule
  • Select “Amazon Elastic MapReduce” role type
  • Then attached this policy

IAM Job Flow Role

  • Navigate to IAM console>Rules>New Role
  • Write a name for rule
  • Select “Amazon Elastic MapReduce for EC2” role type
  • Then attached this policy

Below steps is optional but you can follow if your stack with AWS security exception during EMR cluster creation from code.

 

Create EMR Cluster using .NET SDK

Below are prerequisites,

  • AWS .NET SDK for Core and EMR
  • EMR service and Job flow role
  • S3 bucket
public string CreateEMRCluster()
{
 var stepFactory = new StepFactory();

 var enabledebugging = new StepConfig
 {
 Name = "Enable debugging",
 ActionOnFailure = "TERMINATE_JOB_FLOW",
 HadoopJarStep = stepFactory.NewEnableDebuggingStep()
 };

 var installHive = new StepConfig
 {
 Name = "Install Hive",
 ActionOnFailure = "TERMINATE_JOB_FLOW",
 HadoopJarStep = stepFactory.NewInstallHiveStep()
 };

 var instanceConfig = new JobFlowInstancesConfig
 {
 Ec2KeyName = "testemr",
 InstanceCount = 2,
 KeepJobFlowAliveWhenNoSteps = true,
 MasterInstanceType = "m3.xlarge",
 SlaveInstanceType = "m3.xlarge"
 };

 var request = new RunJobFlowRequest
 {
 Name = "Hive Interactive",
 Steps = { enabledebugging, installHive },
 AmiVersion = "3.8.0",

 LogUri = "s3://test.emr/",
 Instances = instanceConfig,
 ServiceRole = "emrServiceRule",
 JobFlowRole = "EMR_EC2_DefaultRole"
 
 };

 var result = EmrClient.RunJobFlow(request);
 return result.JobFlowId;
}

The method “CreateEMRCluster” will create a EMR cluster name “testemr”. This method will return “Flow Job Id” which has further use if you want to create EMR Step from code.

Related Post: Execute Hive Script in AWS EMR

………………………………

Khayer

Leave a comment