How to create an EMR job with multiple inputs using the ruby client.

I’m using the ruby client to launch Hadoop jobs on Amazon’s Elastic Map Reduce framework. Things have gone very nicely until I tried scripting a job which draws input from multiple buckets. You can’t use the ‘-input’ option twice and the advice around the internet is to use –args. So, I added:
--args -input,s3n://SomeOtherBucket
to the end my commandline and was displeased to see this option completely omitted from my job.

Grepping the source for ‘–args’ brings up:
commands.parse_options(step_commands + ["--bootstrap-action", "--stream"], [
[ ArgsOption, "--args ARGS", "A command separated list of arguments to pass to the step" ],
[ ArgOption, "--arg ARG", "An argument to pass to the step" ],
[ OptionWithArg, "--step-name STEP_NAME", "Set name for the step", :step_name ],
[ OptionWithArg, "--step-action STEP_ACTION", "Action to take when step finishes. One of CANCEL_AND_WAIT, TERMINATE_JOB_FLOW or CONTINUE", :step_action ],
])

Which shows my problem. –arg and –args are associated with a specific step in the job and as such have to follow the –bootstrap-action or –stream options. They can’t be tacked on to the end.

While on the subject, be wary of using –args because it does not play nicely with commas in the options.

Finally, you need to include –input even if you are also using –arg/–args. Otherwise you get the default wordcount input. So my final command line looked something like:

~/amazon/elastic-mapreduce --create --stream --args -input,"s3n://SomeSecondInput" --enable-debugging --num-instances 16 --master-instance-type c1.xlarge --slave-instance-type c1.xlarge --name "Script Name" --mapper "s3n://MyBucket/map.rb" --reducer "s3n://MyBucket/reduce.rb" --log-uri "s3n://MyBucket/logs" --output "s3n://MyBucket/output" --input "s3n://FirstInput" --bootstrap-action "s3n://MyBucket/bootstrap.sh" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive

The last bootstrap option is magic, btw.

How to create an EMR job with multiple inputs using the ruby client.

Comments Disabled

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s