Cloudera Exam CCA175 Topic 1 Question 31 Discussion

Actual exam question for Cloudera's CCA175 exam

Question #: 31
Topic #: 1

Problem Scenario 30 : You have been given three csv files in hdfs as below.

EmployeeName.csv with the field (id, name)

EmployeeManager.csv (id, manager Name)

EmployeeSalary.csv (id, Salary)

Using Spark and its API you have to generate a joined output as below and save as a text tile (Separated by comma) for final distribution and output must be sorted by id.

ld,name,salary,managerName

EmployeeManager.csv

E01,Vishnu

E02,Satyam

E03,Shiv

E04,Sundar

E05,John

E06,Pallavi

E07,Tanvir

E08,Shekhar

E09,Vinod

E10,Jitendra

EmployeeName.csv

E01,Lokesh

E02,Bhupesh

E03,Amit

E04,Ratan

E05,Dinesh

E06,Pavan

E07,Tejas

E08,Sheela

E09,Kumar

E10,Venkat

EmployeeSalary.csv

E01,50000

E02,50000

E03,45000

E04,45000

E05,50000

E06,45000

E07,50000

E08,10000

E09,10000

E10,10000

ASolution :
Step 1 : Create all three files in hdfs in directory called sparkl (We will do using Hue}. However, you can first create in local filesystem and then
Step 2 : Load EmployeeManager.csv file from hdfs and create PairRDDs
val manager = sc.textFile('spark1/EmployeeManager.csv')
val managerPairRDD = manager.map(x=> (x.split(',')(0),x.split(',')(1)))
Step 3 : Load EmployeeName.csv file from hdfs and create PairRDDs
val name = sc.textFile('spark1/EmployeeName.csv')
val namePairRDD = name.map(x=> (x.split(',')(0),x.split('\')(1)))
Step 4 : Load EmployeeSalary.csv file from hdfs and create PairRDDs
val salary = sc.textFile('spark1/EmployeeSalary.csv')
val salaryPairRDD = salary.map(x=> (x.split(',')(0),x.split(',')(1)))
Step 4 : Join all pairRDDS
val joined = namePairRDD.join(salaryPairRDD}.join(managerPairRDD}
Step 5 : Now sort the joined results, val joinedData = joined.sortByKey()
Step 6 : Now generate comma separated data.
val finalData = joinedData.map(v=> (v._1, v._2._1._1, v._2._1._2, v._2._2))
Step 7 : Save this output in hdfs as text file.
finalData.saveAsTextFile('spark1/result.txt')

BSolution :
Step 1 : Create all three files in hdfs in directory called sparkl (We will do using Hue}. However, you can first create in local filesystem and then
Step 2 : Load EmployeeManager.csv file from hdfs and create PairRDDs
val manager = sc.textFile('spark1/EmployeeManager.csv')
val managerPairRDD = manager.map(x=> (x.split(',')(0),x.split(',')(1)))
Step 3 : Load EmployeeSalary.csv file from hdfs and create PairRDDs
val salary = sc.textFile('spark1/EmployeeSalary.csv')
val salaryPairRDD = salary.map(x=> (x.split(',')(0),x.split(',')(1)))
Step 4 : Join all pairRDDS
val joined = namePairRDD.join(salaryPairRDD}.join(managerPairRDD}
Step 5 : Now sort the joined results, val joinedData = joined.sortByKey()
Step 6 : Now generate comma separated data.
val finalData = joinedData.map(v=> (v._1, v._2._1._1, v._2._1._2, v._2._2))
Step 7 : Save this output in hdfs as text file.
finalData.saveAsTextFile('spark1/result.txt')

Show Suggested Answer

Suggested Answer: A

by Mitsue at Sep 21, 2022, 07:49 PM

Limited Time Offer

25%

Off

Get Premium CCA175 Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Currently there are no comments in this discussion, be the first to comment!