Cloudera Exam CCA175 Topic 5 Question 67 Discussion

Actual exam question for Cloudera's CCA175 exam

Question #: 67
Topic #: 5

[All CCA175 Questions]

Problem Scenario 32 : You have given three files as below.

spark3/sparkdir1/file1.txt

spark3/sparkd ir2ffile2.txt

spark3/sparkd ir3Zfile3.txt

Each file contain some text.

spark3/sparkdir1/file1.txt

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework

spark3/sparkdir2/file2.txt

The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File System (HDFS) and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed.

spark3/sparkdir3/file3.txt

his approach takes advantage of data locality nodes manipulating the data they have access to to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking

Now write a Spark code in scala which will load all these three files from hdfs and do the word count by filtering following words. And result should be sorted by word count in reverse order.

Filter words ("a","the","an", "as", "a","with","this","these","is","are","in", "for", "to","and","The","of")

Also please make sure you load all three files as a Single RDD (All three files must be loaded using single API call).

You have also been given following codec

import org.apache.hadoop.io.compress.GzipCodec

Please use above codec to compress file, while saving in hdfs.

ASolution :
Step 1 : Create all three files in hdfs (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs.
Step 2 : Load content from all files.
val content = sc.textFile('spark3/sparkdir1/file1.txt,spark3/sparkdir2/file2.txt,spark3/sparkdir3/file3.txt') //Load the text file
Step 3 : Now create split each line and create RDD of words.
val flatContent = content.flatMap(word=>word.split(' '))
step 4 : Remove space after each word (trim it)
val trimmedContent = f1atContent.map(word=>word.trim)
Step 5 : Create an RDD from remove, all the words that needs to be removed.
val removeRDD = sc.parallelize(List('a','theM,ManM, 'as', 'a','with','this','these','is','are'\'in'\ 'for', 'to','and','The','of'))
Step 6 : Filter the RDD, so it can have only content which are not present in removeRDD. val filtered = trimmedContent.subtract(removeRDD}
Step 7 : Create a PairRDD, so we can have (word,1) tuple or PairRDD. val pairRDD = filtered.map(word => (word,1))
Step 8 : Now do the word count on PairRDD. val wordCount = pairRDD.reduceByKey(_ + _)
Step 9 : Now swap PairRDD.
val swapped = wordCount.map(item => item.swap)
Step 10 : Now revers order the content. val sortedOutput = swapped.sortByKey(false)
Step 11 : Save the output as a Text file. sortedOutput.saveAsTextFile('spark3/result')
Step 12 : Save compressed output.
import org.apache.hadoop.io.compress.GzipCodec
sortedOutput.saveAsTextFile('spark3/compressedresult', classOf[GzipCodec])

BSolution :
Step 1 : Create all three files in hdfs (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs.
Step 2 : Load content from all files.
val content = sc.textFile('spark3/sparkdir1/file1.txt,spark3/sparkdir2/file2.txt,spark3/sparkdir3/file3.txt') //Load the text file
Step 3 : Now create split each line and create RDD of words.
val flatContent = content.flatMap(word=>word.split(' '))
step 4 : Remove space after each word (trim it)
val trimmedContent = f1atContent.map(word=>word.trim)
Step 5 : Create an RDD from remove, all the words that needs to be removed.
val removeRDD = sc.parallelize(List('a','theM,ManM, 'as', 'a','with','this','these','is','are'\'in'\ 'for', 'to','and','The','of'))
Step 6 : Filter the RDD, so it can have only content which are not present in removeRDD. val filtered = trimmedContent.subtract(removeRDD}
Step 7 : Create a PairRDD, so we can have (word,1) tuple or PairRDD. val pairRDD = filtered.map(word => (word,1))
Step 8 : Now do the word count on PairRDD. val wordCount = pairRDD.reduceByKey(_ + _)
Step 9 : Now swap PairRDD.
val swapped = wordCount.map(item => item.swap)

Show Suggested Answer

Suggested Answer: A

by Matthew at Aug 02, 2024, 05:51 PM

Limited Time Offer

25%

11 months ago

I think the correct answer is A. The conf parameter is used to pass the Spark-related configurations, and the provided example shows the correct usage of the extra Java options.

upvoted 0 times

Shawnda

10 months ago

Yes, you are right. Option A correctly specifies the extra Java options needed for the Spark application.

upvoted 0 times

...

Cortney

10 months ago

I think the correct answer is A. The conf parameter is used to pass the Spark-related configurations, and the provided example shows the correct usage of the extra Java options.

upvoted 0 times

...