Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Cloudera Exam CCA175 Topic 4 Question 57 Discussion

Actual exam question for Cloudera's CCA175 exam
Question #: 57
Topic #: 4
[All CCA175 Questions]

Problem Scenario 68 : You have given a file as below.

spark75/f ile1.txt

File contain some text. As given Below

spark75/file1.txt

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework

The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File System (HDFS) and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed.

his approach takes advantage of data locality nodes manipulating the data they have access to to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking

For a slightly more complicated task, lets look into splitting up sentences from our documents into word bigrams. A bigram is pair of successive tokens in some sequence. We will look at building bigrams from the sequences of words in each sentence, and then try to find the most frequently occuring ones.

The first problem is that values in each partition of our initial RDD describe lines from the file rather than sentences. Sentences may be split over multiple lines. The glom() RDD method is used to create a single entry for each document containing the list of all lines, we can then join the lines up, then resplit them into sentences using "." as the separator, using flatMap so that every object in our RDD is now a sentence.

A bigram is pair of successive tokens in some sequence. Please build bigrams from the sequences of words in each sentence, and then try to find the most frequently occuring ones.

Show Suggested Answer Hide Answer
Suggested Answer: B

Contribute your Thoughts:

Cecilia
10 months ago
Got it, so the command should be spark-submit -class com.hadoopexam.MyTask -master yarn --deploy-mode cluster SSPARK HOME/lib/hadoopexam.jar 10.
upvoted 0 times
...
Eura
10 months ago
And we should specify the jar file path as SSPARK HOME/lib/hadoopexam.jar.
upvoted 0 times
...
Harris
10 months ago
Yes, that's necessary to launch the driver on one of the cluster nodes.
upvoted 0 times
...
Cecilia
10 months ago
Should we also include --deploy-mode cluster in the command?
upvoted 0 times
...
Eura
10 months ago
Yes, that makes sense. It specifies the main class of the application.
upvoted 0 times
...
Harris
11 months ago
I think I should use -class com.hadoopexam.MyTask in the spark-submit command.
upvoted 0 times
...
Belen
11 months ago
I think B) makes more sense because we need to specify deploy mode
upvoted 0 times
...
Cammy
11 months ago
I agree with Svetlana, A) seems like the right choice
upvoted 0 times
...
Curtis
11 months ago
No, I believe the correct answer is B)
upvoted 0 times
...
Svetlana
11 months ago
I think the answer is A)
upvoted 0 times
...
Twila
1 years ago
Hmm, I'm still a bit unsure. Maybe we should double-check the Spark documentation to be sure?
upvoted 0 times
...
Gail
1 years ago
Good point. I think the 'deploy-mode cluster' option is only necessary if you want to explicitly specify that the driver should run on a cluster node, rather than on your local machine.
upvoted 0 times
...
Ngoc
1 years ago
But wait, doesn't the 'spark-submit' command automatically launch the driver on a cluster node by default? Do we really need the 'deploy-mode cluster' part?
upvoted 0 times
...
Scarlet
1 years ago
Yeah, I agree. The 'deploy-mode cluster' option will ensure the driver runs on a cluster node, which is what the question is asking for.
upvoted 0 times
...
Fatima
1 years ago
Okay, let's think this through. The question says we want to launch the driver on one of the cluster nodes, so the 'deploy-mode cluster' option seems like the right choice.
upvoted 0 times
...
Cornell
1 years ago
Hmm, this question seems straightforward enough, but I'm a bit unsure about the 'deploy-mode' part. Does that mean the driver will run on a cluster node or on my local machine?
upvoted 0 times
Hannah
11 months ago
YYY: --deploy-mode cluster
upvoted 0 times
...
Ena
11 months ago
XXX: -class com.hadoopexam.MyTask
upvoted 0 times
...
...

Save Cancel