Problem Scenario 32 : You have given three files as below.
spark3/sparkdir1/file1.txt
spark3/sparkd ir2ffile2.txt
spark3/sparkd ir3Zfile3.txt
Each file contain some text.
spark3/sparkdir1/file1.txt
Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework
spark3/sparkdir2/file2.txt
The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File System (HDFS) and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed.
spark3/sparkdir3/file3.txt
his approach takes advantage of data locality nodes manipulating the data they have access to to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking
Now write a Spark code in scala which will load all these three files from hdfs and do the word count by filtering following words. And result should be sorted by word count in reverse order.
Filter words ("a","the","an", "as", "a","with","this","these","is","are","in", "for", "to","and","The","of")
Also please make sure you load all three files as a Single RDD (All three files must be loaded using single API call).
You have also been given following codec
import org.apache.hadoop.io.compress.GzipCodec
Please use above codec to compress file, while saving in hdfs.
Problem Scenario 79 : You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.orders
table=retail_db.order_items
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Columns of products table : (product_id | product categoryid | product_name | product_description | product_prtce | product_image )
Please accomplish following activities.
1. Copy "retaildb.products" table to hdfs in a directory p93_products
2. Filter out all the empty prices
3. Sort all the products based on price in both ascending as well as descending order.
4. Sort all the products based on price as well as product_id in descending order.
5. Use the below functions to do data ordering or ranking and fetch top 10 elements top()
takeOrdered() sortByKey()
Problem Scenario 80 : You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.products
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Columns of products table : (product_id | product_category_id | product_name | product_description | product_price | product_image )
Please accomplish following activities.
1. Copy "retaildb.products" table to hdfs in a directory p93_products
2. Now sort the products data sorted by product price per category, use productcategoryid colunm to group by category
Problem Scenario 94 : You have to run your Spark application on yarn with each executor 20GB and number of executors should be 50.Please replace XXX, YYY, ZZZ
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
-class com.hadoopexam.MyTask \
xxx\
-deploy-mode cluster \ # can be client for client mode
YYY\
222 \
/path/to/hadoopexam.jar \
1000
Problem Scenario 88 : You have been given below three files
product.csv (Create this file in hdfs)
productID,productCode,name,quantity,price,supplierid
1001,PEN,Pen Red,5000,1.23,501
1002,PEN,Pen Blue,8000,1.25,501
1003,PEN,Pen Black,2000,1.25,501
1004,PEC,Pencil 2B,10000,0.48,502
1005,PEC,Pencil 2H,8000,0.49,502
1006,PEC,Pencil HB,0,9999.99,502
2001,PEC,Pencil 3B,500,0.52,501
2002,PEC,Pencil 4B,200,0.62,501
2003,PEC,Pencil 5B,100,0.73,501
2004,PEC,Pencil 6B,500,0.47,502
supplier.csv
supplierid,name,phone
501,ABC Traders,88881111
502,XYZ Company,88882222
503,QQ Corp,88883333
products_suppliers.csv
productID,supplierID
2001,501
2002,501
2003,501
2004,502
2001,503
Now accomplish all the queries given in solution.
1. It is possible that, same product can be supplied by multiple supplier. Now find each product, its price according to each supplier.
2. Find all the supllier name, who are supplying 'Pencil 3B'
3. Find all the products , which are supplied by ABC Traders.
Olga
8 months agoLarae
9 months agoDaisy
10 months agoSharmaine
10 months ago