Question 1This question is about reliability and scalability in Hadoop.(a) Consider the following scenario. While executing a MapReduce job two replicatedblocks of a HDFS block with a replication factor of 3 are marked as corrupt, a Node-Manager fails and the Application Manager also fails. Does this prevent the job fromrunning successfully? For the job to be successful the results must be the same asthey would be if no failures occurred. Justify your answer.(b) There are seven stages to a MapReduce job:• Job setup• Load Split• Мар. Copy• Merge• Reduce• Write partConsider a scenario where there are 10,000 mapping tasks each with a duration of5 seconds which yield 20 intermediate keys to reduce tasks. Each mapping taskrequires 1 second to access its load split. The reducer tasks have a duration of 30seconds. Each reducer requires 2 seconds to write information to HDFS. The job setuptime is 20 seconds and copy/merge stages have a duration of 90 seconds. Assumethe duration of the copy/merge stages does not depend on the number of [login to view URL] a cluster of 1,000 workers. You can assume that the Application Manager isnot on one of the workers. What is the minimum execution time of the workload onthis cluster? Explain your reasoning for your figure.(c) Suppose the duration of copy/merge in the above scenario is altered and can becalculated as:f(1) = 30r2Where r is the number of reducer workers. Should you use a different number ofreducers? Explain why
Question 2You have a dataset of tweets annotated with the personality of the author, according tothe Myers Briggs Type Indicator (or MBTI for short). MBTI is a personality type systemthat divides everyone into 16 distinct personality types across 4 axis (introversion/intuition/-thinking/perceiving). The dataset is a collection of rows, each one containing the followinginformation for a single user:user id;;;mbtiType; ; ; messagesThe messages field contains the last 50 tweets the user posted (with each tweet separatedby :: (3 semicolon characters))(a) Write a MapReduce program that computes the average length of the tweet messagesfrom each personality type. Use pseudocode for the program specification. You mustclearly define the input and output of each one of your functions. State in your solutionany assumptions that are made as part of the program, as well as the behaviour of anycustom function you deem necessary. The code flow must be explained, discussingthe input and output of each function that has been defined. You may use a diagram toillustrate the overall data flow.(b) Would the job above benefit from a combiner? Justify your answer.
Question 3You are given a dataset on the Greatest Albums of All Time. It is in a CSV (commaseparated value) format and consists of the following fields:AlbumRanking , ReleaseYear , AlbumName, ArtistName, GenreAn example of this data is as follows:1,1967, Sgt. Pepper's Lonely Hearts CB, The Beatles, Rock2 , 1966, Pet Sounds, The Beach Boys, Rock3, 1966, Revolver The Beatles, Rock4 ,1965, Highway 61 Revisited , Bob Dylan , Folk5 , 1971 , What's Going On , Marvin Gaye, FunkSuppose you execute the following Python Spark job on the dataset:lines - [login to view URL]("hdfs://inputPath")glines - [login to view URL] (lambda 1 : len (1. split ("," )) -- 5)[login to view URL](lambda 1 : [login to view URL]("," ) )occurrences - [login to view URL](lambda x : (x ,1)).reduceByKey(lambda a,b : a+b)results [login to view URL](10, key - lambda x: -x )We provide for reference the following information about the Python and Spark-specificfunctions appearing in the program:filter is a Spark transformation that for each input element will either generate the sameelement as output if the condition expressed in the function is true, or no element atall if it is [login to view URL] is a Spark transformation that for each input element generates one output element,whose value is dictated by the [login to view URL] is a Spark transformation that takes as.............
I have total 4 questions