SWS3004-Lab Exercise 2

This Lab is about Hadoop and Spark, using AWS EMR and S3.

Exercise 2.1

Run and compare the execution time of WordCount on Wikipidia’s dump with both Hadoop MapReduce and Spark. You can use either IaaS or PaaS, but make sure you use the same type of setup for both Hadoop and Spark (e.g., if you use EMR for Hadoop MapReduce, then use EMR for Spark also). You have to use the provided input of size 12 GB. Is there any difference in the programming model and ease of programming? Is there any difference in performance? Please explain it in maximum 3 paragraphs. You can include up to 2 performance plots.

Input dataset address on AWS S3: s3://sws3004-2023/input/enwiki-12GB.xml

You must use this input dataset for both Hadoop MapReduce and Spark.

(Tip: use the entire address s3://sws3004-2023/input/enwiki-12GB.xml as parameter to your MapReduce job)

Part 1. Hadoop Mapreduce

First I create a S3 bucket and upload files:

S3 bucket

Then I create an EMR cluster, select the S3 bucket created in the previous step as my S3 folder, and select m4.large and default 3 instances as the instance configuration.

EMR

Now add a step, using WordCount.jar to process the input from s3://sws3004-2023/input/enwiki-12GB.xml.

EMR Step

In my S3 bucket, I can check the output when the step finished.

Check the Output

Hadoop MapReduce performance: The process takes 42 minutes totally.

Hadoop MapReduce Performance

Part 2. Spark

I clear the S3 bucket and upload the files of Spark:

Spark

Then create a new EMR for Spark, and add a step.

EMR Step

Spark performance: The process takes 16 minutes totally.

It seems that Spark is better than Hadoop in performance. The reason is Hadoop uses disk to store data while Spark uses memory to store data, which can reduce the I/O time. Also, MapReduce requires a lot of time to sort during Shuffle, and sorting seems inevitable in MapReduce's Shuffle. When Spark is in Shuffle, sorting is only required for some situations, which is faster.

Is there any difference in the programming model and ease of programming?

Hadoop MapReduce: Hadoop MapReduce is a programming model designed for distributed data processing on large clusters of commodity hardware. The MapReduce programming model has two steps to process our data: Map and Reduce.

  • Map: In the Map phase, the input data is divided into splits, and each split is processed independently by multiple mapper tasks in parallel. The mapper tasks extract key-value pairs from the input data and emit intermediate key-value pairs.
  • Shuffle and Sort: The intermediate key-value pairs emitted by the mappers are shuffled and sorted based on the keys. This step ensures that all values for the same key are grouped together and sent to the same reducer task.
  • Reduce: In the Reduce phase, the sorted and shuffled intermediate data is processed by reducer tasks. Each reducer task processes a subset of the intermediate data, grouped by keys. The reducer tasks aggregate the values associated with each key and produce the final output.

Spark: Spark is a better distributed data processing engine, which extends the MapReduce model and offers more versatility and performance improvements. Spark introduces the concept of Resilient Distributed Datasets (RDDs), which are the fundamental data abstraction in Spark. RDDs are distributed collections of data that can be processed in parallel.

Spark provides a more general programming model compared to Hadoop MapReduce. It supports not only Map and Reduce operations but also various other transformations and actions on RDDs, such as filter, join, groupByKey, reduceByKey and so on. Additionally, Spark offers specialized libraries like Spark SQL for structured data processing, Spark Streaming for real-time data streaming, and MLlib for machine learning tasks.

Difference in Programming Model and Ease of Programming:

I found a comparison form in Lecture slides:

Comparision

  1. Programming Model:
    • Hadoop MapReduce has a more rigid programming model, where data is processed in two distinct phases (Map and Reduce), and users need to explicitly handle intermediate data shuffle and sort.
    • Spark provides a more flexible and expressive programming model with RDDs, allowing users to perform complex operations on distributed data through a wide range of transformations and actions.
  2. Ease of Programming:
    • Spark generally offers better ease of programming due to its high-level APIs and expressive transformations and actions on RDDs. It simplifies the development of distributed data processing applications, and its concise syntax often leads to shorter and more readable code compared to Hadoop MapReduce.
    • Hadoop MapReduce, being more low-level, might require developers to write additional code for tasks like intermediate data serialization and deserialization, which can make the development process more cumbersome.

In summary, Spark provides a more powerful and user-friendly programming model compared to Hadoop MapReduce. Spark's RDDs and higher-level APIs make it easier for developers to write distributed data processing applications, leading to faster development cycles and more efficient data processing.

Exercise 2.2

Write and run on AWS EMR a MapReduce program that computes the total number of followers and followees for each user in a Twitter dataset. The dataset is provided to you in the file twitter_combined.txt taken from http://snap.stanford.edu/data/egonets-Twitter.html. Each line of this file contains two user ids A and B meaning “User A follows User B”. For example, the first line is “214328887 34428380” and it means that “User 214328887 follows User 34428380”.

My code is below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// here is the map
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text texts, Context context)
throws IOException, InterruptedException {
String[] users = texts.toString().split(" "); // split as user0 user1
IntWritable followers = new IntWritable(-1); // negative
IntWritable follows = new IntWritable(1); // positive
context.write(new Text(users[1]), followers);
context.write(new Text(users[0]), follows);
}
}

// here is the reduce
public static class Reduce extends Reducer<Text, IntWritable, Text, Text> {
public void reduce(Text texts, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int follows = 0;
int followers = 0;
for (IntWritable text : texts) {
if (text.get() > 0) {
follows += text.get();
} else {
followers -= text.get();
}
}
// output the result for each user
context.write(texts, new Text(String.format("Followers %d", inDegree)));
context.write(texts, new Text(String.format("Follows %d", outDegree)));
}
}

Then we create a S3 and upload the files:

S3

Now we can add a step in EMR and view the final output:

Step

Result

So User 214328887 has 628 followers, and follows 951 users.