From 1c7defe1c3cf96d0b12a98f79a86e5a9cfdcb73b Mon Sep 17 00:00:00 2001
From: Olivier Blanvillain <olivier.blanvillain@epfl.ch>
Date: Sun, 9 May 2021 14:00:03 +0200
Subject: [PATCH] Add labs/lab8-wikipedia

---
 labs/lab8-wikipedia/README.md | 177 ++++++++++++++++++++++++++++++++++
 1 file changed, 177 insertions(+)
 create mode 100644 labs/lab8-wikipedia/README.md

diff --git a/labs/lab8-wikipedia/README.md b/labs/lab8-wikipedia/README.md
new file mode 100644
index 0000000..707cd13
--- /dev/null
+++ b/labs/lab8-wikipedia/README.md
@@ -0,0 +1,177 @@
+# Wikipedia
+
+Use the following commands to make a fresh clone of your repository:
+
+```
+git clone -b wikipedia git@gitlab.epfl.ch:lamp/student-repositories-s21/cs206-GASPAR.git cs206-wikipedia
+```
+
+## Useful links
+
+  * [The API documentation of Spark](http://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html)
+  * [The API documentation of the Scala standard library](https://www.scala-lang.org/files/archive/api/2.13.4)
+  * [The API documentation of the Java standard library](https://docs.oracle.com/en/java/javase/15/docs/api/index.html)
+
+**If you have issues with the IDE, try [reimporting the
+build](https://gitlab.epfl.ch/lamp/cs206/-/blob/master/labs/example-lab.md#ide-features-like-type-on-hover-or-go-to-definition-do-not-work),
+if you still have problems, use `compile` in sbt instead.**
+
+## Introduction
+
+For this assignment, you will need to download the wikipedia dataset (68 MB):
+
+[http://alaska.epfl.ch/~dockermoocs/bigdata/wikipedia-grading.dat](http://alaska.epfl.ch/~dockermoocs/bigdata/wikipedia-grading.dat)
+
+and place it in the folder: `src/main/resources/wikipedia` in your
+project directory.
+
+In this assignment, you will get to know Spark by exploring full-text Wikipedia
+articles.
+
+Gauging how popular a programming language is important for companies judging
+whether or not they should adopt an emerging programming language. For that reason,
+industry analyst firm RedMonk has bi-annually computed a ranking of programming
+language popularity using a variety of data sources, typically from websites like
+GitHub and StackOverflow. See their
+[top-20 ranking for June 2016](http://redmonk.com/sogrady/2016/07/20/language-rankings-6-16/)
+as an example.
+
+In this assignment, we'll use our full-text data from Wikipedia to produce a
+rudimentary metric of how popular a programming language is, in an effort to see
+if our Wikipedia-based rankings bear any relation to the popular Red Monk rankings.
+
+You'll complete this exercise on just one node (your laptop).
+
+## Set up Spark
+
+For the sake of simplified logistics, we'll be running Spark in "local" mode. This
+means that your full Spark application will be run on one node, locally, on your
+laptop.
+
+To start, we need a `SparkContext`. A `SparkContext` is the
+"handle" to your cluster. Once you have a `SparkContext`, you can use it
+to create and populate RDDs with data.
+
+To create a `SparkContext`, you need to first create a
+`SparkConf` instance. A `SparkConf` represents the
+configuration of your Spark application. It's here that you must specify that you
+intend to run your application in "local" mode. You must also name your Spark
+application at this point. For help, see the
+[Spark API documentation](http://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html).
+
+Configure your cluster to run in local mode by implementing `val conf`
+and `val sc`.
+
+## Read-in Wikipedia Data
+
+There are several ways to read data into Spark. The simplest way to read in data
+is to convert an existing collection in memory to an RDD using the
+`parallelize` method of the Spark context.
+
+We have already implemented a method `parse` in the object
+`WikipediaData` object that parses a line of the dataset and turns it into a `WikipediaArticle`.
+
+Create an `RDD` (by implementing `val wikiRdd`) which contains
+the `WikipediaArticle` objects of `articles`.
+
+## Compute a ranking of programming languages
+
+We will use a simple metric for determining the popularity of a programming
+language: the number of Wikipedia articles that mention the language at least once.
+
+### Rank languages attempt #1: rankLangs
+
+**Computing** `occurrencesOfLang`
+
+Start by implementing a helper method `occurrencesOfLang` which computes
+the number of articles in an `RDD` of type `RDD[WikipediaArticles]`
+that mention the given language at least once. For the sake of simplicity we check
+that it least one word (delimited by spaces) of the article text is equal to the given
+language.
+
+**Computing the ranking,** `rankLangs`
+
+Using `occurrencesOfLang`, implement a method `rankLangs` which
+computes a list of pairs where the second component of the pair is the number of
+articles that mention the language (the first component of the pair is the name of
+the language).
+
+An example of what `rankLangs` might return might look like this, for
+example:
+
+```scala
+List(("Scala", 999999), ("JavaScript", 1278), ("LOLCODE", 982), ("Java", 42))
+```
+
+The list should be sorted in descending order. That is, according to this ranking,
+the pair with the highest second component (the count) should be the first element
+of the list.
+
+Pay attention to roughly how long it takes to run this part! (It should take
+tens of seconds.)
+
+### Rank languages attempt #2: rankLangsUsingIndex
+
+**Compute an inverted index**
+
+An inverted index is an index data structure storing a mapping from content,
+such as words or numbers, to a set of documents. In particular, the purpose of
+an inverted index is to allow fast full text searches. In our use-case, an
+inverted index would be useful for mapping from the names of programming
+languages to the collection of Wikipedia articles that mention the name at
+least once.
+
+To make working with the dataset more efficient and more convenient, implement
+a method that computes an "inverted index" which maps programming language names
+to the Wikipedia articles on which they occur at least once.
+
+Implement method `makeIndex` which returns an RDD of the following type:
+`RDD[(String, Iterable[WikipediaArticle])]`. This RDD contains pairs,
+such that for each language in the given `langs` list there is at most
+one pair. Furthermore, the second component of each pair (the `Iterable`)
+contains the `WikipediaArticles` that mention the language at least once.
+
+_Hint: You might want to use methods **`flatMap`** and
+**`groupByKey`** on **`RDD`** for this part._
+
+**Computing the ranking, `rankLangsUsingIndex`**
+
+Use the `makeIndex` method implemented in the previous part to
+implement a faster method for computing the language ranking.
+
+Like in part 1, `rankLangsUsingIndex` should compute a list of pairs
+where the second component of the pair is the number of articles that mention
+the language (the first component of the pair is the name of the language).
+
+Again, the list should be sorted in descending order. That is, according to
+this ranking, the pair with the highest second component (the count) should
+be the first element of the list.
+
+_Hint: method **`mapValues`** on **`PairRDD`** could be useful
+for this part._
+
+_Can you notice a performance improvement over attempt #1? Why?_
+
+### Rank languages attempt #3: rankLangsReduceByKey
+
+In the case where the inverted index from above is _only_ used for computing
+the ranking and for no other task (full-text search, say), it is more efficient
+to use the `reduceByKey` method to compute the ranking directly,
+without first computing an inverted index. Note that the `reduceByKey`
+method is only defined for RDDs containing pairs (each pair is interpreted as
+a key-value pair).
+
+Implement the `rankLangsReduceByKey` method, this time computing the
+ranking without the inverted index, using `reduceByKey`.
+
+Like in part 1 and 2, `rankLangsReduceByKey` should compute a list
+of pairs where the second component of the pair is the number of articles that
+mention the language (the first component of the pair is the name of the language).
+
+Again, the list should be sorted in descending order. That is, according to
+this ranking, the pair with the highest second component (the count) should
+be the first element of the list.
+
+_Can you notice an improvement in performance compared to measuring both the
+computation of the index and the computation of the ranking as we did in
+attempt #2? If so, can you think of a reason?_
-- 
GitLab