-
Olivier Blanvillain authoredOlivier Blanvillain authored
StackOverflow
Use the following commands to make a fresh clone of your repository:
git clone -b stackoverflow git@gitlab.epfl.ch:lamp/student-repositories-s21/cs206-GASPAR.git cs206-stackoverflow
Useful links
- The API documentation of Spark
- The API documentation of the Scala standard library
- The API documentation of the Java standard library
If you have issues with the IDE, try reimporting the
build,
if you still have problems, use compile
in sbt instead.
Introduction
For this assignment, you will need to download the stackoverflow dataset (84 MB):
http://alaska.epfl.ch/~dockermoocs/bigdata/stackoverflow-grading.csv
and place it in the folder: src/main/resources/stackoverflow
in your
project directory.
The overall goal of this assignment is to implement a distributed k-means algorithm which clusters posts on the popular question-answer platform StackOverflow according to their score. Moreover, this clustering should be executed in parallel for different programming languages, and the results should be compared.
The motivation is as follows: StackOverflow is an important source of documentation. However, different user-provided answers may have very different ratings (based on user votes) based on their perceived value. Therefore, we would like to look at the distribution of questions and their answers. For example, how many highly-rated answers do StackOverflow users post, and how high are their scores? Are there big differences between higher-rated answers and lower-rated ones?
Finally, we are interested in comparing these distributions for different programming language communities. Differences in distributions could reflect differences in the availability of documentation. For example, StackOverflow could have better documentation for a certain library than that library's API documentation. However, to avoid invalid conclusions we will focus on the well-defined problem of clustering answers according to their scores.
The Data
You are given a CSV (comma-separated values) file with information about StackOverflow posts. Each line in the provided text file has the following format:
<postTypeId>,<id>,[<acceptedAnswer>],[<parentId>],<score>,[<tag>]
A short explanation of the comma-separated fields follows.
<postTypeId>: Type of the post. Type 1 = question,
type 2 = answer.
<id>: Unique id of the post (regardless of type).
<acceptedAnswer>: Id of the accepted answer post. This
information is optional, so maybe be missing
indicated by an empty string.
<parentId>: For an answer: id of the corresponding
question. For a question:missing, indicated
by an empty string.
<score>: The StackOverflow score (based on user
votes).
<tag>: The tag indicates the programming language
that the post is about, in case it's a
question, or missing in case it's an answer.
You will see the following code in the main class:
val lines = sc.textFile("src/main/resources/stackoverflow/stackoverflow-grading.csv")
val raw = rawPostings(lines)
val grouped = groupedPostings(raw)
val scored = scoredPostings(grouped)
val vectors = vectorPostings(scored)