Skip to content
Snippets Groups Projects
README.md 11.07 KiB

StackOverflow

Use the following commands to make a fresh clone of your repository:

git clone -b stackoverflow git@gitlab.epfl.ch:lamp/student-repositories-s21/cs206-GASPAR.git cs206-stackoverflow

Useful links

If you have issues with the IDE, try reimporting the build, if you still have problems, use compile in sbt instead.

Introduction

For this assignment, you will need to download the stackoverflow dataset (84 MB):

http://alaska.epfl.ch/~dockermoocs/bigdata/stackoverflow-grading.csv

and place it in the folder: src/main/resources/stackoverflow in your project directory.

The overall goal of this assignment is to implement a distributed k-means algorithm which clusters posts on the popular question-answer platform StackOverflow according to their score. Moreover, this clustering should be executed in parallel for different programming languages, and the results should be compared.

The motivation is as follows: StackOverflow is an important source of documentation. However, different user-provided answers may have very different ratings (based on user votes) based on their perceived value. Therefore, we would like to look at the distribution of questions and their answers. For example, how many highly-rated answers do StackOverflow users post, and how high are their scores? Are there big differences between higher-rated answers and lower-rated ones?

Finally, we are interested in comparing these distributions for different programming language communities. Differences in distributions could reflect differences in the availability of documentation. For example, StackOverflow could have better documentation for a certain library than that library's API documentation. However, to avoid invalid conclusions we will focus on the well-defined problem of clustering answers according to their scores.

The Data

You are given a CSV (comma-separated values) file with information about StackOverflow posts. Each line in the provided text file has the following format:

<postTypeId>,<id>,[<acceptedAnswer>],[<parentId>],<score>,[<tag>]

A short explanation of the comma-separated fields follows.

<postTypeId>:     Type of the post. Type 1 = question,
                  type 2 = answer.

<id>:             Unique id of the post (regardless of type).

<acceptedAnswer>: Id of the accepted answer post. This
                  information is optional, so maybe be missing
                  indicated by an empty string.

<parentId>:       For an answer: id of the corresponding
                  question. For a question:missing, indicated
                  by an empty string.

<score>:          The StackOverflow score (based on user
                  votes).

<tag>:            The tag indicates the programming language
                  that the post is about, in case it's a
                  question, or missing in case it's an answer.

You will see the following code in the main class:

val lines   = sc.textFile("src/main/resources/stackoverflow/stackoverflow-grading.csv")
val raw     = rawPostings(lines)
val grouped = groupedPostings(raw)
val scored  = scoredPostings(grouped)
val vectors = vectorPostings(scored)