Skip to content
Snippets Groups Projects
solutions-6.md 1.12 KiB
Newer Older
# Exercise 1 : Spark Fundamentals

## Question 1

```scala
rdd.map(text => text.split(" ").length).reduce(_ + _)
```

## Question 2

```scala
val errorLogs = rawLogs.map(toLog(_)).filter(isError(_)).cache()
val numberErrors = errorLogs.count()
val messages = errorLogs.filter(isRecent).map(message(_)).collect()
```

## Question 3

In the first case, all the strings are displayed on the master. In the second case, records are displayed on the standard output of the worker nodes.

## Question 4

In the case of 2., the map operations are pipelined and only applied to 10 elements.

# Exercise 2 : Demographics


## Question 1

```scala
val adultAges = people.map(_.age).filter(_ >= 18).cache()
groups.map {
  case (lower, upper) => adultAges.filter { (age: Int) =>
    lower <= age && age <= upper
  }.count()
}
```

## Question 2

```scala
val groups: Array[(Int, Int)] =
  people.map(_.age)
        .filter(_ >= 18)
        .map((age: Int) => (groupOf(age), 1))
        .reduceByKey(_ + _)
        .collect()

// Now, we have a small collection
// and can work on a single machine.
groups.sortBy(_._1)
      .map(_._2)
      .toList()
```