We often need to aggregate data. Hadoop/Spark map-reduce is a classic example of this (the reduce part). There are other more trivial examples - e.g. summing the line items in a shopping basket to calculate the total basket price
Semigroup
encapsulates the reduce part of map reduce.You can find complete examples of the concepts discussed on my blog in my Github repo . In particular, check out the Semigroup example
Semigroups are useful in themselves, but they also allow us to customise other type classes and data types in the Cats ecosystem (e.g. Validated), so it’s worth learning about them, even if you don’t see an immediate need
Introduction
In functional programming terms a Semigroup is a concept which encapsulates this combining/aggregating process. The Semigroup type class defines a simple method:
def combine(x: A, y: A): A
and this is how we use it:
// The type class definition
import cats.kernel.Semigroup
// Bring an instance for Int into scope
import cats.instances.int._
val onePlusTwo = Semigroup[Int].combine(1, 2)
Given two values of type A
, combine them to produce a single A
. There is one additional constraint that’s not
represented in the signature though:
The associative property
A Semigroup has an associative binary operation
What exactly does this mean? The binary part is easy to understand (combine takes two parameters). Associative simply means
combine(combine(a, b), c) == combine(a, combine(b, c))
We know Integer addition is associative i.e. (a + b) + c == a + (b + c)
but integer subtraction is
not: (a - b) - c != a - (b - c)
Why is the associative requirement so important?
At this stage you’re probably wondering what all the fuss is about? Associativity allows us to partition the data any way we want
and potentially parallelize the operations e.g. given a list of numbers: 1, 2, 3, 4, 5
we can do something like this:
// Could be run in different threads
val group1 = Semigroup[Int].combine(1, 2)
val group2 = Semigroup[Int].combine(3, 4)
val group3 = Semigroup[Int].combine(group1, group2)
val total = Semigroup[Int].combine(group3, 5)
or
val group1 = Semigroup[Int].combine(2, 3)
val group2 = Semigroup[Int].combine(4, 5)
val group3 = Semigroup[Int].combine(group1, group2)
val total = Semigroup[Int].combine(1, group3)
Writing a custom Semigroup
Cats includes Semigroup implementations for most of the common types in the Scala ecosystem (String
, Int
, List
etc).
Most of the time the behaviour is predictable and obvious. The combine
implementation for Int
will add the two parameters,
the String implementation will concatenate them. What if we want to multiply two numbers? We can easily write our own implementation:
import cats.kernel.Semigroup
implicit val multiplicationSemigroup = new Semigroup[Int] {
override def combine(x: Int, y: Int): Int = x * y
}
// uses our implicit Semigroup instance above
val four = Semigroup[Int].combine(2, 2)
Cats also includes a factory function to make the code a bit less verbose:
implicit val multiplicationSemigroup = Semigroup.instance[Int](_ * _)
Scala 2.12 also allows us to use Java 8’s Single Abstract Method pattern so if you’re using Scala 2.12 and Java 8 you could also do something like:
implicit val multiplicationSemigroup: Semigroup[Int] = (x: Int, y: Int) => x * y
// or more succintly
implicit val multiplicationSemigroup: Semigroup[Int] = (x, y) => x * y
// or even
implicit val multiplicationSemigroup: Semigroup[Int] = _ * _
Of course the same principle applies if we want to combine our own types however we have to remember
that its (A, A) => A
not (A, A) => B
so we can’t do something like
implicit val lineItemSemigroup = Semigroup.instance[LineItem] { (item1, item2) =>
item1.price + item2.price // returning a Float not a LineItem
}
however we could do something like
implicit val orderSummarySemigroup = Semigroup.instance[OrderSummary] { (summary1, summary2) =>
Summary(
items = summary1.items ++ summary2.items,
total = summary1.total + summary2.total
)
}
String gotchas
I mentioned that the Cats implementation for String concatenates the two parameters. Be aware that it doesn’t include a space between them which isn’t very useful:
import cats.instances.string._
import cats.instances.string._
Semigroup[String].combine("john", "doe")
res7: String = "johndoe"
but we can bring in our own instance easily:
implicit val mySemigroup = Semigroup.instance[String](_ + " " + _)
mySemigroup: Semigroup[String] = cats.kernel.Semigroup$$anon$1@32c0fecc
Semigroup[String].combine("john", "doe")
res3: String = "john doe"
Working with collections
Given the associative constraint we can build more useful constructs from the simple combine(x, y)
method. We can use recursion
directly or make use of scala’s fold()
to operate on a collection of values:
import cats.kernel.Semigroup
import cats.instances.string._ // bring a Semigroup into scope for String
def combineStrings(collection: Seq[String], acc: String = ""): String = {
collection.headOption match {
case Some(head) =>
val updatedAcc = Semigroup[String].combine(acc, head)
combineColectionOfStrings(collection.tail, updatedAcc)
case None => acc
}
}
or
def combineStrings(collection: Seq[String]): String = {
collection.foldLeft("")(Semigroup[String].combine)
}
Notice how in both cases I need to pass a default or fallback value, in this case an empty String “”. This limitation means
that I can’t easily write a generic method combineStuff[A](collection: Seq[A])
because the fallback value will depend
on the type of A ("" for Strings, zero for Ints etc).
There is a solution to this problem though, it’s called the Monoid …