Toby Hobson

What is a Semigroup?

banner.png
What is a Semigroup?
Estimated reading time: 5 minutes

We often need to aggregate data. Hadoop/Spark map-reduce is a classic example of this (the reduce part). There are other more trivial examples - e.g. summing the line items in a shopping basket to calculate the total basket price

Semigroups allow us to parallelize operations on large and small data sets, then combine the results together. In essence Semigroup encapsulates the reduce part of map reduce.

You can find complete examples of the concepts discussed on my blog in my Github repo . In particular, check out the Semigroup example

Semigroups are useful in themselves, but they also allow us to customise other type classes and data types in the Cats ecosystem (e.g. Validated), so it’s worth learning about them, even if you don’t see an immediate need

Introduction

In functional programming terms a Semigroup is a concept which encapsulates this combining/aggregating process. The Semigroup type class defines a simple method:

def combine(x: A, y: A): A

and this is how we use it:

// The type class definition
import cats.kernel.Semigroup
// Bring an instance for Int into scope
import cats.instances.int._

val onePlusTwo = Semigroup[Int].combine(1, 2)

Given two values of type A, combine them to produce a single A. There is one additional constraint that’s not represented in the signature though:

The associative property

A Semigroup has an associative binary operation

What exactly does this mean? The binary part is easy to understand (combine takes two parameters). Associative simply means

combine(combine(a, b), c) == combine(a, combine(b, c))

We know Integer addition is associative i.e. (a + b) + c == a + (b + c) but integer subtraction is not: (a - b) - c != a - (b - c)

Why is the associative requirement so important?

At this stage you’re probably wondering what all the fuss is about? Associativity allows us to partition the data any way we want and potentially parallelize the operations e.g. given a list of numbers: 1, 2, 3, 4, 5 we can do something like this:

// Could be run in different threads
val group1 = Semigroup[Int].combine(1, 2)
val group2 = Semigroup[Int].combine(3, 4)
val group3 = Semigroup[Int].combine(group1, group2)
val total  = Semigroup[Int].combine(group3, 5)

or

val group1 = Semigroup[Int].combine(2, 3)
val group2 = Semigroup[Int].combine(4, 5)
val group3 = Semigroup[Int].combine(group1, group2)
val total  = Semigroup[Int].combine(1, group3)

Writing a custom Semigroup

Cats includes Semigroup implementations for most of the common types in the Scala ecosystem (String, Int, List etc). Most of the time the behaviour is predictable and obvious. The combine implementation for Int will add the two parameters, the String implementation will concatenate them. What if we want to multiply two numbers? We can easily write our own implementation:

import cats.kernel.Semigroup

implicit val multiplicationSemigroup = new Semigroup[Int] {
  override def combine(x: Int, y: Int): Int = x * y
} 

// uses our implicit Semigroup instance above
val four = Semigroup[Int].combine(2, 2)

Cats also includes a factory function to make the code a bit less verbose:

implicit val multiplicationSemigroup = Semigroup.instance[Int](_ * _)

Scala 2.12 also allows us to use Java 8’s Single Abstract Method pattern so if you’re using Scala 2.12 and Java 8 you could also do something like:

implicit val multiplicationSemigroup: Semigroup[Int] = (x: Int, y: Int) => x * y

// or more succintly

implicit val multiplicationSemigroup: Semigroup[Int] = (x, y) => x * y

// or even

implicit val multiplicationSemigroup: Semigroup[Int] = _ * _

Of course the same principle applies if we want to combine our own types however we have to remember that its (A, A) => A not (A, A) => B so we can’t do something like

implicit val lineItemSemigroup = Semigroup.instance[LineItem] { (item1, item2) => 
  item1.price + item2.price // returning a Float not a LineItem
}

however we could do something like

implicit val orderSummarySemigroup = Semigroup.instance[OrderSummary] { (summary1, summary2) => 
  Summary(
    items = summary1.items ++ summary2.items, 
    total = summary1.total + summary2.total
  )
}

String gotchas

I mentioned that the Cats implementation for String concatenates the two parameters. Be aware that it doesn’t include a space between them which isn’t very useful:

import cats.instances.string._
import cats.instances.string._

Semigroup[String].combine("john", "doe") 
res7: String = "johndoe"

but we can bring in our own instance easily:

implicit val mySemigroup = Semigroup.instance[String](_ + " " + _) 
mySemigroup: Semigroup[String] = cats.kernel.Semigroup$$anon$1@32c0fecc

Semigroup[String].combine("john", "doe") 
res3: String = "john doe"

Working with collections

Given the associative constraint we can build more useful constructs from the simple combine(x, y) method. We can use recursion directly or make use of scala’s fold() to operate on a collection of values:

import cats.kernel.Semigroup
import cats.instances.string._ // bring a Semigroup into scope for String

def combineStrings(collection: Seq[String], acc: String = ""): String = {
  collection.headOption match {
    case Some(head) =>
      val updatedAcc = Semigroup[String].combine(acc, head)
      combineColectionOfStrings(collection.tail, updatedAcc)
    case None => acc
  }
}

or 

def combineStrings(collection: Seq[String]): String = {
  collection.foldLeft("")(Semigroup[String].combine)
}

Notice how in both cases I need to pass a default or fallback value, in this case an empty String “”. This limitation means that I can’t easily write a generic method combineStuff[A](collection: Seq[A]) because the fallback value will depend on the type of A ("" for Strings, zero for Ints etc).

There is a solution to this problem though, it’s called the Monoid

comments powered by Disqus

Need help with your project?

Do you need some help or guidance with your project? Reach out to me (email is best)