Higher-Order Functions, map, reduce, filter. Yeah, the ones used in Big Data Frameworks like Apache Spark and Hadoop

In my previous post about Functional Programming in Java, I mentioned higher-order functions that are often used in big data frameworks like Hadoop, and Spark.  We’ll be discussing only a small subset of the possible functions available in Spark and Hadoop because this subset are the functions I’ve found most useful as a developer not working in big data.  However, as we’ll see, by the end we’ll have fundamental enough knowledge to apply the function calls to a big data framework if necessary.

map()

We’ll begin with perhaps the easiest function to grasp, and the first in the all too familiar phrase map-reduce: the map function.  The way I’ve often heard map described is that we’re “mapping an input to a specified output”.  To do this mapping we’ll need a function that maps a set of inputs to a set of outputs.  In essence, you’re iterating over every object in a data structure and mapping it to a new object.  It often feels very much to me like a for each loop that forces the programmer to do something inside of it.  Here’s a quick example:

    var arr = ['1','2','3','4','5'];
    arr.map( i => console.log(i));

Here the function we’re supplying to map the inputs (‘1′,’2’,…, etc) is a lambda, that takes i as a parameter and maps this input to the console as it’s output.  It’s obviously the same as:


arr.forEach(function(i) {
  console.log(i);
});

or even the age old:


for(var i in arr){
  console.log(arr[i]);
}

Obviously the example using the forEach could’ve used a lambda in place of the function, and the map function can take a non lambda function, but I wanted to illustrate the many different ways to write it.  We can also add an arbitrarily long function in place of the console log.


arr.map( i => {
  var iTimes20 = i * 20;
  if(i > 3){
    console.log((iTimes20 % 20) == 0);
  }
});

“What happens in map, stays in map”

One thing that always bites me in the ass is I think I’m altering the object passed in by the lambda.  In this case one of the strings ‘1’, ‘2’, …. just as you would be when using the old for loop, but this isn’t the case.  If you do something like this:

arr.map( i => i = 100);

and print out the result of arr, you’ll see it remains unchanged.  This is one of the main tenets of functional programming.  You never want to “cause side effects”.  What this means is, you want to avoid changing the state of a program in unintentional places.  What happens in map, stays in map.  If you want to alter the state of the object in arr, you need to return out a new copy of the original array:


var newArr = arr.map( i => return 100 );

Now if you console log newArr, you’ll see an array of 100’s in place of the original array, and nothing is changed in arr itself.  This is one advantage map has over the old school for loop, you can be certain the state of the original container holding the objects will be the same after the for loop as it was before.

This idea is difficult to wrap your head around as a college student (at least it was for me).  You’re preparing for interviews and space vs time complexity is beat over your head again and again.  The above code looks atrocious from this perspective.  You’re unnecessarily creating a new array, making the space 2n (where n is the size of the array).  Yes, you’re correct.  However, as you’ll see when you get into industry, writing bug free code is often much more important than shaving off a factor of n.  In reality, the code is still O(n), and you can always come back to refactor this bit of code if the bottleneck of the software ends up being this line.  It’s often the case that the bottlenecks appear elsewhere in the software architecture though, and they’ll have been discovered in design.

reduce()

I often think of reduce as a concise replacement for this programming construct:


var finalSum = 0;

var arr = [10,20,30,40,50];

for(var i in arr){
  finalSum += arr[i];
}

The same thing is accomplished using reduce:


var finalSum = add.reduce((countSoFar,currentVal) => {
  return countSoFar + currentVal;
});

Here, countSoFar is an “accumulator” which carries the returned value throughout the function calls, and currentVal is the current object in the collection.  So we’re adding the accumulated sum we’ve seen up to this point, to the current value in the array, and returning this to the countSoFar for the next iteration.  This particular example is kind of trivial, however, since you have access to the accumulator you can do some really interesting things.  For example:


var doubleNestedArray = [
  ['Bob', 'White'],
  ['Clark', 'Kent'],
  ['Bilbo','Baggins']
];

var toMap = doubleNestedArray.reduce((carriedObject, currentArrayValue) => {
  carriedObject[currentArrayValue[0]] = currentArrayValue[1];
  return carriedObject;
}, {});

This will return the array as a map object that looks like this: { Bob: ‘White’, Clark: ‘Kent’, Bilbo: ‘Baggins’ }

Here we see a feature of reduce that I didn’t mention previously.  The second argument after the lambda function is the initial value for the reduce.  In our case it’s an empty javascript object, however you could’ve easily added an initial person to our map:


var toMap= doubleNestedArray.reduce(carriedObject, currentArray) => {
  carriedObject[currentArray[0]] = currentArray[1];
  return carriedObject;
}, {Bruce: 'Wayne'});

I often use reduce on large JSON objects returned by an API, where I want to sum over one attribute across all the objects.  For example, getting the star count from a list of GitHub repos:


return repos.data.reduce(function(count,repo){
  return count + repo.stargazers_count;
},0);

filter()

Finally, lets throw in one more for good measure, as it’s another that very frequently comes up in big data computing (I think I’ve heard the joke somewhere: “it should be map-filter-reduce but that doesn’t roll off the tongue quite like map-reduce”).  The filter function is a replacement for this construct:


var arr = [10,20,30,40];

var lessThan12 = [];

for(var i in arr){
  if(arr[i] < 12){
    lessThan12.push(arr[i]);
  }
}

This can be shortened to:

var lessThan12 = arr.filter( num => return num < 12; );

Naturally, any sort or predicate logic can be put in place of the if statement to select elements from an arbitrary container.

One thing I’ll often forget is you have to return the predicate.  It seems odd because you’re returning a boolean, but getting the actual value that evaluates to true in the returned container.

A big data example

A famous “hello world” example from big data is the word count.  Let’s use our new found knowledge on this problem.


var TheCrocodile = "How doth the little crocodile
Improve his shining tail
And pour the waters of the Nile
On every golden scale
How cheerfully he seems to grin
How neatly spreads his claws
And welcomes little fishes in
With gently smiling jaws";

var stringCount = TheCrocodile
                     .toLowerCase()
                     .split(' ')
                     .reduce((count, word) => {
                       count[word] = count[word] + 1 || 1;
                       return count;
                     }, {});

 

Hopefully now the power of functional programming is beginning to be more apparent.  Counting the words in a string took us 4 lines of code.  Not only this, but this code could be parallelized to multiple machines using Hadoop or Spark.

That’s all for functional programming!  In the next post, we’ll finally start talking about a fundamental topic associated with this blog: cryptocurrency.  I’m particularly interested in Ethereum as a developer, and therefore the solidity programming language.  We’ll start this broad and deeply interesting topic with a brief explanation of what the blockchain is, and move forward from there.  See you then!

Advertisements

One thought on “Higher-Order Functions, map, reduce, filter. Yeah, the ones used in Big Data Frameworks like Apache Spark and Hadoop

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s