Pages

Footer Pages

Spring Boot

Java String API

Java Conversions

Kotlin Programs

Kotlin Conversions

Java Threads Tutorial

Java 8 Tutorial

Tuesday, December 14, 2021

Java Spark RDD reduce() Examples - sum, min and max opeartions

1. Overview

In this tutorial, we will learn how to use the Spark RDD reduce() method using the java programming language. Most of the developers use the same method reduce() in pyspark but in this article, we will understand how to get the sum, min and max operations with Java RDD.

Java Spark RDD reduce() Examples



2. Java Spark RDD - reduce() method


First let understand the syntax of java reduce() spark method.

public T reduce(scala.Function2<T,T,T> f)
 

This method takes the Function2 functional interface which is the concept of Java 8. But the Function2 is implemented in Scala language.

Function2 takes two arguments as input and returns one value. Here, always input and output type should be the same.


3. Java Spark RDD reduce() Example to find the sum


In the below examples, we first created the SparkConf and JavaSparkContext with local mode for the testing purpose.

We've provided the step by step meaning in the program.

We must have to pass the lambda expression to the reduce() method. If you are new to java, please read the in-depth article on Java 8 Lambda expressions.

You might be surprised with the logic behind the reduce() method. Below is the explanation of its internals. As a developer, you should know the basic knowledge on hood what is going on.

On the RDD, reduce() method is called with the logic of value1 + value2. That means this formula will be applied to all the values in each partition untill partition will have only one value.

If there are more than one partitions then all the outputs of partitions are moved to another data node. Then next, again the same logic value1 + value2 is applied to get the final result.

if only one partition is for the input file or dataset then it will return the final output of the single partion.


package com.javaprogramto.rdd.reduce;

import java.util.Arrays;
import java.util.List;

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class RDDReduceExample {

	public static void main(String[] args) {
		
		// to remove the unwanted loggers from output.
		Logger.getLogger("org.apache").setLevel(Level.WARN);

		// Getting the numbers list.
		List<Integer> numbersList = getSampleData();
		
		// Creating the SparkConf object
		SparkConf sparkConf = new SparkConf().setAppName("Java RDD_Reduce Example").setMaster("local");

		// Creating JavaSprakContext object
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// Converting List into JavaRDD.
		JavaRDD<Integer> integersRDD =  sc.parallelize(numbersList);
		
		// Getting the sum of all numbers using reduce() method
		Integer sumResult = integersRDD.reduce( (value1, value2) -> value1 + value2);

		// printing the sum
		
		System.out.println("Sum of RDD numbers using reduce() : "+sumResult);
		
		// closing Spark Context
		sc.close();
		
	}

	/**
	 * returns a list of integer numbers
	 * 
	 * @return
	 */
	private static List<Integer> getSampleData() {

		return Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9);

	}

}

 
Output:
Sum of RDD numbers using reduce() : 45


 

4. Java Spark RDD reduce() min and max Examples


Next, let us find the min and max values from the RDD.

package com.javaprogramto.rdd.reduce;

import java.util.Arrays;
import java.util.List;

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class RDDReduceExample {

	public static void main(String[] args) {
		
		// to remove the unwanted loggers from output.
		Logger.getLogger("org.apache").setLevel(Level.WARN);

		// Getting the numbers list.
		List<Integer> numbersList = getSampleData();
		
		// Creating the SparkConf object
		SparkConf sparkConf = new SparkConf().setAppName("Java RDD_Reduce Example").setMaster("local");

		// Creating JavaSprakContext object
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// Converting List into JavaRDD.
		JavaRDD<Integer> integersRDD =  sc.parallelize(numbersList);
		
		// Finding Min and Max values using reduce() method
		
		Integer minResult = integersRDD.reduce( (value1, value2) -> Math.min(value1, value2));
		
		System.out.println("Min of RDD numbers using reduce() : "+minResult);
		
		Integer maxResult = integersRDD.reduce( (value1, value2) -> Math.max(value1, value2));
		
		System.out.println("Max of RDD numbers using reduce() : "+maxResult);
		
		// closing Spark Context
		sc.close();
		
	}

	/**
	 * returns a list of integer numbers
	 * 
	 * @return
	 */
	private static List<Integer> getSampleData() {

		return Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9);

	}

}

 
Output:
Min of RDD numbers using reduce() : 1
Max of RDD numbers using reduce() : 9
 

5. Conclusion


In this post, we've seen how to use reduce() aggregate operation on the RDD dataset to find the sum,  min and max values with an example program in java.




No comments:

Post a Comment

Please do not add any spam links in the comments section.