Pages

Footer Pages

Spring Boot

Java String API

Java Conversions

Kotlin Programs

Kotlin Conversions

Java Threads Tutorial

Java 8 Tutorial

Friday, November 1, 2019

How to Read a Large File Efficiently In Java

1. Overview


In this tutorial, You'll learn how to read the file fast in java efficient manner.

Before going to our topic, Every programming developer must be working on some technology such as java, python, c++ or swift. Either developing web applications or mobile applications. All the user operations must be saved in files, databases or in-memory or in image format. But here the challenging part is doing it considering the performance aspect. This is what you are going to see how to read the file in java efficiently.

And also this is a famous interview question for all level java developers. We will see the good and bad ways to read large files.

How to Read a Large File Efficiently In Java



2. Reading file using In-memory


Java introduced a new nio package for file operations. Files class a method lines() method which reads all lines from a file and creates a string in memory. This consumes lots of memory and kills the application.

If we are processing a 3GB file then it occupies the memory once the file is loaded into in-memory. Eventually, It ends up the application in the OutOfMemoryError. Once we get the OutOfMemoryError then the application stops functioning properly. Finally, We have to free up the application memory or immediately need to restart the application.

Example reading in memory:


Loading the address.JSON file in java using Files.lines() method.
If you are not aware of what is JSON? JSON is a Javascript Object Notation which is used to store key-value pair.

address.json:

{ "name"   : "John Smith",
  "sku"    : "20223",
  "price"  : 23.95,
  "shipTo" : { "name" : "Jane Smith",
               "address" : "123 Maple Street",
               "city" : "Pretendville",
               "state" : "NY",
               "zip"   : "12345" },
  "billTo" : { "name" : "John Smith",
               "address" : "123 Maple Street",
               "city" : "Pretendville",
               "state" : "NY",
               "zip"   : "12345" }
}

Program:

package com.java.w3schools.blog.java12.files;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Iterator;
import java.util.stream.Stream;

public class ReadInMemory {

 public static void main(String[] args) throws IOException {
  Stream fileContent = Files.lines(Paths.get("files", "address.json"));
  Iterator iterator = fileContent.iterator();
  while (iterator.hasNext()) {
   System.out.println(iterator.next());
  }
 }
}

Output:

{ "name"   : "John Smith",
  "sku"    : "20223",
  "price"  : 23.95,
  "shipTo" : { "name" : "Jane Smith",
               "address" : "123 Maple Street",
               "city" : "Pretendville",
               "state" : "NY",
               "zip"   : "12345" },
  "billTo" : { "name" : "John Smith",
               "address" : "123 Maple Street",
               "city" : "Pretendville",
               "state" : "NY",
               "zip"   : "12345" }
}

Here it is a small file. If it is a large file in GigaBytes then chances of failing and occurrences of performance issues are more.

So this is not a suggested way to use to read the large files.

3. Apache Commons and Gauve API (In-Memory)


As you have seen in the above section that does not work well for large files. Similar to that org.apache.commons and Guava also have such type of methods.

Files.readLines(new File(path), Charsets.UTF_8);
 
FileUtils.readLines(new File(path));

All technical architects are suggested not to use these methods on large files. Because data is loaded into memory at once.

4. Reading file line by line


In this approach, you will be reading only one line at a time. In this process, all are retrieved line by line sequentially.

package com.java.w3schools.blog.files;

import java.io.FileInputStream;
import java.io.IOException;
import java.util.Scanner;

public class ScannerExample {

 public static void main(String[] args) throws IOException {

  FileInputStream inputStream = null;
  Scanner scanner = null;
  try {
   inputStream = new FileInputStream("files//address.json");
   scanner = new Scanner(inputStream, "UTF-8");
   while (scanner.hasNextLine()) {
    String line = scanner.nextLine();
    System.out.println(line.toUpperCase());
   }
  } finally {
   if (inputStream != null) {
    inputStream.close();
   }
   if (scanner != null) {
    scanner.close();
   }
  }
 }

}

This process will repeat through every line in the file – taking into consideration the handling of each line, without keeping references in memory.

5. Reading Efficiently with Apache Commons IO


The same from the above approach can be achieved from the Apache Commons library as well. As it provides a custom line iterator.

Add the below dependency in the pom.xml file.


<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
<dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
    <version>2.6</version>
</dependency>


Code:

package com.java.w3schools.blog.files;

import java.io.File;
import java.io.IOException;

import org.apache.commons.io.FileUtils;
import org.apache.commons.io.LineIterator;

public class ApacheCommonsCustomIterator {

 public static void main(String[] args) throws IOException {
  LineIterator it = FileUtils.lineIterator(new File("files//address.json"), "UTF-8");
  try {
   while (it.hasNext()) {
    String line = it.nextLine();
    System.out.println(line.toLowerCase());
   }
  } finally {
   LineIterator.closeQuietly(it);
  }
 }

}

In this process and above, the entire file is not loaded into memory. So, memory is utilized efficiently.

6. Split File and process parallel


Reading the file line by line will do the job in terms of memory efficiently but it takes lots of time. You should consider the time as well. For huge traffic websites, time is very crucial to understand business.

You should divide the file into chunks as how Hadoop does internally to store the files into HDFS. But here the discussion on reading the file effectively. So not going into much on Hadoop and HDFS. But, just remember that Hadoop does the file split and run the same logic on each splitted file. Finally, aggregate the output from all splits and run the same logic on this.

Example code to divide the file into MB's:


Constants declaration:

private static final String dir = "/tmp/";
private static final String suffix = ".splitPart";

Split files logic:

/**
 * Split a file into multiples files.
 *
 * @param fileName   Name of file to be split.
 * @param mBperSplit maximum number of MB per file.
 * @throws IOException
 */
public static List splitFile(final String fileName, final int mBperSplit) throws IOException {

    if (mBperSplit <= 0) {
        throw new IllegalArgumentException("mBperSplit must be more than zero");
    }

    List partFiles = new ArrayList<>();
    final long sourceSize = Files.size(Paths.get(fileName));
    final long bytesPerSplit = 1024L * 1024L * mBperSplit;
    final long numSplits = sourceSize / bytesPerSplit;
    final long remainingBytes = sourceSize % bytesPerSplit;
    int position = 0;

    try (RandomAccessFile sourceFile = new RandomAccessFile(fileName, "r");
         FileChannel sourceChannel = sourceFile.getChannel()) {

        for (; position < numSplits; position++) {
            //write multipart files.
            writePartToFile(bytesPerSplit, position * bytesPerSplit, sourceChannel, partFiles);
        }

        if (remainingBytes > 0) {
            writePartToFile(remainingBytes, position * bytesPerSplit, sourceChannel, partFiles);
        }
    }
    return partFiles;
}

Write files example:

private static void writePartToFile(long byteSize, long position, FileChannel sourceChannel, List partFiles) throws IOException {
    Path fileName = Paths.get(dir + UUID.randomUUID() + suffix);
    try (RandomAccessFile toFile = new RandomAccessFile(fileName.toFile(), "rw");
         FileChannel toChannel = toFile.getChannel()) {
        sourceChannel.position(position);
        toChannel.transferFrom(sourceChannel, 0, byteSize);
    }
    partFiles.add(fileName);
}


The above code is to generate the file division. Once the file is splitted then run the reading line by line logic on each split. This will minimize the time processing.

7. Conclusion


In this article, we've seen how to read the file effectively.

Covered areas

How to load the entire file into memory?
The drawback of reading the whole file into memory?
How to read line by line using traditional java API?
Reading line by line using Apache commons API? (Recommended for big files if no importance how much time it takes).
What are the drawbacks of reading line by line?
The best way through file split?

If you have any questions, please leave a comment.

8. References

References sites that used to craft this interesting tutorial.

References for file split

JSON Spec

Files.lines() api

Iterate API

Apache Commons IO API


No comments:

Post a Comment

Please do not add any spam links in the comments section.