Memory Efficient File Reader

Jul 6, 2018 · 9 minute read

TL;DR

  • The FileHandle class offers APIs for reading a file one chunk at the time
  • This capability may come handy when dealing with very large files that cannot be read completely into main memory.
  • We can combine the FileHandle and the Sequence APIs to build a FileReader class that will helps us read the contents of a file one line at the time while keeping the memory footprint low.
***

Reading the contents of a file using Swift’s standard library can be done via the String initializer init(contentsOf: URL, encoding: String.Encoding). This API will load all the contents of the file at the given URL into main memory and interpret it using the given encoding (e.g: .utf8).

// ... 
let fileURL = URL(...)
let fileContents = try? String(contentsOf: fileURL, encoding: .utf8)
// ..

This approach will work perfectly fine when dealing with relatively small files whose contents can be completely loaded into memory. The same approach, however, will have a much harder time dealing with larger files.

Imagine, for instance, that you are given a very large file (~ 5 GB) that needs to be processed in some way –e.g you need to extract the last N lines of a log file or find the most expensive transaction in a transaction log. In this case, loading the entire file into memory may not be possible at all.

Enter: FileHandle

The FileHandle is a Foundation class that can be used, among other things, to access data associated with a file.

According to the documentation:

You use file handle objects to access data associated with files, sockets, pipes, and devices. For files, you can read, write, and seek within the file. For sockets, pipes, and devices, you can use a file handle object to monitor the device and process data asynchronously [1]

For our case, the most relevant capability of a FileHandle object is the possibility of reading specific chunks of a file using the readData(ofLength: Int) -> Data and the seek(toFileOffset: UInt64) APIs.

These two APIs would allow us to write a function that reads the contents of a file one chunk at the time, passing the read contents to a processing block.

func processFile(path: String, chunkSize: Int, block: (Data) -> Void) {
    guard let fileHandle = FileHandle(forReadingAtPath: path) else {
        debugPrint("Could not open file handle for file: \(fileName)")
        return
    }
    fileHandle.seek(toFileOffset: 0)
    var hasData: Bool = true
    repeat {
        let data = fileHandle.readData(ofLength: chunkSize)
        hasData = data.count > 0
        if hasData {
            block(data)
        }
    } while hasData
}

The advantage of this approach over the approach that uses the String initializer, is that not matter how big is the file, we will always be able to go through its contents and perform some kind of processing while keeping the memory footprint very low.

Using this approach is not completely free, though.

Now, instead of paying the memory cost, we are paying by the latency involved in accessing the file system several times, which will detract on the performance qualities of the solution. In some problems – e.g: retrieving the last N lines of a log file, we will be able to get away by accessing the file one or two times. In other problems – e.g: finding the most expensive transaction in a transaction log, we may have to access the file many more times, which is expensive, but is a better option that not being able to solve the problem at all due to lack of memory.

Fortunately, in cases where we are dealing with a file that contains a collection of records – e.g: a transaction log, the cost of accessing the file can be amortized [2] over a sequence of read operations by increasing the amount of data fetched each time. This way, the cost of reading the first line will be rather large (since it will first need to load a chunk of the file into main memory), while the subsequent M reads would be much faster, thus balancing out the cost payed by the first read.

Implementing a memory efficient FileReader

So far we have seen how the FileHandle class offers a couple of APIs that can be used to efficiently read the contents of a very large file. In the rest of this post we will build a FileReader class that will leverage these APIs to read the contents of a text file one line at the time. To make the API of this class easier to work with we will also add conformance to the Sequence type, which will bring in default implementations for common collection operations like: map, filter, forEach, prefix, etc.

Ultimately, the client code we are aiming for should look like this:

func readFile(url: URL, lines: Int, block: (String) -> void) throws {
    guard let fileReader = FileReader(url: url) else {
        debugPrint("Could not read file at: \(url)")
        return
    }
    fileReader.prefix(lines).forEach(block)
} 

Let’s start off by defining the initializer of our class. Even though in the snippet above we used a initializer with the file URL as the only parameter, there are several configuration options that we should offer to client code, namely:

  • delimiter: The token used for chunking the read data. We would use the newline \n character as the default value
  • encoding: The expected string encoding of the file. The default value should be .utf8
  • chunkSize: The amount of data that should be fetched from the file each time. This parameter will give us control over how much memory we want to use versus how many times we want to access the file. A small value will imply lower memory usage but more frequent file reads. A large value, in contrast, will imply larger memory usage but less frequent file accesses. Depending on the use case we may wish to favor one variable over the other. For our implementation we would use a rather arbitrary value of 4_096 bytes.

Finally, our initializer should also take care of initializing two variables, the fileHandle instance that we will use for reading the file, and a private buffer where we will store the data read from the file.

class FileReader {

    // MARK: - Private Members

    private let file: URL
    private let delimiter: Data
    private let chunkSize: Int
    private let encoding: String.Encoding
    private let fileHandle: FileHandle
    private var buffer: Data

    // MARK: - Initializers

    init(file: URL,
         delimiter: String = "\n",
         encoding: String.Encoding = .utf8,
         chunkSize: Int = 4_096) throws {
        self.file = file
        self.delimiter = delimiter.data(using: .utf8)!
        self.encoding  = encoding
        self.chunkSize = chunkSize
        self.buffer = Data()
        self.fileHandle = try FileHandle(forReadingFrom: file)
    }

}

We will now focus on the implementation of the readLine() -> String? API, which is where most of the work happens.

class FileReader {

    // ...

    func readLine() -> String? {
        var nextLine: Data? = nil
        var eof: Bool = false
        while eof == false && nextLine == nil {
            if let range = buffer.range(of: delimiter, options: []) {
                nextLine = buffer.prefix(upTo: range.lowerBound)
                buffer = buffer.suffix(from: range.upperBound)
            } else {
                let newData = fileHandle.readData(ofLength: chunkSize)
                eof = newData.count == 0
                buffer += newData
            }
        }
        return nextLine.flatMap { String(data: $0, encoding: .utf8) }
    }
}

At the core of this method there is a loop that will attempt to read chunks of data from the file into the buffer until the next instance of the delimiter is found or until the end of the file (eof) is reached. In case the delimiter is found, the data in the buffer is split in two parts: one part containing the nextLine, and the other containing the the data that will be preserved in the buffer.

Notice the important role that the buffer variable is playing in helping us reduce the number of times that we access the file system. Since we don’t know where the next delimiter will be, we first read 4_096 bytes and then perform a linear search looking for the next appearance of the delimiter. Given the moderate amount of bytes read, it is likely that the read data will contain several instances of the delimiter. Therefore, instead of extracting the data of the nextLine and throwing away rest of the bytes, we store the rest of the bytes in the buffer and repeat the search process until there are no more delimiter instances left in it; at which point we will try to load more data from the underlying file and repeat the previous process.

Finally, a word of caution. The stop conditions in the above implementation are a bit simplified. If we are not careful, the main loop may end up reading the whole file into memory (which is precisely what we are trying to avoid). One way to prevent this issue is to define a maximum amount of reads that can be done before giving up on finding the next delimiter and simply returning all the contents in the buffer. For the sake of simplicity, in this post we will look past this issue. You can find a complete implementation of the FileReader class that manages this and other cases in this gist.

Adding Sequence conformance

If you think about it, the FileReader class is essentially a sequence of Strings. The fact that these Strings are coming from a file, and that this file is being read in chunks is just an implementation detail.

As we have discussed in previous posts: (Generators, Queues), having our collection types conform to Sequence can bring in a lot of functionality for free, allowing client code to use familiar APIs like: map, filter, forEach and prefix in our data structures.

One of the best features about the Sequence type, is that adding conformance to it in custom types is usually very easy, especially when types also declare conformance to the IteratorProtocol type. In fact, making our FileReader class conform to it, will only require us to declare conformace to the Sequence and IteratorProtocol types, and change the name of our readLine API for next. That’s it!.

With these changes in place, we will be able to write the following client code:

func readFile(url: URL, lines: Int, block: (String) -> void) throws {
    guard let fileReader = FileReader(url: url) else {
        debugPrint("Could not read file at: \(url)")
        return
    }
    // We can use `prefix` and `forEach` thanks to the conformance to `Sequence`
    fileReader.prefix(lines).forEach(block)
}