Memory Efficient File Reader
Jul 6, 2018 · 9 minute read
TL;DR
- The
FileHandle
class offers APIs for reading a file one chunk at the time- This capability may come handy when dealing with very large files that cannot be read completely into main memory.
- We can combine the
FileHandle
and theSequence
APIs to build aFileReader
class that will helps us read the contents of a file one line at the time while keeping the memory footprint low.
Reading the contents of a file using Swift’s standard library can be done via the String
initializer init(contentsOf: URL, encoding: String.Encoding)
. This API will load all the contents of the file at the given URL
into main memory and interpret it using the given encoding (e.g: .utf8
).
// ...
let fileURL = URL(...)
let fileContents = try? String(contentsOf: fileURL, encoding: .utf8)
// ..
This approach will work perfectly fine when dealing with relatively small files whose contents can be completely loaded into memory. The same approach, however, will have a much harder time dealing with larger files.
Imagine, for instance, that you are given a very large file (~ 5 GB) that needs to be processed in some way –e.g you need to extract the last N lines of a log file or find the most expensive transaction in a transaction log. In this case, loading the entire file into memory may not be possible at all.
Enter: FileHandle
The FileHandle
is a Foundation class that can be used, among other things, to access data associated with a file.
According to the documentation:
You use file handle objects to access data associated with files, sockets, pipes, and devices. For files, you can read, write, and seek within the file. For sockets, pipes, and devices, you can use a file handle object to monitor the device and process data asynchronously [1]
For our case, the most relevant capability of a FileHandle
object is the possibility of reading specific chunks of a file using the readData(ofLength: Int) -> Data
and the seek(toFileOffset: UInt64)
APIs.
These two APIs would allow us to write a function that reads the contents of a file one chunk at the time, passing the read contents to a processing block.
func processFile(path: String, chunkSize: Int, block: (Data) -> Void) {
guard let fileHandle = FileHandle(forReadingAtPath: path) else {
debugPrint("Could not open file handle for file: \(fileName)")
return
}
fileHandle.seek(toFileOffset: 0)
var hasData: Bool = true
repeat {
let data = fileHandle.readData(ofLength: chunkSize)
hasData = data.count > 0
if hasData {
block(data)
}
} while hasData
}
The advantage of this approach over the approach that uses the String
initializer, is that not matter how big is the file, we will always be able to go through its contents and perform some kind of processing while keeping the memory footprint very low.
Using this approach is not completely free, though.
Now, instead of paying the memory cost, we are paying by the latency involved in accessing the file system several times, which will detract on the performance qualities of the solution. In some problems – e.g: retrieving the last N lines of a log file, we will be able to get away by accessing the file one or two times. In other problems – e.g: finding the most expensive transaction in a transaction log, we may have to access the file many more times, which is expensive, but is a better option that not being able to solve the problem at all due to lack of memory.
Fortunately, in cases where we are dealing with a file that contains a collection of records – e.g: a transaction log, the cost of accessing the file can be amortized [2] over a sequence of read operations by increasing the amount of data fetched each time. This way, the cost of reading the first line will be rather large (since it will first need to load a chunk of the file into main memory), while the subsequent M
reads would be much faster, thus balancing out the cost payed by the first read.
Implementing a memory efficient FileReader
So far we have seen how the FileHandle
class offers a couple of APIs that can be used to efficiently read the contents of a very large file. In the rest of this post we will build a FileReader
class that will leverage these APIs to read the contents of a text file one line at the time. To make the API of this class easier to work with we will also add conformance to the Sequence
type, which will bring in default implementations for common collection operations like: map
, filter
, forEach
, prefix
, etc.
Ultimately, the client code we are aiming for should look like this:
func readFile(url: URL, lines: Int, block: (String) -> void) throws {
guard let fileReader = FileReader(url: url) else {
debugPrint("Could not read file at: \(url)")
return
}
fileReader.prefix(lines).forEach(block)
}
Let’s start off by defining the initializer of our class. Even though in the snippet above we used a initializer with the file URL
as the only parameter, there are several configuration options that we should offer to client code, namely:
delimiter
: The token used for chunking the read data. We would use the newline\n
character as the default valueencoding
: The expected string encoding of the file. The default value should be.utf8
chunkSize
: The amount of data that should be fetched from the file each time. This parameter will give us control over how much memory we want to use versus how many times we want to access the file. A small value will imply lower memory usage but more frequent file reads. A large value, in contrast, will imply larger memory usage but less frequent file accesses. Depending on the use case we may wish to favor one variable over the other. For our implementation we would use a rather arbitrary value of4_096
bytes.
Finally, our initializer should also take care of initializing two variables, the fileHandle
instance that we will use for reading the file, and a private buffer
where we will store the data read from the file.
class FileReader {
// MARK: - Private Members
private let file: URL
private let delimiter: Data
private let chunkSize: Int
private let encoding: String.Encoding
private let fileHandle: FileHandle
private var buffer: Data
// MARK: - Initializers
init(file: URL,
delimiter: String = "\n",
encoding: String.Encoding = .utf8,
chunkSize: Int = 4_096) throws {
self.file = file
self.delimiter = delimiter.data(using: .utf8)!
self.encoding = encoding
self.chunkSize = chunkSize
self.buffer = Data()
self.fileHandle = try FileHandle(forReadingFrom: file)
}
}
We will now focus on the implementation of the readLine() -> String?
API, which is where most of the work happens.
class FileReader {
// ...
func readLine() -> String? {
var nextLine: Data? = nil
var eof: Bool = false
while eof == false && nextLine == nil {
if let range = buffer.range(of: delimiter, options: []) {
nextLine = buffer.prefix(upTo: range.lowerBound)
buffer = buffer.suffix(from: range.upperBound)
} else {
let newData = fileHandle.readData(ofLength: chunkSize)
eof = newData.count == 0
buffer += newData
}
}
return nextLine.flatMap { String(data: $0, encoding: .utf8) }
}
}
At the core of this method there is a loop that will attempt to read chunks of data from the file into the buffer
until the next instance of the delimiter
is found or until the end of the file (eof
) is reached. In case the delimiter
is found, the data in the buffer
is split in two parts: one part containing the nextLine
, and the other containing the the data that will be preserved in the buffer
.
Notice the important role that the buffer
variable is playing in helping us reduce the number of times that we access the file system. Since we don’t know where the next delimiter
will be, we first read 4_096
bytes and then perform a linear search looking for the next appearance of the delimiter
. Given the moderate amount of bytes read, it is likely that the read data will contain several instances of the delimiter
. Therefore, instead of extracting the data of the nextLine
and throwing away rest of the bytes, we store the rest of the bytes in the buffer
and repeat the search process until there are no more delimiter
instances left in it; at which point we will try to load more data from the underlying file and repeat the previous process.
Finally, a word of caution. The stop conditions in the above implementation are a bit simplified. If we are not careful, the main loop may end up reading the whole file into memory (which is precisely what we are trying to avoid). One way to prevent this issue is to define a maximum amount of reads that can be done before giving up on finding the next delimiter and simply returning all the contents in the buffer. For the sake of simplicity, in this post we will look past this issue. You can find a complete implementation of the FileReader
class that manages this and other cases in this gist.
Adding Sequence
conformance
If you think about it, the FileReader
class is essentially a sequence of Strings
. The fact that these Strings
are coming from a file, and that this file is being read in chunks is just an implementation detail.
As we have discussed in previous posts: (Generators, Queues), having our collection types conform to Sequence
can bring in a lot of functionality for free, allowing client code to use familiar APIs like: map
, filter
, forEach
and prefix
in our data structures.
One of the best features about the Sequence
type, is that adding conformance to it in custom types is usually very easy, especially when types also declare conformance to the IteratorProtocol
type. In fact, making our FileReader
class conform to it, will only require us to declare conformace to the Sequence
and IteratorProtocol
types, and change the name of our readLine
API for next
. That’s it!.
With these changes in place, we will be able to write the following client code:
func readFile(url: URL, lines: Int, block: (String) -> void) throws {
guard let fileReader = FileReader(url: url) else {
debugPrint("Could not read file at: \(url)")
return
}
// We can use `prefix` and `forEach` thanks to the conformance to `Sequence`
fileReader.prefix(lines).forEach(block)
}