About the Feather Data Format

data format big data October 5, 2021

towards data science, a Medium site, posted “Stop Using CSVs for Storage — This File Format Is 150 Times Faster”. speaking of the Feather file format. Despite the rather blatant and manipulative clickbait in the title, it’s a useful article if only because it demonstrates the use of Feather.

The summary of the Medium page is pretty… blunt:

CSV’s are costing you time, disk space, and money. It’s time to end it.

The opening paragraph is pretty apt, and while it’s still rather clickbait-y, it’s a lot less clickbait than the title and the summary are:

CSV is not the only data storage format out there. In fact, it’s likely the last one you should consider. If you don’t plan to edit the saved data manually, you’re wasting both time and money by sticking to it.

Given that the blog’s focus is on data science, it’s referring to using CSV for large datasets, not CSV in general, and given that presumption (large datasets, no manual editing) it’s probably pretty accurate: CSV is wasteful and slow, in that case, and formats like Feather might be more appropriate.

What exactly is Feather?

Put simply, it’s a data format for storing data frames (think Pandas). It’s designed around a simple premise — to push data frames in and out of memory as efficiently as possible. It was initially designed for fast communication between Python and R, but you’re not limited to this use case.

Implementations for using Feather exist for nearly every programming language in common use, including Python (obviously), Java, Javascript, Go, C and C++, C#, Ruby, Rust, and a few others.

Reading time: 1 minute.

« Notes From Clean Code Mongo Transactions »