Date: October 26 2020
Summary: Language-agnostic in-memory columnar format for analytical query engines and data frames
Keywords: ##zettel #apache #arrow #dataframe #data #storage #format #archive
Not Available
Launched in 2016
Purpose: Language-independent open standards and libraries to accelerate and simplify in-memory computing
Columnar Memory Format
Ideal for Columnar Storage (small size)
Supports flat and nested schemas
Organized for cache-efficient access on CPUs and GPUs and takes advantage of parallel processing
Optimized for scan and random access
Binary Messaging Protocol
Record batch is a part of a table
Can be streamed
Encapsulated Protocol Messages (IPC)
Specifies the memory location of each column in the table
[1]
The arrow format allows for larger-than-memory datasets. When writing a dataset to arrow format, the data and metadata is laid in a descriptive layout. The data is written in pre-determined, binary formats by supported type. When reading, Arrow memory maps data from arrow memory. This means the OS gives access to memory which is swapped into RAM upon requests. (Jacob Quinn – correspondence on Apache Arrow mailing list)
Zelko, Jacob. Apache Arrow. https://jacobzelko.com/10262020041544-apache-arrow. October 26 2020.
[1] Apache Arrow and the Future of Data Frames, (2020).Available: https://www.youtube.com/watch?v=fyj4FyH3XdU&t=281s