Apache Arrow

Date: October 26 2020

Summary: Language-agnostic in-memory columnar format for analytical query engines and data frames

Keywords: ##zettel #apache #arrow #dataframe #data #storage #format #archive

Bibliography

Not Available

How To Cite
References
Discussion:

Launched in 2016
Purpose: Language-independent open standards and libraries to accelerate and simplify in-memory computing
Columnar Memory Format
- Ideal for Columnar Storage (small size)
- Supports flat and nested schemas
- Organized for cache-efficient access on CPUs and GPUs and takes advantage of parallel processing
- Optimized for scan and random access
Binary Messaging Protocol
- Record batch is a part of a table
- Can be streamed
Encapsulated Protocol Messages (IPC)
- Specifies the memory location of each column in the table

[1]

The arrow format allows for larger-than-memory datasets. When writing a dataset to arrow format, the data and metadata is laid in a descriptive layout. The data is written in pre-determined, binary formats by supported type. When reading, Arrow memory maps data from arrow memory. This means the OS gives access to memory which is swapped into RAM upon requests. (Jacob Quinn – correspondence on Apache Arrow mailing list)

Apache Arrow

Bibliography

Table of Contents

How To Cite

References

Discussion: