Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow type system as base for the type system #4472

Open
snth opened this issue May 14, 2024 · 1 comment
Open

Arrow type system as base for the type system #4472

snth opened this issue May 14, 2024 · 1 comment

Comments

@snth
Copy link
Member

snth commented May 14, 2024

Motivation

Arrow is increasingly becoming a Lingua Franca interchange format in the data world. Its aims are to have efficient in memory representations of data structures that allow zero-copy reuse of data between different systems (e.g. between Python and R). The creators have thought a lot about making the format memory and cache efficient and what hurdles they encountered in their previous work (e.g. Pandas, dplyr, ...). As such it benefits from a wealth of experience and is geared towards modern analytics workloads. It is therefore no surprise that we are seeing it increasingly being adopted as the basis for new data tools such as DataFusion, Polars, InfluxDB v3, Lance v2, ... to name a few that I have personally encountered.

I therefore think that it would be a good starting point for the type system for PRQL as tracked in #1965 .

Resources

Further work

Compare it with the RFC in #1964 and highlight similarities and differences for further discussion.

@snth
Copy link
Member Author

snth commented May 14, 2024

Below is a summary of the Arrow Schema.fbs specification by gemini-1.5-pro-api-preview for the impatient:


This document describes the Arrow typesystem, which is used to represent structured data like tables or JSON objects. You can think of it as a schema definition language similar to SQL schemas or JSONSchema.

Here's a breakdown:

  • Basic Types: Arrow supports common data types like integers (Int), floating-point numbers (FloatingPoint), strings (Utf8, LargeUtf8), binary data (Binary, LargeBinary), booleans (Bool), and null values (Null).
  • Complex Types:
    • Structs (Struct_): Like a table row or a JSON object, containing multiple fields with different data types.
    • Lists (List, LargeList): Ordered collections of values of the same type.
    • FixedSizeLists (FixedSizeList): Lists where each entry has a fixed number of elements.
    • Maps (Map): Key-value pairs, similar to dictionaries in Python or objects in JavaScript.
    • Unions (Union): Can hold values of different types, but only one at a time.
  • Logical Types: These provide additional semantic meaning to the underlying data types.
    • Decimal (Decimal): Represents exact decimal values, useful for financial calculations.
    • Date (Date): Represents a date, either as days or milliseconds since the Unix epoch.
    • Time (Time): Represents time of day, with different units like seconds, milliseconds, microseconds, or nanoseconds.
    • Timestamp (Timestamp): Represents a specific point in time, with optional timezone information.
  • Other features:
    • Run-End Encoded (RunEndEncoded): A compressed representation for data with repeating values.
    • View Types (Utf8View, BinaryView): Efficient representation for large strings and binary data.

This specification also defines metadata versions to ensure compatibility between different implementations of Arrow. Additionally, it outlines features that may not be fully supported by all implementations, allowing for forward compatibility and negotiation between clients and servers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant