How to Read Metadata of Parquet File in Python

Apache Parquet is a popular cavalcade storage file format used by Hadoop systems, such equally Pig, Spark, and Hive. The file format is language contained and has a binary representation. Parquet is used to efficiently store big data sets and has the extension .parquet. This weblog postal service aims to understand how parquet works and the tricks it uses to efficiently store data.

Fundamental features of parquet are:

  • information technology'due south cross platform
  • it'south a recognised file format used by many systems
  • it stores information in a column layout
  • it stores metadata

The latter two points allow for efficient storage and querying of information.

Column Storage

Suppose we have a simple data frame:

            tibble::              tibble(id              =              1              :              3,                proper noun              =              c("n1",              "n2",              "n3"),                age              =              c(20,              35,              62))              #> # A tibble: 3 × three              #>      id proper name    age              #>   <int> <chr> <dbl>              #> 1     1 n1       20              #> 2     2 n2       35              #> 3     3 n3       62                      

If nosotros stored this data set as a CSV file, what we see in the R terminal is mirrored in the file storage format. This is row storage. This is efficient for file queries such every bit,

                          SELECT              *              FROM              table_name              WHERE              id              ==              2                      

Nosotros simply go to the 2nd row and retrieve that data. It'southward too very easy to append rows to the data ready - we but add together a row to the bottom of the file. All the same, if we want to sum the data in the age column, then this is potentially inefficient. We would demand to determine which value on each row is related to historic period, and extract that value.

Parquet uses column storage. In column layouts, column data are stored sequentially.

With this layout, queries such equally

                          SELECT              *              FROM              dd              WHERE              id              ==              2                      

are now inconvenient. But if we want to sum upwardly all ages, we only go to the third row and add upward the numbers.

Reading and writing parquet files

In R, we read and write parquet files using the {arrow} package.

                          # install.packages("arrow")              library("arrow")              packageVersion("arrow")              #> [1] '5.0.0'                      

To create a parquet file, we utilize write_parquet()

                          # Apply the penguins data fix              data(penguins, parcel              =              "palmerpenguins")              # Create a temporary file for the output              parquet              =              tempfile(fileext              =              ".parquet")              write_parquet(penguins, sink              =              parquet)                      

To read the file, we use read_parquet(). One of the benefits of using parquet, is modest file sizes. This is important when dealing with large data sets, especially once you start incorporating the cost of cloud storage. Reduced file size is achieved via two methods:

  • File compression. This is specified via the compression argument in write_parquet(). The default is snappy.
  • Clever storage of values (the next section).

Practise you lot utilize RStudio Pro? If and then, checkout our our managed RStudio services


Parquet Encoding

Since parquet uses column storage, values of the same blazon are number stored together. This opens upwards a whole world of optimisation tricks that aren't bachelor when nosotros save data every bit rows, due east.g. CSV files.

Run length encoding

Suppose a column merely contains a unmarried value repeated on every row. Instead of storing the aforementioned number over and over (equally a CSV file would), we can just record "value Ten repeated N times". This ways that fifty-fifty when N gets very large, the storage costs remain small-scale. If nosotros had more than 1 value in a column, and then nosotros can employ a unproblematic look-upwardly tabular array. In parquet, this is known as run length encoding. If nosotros have the following column

                          c(4,              four,              4,              4,              4,              ane,              2,              2,              2,              two)              #>  [i] 4 4 4 four 4 1 2 2 2 ii                      

This would exist stored as

  • value 4, repeated v times
  • value i, repeated once
  • value two, reported iv times

To see this in activeness, lets create a unproblematic example, where the character A is repeated multiple times in a data frame column:

            10              =              data.frame(x              =              rep("A",              1e6))                      

We tin can so create a couple of temporary files for our experiment

            parquet              =              tempfile(fileext              =              ".parquet") csv              =              tempfile(fileext              =              ".csv")                      

and write the data to the files

            arrow::              write_parquet(10, sink              =              parquet, pinch              =              "uncompressed") readr::              write_csv(10, file              =              csv)                      

Using the {fs} package, nosotros extract the size

                          # Could besides utilize file.info()              fs::              file_info(c(parquet, csv))[,              "size"]              #> # A tibble: 2 × 1              #>          size              #>   <fs::bytes>              #> 1        1015              #> 2       1.91M                      

Nosotros see that the parquet file is tiny, whereas the CSV file is almost 2MB. This is actually a 500 fold reduction in file space.

Dictionary encoding

Suppose we had the following graphic symbol vector

                          c("Jumping Rivers",              "Jumping Rivers",              "Jumping Rivers")              #> [i] "Jumping Rivers" "Jumping Rivers" "Jumping Rivers"                      

If we desire to save storage, and then we could replace Jumping Rivers with the number 0 and have a table to map between 0 and Jumping Rivers. This would significantly reduce storage, especially for long vectors.

            ten              =              data.frame(x              =              rep("Jumping Rivers",              1e6)) pointer::              write_parquet(x, sink              =              parquet) readr::              write_csv(x, file              =              csv) fs::              file_info(c(parquet, csv))[,              "size"]              #> # A tibble: 2 × one              #>          size              #>   <fs::bytes>              #> i       1.09K              #> 2      xiv.31M                      

Delta encoding

This encoding is typically used in conjunction with timestamps. Times are typically stored as Unix times, which is the number of seconds that have elapsed since January 1st, 1970. This storage format isn't particularly helpful for humans, so typically it is pretty-printed to make information technology more palatable for us. For example,

            (time              =              Sys.fourth dimension())              #> [1] "2021-09-21 17:05:08 BST"              unclass(time)              #> [ane] 1632240309                      

If we accept a large number of time stamps in a column, ane method for reducing file size is to merely subtract the minimum time stamp from all values. For instance, instead of storing

                          c(1628426074,              1628426078,              1628426080)              #> [one] 1628426074 1628426078 1628426080                      

we would store

with the corresponding offset 1628426074.

Other encodings

There are a few other tricks that parquet uses. Their GitHub folio gives a complete overview.

If you lot take a parquet file, you tin use parquet-mr to investigate the encoding used within a file. Nonetheless, installing the tool isn't trivial and does take some time.

Feather vs Parquet

The obvious question that comes to mind when discussing parquet, is how does it compare to the plumage format. Plume is optimised for speed, whereas parquet is optimised for storage. It's also worth noting, that the Apache Arrow file format is feather.

Parquet vs RDS Formats

The RDS file format used past readRDS()/saveRDS() and load()/relieve(). It is file format native to R and can only exist read past R. The main benefit of using RDS is that information technology can store any R object - environments, lists, and functions.

If nosotros are solely interested in rectangular data structures, e.g. data frames, then reasons for using RDS files are

  • the file format has been around for a long time and isn't probable to modify. This means information technology is backwards uniform
  • it doesn't depend on whatever external packages; just base R.

The advantages of using parquet are

  • the file size of parquet files are slightly smaller. If you desire to compare file sizes, make sure y'all set compression = "gzip" in write_parquet() for a fair comparing.
  • parquet files are cross platform
  • in my experiments, parquet files, equally you would wait, are slightly smaller. For some utilize cases, an additional saving of 5% may be worth it. But, as always, it depends on your particular use cases.

References

  • The default compression algorithm used by Parquet.
  • A nice talk by Raoul-Gabriel Urma.
  • Parquet-tools for interrogating Parquet files.
  • The official list of file optimisations
  • Stackoverflow questions on Parquet: Feather & Parquet and Arrow and Parquet

colbournewenbestaide.blogspot.com

Source: https://www.jumpingrivers.com/blog/parquet-file-format-big-data-r/

0 Response to "How to Read Metadata of Parquet File in Python"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel