How to Read Metadata of Parquet File in Python
Apache Parquet is a popular cavalcade storage file format used by Hadoop systems, such equally Pig, Spark, and Hive. The file format is language contained and has a binary representation. Parquet is used to efficiently store big data sets and has the extension .parquet
. This weblog postal service aims to understand how parquet works and the tricks it uses to efficiently store data.
Fundamental features of parquet are:
- information technology'due south cross platform
- it'south a recognised file format used by many systems
- it stores information in a column layout
- it stores metadata
The latter two points allow for efficient storage and querying of information.
Column Storage
Suppose we have a simple data frame:
tibble:: tibble(id = 1 : 3, proper noun = c("n1", "n2", "n3"), age = c(20, 35, 62)) #> # A tibble: 3 × three #> id proper name age #> <int> <chr> <dbl> #> 1 1 n1 20 #> 2 2 n2 35 #> 3 3 n3 62
If nosotros stored this data set as a CSV file, what we see in the R terminal is mirrored in the file storage format. This is row storage. This is efficient for file queries such every bit,
SELECT * FROM table_name WHERE id == 2
Nosotros simply go to the 2nd row and retrieve that data. It'southward too very easy to append rows to the data ready - we but add together a row to the bottom of the file. All the same, if we want to sum the data in the age
column, then this is potentially inefficient. We would demand to determine which value on each row is related to historic period
, and extract that value.
Parquet uses column storage. In column layouts, column data are stored sequentially.
With this layout, queries such equally
SELECT * FROM dd WHERE id == 2
are now inconvenient. But if we want to sum upwardly all ages, we only go to the third row and add upward the numbers.
Reading and writing parquet files
In R, we read and write parquet files using the {arrow} package.
# install.packages("arrow") library("arrow") packageVersion("arrow") #> [1] '5.0.0'
To create a parquet file, we utilize write_parquet()
# Apply the penguins data fix data(penguins, parcel = "palmerpenguins") # Create a temporary file for the output parquet = tempfile(fileext = ".parquet") write_parquet(penguins, sink = parquet)
To read the file, we use read_parquet()
. One of the benefits of using parquet, is modest file sizes. This is important when dealing with large data sets, especially once you start incorporating the cost of cloud storage. Reduced file size is achieved via two methods:
- File compression. This is specified via the
compression
argument inwrite_parquet()
. The default issnappy
. - Clever storage of values (the next section).
Practise you lot utilize RStudio Pro? If and then, checkout our our managed RStudio services
Parquet Encoding
Since parquet uses column storage, values of the same blazon are number stored together. This opens upwards a whole world of optimisation tricks that aren't bachelor when nosotros save data every bit rows, due east.g. CSV files.
Run length encoding
Suppose a column merely contains a unmarried value repeated on every row. Instead of storing the aforementioned number over and over (equally a CSV file would), we can just record "value Ten repeated N times". This ways that fifty-fifty when N gets very large, the storage costs remain small-scale. If nosotros had more than 1 value in a column, and then nosotros can employ a unproblematic look-upwardly tabular array. In parquet, this is known as run length encoding. If nosotros have the following column
c(4, four, 4, 4, 4, ane, 2, 2, 2, two) #> [i] 4 4 4 four 4 1 2 2 2 ii
This would exist stored as
- value 4, repeated v times
- value i, repeated once
- value two, reported iv times
To see this in activeness, lets create a unproblematic example, where the character A
is repeated multiple times in a data frame column:
10 = data.frame(x = rep("A", 1e6))
We tin can so create a couple of temporary files for our experiment
parquet = tempfile(fileext = ".parquet") csv = tempfile(fileext = ".csv")
and write the data to the files
arrow:: write_parquet(10, sink = parquet, pinch = "uncompressed") readr:: write_csv(10, file = csv)
Using the {fs} package, nosotros extract the size
# Could besides utilize file.info() fs:: file_info(c(parquet, csv))[, "size"] #> # A tibble: 2 × 1 #> size #> <fs::bytes> #> 1 1015 #> 2 1.91M
Nosotros see that the parquet file is tiny, whereas the CSV file is almost 2MB. This is actually a 500 fold reduction in file space.
Dictionary encoding
Suppose we had the following graphic symbol vector
c("Jumping Rivers", "Jumping Rivers", "Jumping Rivers") #> [i] "Jumping Rivers" "Jumping Rivers" "Jumping Rivers"
If we desire to save storage, and then we could replace Jumping Rivers
with the number 0
and have a table to map between 0
and Jumping Rivers
. This would significantly reduce storage, especially for long vectors.
ten = data.frame(x = rep("Jumping Rivers", 1e6)) pointer:: write_parquet(x, sink = parquet) readr:: write_csv(x, file = csv) fs:: file_info(c(parquet, csv))[, "size"] #> # A tibble: 2 × one #> size #> <fs::bytes> #> i 1.09K #> 2 xiv.31M
Delta encoding
This encoding is typically used in conjunction with timestamps. Times are typically stored as Unix times, which is the number of seconds that have elapsed since January 1st, 1970. This storage format isn't particularly helpful for humans, so typically it is pretty-printed to make information technology more palatable for us. For example,
(time = Sys.fourth dimension()) #> [1] "2021-09-21 17:05:08 BST" unclass(time) #> [ane] 1632240309
If we accept a large number of time stamps in a column, ane method for reducing file size is to merely subtract the minimum time stamp from all values. For instance, instead of storing
c(1628426074, 1628426078, 1628426080) #> [one] 1628426074 1628426078 1628426080
we would store
with the corresponding offset 1628426074
.
Other encodings
There are a few other tricks that parquet uses. Their GitHub folio gives a complete overview.
If you lot take a parquet file, you tin use parquet-mr to investigate the encoding used within a file. Nonetheless, installing the tool isn't trivial and does take some time.
Feather vs Parquet
The obvious question that comes to mind when discussing parquet, is how does it compare to the plumage format. Plume is optimised for speed, whereas parquet is optimised for storage. It's also worth noting, that the Apache Arrow file format is feather.
Parquet vs RDS Formats
The RDS file format used past readRDS()/saveRDS()
and load()/relieve()
. It is file format native to R and can only exist read past R. The main benefit of using RDS is that information technology can store any R object - environments, lists, and functions.
If nosotros are solely interested in rectangular data structures, e.g. data frames, then reasons for using RDS files are
- the file format has been around for a long time and isn't probable to modify. This means information technology is backwards uniform
- it doesn't depend on whatever external packages; just base R.
The advantages of using parquet are
- the file size of parquet files are slightly smaller. If you desire to compare file sizes, make sure y'all set
compression = "gzip"
inwrite_parquet()
for a fair comparing. - parquet files are cross platform
- in my experiments, parquet files, equally you would wait, are slightly smaller. For some utilize cases, an additional saving of 5% may be worth it. But, as always, it depends on your particular use cases.
References
- The default compression algorithm used by Parquet.
- A nice talk by Raoul-Gabriel Urma.
- Parquet-tools for interrogating Parquet files.
- The official list of file optimisations
- Stackoverflow questions on Parquet: Feather & Parquet and Arrow and Parquet
colbournewenbestaide.blogspot.com
Source: https://www.jumpingrivers.com/blog/parquet-file-format-big-data-r/
0 Response to "How to Read Metadata of Parquet File in Python"
Post a Comment