My server storing cryptocurrency data started to overflow as I was storing compressed CSV files. Trying to come up with a quick solution, I figured I should just switch to a more effective compression algorithm (it was stored using gzip). But how to quickly figure out which will be better?
Fast forward: I made the package shrynk for compression using machine learning! It helps you by choosing (and applying) the format to compress your dataframes, JSON, or actually, files in general.
Given example data, it is able to compress using 30% overall less disk space using a mixed strategy by machine learning compared to the best single compression algorithm (meaning: choosing any other single compression algorithm to compress everything will be worse).
You can try it for yourself at https://shrynk.ai.
Bonus: If the algorithm has it wrong, the features of the data (such as number of rows, number of columns, but not the data itself) will be added to the python package on the next release! This allows the algorithm to learn when to apply which compression algorithm.
Next, I will explain Compression, Machine Learning and the library I’ve built in Python. You can also jump to a section using the links below.
Compression
Compression is about reducing the size of data. The obvious reason for this is to be able to store more data. Since IO / Networking (to disk or over the internet) is often the bottleneck, and not CPU Processing, it makes sense to compress data before it goes in transit, and incur time for decompressing afterwards.
Different algorithms exist for different types of data, such as images, video, audio, but also text and general purpose files. Any data can be compressed as long as there is a pattern to use. For example, a compression algorithm might find that in images there is a lot of repetition: there will be areas of pixels with the same color, and it is possible to use compression to avoid having to store every single pixel invidivually.
In video, it is not just “2D” but there’s a time dimension as well, and one succesful way to compress might be to only store the delta between frames, that is, only store where the video is different from the last frame. Yet another example is whitespace which can be compressed in JSON files.
Compressing tabular data
The first case of shrynk has to do with matrix/tabular data: rows and columns of data. Here is an example file called toy_data.csv
:
gender,age
female, 20
female, 30
female, 40
male, 50
male, 60
Let’s concern ourselves with the gender variable only. For each row, storing female
or male
is not optimal given that we know these are the only values in this column. One improvement could be to take all the values in the column, and doing a replacement: F
for female
, and M
for male
. The shorter the string, the better, right? Of course, to decompress you need to add the extra data that F
means female
, and M
means male
, but like this you only have to store the “longer” string once. This is a part of what is called dictionary encoding and is used in the Parquet format.
Another optimization of Parquet is to use Run-length encoding, which makes use of the observation that very often the same value occurs in sequence. Oversimplifying a bit, we could encode the column gender as 3F2M
. Decompressing this would expand it back into the original data.
Note that at the same time, you might be able to imagine that the Parquet schema is not necessarily better for floating-point values (like pi), as mostly these values will be unique, we cannot use these tricks.
Tabular data in Python
In the case of tabular data, we often use pandas.read_csv
to read in data from disk and produce a DataFrame.
However, csv is often uncompressed. A dataframe can easily be written back to disk using compression though. Here are several available options, with and without compression:
fastparquet+UNCOMPRESSED
fastparquet+GZIP
fastparquet+SNAPPY
fastparquet+LZO
fastparquet+LZ4
fastparquet+ZSTD
pyarrow+brotli
pyarrow+gzip
pyarrow+lz4
pyarrow+snappy
pyarrow+zstd
csv+gzip
csv+bz2
csv+zip
csv+xz
csv
The first part before the +
is the “engine” (csv
is handled by df.to_csv
, and pyarrow
and fastparquet
are different libraries used for df.to_parquet
), whereas the latter part is the compression that is being used.
It turns out that it is difficult to find out when to use which format, that is, finding the right “boundaries” choosing between so many options. It depends on a lot of factors.
It would be great to know which compression would be best, wouldn’t it?
Running benchmarks
First of all, shrynk
can help with running benchmark for your own file, or of course, for a collection of files.
I took the gender / age toy data from before and show below how to get the actual profiled results per compression for a file.
from shrynk import show_benchmark
show_benchmark("toy_data.csv") # takes either a(n un)compressed filename, or DataFrame/object
# Output (sorted on size)
kwargs size write read
csv+bz2 87 0.135 0.138
csv 89 0.136 0.139
csv+gzip 97 0.137 0.140
csv+xz 112 0.140 0.143
csv+zip 218 0.133 0.137
fastparquet+LZ4 515 0.277 0.279
fastparquet+GZIP 518 0.284 0.286
fastparquet+SNA.. 520 0.295 0.298
fastparquet+ZSTD 522 0.273 0.275
fastparquet+LZO 526 0.281 0.284
fastparquet+UNC.. 556 0.438 0.675
pyarrow+brotli 1146 0.016 0.023
pyarrow+lz4 1152 0.015 0.018
pyarrow+snappy 1171 0.012 0.016
pyarrow+zstd 1180 0.013 0.018
pyarrow+gzip 1209 0.017 0.023
You can see that in the case of this file, it is recommend to store the data as csv+bz2
, as it yields a very small file on disk (and it uses the default settings).
Running all the compression benchmarks only took 3 seconds on my laptop, which might make you wonder why we wouldn’t just always run the benchmarks? Well, for any sizeable dataframe, you do not want to have to run the benchmarks as this can take a very long time - especially for an unfortunate compression algorithm taking a long time for the particular data at hand. At the same time, why run the benchmarks when we can predict? Enter Machine Learning.
Machine Learning
Computers are good at automating human decisions. In order to make decisions without having to explicitly program it to do so, it will instead leverage historical data, such as characteristics and actual decisions, to predict the best decision in the future for new characteristics.
For example, Machine Learning helps in deciding whether a new email is spam based on users having labeled loads of other emails in the past as either spam or not spam (the decisions), and using the words (their characteristics) that occur more frequently in the spam emails (e.g. “Nigerian Prince”) opposed to the non-spam emails (e.g. “Hi John”) to drive the decision. Well, assuming that the person’s name was actually John, of course, otherwise it’s probably spam again.
The usual downside of machine learning is that it is very costly to gather the correct decisions - it is often manual labor by people.
The cool thing about machine learning for compression is that we can try all compressions for a file to find out what the optimal decision would be, without much cost (except time).
In this case, ~3000 files have been used to allow the modelling of the compression. Their characteristics have been recorded and the things we would like to learn to minimize:
-
how large they become (
size
on disk), -
how long a write takes (
write time
) and -
how long a read takes (
read time
)
It could be that, based on some data of let’s say 5000 rows and 10 columns of price data, compression A is best in terms of size, but if it concerns 500 rows and 100 columns of time data, compression B would be best. Perhaps depending on other factors, such as the amount of text columns, the situation will again be different.
To give more example features, currently it will consider percentage of null values
, average string length
, and how distinct variables are (cardinality
), and even a few more. These are all used to figure out which compression to apply in a similar context. Check here for an overview of features used.
Machine Learning in Shrynk
Note that there are other attempts at using machine learning to do compression… most notably in compressing images. Shrynk uses a meta approach instead, as it only uses already existing compression methods.
For each file, it will apply the compression and gather the values for each compression algorithm on size, write and read columns and converts them to z-scores: this makes them comparable at each dimension and is a novel approach as far as I know. The lowest sum of z-scores given the user weights gets chosen as a classification label. Let’s see that in slow-motion.
First to see how converting to z-scores works:
from sklearn.preprocessing import scale
scale([1, 2, 3])
# array([-1.224, 0, 1.224])
scale([100, 200, 300])
# array([-1.224, 0, 1.224])
You can see that the scale does not matter but the relative difference does: (1, 2, 3) and (100, 200, 300) get the same scores even though they are 100x larger. Also note that here we are ignoring the unit (bytes vs seconds).
Here a fake example to show 3 compression scores of a single imaginary file, and only considering size and write:
size write
compr A 100kb 2s
compr B 200kb 1s
compr C 300kb 3s
Converting per column to Z-scores:
z-size z-write
comp A -1.22 0
comp B 0 -1.22
comp C 1.22 1.22
Then to multiply the z-scores with User weights (Size=1, Write=2):
u-s u-w | user-z-sum
-1.22 0 | -1.22
0 -2.44 | -2.44
1.22 2.44 | 3.66
In the last column you can see the sum over the rows to get a weighted z-score for each compression.
Given the example data and s=1 and w=2, compression B
would have the lowest summed z-score and thus be best! This means that the characteristics of this data (such as num_rows
etc) and the label compression B
will be used to train a classification model.
In the end, the input will be this result for each file (so the sample size is number_of_files
; not number_of_files * number_of_compression
).
A sample data set might look like (completely made up):
num_cols num_rows missing | best_label
19 1000 0.1 | csv+bz
5 10000 0 | fastparquet+GZIP
55 2333 0.2 | pyarrow+brotli
190 500 0.05 | csv+gzip
... and so on ....
A simple RandomForestClassifier will be trained on the included benchmark data for 3000 files, based on the user weights.
Knowing that, let’s look at some usage examples.
Usage
Basic example of saving (which predicts the best type) and loading:
import pandas as pd
from shrynk import save, load
# save dataframe compressed
my_df = pd.DataFrame({"a": [1]})
file_path = save(my_df, "mypath.csv")
# e.g. mypath.csv.bz2
# load compressed file
loaded_df = load(file_path)
If you just want the prediction, you can also use infer
:
import pandas as pd
from shrynk import infer
infer(pd.DataFrame({"a": [1]}))
# {"engine": "csv", "compression": "bz2"}
For more control, it is possible to use shrynk classes directly:
import pandas as pd
from shrynk import PandasCompressor
df = pd.DataFrame({"a": [1,2,3]})
pdc = PandasCompressor()
pdc.train_model(3, 1, 1)
pdc.predict(df)
pdc.show_benchmark(df, 3, 1, 1)
After installing it is also available on the command as shrynk
.
Use it to automagically compress, decompress and run benchmarks for files:
shrynk compress mydata.csv
shrynk decompress mydata.json.gz
shrynk benchmark mydata.png
Data comes packaged
Note that the data comes packaged in shrynk: this is a feature, not a bug. It allows people to provide their own requirements for size, write and read speeds. It will then train a model on the spot (if not trained earlier).
I also hope that including the data encourages others to see if they can improve the compression score :)
You can also use it for validating improvements to the algorithm, or generically, the approach.
How well does it work? Cross-validation
I have builtin a validate
function that is available to all shrynk classes so you can test your own strategies or train and verify results on your own data.
It uses the available data for the given class, and produces results for cross-validation; taking into account the user defined weights.
Validation was a tough one, mostly because it took me a while to figure out how to express the results in a meaningful way when allowing the use of weights. In the end I chose to keep the dimensions of interest, size
, read
, write
, and show how the aggregate of shrynk’s predictions per object compare (in proportions) against choosing to always use a single strategy. I hope the example makes it clear.
See the code below:
from shrynk import PandasCompressor
pdc = PandasCompressor()
weights = (1, 0, 0)
acc, result = pdc.validate(*weights)
Note that shrynk in the example below is not only 31% better in size (check the arrows), but also much better in terms of read- and write time compared to always applying the single best strategy, csv+xz
.
The 1.001
value indicates it is only 0.1% away from what would be achievable in terms of size if it would always choose the best compression per file in the validation set. At the same time, it is 6.653 times slower in terms of reading time compared to always choosing the best. It was after all optimizing for size.
[shrynk] s=1 w=0 r=0
----------------
it 0/5: accuracy shrynk 82.49%
it 1/5: accuracy shrynk 100.0%
it 2/5: accuracy shrynk 78.88%
it 3/5: accuracy shrynk 83.33%
it 4/5: accuracy shrynk 99.95%
Avg Accuracy: 0.889
results sorted on SIZE, shown in proportion increase vs ground truth best
size read write
shrynk ---> 1.001 6.653 7.443
csv+xz ---> 1.311 25.180 31.659
csv+bz2 1.370 12.698 14.549
csv+zip 2.163 11.646 13.701
fastparquet+ZSTD 4.845 5.069 6.625
fastparquet+GZIP 4.958 6.866 8.805
fastparquet+LZO 5.666 4.959 6.446
fastparquet+LZ4 5.858 4.874 6.395
fastparquet+SNA. 6.008 5.105 6.611
csv 7.454 9.214 10.663
csv+gzip 7.455 9.114 10.543
pyarrow+brotli 8.418 2.163 3.045
pyarrow+zstd 8.715 1.219 1.495
pyarrow+gzip 9.044 2.038 2.864
pyarrow+lz4 9.320 1.126 1.366
pyarrow+snappy 9.356 1.165 1.433
fastparquet+UNC 16.360 5.851 6.870
Lastly, let’s look at user weights of size=3
and write=1
to see a mixed approach:
size write z
shrynk 1.422 4.069 -4.258
csv+bz2 1.330 9.621 -3.139
csv+zip 2.148 7.857 -3.063
fastparquet+ZSTD 6.149 6.379 -1.183
fastparquet+GZIP 6.239 7.307 -0.939
fastparquet+LZO 6.968 6.470 -0.716
fastparquet+LZ4 7.035 6.424 -0.689
csv+xz 1.272 22.527 -0.452
fastparquet+SNA. 7.564 6.481 -0.387
pyarrow+zstd 10.790 1.453 0.320
pyarrow+brotli 10.400 2.806 0.391
pyarrow+lz4 11.323 1.307 0.581
pyarrow+snappy 11.365 1.338 0.610
pyarrow+gzip 11.147 2.469 0.729
csv+gzip 11.243 6.598 1.652
csv 11.263 6.564 1.655
fastparquet+UNC 24.420 6.694 8.887
Conclusion
Since having to come up with manual rules for when to use which compression would be very complex and time costly, this was a great case for machine learning. It’s very scalable, as when a new compression algorithm shows up on the market, shrynk will be able to benchmark that one and no new rules have to be figured out.
Note that it currently required 30% less disk compared to if I would have used only a single best strategy (in reality, you might not even always use the single best strategy but a worse one). Though, granted, this was on rather heterogeneous data: small, big, and really versatile, e.g. in terms of text, and missingness. To prevent bias, it was trained on balancing the occurence of all compression algorithms to be best.
You know you’re in for a win with machine learning when you can get (or even create) data cheaply.
It also illustrates an awesome way to enhance coding: by having a computer predict the best way to do “something” - in this case, compress.
What’s next
If you’ve liked this, feel free to follow me on github and leave a star for shrynk on github.
If you haven’t tried it yet, you can see what it looks like in action at https://shrynk.ai
And please help out if you’re interested, as any attempts at Pull Requests will be welcomed: it’s a python project, by the community, for the community.
- Further investigation of which variables are most predictive (and cheap to compute)
- The stats computation could probably be sped up as well once we further investigate which features are most succesful.
- Add other data types to the framework (JSON and Bytes have recently been added)
- I’ve been working on a package to generate artificial data. Imagine creating slight variations based off requirements, or existing dataframes to have data evolve in the direction we want to learn better boundaries!
- Other improvements?