Encrypting Large Files
We have a client that is doing interesting data science that depends on processing very large files (100GB) that are also transferred between parties.
This workflow was actually the inspiration for our S3S2 Open Source project which aims to make it easy to share files securely.
As we got deeper into the problem, we realized that we hadn’t fully examined how resource intensive it would be to handle such large files. This post shares some of the initial results.
Getting Started with Large Files
First we created a very large file by doing this:
mkfile -n 100g ./100g_test_file.txt
This is of course, naive, because we won’t see realistic compression. The file we just created is all 0’s. However, it is and looks like a 100GB file so let’s start here.
Next, we’re curious how hard it is to just compress the file.
om:s3s2 mk$ time gzip 100g_test_file.txtreal 9m9.417suser 7m53.403ssys 0m58.835s
OK, so we’re talking ~10 minutes to zip a 100GB file on my local drive with an SSD. Could be faster in the cloud but for our use case we have folks that want the data locally and some who want to use it in the cloud, so it’s at least a reasonable first test.
So… what do you think, do you think the encryption is going to take a lot longer than that?
GPG RSA Encryption
Let’s see! Here we’ll encrypt the same data with RSA.
om:s3s2 mk$ time gzip 100g_test_file.txtreal 9m9.417suser 7m53.403ssys 0m58.835s
Turns out its pretty much the same. ~10 minutes. This is RSA with a 4096 bit key.
Cool. OK, well that’s good. I wouldn’t want the encryption to be much slower than that or this whole approach is going to be unusable.
Let’s check out the decryption.
om:s3s2 mk$ time gpg -d 100g_test_file.txt.gpgYou need a passphrase to unlock the secret key foruser: "Matt Konda (s3s2-test-key) <mkonda@jemurai.com>"4096-bit RSA key, ID A497AEC4, created 2019-04-11 (main key ID 5664D905)gpg: encrypted with 4096-bit RSA key, ID A497AEC4, created 2019-04-11"Matt Konda (s3s2-test-key) <mkonda@jemurai.com>"real 206m31.857suser 49m31.655ssys 43m35.630s
Oh no! 200+ MINUTES!? As my kids would say: WHAT ON PLANET EARTH!? That is definitely unusable.
OK. Step back. Let’s think here. With encryption, people usually use symmetric algorithms like AES after exchanging the key with asymmetric ones like RSA… because it is faster. Cool. Let’s see what that looks like then.
Symmetric Encryption
Let’s try encrypting that 100GB file with AES-256, which is presumably very fast because processors have been optimized to do it efficiently.
om:s3s2 mk$ time gpg --cipher-algo AES256 -c 100g_test_file.txtreal 14m36.886suser 12m35.381ssys 1m14.609s
14+ minutes is longer than zip or RSA encrypt, but not terrible. What does it look like going the other way? My guess would have been about the same as on the way in. That just shows how much I know…
om:s3s2 mk$ time gpg --output 100g_test_file.aesdecrypted.txt --decrypt 100g_test_file.txt.gpggpg: AES256 encrypted datagpg: encrypted with 1 passphrasereal 49m34.661suser 44m6.992ssys 3m32.160s
50 minutes! That’s more than 3x the time to encrypt and too long to be practically useful.
Compression
Per a colleague’s suggestion, we thought maybe the compression was slowing us down a lot so we tested with no compression.
om:s3s2 mk$ time gpg --cipher-algo AES256 --compress-algo none -c 100g_test_file.txtreal 23m28.978suser 18m29.866ssys 3m22.856som:s3s2 mk$ time gpg -d 100g_test_file.nocompress.txt.gpggpg: AES256 encrypted datagpg: encrypted with 1 passphrasereal 205m4.581suser 65m6.715ssys 47m45.610s
Everything got slower. So that was a wrong turn.
It is possible that using a library that would better capture processor support, a la AES-NI would significantly improve the result. We’re not sure how portable that will be so we’re setting that aside for now.
Other Approaches
We tried 7z with AES in stream… but that was also just slower.
om:s3s2 mk$ time 7za a -p 100g_test_file.txt.7z 100g_test_file.txt...Files read from disk: 1Archive size: 15768178 bytes (16 MiB)Everything is Okreal 43m48.769suser 102m6.180ssys 4m51.758s
Decompressing and decrypting:
om:s3s2 mk$ time 7z x 100g_test_file.txt.7z -p...Size: 107374182400Compressed: 15768178real 7m29.579suser 5m17.954ssys 1m11.559s
Other Archiving: Zstd
Kudos to @runako who pointed me to the zstd library.
om:s3s2 mk$ time zstd -o 100g_test_file.txt.zstd 100g_test_file.txt100g_test_file.txt : 0.01% (107374182400 => 9817620 bytes, 100g_test_file.txt.zstd)real 1m22.134suser 0m55.896ssys 1m5.896s
Woah - that’s fast! I guess sometimes the new tricks are worth learning. If we can apply really good compression this fast, it will also make the encryption step much faster.
Conclusion
I don’t know as much about encryption or compression as I thought. 😀
Dealing with very large files is still a legitimate and relevant challenge given day to day tasks of data scientists working on huge data sets. How we approach the problem will have real consequences for how effectively we can handle data.
Having a solution that pulls some of the effective practices together was a bigger challenge than we thought it might be, but it’s bearing fruit with our S3S2 Project. By using Zstd with gpg and S3, we’ve settled on what is, at least so far, the fastest way to safely work on and share lots of very large files.
We’d love to hear your input.
References