Encrypting Large Files

We have a client that is doing interesting data science that depends on processing very large files (100GB) that are also transferred between parties.

This workflow was actually the inspiration for our S3S2 Open Source project which aims to make it easy to share files securely.

As we got deeper into the problem, we realized that we hadn’t fully examined how resource intensive it would be to handle such large files. This post shares some of the initial results.

Getting Started with Large Files

First we created a very large file by doing this:

mkfile -n 100g ./100g_test_file.txt

This is of course, naive, because we won’t see realistic compression. The file we just created is all 0’s. However, it is and looks like a 100GB file so let’s start here.

Next, we’re curious how hard it is to just compress the file.

om:s3s2 mk$ time gzip 100g_test_file.txtreal    9m9.417suser    7m53.403ssys    0m58.835s

OK, so we’re talking ~10 minutes to zip a 100GB file on my local drive with an SSD. Could be faster in the cloud but for our use case we have folks that want the data locally and some who want to use it in the cloud, so it’s at least a reasonable first test.

So… what do you think, do you think the encryption is going to take a lot longer than that?

GPG RSA Encryption

Let’s see! Here we’ll encrypt the same data with RSA.

om:s3s2 mk$ time gzip 100g_test_file.txtreal    9m9.417suser    7m53.403ssys    0m58.835s

Turns out its pretty much the same. ~10 minutes. This is RSA with a 4096 bit key.

Cool. OK, well that’s good. I wouldn’t want the encryption to be much slower than that or this whole approach is going to be unusable.

Let’s check out the decryption.

om:s3s2 mk$ time gpg -d 100g_test_file.txt.gpgYou need a passphrase to unlock the secret key foruser: "Matt Konda (s3s2-test-key) <mkonda@jemurai.com>"4096-bit RSA key, ID A497AEC4, created 2019-04-11 (main key ID 5664D905)gpg: encrypted with 4096-bit RSA key, ID A497AEC4, created 2019-04-11"Matt Konda (s3s2-test-key) <mkonda@jemurai.com>"real    206m31.857suser    49m31.655ssys    43m35.630s

Oh no! 200+ MINUTES!? As my kids would say: WHAT ON PLANET EARTH!? That is definitely unusable.

OK. Step back. Let’s think here. With encryption, people usually use symmetric algorithms like AES after exchanging the key with asymmetric ones like RSA… because it is faster. Cool. Let’s see what that looks like then.

Symmetric Encryption

Let’s try encrypting that 100GB file with AES-256, which is presumably very fast because processors have been optimized to do it efficiently.

om:s3s2 mk$ time gpg --cipher-algo AES256 -c 100g_test_file.txtreal    14m36.886suser    12m35.381ssys    1m14.609s

14+ minutes is longer than zip or RSA encrypt, but not terrible. What does it look like going the other way? My guess would have been about the same as on the way in. That just shows how much I know…

om:s3s2 mk$ time gpg --output 100g_test_file.aesdecrypted.txt --decrypt 100g_test_file.txt.gpggpg: AES256 encrypted datagpg: encrypted with 1 passphrasereal    49m34.661suser    44m6.992ssys    3m32.160s

50 minutes! That’s more than 3x the time to encrypt and too long to be practically useful.

Compression

Per a colleague’s suggestion, we thought maybe the compression was slowing us down a lot so we tested with no compression.

om:s3s2 mk$ time gpg --cipher-algo AES256 --compress-algo none -c 100g_test_file.txtreal    23m28.978suser    18m29.866ssys    3m22.856som:s3s2 mk$ time gpg -d 100g_test_file.nocompress.txt.gpggpg: AES256 encrypted datagpg: encrypted with 1 passphrasereal    205m4.581suser    65m6.715ssys    47m45.610s

Everything got slower. So that was a wrong turn.

It is possible that using a library that would better capture processor support, a la AES-NI would significantly improve the result. We’re not sure how portable that will be so we’re setting that aside for now.

Other Approaches

We tried 7z with AES in stream… but that was also just slower.

om:s3s2 mk$ time 7za a -p 100g_test_file.txt.7z 100g_test_file.txt...Files read from disk: 1Archive size: 15768178 bytes (16 MiB)Everything is Okreal    43m48.769suser    102m6.180ssys     4m51.758s

Decompressing and decrypting:

om:s3s2 mk$ time 7z x 100g_test_file.txt.7z -p...Size:       107374182400Compressed: 15768178real  7m29.579suser  5m17.954ssys   1m11.559s

Other Archiving: Zstd

Kudos to @runako who pointed me to the zstd library.

om:s3s2 mk$ time zstd -o 100g_test_file.txt.zstd 100g_test_file.txt100g_test_file.txt   :  0.01%   (107374182400 => 9817620 bytes, 100g_test_file.txt.zstd)real  1m22.134suser  0m55.896ssys   1m5.896s

Woah - that’s fast! I guess sometimes the new tricks are worth learning. If we can apply really good compression this fast, it will also make the encryption step much faster.

Conclusion

I don’t know as much about encryption or compression as I thought. 😀

Dealing with very large files is still a legitimate and relevant challenge given day to day tasks of data scientists working on huge data sets. How we approach the problem will have real consequences for how effectively we can handle data.

Having a solution that pulls some of the effective practices together was a bigger challenge than we thought it might be, but it’s bearing fruit with our S3S2 Project. By using Zstd with gpg and S3, we’ve settled on what is, at least so far, the fastest way to safely work on and share lots of very large files.

We’d love to hear your input.

References

Previous
Previous

User Auditing with GAA

Next
Next

Package managers