Monday, January 21, 2013

Mega and Encrypted Cloud Deduplication

The new "Mega" service has crypto flaws, but "deduplication" isn't necessarily one of them.

The idea behind Mega is that the CLIENT encrypts the files before uploading them. Therefore, the FBI cannot come after Mega, because the SERVER never has the keys to decrypt the files, and hence, doesn't know their content.

Currently, in the few days after its release, people have pointed out numerous problems with Mega, such as cross site scripting (XSS), random number generation, sending passwords hashes via email, and other problems. Thus, at the moment, you cannot trust Mega. But hopefully they'll fix these problems soon.

Deduplication

One issue that people are talking about is "deduplication" (storing only one copy if multiple users upload the same file). They think it's impossible without the server knowing how to decrypt file. It's actually quite possible. I'll suggest a naive method here:

Step #1: Hash the original file. Then, use this hash as the AES-KEY to encrypt the file.
Step #2: Hash the encrypted file. Then, use this hash as the FILENAME of the file on the server.

If two people upload a file with the same FILENAME (derived in step 2 by hashing the encrypted file), then we knows it's a duplicate, even though we don't know how to decrypt it. Only the client knows the AES-KEY, the server never does, so the server can never decrypt it, even though it can detect duplicates.

FBI-MPAA detection

In order to save on bandwidth, the client can first send the FILENAME to the server to see if it exists, and if it does, then not upload it again. The problem with this approach is that the FBI or MPAA could do the same thing in order to easily find copyrighted content. This will cause the to flood them with millions of hashes trolling for content, and in the end, probably use more bandwidth than it saves.

Therefore, Mega shouldn't leak information. It should force the client to upload the file every time, and not reveal to the client whether the file was a duplicate or not.

Update: So couldn't the FBI serve a warrant on Mega in order to find out the owners of a particular file? Yes, they could. This is encryption but not anonymization.


Password changes

BTW, the user encrypts the AES-KEY with their password. Thus, the user can change their password without having to re-encrypt all the files. I point this out because it's often described how the user "encrypts files with their password". That simple phrases ignores the intermediate step. What actually happens is "encrypts AES-KEY with their password, encrypts files with AES-KEY".

File sharing

What happens if one user wants to share the file with another user? This works with standard public/private key cryptography, such as with the RSA algorithm. The recipient publishes their PUBLIC-RSA key. The sender then uses this to encrypt the AES-KEY generated above, and then sends this to the recipient, who uses his PRIVATE-RSA key to decrypt it.

File publication

Alternatively, a user might publish the file for everyone to read. In this case, they can simply publish the AES-KEY and FILENAME, and let anybody access the file. The problem with this is that movie studios can then grab the file, too, and send Mega a takedown notice.

Conclusion

These are all my naive algorithms. I came up with the deduplication algorithm in 30 seconds in response to an e-mail asking how it could be possible. In addition, I skip steps. I doubt these are precisely how Mega implements their stuff, or that these are the optimal algorithms.

My point is this: Mega's features are possible cryptographically. Their problem is implementation, not theory.



Update: Bah! Almost immediately after posting this, people have pointed out that the algorithm is called "convergent encryption", and is widely known and used, including in such well-known projects like Tahoe-LAFS.

6 comments:

Anonymous said...

Excuse me please, I might be misunderstanding something, but as far as I understand, for deduplication to work with encryption, one would need same files to result in same ciphertexts (convergent encryption is the term, I think)

So, let's say I'm Alice and I host, I don't know, X.mp3, which is pirated

So, when Malory the RIAA lawyer (who has a copy of X.mp3, by virtue of being a RIAA bloodhound ;-) ) comes to mega and says, "Alice is hosting infringing content", Mega actually won't be able to do the whack-a-mole "show us the download link and the decryption key" routine.

Because as soon as they do this, Malory will simply produce his copy of X.mp3, upload it to Mega, and their de-duplication routine will rat on my by demonstrating that ciphertexts resulting from my X.mp3 and Malory's X.mp3, well, converge

Am I correct ?

Robert Graham said...

Correct.

That's why deduplication should only be for storage, not bandwidth. Mega shouldn't provide any clue to the client that there is a duplication. Thus, Malory can't detect something as a duplicate on the server. It's possible they could create a list of millions of known compromised files (such as from Pirate Bay), but then this becomes a burden on Mega, because it must now parse millions of hashes while not knowing if any are on their servers.

Anonymous said...

Couldn't Malory use, well, the court system to gain access to the storage, or demand that Mega implement a preemptive "infringing content filter" at storage level (with a large set of "template infringing plaintexts" supplied by RIAA or whatever) ?

P.S.: Unrelated but...
Just saw a pic with a number in your ReCaptcha!
Cool! First time I saw those...

Christopher said...

Anonymity is taken care of simply by using a TOR-enabled browser to upload your files and to sign up with the e-mail service you use to sign up for the account.

Andrew said...

The thing is, if they CAN screen content (even if at storage level) and CAN be pressured into "rounding up and shooting" all files of a particular type, then what's the point of all this ?

If I have to both 1) encrypt with a third party app 2) hide behind tor, then I could use Dropbox. Or Bitcassa.

The only way Mega is different from those is that it offers plugin-less in-browser encryption (rather questionable one at that), and that is just a marketing gimmick.

A dangerous marketing gimmick (http://www.matasano.com/articles/javascript-cryptography/)

Anonymous said...

You may want to read some docs on deduplication, deduplicating on blocks can have a big impact on storage even though everything is encrypted.