-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC4016: Streaming E2EE file transfers with random access #4016
base: main
Are you sure you want to change the base?
Changes from 14 commits
937deaa
e7c2395
bce7730
6e845a7
0b135b7
6dc6f94
68d5d14
65f20d0
dc61354
d25b7e0
3d9d788
97f72f7
16efd7f
abca46f
e671945
84d0ebf
903d42a
c06cdcb
87590f2
3a5e682
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,266 @@ | ||
# MSC4016: Streaming E2EE file transfer with random access and zero latency | ||
|
||
## Problem | ||
|
||
* File transfers currently take twice as long as they could, as they must first be uploaded in their entirety to the | ||
sender’s server before being downloaded via the receiver’s server. | ||
* As a result, relative to a dedicated file-copying system (e.g. scp) they feel sluggish. For instance, you can’t | ||
incrementally view a progressive JPEG or voice or video file as it’s being uploaded for “zero latency” file | ||
transfers. | ||
* You can’t skip within them without downloading the whole thing (if they’re streamable content, such as an .opus file) | ||
* For instance, you can’t do realtime broadcast of voice messages via Matrix, or skip within them (other than splitting | ||
them into a series of separate file transfers). | ||
* Another example is sharing document snapshots for real-time collaboration. If a user uploads 100MB of glTF in Third | ||
Room to edit a scene, you want all participants to be able to receive the data and stream-decode it with minimal | ||
latency. | ||
|
||
Closes [https://github.com/matrix-org/matrix-spec/issues/432](https://github.com/matrix-org/matrix-spec/issues/432) | ||
|
||
N.B. this MSC is *not* needed to do a streaming decryption or encryption of E2EE files (as opposed to streaming | ||
transfer). The current APIs let you stream a download of AES-CTR data and incrementally decrypt it without loading the | ||
whole thing into RAM, calculating the hash as you go, and then either surfacing or deleting the decrypted result at the | ||
end if the hash matches. | ||
|
||
Relatedly, v2 MXC attachments can't be stream-transferred, even if combined with [MSC2246] | ||
(https://github.com/matrix-org/matrix-spec-proposals/pull/2246), given you won't be able to send the hash in the event | ||
contents until you've uploaded the media. | ||
Comment on lines
+25
to
+27
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is a v2 MXC attachment? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. EncryptedFile with |
||
|
||
## Solution sketch | ||
|
||
* Upload content in a single file made up of contiguous blocks of AES-GCM content. | ||
ara4n marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Typically constant block size (e.g. 32KB) | ||
* Or variable block size (to allow time-based blocksize for low-latency seeking in streamable content) - e.g. one | ||
block per opus frame. Otherwise a 32KB block ends up being 8s of typical opus latency. | ||
* This would then require a registration sequence to identify the starts of blocks boundaries when seeking | ||
randomly (potentially escaping the bitstream to avoid registration code collisions). | ||
* Unlike today’s AES-CTR attachments, AES-GCM makes the content self-authenticating, in that it includes an | ||
authentication tag (AEAD) to hash the contents and protect against substitution attacks (i.e. where an attacker flips | ||
some bits in the encrypted payload to strategically corrupt the plaintext, and nobody notices as the content isn’t | ||
hashed). | ||
* (The only reason Matrix currently uses AES-CTR is that native AES-GCM primitives weren’t widespread enough on | ||
Android back in 2016) | ||
* To prevent against reordering attacks, each AES-GCM block has to include an encrypted block header which includes a | ||
sequence number, so we can be sure that when we request block N, we’re actually getting block N back - or | ||
equivalent. | ||
* XXX: is there still a vulnerability here? Other approaches use Merkle trees to hash the AEADs rather than simple | ||
sequence numbers, but why? | ||
* We then use normal [HTTP Range](https://datatracker.ietf.org/doc/html/rfc2616#section-14.35.1) headers to seek while | ||
downloading | ||
* We could also use [Youtube-style] | ||
(https://developers.google.com/youtube/v3/guides/using_resumable_upload_protocol) off-standard Content-Range headers | ||
on POST when uploading for resumable/incremental uploads. | ||
|
||
## Advantages | ||
|
||
* Backwards compatible with current implementations at the HTTP layer | ||
* Fully backwards compatible for unencrypted transfers | ||
* Relatively minor changes needed from AES-CTR to sequence-of-AES-GCM-blocks for implementations like | ||
[https://github.com/matrix-org/matrix-encrypt-attachment](https://github.com/matrix-org/matrix-encrypt-attachment) | ||
* We automatically maintain a serverside E2EE store of the file as normal, while also getting 1:many streaming | ||
semantics | ||
* Provides streaming transfer for any file type - not just media formats | ||
* Minimises memory usage in Matrix clients for large file transfers. Currently all(?) client implementations store the | ||
whole file in RAM in order to check hashes and then decrypt, whereas this would naturally lend itself to processing | ||
files incrementally in blocks. | ||
* Leverages AES-GCM’s existing primitives and hashing rather than inventing our own hashing strategy | ||
* We already had Range/Content-Range resumable/seekable zero-latency HTTP transfer implemented and working excellently | ||
pre-E2EE and pre-Matrix in our ‘glow’ codebase. | ||
* Random access could enable torrent-like semantics in future (i.e. servers doing parallel downloads of different chunks | ||
from different servers, with appropriate coordination) | ||
|
||
## Limitations | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another limitation here is that the custom file format means that interop with other E2EE systems (e.g. Android Messages) becomes significantly harder and less likely to work out of the box. At least using a preexisting format like AES-CTR means that interop is easier rather than trying to persuade WhatsApp or Signal or whoever to adopt an entirely novel format like MSC4016. |
||
|
||
* Enterprisey features like content scanning and CDGs require visibility on the whole file, so would eliminate the | ||
advantages of streaming by having to buffering it up in order to scan it. (Clientside scanners would benefit from | ||
file transfer latency halving but wouldn't be able to show mid-transfer files) | ||
* When applied to unencrypted files, server-side content scanning (for trust & safety etc) would be unable to scan until | ||
it’s too late. | ||
* Cancelled file uploads will still leak a partial file transfer to receivers who start to stream, which could be | ||
awkward if the sender sent something sensitive, and then can’t tell who downloaded what before they hit the cancel | ||
button | ||
* Small bandwidth overhead for the additional AEADs and block headers - probably ~16 bytes per block. | ||
* Out of the box it wouldn't be able to adapt streaming to network conditions (no HLS or DASH style support for multiple | ||
bitstreams) | ||
* Might not play nice with CDNs? (I haven't checked if they pass through Range headers properly) | ||
|
||
## Detailed proposal | ||
|
||
The file is uploaded asynchronously using [MSC2246](https://github.com/matrix-org/matrix-spec-proposals/pull/2246). | ||
|
||
The encrypted file block looks like: | ||
|
||
```json5 | ||
"file": { | ||
"v": "org.matrix.msc4016.v3", | ||
"key": { | ||
"alg": "A256GCM", | ||
"ext": true, | ||
"k": "cngOuL8OH0W7lxseExjxUyBOavJlomA7N0n1a3RxSUA", | ||
"key_ops": [ | ||
"encrypt", | ||
"decrypt" | ||
], | ||
"kty": "oct" | ||
}, | ||
"iv": "HVTXIOuVEax4E+TB", // 96-bit base-64 encoded initialisation vector | ||
"url": "mxc://example.com/raAZzpGSeMjpAYfVdTrQILBI", | ||
}, | ||
``` | ||
|
||
N.B. there is no longer a `hashes` key, as AES-GCM includes its own hashing to enforce the integrity of the file | ||
transfer. Therefore we can authenticate the transfer by the fact we can decrypt it using its key & IV (unless an | ||
attacker who controls the same key & IV has substituted it for another file - but the benefit of doing so is | ||
questionable). | ||
|
||
We split the file stream into blocks of AES-256-GCM, with the following simple framing: | ||
|
||
* File header with a magic number of: 0x4D, 0x58, 0x43, 0x03 ("MXC" 0x03) - just so `file` can recognise it. | ||
* 1..N blocks, each with a header of: | ||
* a 32-bit field: 0xFFFFFFFF (a registration code to let a parser handle random access within the file | ||
* a 32-bit field: block sequence number (starting at zero, used to calculate the IV of the block, and to aid random | ||
access) | ||
* a 32-bit field: the length in bytes of the encrypted data in this block. | ||
* a 32-bit field: a CRC32 checksum of the prior data. This is used when randomly seeking as a consistency check to | ||
confirm that the registration code really did indicate the beginning of a valid frame of data. It is not used | ||
for cryptographic integrity. | ||
* the actual AES-GCM bitstream for that block. | ||
* the plaintext block size can be variable; 32KB is a good default for most purposes. | ||
* Audio streams may want to use a smaller block size (e.g. 1KB blocks for a CBR 32kbps Opus stream will give | ||
250ms of streaming latency). Audio streams should be CBR to avoid leaking audio waveform metadata via block | ||
size. | ||
* The block is encrypted using an IV formed by concatenating the block sequence number of the `file` block with | ||
the IV from the `file` block (forming a 128-bit IV, which will be hashed down to 96-bit again within | ||
AES-GCM). This avoids IV reuse (at least until it wraps after 2^32-1 blocks, which at 32KB per block is | ||
137TB (18 hours of 8k raw video), or at 1KB per block is 4TB (34 years of 32kbps audio)). | ||
* Implementations MUST terminate a stream if the seqnum is exhausted, to prevent IV reuse. | ||
* XXX: Alternatively, we could use a 64-bit seqnum, spending 8 bytes of header on seqnums feels like a waste | ||
of bandwidth just to support massive transfers. And we'd have to manually hash it with the 96-bit IV | ||
rather than use the GCM implementation. | ||
Comment on lines
+149
to
+151
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you really need a 96-bit "IV" for the file? I am really asking here -- this is a bit beyond the level of my expertise. (Really it's more of a nonce anyway, but the "iv" key is already there in the existing JSON structure so whatever...) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 96-bit is the default for GCM - anything bigger gets hashed down internally to 96-bits. The proposed actual IV used is (i'm calling it an IV even though it's a nonce 'cos that's what webcrypto calls it too :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem is that the block ID's repeat for every file, right? So you don't ever want to have the main "IV" be repeated for two different files. With 96 bits you should be fine. The birthday bound says you can do up to about 2^48 files with that. The only question is whether you want to push your luck by allocating a few more bits to the block ID and a few less to the main IV/nonce. Like 80 / 48 instead of 96 / 32. Then you could do up to 2^40 files. Again, probably not worth bothering about it. 4 TB should be enough for anybody. |
||
* The block is encrypted including the 32-bit block sequence number as Additional Authenticated Data, thus | ||
stopping encrypted blocks from impersonating each other. | ||
|
||
Or graphically, each frame is: | ||
|
||
``` | ||
protocol "Registration Code (0xFFFFFFF):32,Block sequence number:32,Encrypted block length:32,CRC32:32,AES-GCM encrypted Data:64" | ||
|
||
0 1 2 3 | ||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| Registration Code (0xFFFFFFF) | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| Block sequence number | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| Encrypted block length | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| CRC32 | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| | | ||
+ AES-GCM encrypted Data + | ||
| | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
|
||
``` | ||
|
||
The actual file upload can then be streamed in the request body in the PUT (requires HTTP/2 in browsers). Similarly, the | ||
download can be streamed in the response body. The download should stream as rapidly as possible from the media | ||
server, letting the receiver view it incrementally as the upload happens, providing "zero-latency" - while also storing | ||
the stream to disk. | ||
Comment on lines
+178
to
+181
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a financial limitation here, at least for media servers using a CDN. The CDN is primarily intended to reduce bandwidth costs, but if the server is being asked to download a piece of media that hasn't finished uploading yet, then the CDN could cache a partial file. The media server is then forced to serve the partial file itself from storage, which may incur additional bandwidth fees. Especially so if the storage is network-operated as well. For small files (<100mb), the async upload endpoint is probably fine enough. There's no real need for zero latency file transfers because the files are already transferred pretty quickly. For larger files, it's more likely that the file doesn't need to be sent instantly between two parties as there is likely a delay in when the receiver even notices the file being uploaded. This affords the sender some time to finish the entire upload process. Or in short: the cost of bandwidth outweighs the cost of a "slow" download, imo. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that streaming file transfers don't play that nicely with CDNs (but might be okay - need to actually see how they interact). I don't think i follow the logic here, though: typical use cases I have in mind here are:
Now, once any upload has completed, then normal CDN semantics can kick in. So, yes: it's possible CDNs won't be able to cache downloads which are still uploading. But I think the financial cost of supporting zero-transfer uploads could be worth it as a value-added feature for the edge-case where people download concurrently with the upload. And if the server admin doesn't want to risk that cost, they can simply turn it off on their media repo. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not 100% sure that I understand the use case for this. I mean, it's cool, but ...
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup, these are all fair points. The use case is more: "file transfers appear instantly to the recipient on the sender hitting send, making the app feel magically fast (a bit like instagram's hack of proactively uploading files to the server in the background while the user's still typing the caption)". So not only would the blurhash pop up instantly, but the 800x600 thumbnail would the replace it as immediately as possible, even on crap networks, quite aside from the full-res transfer. Now, totally agreed this is completely finetuning perf and UX, but (amazingly) we're pretty much at the point where this level of UX snappiness and polish is where the battle's at. For audio and video pseudo-streaming, we could absolutely send a series of M3U-esque playlist updates over instead, which is pretty much what MSC3888 does. However, this does feel a bit fiddly, and I'm not sure that the benefits of being able to format it as real M3U and pass it straight into an HLS player (complete with crappy unauthed CBC encryption) are really worth it. (I guess you could get support for variable bandwidth by including different stream resolutions, though - and you get the potential benefit of commitment hashes, as you mention below). In practice, simply being able to decrypt a single stream of data as per this MSC and pass it into an For file transfer: sure, you could do WebRTC, but having the server relay means that you a) can do one-to-many transfer efficiently, b) you can do resumable uploads to a single place, c) you can do resumable downloads from a single place, d) you don't need the sender & recipient online at the same time, e) you hopefully get CDN for free, f) you don't need to worry about TURN, g) your client needs a webrtc stack. If plain old HTTP file transfers give you this for free already, why not use it? I'll plonk this all into the alternatives tho. |
||
|
||
For resumable uploads (or to upload in blocks for HTTP clients which don't support streaming request bodies), the client | ||
can use `Content-Range` headers as per https://developers.google.com/youtube/v3/guides/using_resumable_upload_protocol#Resume_Upload. | ||
|
||
TODO: the media API needs to advertise if it supports resumable uploads. | ||
|
||
For resumable downloads, we then use normal [HTTP Range](https://datatracker.ietf.org/doc/html/rfc2616#section-14.35.1) headers to seek and | ||
resume while downloading. | ||
|
||
TODO: We need a way to mark a transfer as complete or cancelled (via a relation?). If cancelled, the sender should | ||
delete the partial upload (but the partial contents will have already leaked to the other side, of course). | ||
|
||
TODO: While we're at it, let's actually let users DELETE their file transfers, at last. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (this seems best as a dedicated MSC - feels too detached from streaming to include here) |
||
|
||
## Alternatives | ||
|
||
* We could use an existing streaming encrypted framing format of some kind rather (SRTP perhaps, which would give us | ||
timestamps for easier random access for audio/video streams) - but this feels a bit strange for plain old file | ||
streams. | ||
* Alternatively, we could descope random access entirely, given it only makes sense for AV streams, and requires | ||
timestamps to work nicely - and simply being able to stream encryption/decryption is a win in its own right. For | ||
instance, glow doesn't let you seek randomly within files which are mid transfer; only tail. | ||
* Split files into a series of separate m.file uploads which the client then has to glue back together (as the | ||
[voice broadcast feature](https://github.com/vector-im/element-meta/discussions/632) does in Element today). | ||
* Pros: | ||
* Works automatically with antivirus & CDGs | ||
* Could be made to map onto HLS or DASH? (by generating an .m3u8 which contains a bunch of MXC urls? This could | ||
also potentially solve the glitching problems we’ve had, by reusing existing HLS players augmented with our | ||
E2EE support) | ||
* Cons: | ||
* Is always going to be high latency (e.g. Element currently splits into ~30s chunks) given rate limits on | ||
sending file events | ||
* Can be a pain to glue media uploads back together without glitching | ||
* Transfer files via streaming P2P file transfer via WebRTC data channels | ||
(https://github.com/matrix-org/matrix-spec/issues/189) | ||
* Pros: | ||
* Easy to implement with Matrix’s existing WebRTC signalling | ||
* Could use MSC3898-inspired media control to seek in the stream | ||
* Cons: | ||
* You don’t get a serverside copy of the data | ||
* Hard for clients to implement relative to a simple HTTP download | ||
* You expose client IPs to each other if going P2P rather than via TURN | ||
* Do streaming voice/video messages/broadcast via WebRTC media channels instead | ||
* Pros: | ||
* Lowest latency | ||
* Could use media control to seek | ||
* Supports multiple senders | ||
* Works with CDGs and other enterprisey scanners which know how to scan VOIP payloads | ||
* Could automatically support variable streams via SFU to adapt to network conditions | ||
* If the SFU does E2EE and archiving, you get that for free. | ||
* Cons: | ||
* Complex; you can’t just download the file via HTTP | ||
* Requires client to have a WebRTC stack | ||
* A suitable SFU still doesn’t exist yet | ||
* Transfer files out of band using a protocol which already provides streaming transfers (e.g. IPFS?) | ||
* Could use tus.io as an almost-standard format for HTTP resumable uploads (PATCH + Upload-Offset headers) instead, | ||
although tus servers don't seem to stream. | ||
|
||
## Security considerations | ||
|
||
* Variable size blocks could leak metadata for VBR audio. Mitigation is to use CBR if you care about leaking voice | ||
traffic patterns (constant size blocks isn’t necessarily enough, as you’d still leak the traffic patterns) | ||
Comment on lines
+251
to
+252
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Two comments here:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for audio i've been using a window size of 20ms, to keep it nice and low latency :) good point for VBR video |
||
* Is encrypting a sequence number in block header (with authenticated encryption) sufficient to mitigate reordering | ||
attacks? | ||
* The resulting lack of atomicity on file transfer means that accidentally uploaded files may leak partial contents to | ||
other users, even if they're cancelled. | ||
* Clients may well wish to scan untrusted inbound file transfers for malware etc, which means buffering the inbound | ||
transfer and scanning it before presenting it to the user. | ||
* Removing the `hashes` entry on the EncryptedFile description means that an attacker who controls the key & IV of the | ||
original file transfer could strategically substitute the file contents. This could be desirable for CDGs wishing to | ||
switch a file for a sanitised version without breaking the Matrix event hashes. For other scenarios it could be | ||
undesirable. An alternative might be for the sender to keep sending new hashes in related matrix events as the | ||
stream uploads, but it's unclear if this is worth it. | ||
|
||
## Conclusion | ||
|
||
For the voice broadcast use case, it's a bit unclear whether this is actually an improvement over splitting files into | ||
multiple file uploads (or [MSC3888](https://github.com/matrix-org/matrix-spec-proposals/blob/weeman1337/voice-broadcast/proposals/3888-voice-broadcast.md)). | ||
It's also unfortunate that the benefits of the MSC are reduced with content scanners and CDGs. It’s also a bit unclear | ||
whether voice/video broadcast would be better served via MSC3888 style behaviour. | ||
|
||
However, for halving the transfer time for large videos and files (and the magic "zero latency" of being able to see | ||
file transfers instantly start to download as they upload) it still feels like a worthwhile MSC. Switching to GCM is | ||
desirable too in terms of providing authenticated encryption and avoiding having to calculate out-of-band hashes for | ||
file transfer. Finally, implementing this MSC will force implementations to stream their file encryption/decryption | ||
and avoid the temptation to load the whole file into RAM (which doesn't scale, especially in constrained environments | ||
such as iOS Share Extensions). | ||
|
||
## Dependencies | ||
|
||
This MSC depends on [MSC2246](https://github.com/matrix-org/matrix-spec-proposals/pull/2246), which has now landed in | ||
the spec. Extends [MSC3469](https://github.com/matrix-org/matrix-spec-proposals/pull/3469). | ||
|
||
## Unstable prefixes | ||
|
||
| Unstable prefix | Stable prefix | | ||
| --------------------- | ------------------- | | ||
| org.matrix.msc4016.v3 | v3 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(they can if your server supports
Range
requests - MMR supports this, all other homeservers don't)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MMR doesn't support Range headers into downloads which are still being uploaded does it, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but this point in the MSC implies that Range headers aren't supported anywhere.