This article describes the end-to-end encryption used for Telegram voice and video calls.
Before a call is ready, some preliminary actions have to be performed. The calling party needs to contact the party to be called and check whether it is ready to accept the call. Besides that, the parties have to negotiate the protocols to be used, learn the IP addresses of each other or of the Telegram relay servers to be used (so-called reflectors), and generate a one-time encryption key for this voice call with the aid of Diffie--Hellman key exchange. All of this is accomplished in parallel with the aid of several Telegram API methods and related notifications. This document covers details related to key generation, encryption and security.
The Diffie-Hellman key exchange, as well as the whole protocol used to create a new voice call, is quite similar to the one used for Secret Chats. We recommend studying the linked article before proceeding.
However, we have introduced some important changes to facilitate the key verification process. Below is the entire exchange between the two communicating parties, the Caller (A) and the Callee (B), through the Telegram servers (S).
g_a_hash:bytes
, among others. For this call, this field is to be filled with g_a_hash, not g_a itself.g_a:bytes
and key_fingerprint:long
.key_fingerprint:long
received from the other side, as an implementation sanity check.At this point, the Diffie--Hellman key exchange is complete, and both parties have a 256-byte shared secret key key which is used to encrypt all further exchanges between A and B.
It is of paramount importance to accept each update only once for each instance of the key generation protocol, discarding any duplicates or alternative versions of already received and processed messages (updates).
This document describes encryption in voice and video calls as implemented in Telegram apps with versions 7.0 and above. See this document for details on encryption used in voice calls in app versions released before August 14, 2020.
The Telegram Voice and Video Call Library uses an optimized version of MTProto 2.0 to send and receive packets, consisting of one or more end-to-end encrypted messages of various types (ice candidates list, video formats, remote video status, audio stream data, video stream data, message ack or empty).
This document describes only the encryption process, leaving out encoding and network-dependent parts.
The library starts working with:
key
shared between the parties, as generated above.Both data transfer channels are unreliable (messages may get lost), but signaling is slower and more reliable.
The body of a packet (decrypted_body
) consists of several messages and their respective seq
numbers concatenated together.
Each decrypted_body
is unique because no two seq
numbers of the first message can be the same. If only old messages need to be re-sent, an empty message with new unique seq
is added to the packet first.
The encryption key key
is used to compute a 128-bit msg_key
and then a 256-bit aes_key
and a 128-bit aes_iv
:
x
depends on whether the call is outgoing or incoming and on the connection type:
This allows apps to decide which packet types will be sent to which connections and work in these connections independently (with each having its own seq
counter).
The resulting aes_key
and aes_iv
are used to encrypt decrypted_body
:
The packet that gets sent consists of msg_key
and encrypted_body
:
When received, the packet gets decrypted using key
and msg_key
, after which msg_key
is checked against the relevant SHA256
substring. If the check fails, the packet must be discarded.
Each of the peers maintains its own 32-bit monotonically increasing counter for outgoing messages, seq
, starting with 1
. This seq
counter is prepended to each sent message and increased by 1
for each new message. No two seq
numbers of the first message in a packet can be the same. If only old messages need to be re-sent, an empty message with a new unique seq
is added to the packet first. When the seq
counter reaches 2^30
, the call must be aborted. Each peer stores seq
values of all the messages it has received (and processed) which are larger than max_received_seq - 64
, where max_received_seq
is the largest seq
number received so far.
If a packet is received, the first message of which has a seq
that is smaller or equal to max_received_seq - 64
or its seq
had already been received, the message is discarded. Otherwise, the seq
values of all incoming messages are memorized and max_received_seq
is adjusted. This guarantees that no two packets will be processed twice.
To verify the key, and ensure that no MITM attack is taking place, both parties concatenate the secret key key with the value g_a of the Caller ( A ), compute SHA256 and use it to generate a sequence of emoticons. More precisely, the SHA256 hash is split into four 64-bit integers; each of them is divided by the total number of emoticons used (currently 333), and the remainder is used to select specific emoticons. The specifics of the protocol guarantee that comparing four emoticons out of a set of 333 is sufficient to prevent eavesdropping (MiTM attack on DH) with a probability of 0.9999999999.
This is because instead of the standard Diffie-Hellman key exchange which requires only two messages between the parties:
we use a three-message modification thereof that works well when both parties are online (which also happens to be a requirement for voice calls):
The idea here is that A commits to a specific value of a (and of g_a) without disclosing it to B. B has to choose its value of b and g_b without knowing the true value of g_a, so that it cannot try different values of b to force the final key (g_a)^b to have any specific properties (such as fixed lower 32 bits of SHA256(key)). At this point, B commits to a specific value of g_b without knowing g_a. Then A has to send its value g_a; it cannot change it even though it knows g_b now, because the other party B would accept only a value of g_a that has a hash specified in the very first message of the exchange.
If some impostor is pretending to be either A or B and tries to perform a Man-in-the-Middle Attack on this Diffie--Hellman key exchange, the above still holds. Party A will generate a shared key with B -- or whoever pretends to be B -- without having a second chance to change its exponent a depending on the value g_b received from the other side; and the impostor will not have a chance to adapt his value of b depending on g_a, because it has to commit to a value of g_b before learning g_a. The same is valid for the key generation between the impostor and the party B.
The use of hash commitment in the DH exchange constrains the attacker to only one guess to generate the correct visualization in their attack, which means that using just over 33 bits of entropy represented by four emoji in the visualization is enough to make a successful attack highly improbable.
For a slightly more user-friendly explanation of the above see: How are calls authenticated?