_ _ _/B\_ _/W\_ (* *) Phrack #64 file 10 (* *) | - | | - | | | Cryptanalysis of DPA-128 | | | | | | | | By SysK | | | | | | | | syskall@phreaker.net | | (____________________________________________________) --[ Contents 1 - Introduction 2 - A short word about block ciphers 3 - Overview of block cipher cryptanalysis 4 - Veins' DPA-128 4.1 - Bugs in the implementation 4.2 - Weaknesses in the design 5 - Breaking the linearized version 6 - On the non linearity of addition modulo n in GF(2) 7 - Exploiting weak keys 7.1 - Playing with a toy cipher 7.2 - Generalization and expected complexity 7.3 - Cardinality of |W 8 - Breaking DPA based unkeyed hash function 8.1 - Introduction to hash functions 8.2 - DPAsum() algorithm 8.3 - Weaknesses in the design/implementation 8.4 - A (2nd) preimage attack 9 - Conclusion 10 - Greetings 11 - Bibliography --[ 1 - Introduction While the cracking scene has grown with cryptology thanks to the evolution of binary protection schemes, the hacking scene mostly hasn't. This fact is greatly justified by the fact that there were globally no real need. Indeed it's well known that if a hacker needs to decrypt some files then he will hack into the box of its owner, backdoor the system and then use it to steal the key. A cracker who needs to break a protection scheme will not have the same approach: he will usually try to understand it fully in order to find and exploit design and/or implementation flaws. Although the growing of the security industry those last years changed a little bit the situation regarding the hacking community, nowadays there are still too many people with weak knowledge of this science. What is disturbing is the broadcast of urban legends and other hoax by some paranoids among them. For example, haven't you ever heard people claiming that government agencies were able to break RSA or AES? A much more clever question would have been: what does "break" mean? A good example of paranoid reaction can be found in M1lt0n's article [FakeP63]. The author who is probably skilled in hacking promotes the use of "home made cryptographic algorithms" instead of standardized ones such as 3DES. The corresponding argument is that since most so-called security experts lake coding skills then they aren't able to develop appropriate tools for exotic ciphers. While I agree at least partially with him regarding the coding abilities, I can't possibly agree with the main thesis. Indeed if some public tools are sufficient to break a 3DES based protection then it means that a design and/or an implementation mistake was/were made since, according to the state of the art, 3DES is still unbroken. The cryptosystem was weak from the beginning and using "home made cryptography" would only weaken it more. It is therefore extremely important to understand cryptography and to trust the standards. In a previous Phrack issue (Phrack 62), Veins exposed to the hacking community a "home made" block cipher called DPA (Dynamic Polyalphabetic Algorithms) [DPA128]. In the following paper, we are going to analyze this cipher and demonstrate that it is not flawless - at least from a cryptanalytic perspective - thus fitting perfectly with our talk. --[ 2 - A short word about block ciphers Let's quote a little bit the excellent HAC [MenVan]: "A block cipher is a function which maps n-bit plaintext blocks to n-bit ciphertext blocks; n is called the blocklength. It may be viewed as a simple substitution cipher with large character size. The function is parametrized by a k-bit key K, taking values from a subset |K (the key space) of the set of all k-bit vectors Vk. It is generally assumed that the key is chosen at random. Use of plaintext and ciphertext blocks of equal size avoids data expansion." Pretty clear isn't it? :> So what's the purpose of such a cryptosystem? Obviously since we are dealing with encryption this class of algorithms provides confidentiality. Its construction makes it particularly suitable for applications such as large volumes encryption (files or HD for example). Used in special modes such as CBC (like in OpenSSL) then it can also provide stream encryption. For example, we use AES-CBC in the WPA2, SSL and SSH protocols. Remark: When used in conjunction with other mechanisms, block ciphers can also provide services such as authentication or integrity (cf part 8 of the paper). An important point is the understanding of the cryptology utility. While cryptography aims at designing best algorithms that is to say secure and fast, cryptanalysis allows the evaluation of the security of those algorithms. The more an algorithm is proved to have weaknesses, the less we should trust it. --[ 3 - Overview of block cipher cryptanalysis The cryptanalysis of block ciphers evolved significantly in the 90s with the apparition of some fundamental methods such as the differential [BiSha90] and the linear [Matsui92] cryptanalysis. In addition to some more recent ones like the boomerang attack of Wagner or the chi square cryptanalysis of Vaudenay [Vaud], they constitute the set of so-called statistical attacks on block ciphers in opposition to the very recent and still controverted algebraic ones (see [CourtAlg] for more information). Today the evolution of block cipher cryptanalysis tends to stabilize itself. However a cryptographer still has to acquire quite a deep knowledge of those attacks in order to design a cipher. Reading the Phrack paper, we think - actually we may be wrong - that the author mostly based his design on statistical tests. Although they are obviously necessary, they can't possibly be enough. Every component has to be carefully chosen. We identified several weaknesses and think that some more may still be left. --[ 4 - Veins' DPA-128 description DPA-128 is a 16 rounds block cipher providing 128 bits block encryption using an n bits key. Each round encryption is composed of 3 functions which are rbytechain(), rbitshift() and S_E(). Thus for each input block, we apply the E() function 16 times (one per round) : void E (unsigned char *key, unsigned char *block, unsigned int shift) { rbytechain (block); rbitshift (block, shift); S_E (key, block, shift); } where: - block is the 128b input - shift is a 32b parameter dependent of the round subkey - key is the 128b round subkey Consequently, the mathematical description of this cipher is: f: |P x |K ----> |C where: - |P is the set of all plaintexts - |K is the set of all keys - |C is the set of all ciphertexts For p element of |P, k of |K and c of |C, we have c = f(p,k) with f = EE...EE = E^16 and meaning the composition of functions. We are now going to describe each function. Since we sometimes may need mathematics to do so, we will assume that the reader is familiar with basic algebra ;> rbytechain() is described by the following C function: void rbytechain(unsigned char *block) { int i; for (i = 0; i < DPA_BLOCK_SIZE; ++i) block[i] ^= block[(i + 1) % DPA_BLOCK_SIZE]; return; } where: - block is the 128b input - DPA_BLOCK_SIZE equals 16 Such an operation on bytes is called linear mixing and its goal is to provide the diffusion of information (according to the well known Shannon theory). Mathematically, it's no more than a linear map between two GF(2) vector spaces of dimension 128. Indeed, if U and V are vectors over GF(2) representing respectively the input and the output of rbytechain() then V = M.U where M is a 128x128 matrix over GF(2) of the linear map where coefficients of the matrix are trivial to find. Now let's see rbitshift(). Its C version is: void rbitshift(unsigned char *block, unsigned int shift) { unsigned int i; unsigned int div; unsigned int mod; unsigned int rel; unsigned char mask; unsigned char remainder; unsigned char sblock[DPA_BLOCK_SIZE]; if (shift) { mask = 0; shift %= 128; div = shift / 8; mod = shift % 8; rel = DPA_BLOCK_SIZE - div; for (i = 0; i < mod; ++i) mask |= (1 << i); for (i = 0; i < DPA_BLOCK_SIZE; ++i) { remainder = ((block[(rel + i - 1) % DPA_BLOCK_SIZE]) & mask) << (8 - mod); sblock[i] = ((block[(rel + i) % DPA_BLOCK_SIZE]) >> mod) | remainder; } } memcpy(block, sblock, DPA_BLOCK_SIZE); } where: - block is the 128b input - DPA_BLOCK_SIZE equals 16 - shift is derived from the round subkey Veins describes it in his paper as a key-related shifting (in fact it has to be a key-related 'rotation' since we intend to be able to decrypt the ciphertext ;)). A careful read of the code and several tests confirmed that it was not erroneous (up to a bug detailed later in this paper), so we can describe it as a linear map between two GF(2) vector spaces of dimension 128. Indeed, if V and W are vectors over GF(2) representing respectively the input and the output of rbitshift() then: W = M'.V where M' is the 128x128 matrix over GF(2) of the linear map where, unlike the previous function, coefficients of the matrix are unknown up to a probability of 1/128 per round. Such a function also provides diffusion of information. Finally, the last operation S_E() is described by the C code: void S_E (unsigned char *key, unsigned char *block, unsigned int s) { int i; for (i = 0; i < DPA_BLOCK_SIZE; ++i) block[i] = (key[i] + block[i] + s) % 256; return; } where: - block is the 128b input - DPA_BLOCK_SIZE equals 16 - s is the shift parameter described in the previous function - key is the round subkey The main idea of veins' paper is the so-called "polyalphabetic substitution" concept, whose implementation is supposed to be the S_E() C function. Reading the code, it appears to be no more than a key mixing function over GF(2^8). Remark: We shall see later the importance of the mathematical operation know as 'addition' over GF(2^8). Regarding the key scheduling, each cipher round makes use of a 128b subkey as well as of a 32b one deriving from it called "shift". The following pseudo code describes this operation: skey(0) = checksum128(master_key) for i = 0, nbr_round-2: skey(i+1) = checksum128(skey(i)) skey(0) = skey(15) for i = 0, nbr_round-1: shift(nbr_round-1 - i) = hash32(skey(i)) where skey(i) is the i'th subkey. It is not necessary to explicit the checksum128() and hash32(), the reader just has to remind this thing: whatever the weakness there may be in those functions, we will now consider them being true oneway hash functions providing perfect entropy. As a conclusion, the studied cipher is closed to being a SPN (Substitution - Permutation Network) which is a very generic and well known construction (AES is one for example). --[ 4.1 - Bugs in the implementation Although veins himself honestly recognizes that the cipher may be weak and "strongly discourages its use" to quote him [DPA128], some people could nevertheless decide to use it as a primitive for encryption of personal and/or sensitive data as an alternative to 'already-cracked-by-NSA' ciphers [NSA2007]. Unfortunately for those theoretical people, we were able to identify a bug leading to a potentially incorrect functioning of the cryptosystem (with a non negligible probability). We saw earlier that the bitshift code skeleton was the following: /* bitshift.c */ void {r,l}bitshift(unsigned char *block, unsigned int shift) { [...] // SysK : local vars declaration unsigned char sblock[DPA_BLOCK_SIZE]; if (shift) { [...] // SysK : sblock initialization } memcpy(block, sblock, DPA_BLOCK_SIZE); } Clearly, if 'shift' is 0 then 'block' is fed with stack content! Obviously in such a case the cryptosystem can't possibly work. Since shift is an integer, such an event occurs with at least a theoretical probability of 1/2^32 per round. Now let's study the shift generation function: /* hash32.c */ /* * This function computes a 32 bits output out a variable length input. It is * not important to have a nice distribution and low collisions as it is used * on the output of checksum128() (see checksum128.c). There is a requirement * though, the function should not consider \0 as a key terminator. */ unsigned long hash32(unsigned char *k, unsigned int length) { unsigned long h; for (h = 0; *k && length; ++k, --length) h = 13 * h + *k; return (h); } As stated in the C code commentary, hash32() is the function which produces the shift. Although the author is careful and admits that the output distribution may not be completely uniform (not exactly equal probability for each byte value to appear) it is obvious that a strong bias is not desirable (Cf 7.3). However what happens if the first byte pointed by k is 0 ? Since the loop ends for k equal to 0, then h will be equal to 13 * 0 + 0 = 0. Assuming that the underlying subkey is truly random, such an event should occur with a probability of 1/256 (instead of 1/2^32). Since the output of hash32() is an integer as stated in the comment, this is clearly a bug. We could be tempted to think that this implementation failure leads to a weakness but a short look at the code tells us that: struct s_dpa_sub_key { unsigned char key[DPA_KEY_SIZE]; unsigned char shift; }; typedef struct s_dpa_sub_key DPA_SUB_KEY; Therefore since shift is a char object, the presence of "*k &&" in the code doesn't change the fact that the cryptosystem will fail with a probability of 1/256 per round. Since the bug may appear independently in each round, the probability of failure is even greater: p("fail") = 1 - p("ok") = 1 - Mul( p("ok in round i") ) = 1 - (255/256)^16 = 0.0607... where i is element of [0, (nbr_rounds - 1)] It's not too far from 1/16 :-) Remark: We shall see later that the special case where shift is equal to 0 is part of a general class of weak keys potentially allowing an attacker to break the cryptosystem. Hunting weaknesses and bugs in the implementation of cryptographic primitives is the common job of some reverse engineers since it sometimes allows to break implementations of algorithms which are believed to be theoretically secure. While those flaws mostly concern asymmetric primitives of digital signature or key negotiation/generation, it can also apply in some very specific cases to the block cipher world. From now, we will consider the annoying bug in bitshift() fixed. --[ 4.2 - Weaknesses in the design When designing a block cipher, a cryptographer has to be very careful about every details of the algorithm. In the following section, we describe several design mistakes and explain why in some cases, it can reduce the security of the cipher. a) We saw earlier that the E() function was applied to each round. However such a construction is not perfect regarding the first round. Since rbytechain() is a linear mixing operating not involving key material, it shouldn't be used as the first operation on the input buffer since its effect on it can be completely canceled. Therefore, if a cryptanalyst wants to attack the bitshift() component of the first round, he just have to apply lbytechain() (the rbytechain() inverse function) to the input vector. It would thus have been a good idea to put a key mixing as the first operation. b) The rbitshift() operation only need the 7 first bits of the shift character whereas the S_E() uses all of them. It is also generally considered a bad idea to use the same key material for several operations. c) If for some reason, the attacker is able to leak the second (not the first) subkey then it implies the compromising of all the key material. Of course the master key will remain unknown because of the onewayness of checksum128() however we do not need to recover it in order to encrypt and/or decrypt datas. d) In the bitshift() function, a loop is particularly interesting: for (i = 0; i < mod; ++i) mask |= (1 << i); What is interesting is that the time execution of the loop is dependent of "mod" which is derived from the shift. Therefore we conclude that this loop probably allows a side channel attack against the cipher. Thanks to X for having pointed this out ;> In the computer security area, it's well known that a single tiny mistake can lead to the total compromising of an information system. In cryptography, the same rules apply. --[ 5 - Breaking the linearized version Even if we regret the non justification of addition operation employment, it is not the worst choice in itself. What would have happen if the key mixing had been done with a xor operation over GF(2^8) instead as it is the case in DES or AES for example? To measure the importance of algebraic consideration in the security of a block cipher, let's play a little bit with a linearized version of the cipher. That is to say that we replace the S_E() function with the following S_E2() where : void S_E2 (unsigned char *key, unsigned char *block, unsigned int s) { int i; for (i = 0; i < DPA_BLOCK_SIZE; ++i) block[i] = (key[i] ^ block[i] ^ s) % 256; [1] // + is replaced by xor return; } If X, Y and K are vectors over GF(2^8) representing respectively the input, the output of S_E2() and the round key material then Y = X xor K. Remark: K = sK xor shift. We use K for simplification purpose. Now considering the full round we have : V = M.U [a] (rbytechain) W = M'.V [b] (rbitshift) Y = W xor K [c] (S_E2) Linear algebra allows the composition of applications rbytechain() and rbitshift() since the dimensions of M and M' match but W in [b] is a vector over GF(2) whereas W in [c] is clearly over GF(2^8). However, due to the use of XOR in [c], Y, W and K can also be seen as vectors over GF(2). Therefore, S_E2() is a GF(2) affine map between two vector spaces of dimension 128. We then have: Y = M'.M.U xor K The use of differential cryptanalysis will help us to get rid of the key. Let's consider couples (U0,Y0 = E(U0)) and (U1,Y1 = E(U1)) then: DELTA(Y) = Y0 xor Y1 = (M'.M.U0 xor K) xor (M'.M.U1 xor K) = (M'.M.U0 xor M'.M.U1) xor K xor K (commutativity & associativity of xor) = (M'.M).(U0 xor U1) (distributivity) = (M'.M).DELTA(U) Such a result shows us that whatever sK and shift are, there is always a linear map linking an input differential to the corresponding output differential. The generalization to the 16 rounds using matrix multiplication is obvious. Therefore we have proved that there exists a 128x128 matrix Mf over GF(2) such as DELTA(Y) = Mf.DELTA(X) for the linearized version of the cipher. Then assuming we know one couple (U0,Y0) and Mf, we can encrypt any input U. Indeed, Y xor Y0 = Mf.(U xor U0) therefore Y = (Mf.(U xor U0)) xor Y0. Remark 1: The attack doesn't give us the knowledge of subkeys and shifts but such a thing is useless. The goal of an attacker is not the key in itself but rather the ability of encrypting/decrypting a set of plaintexts/ciphertexts. Furthermore, considering the key scheduling operation, if we really needed to recover the master key, it would be quite a pain in the ass considering the fact that checksum128() is a one way function ;-) Remark 2: Obviously in order to decrypt any output Y we need to calculate Mf^-1 which is the inverse matrix of Mf. This is somewhat more interesting isn't it ? :-) Because of rbitshift(), we are unable to determine using matrix multiplications the coefficients of Mf. An exhaustive search is of course impossible because of the huge complexity (2^16384) however, finding them is equivalent to solving 128 systems (1 system per row of Mf) of 128 variables (1 variable per column) in GF(2). To build such a system, we need 128 couples of (cleartext,ciphertext). The described attack was implemented using the nice NTL library ([SHOUP]) and can be found in annexe A of this paper. $ g++ break_linear.cpp bitshift.o bytechain.o key.c hash32.o checksum128.o -o break_linear -lntl -lcrypto -I include $ ./break_linear [+] Generating the plaintexts / ciphertexts [+] NTL stuff ! [+] Calculation of Mf [+] Let's make a test ! [+] Well done boy :> Remark: Sometimes NTL detects a linear relation between chosen inputs (DELTA_X) and will then refuse to work. Indeed, in order to solve the 128 systems, we need a situation where every equations are independent. If it's not the case, then obviously det(M) is equal to 0 (with probability 1/2). Since inputs are randomly generated, just try again until it works :-) $ ./break_linear [+] Generating the plaintexts / ciphertexts [+] NTL stuff ! det(M) = 0 As a conclusion we saw that the linearity over GF(2) of the xor operation allowed us to write an affine relation between two elements of GF(2)^128 in the S_E2() function and then to easily break the linearized version using a 128 known plaintext attack. The use of non linearity is crucial in the design. Fortunately for DPA-128, Veins chose the addition modulo 256 as the key mixer which is naturally non linear over GF(2). --[ 6 - On the non linearity of addition modulo n over GF(2) The bitshift() and bytechain() functions can be described using matrix over GF(2) therefore it is interesting to use this field for algebraic calculations. The difference between addition and xor laws in GF(2^n) lies in the carry propagation: w(i) + k(i) = w(i) xor k(i) xor carry(i) where w(i), k(i) and carry(i) are elements of GF(2). We note w(i) as the i'th bit of w and will keep this notation until the end. carry(i), written c(i) for simplification purpose, is defined recursively: c(i+1) = w(i).k(i) xor w(i).c(i) xor k(i).c(i) with c(0) = 0 Using this notation, it would thus be possible to determine a set of relations over GF(2) between input/output bits which the attacker controls using a known plaintext attack and the subkey bits (which the attacker tries to guess). However, recovering the subkey bits won't be that easy. Indeed, to determine them, we need to get rid of the carries replacing them by multivariate polynomials were unknowns are monomials of huge order. Remark 1: Because of the recursivity of the carry, the order of monomials grows up as the number of input bits per round as well as the number of rounds increases. Remark 2: Obviously we can not use intermediary input/output bits in our equations. This is because unlike the subkey bits, they are dependent of the input. We are thus able to express the cryptosystem as a multivariate polynomial system over GF(2). Solving such a system is NP-hard. There exists methods for system of reasonable order like groebner basis and relinearization techniques but the order of this system seems to be far too huge. However for a particular set of keys, the so-called weak keys, it is possible to determine the subkeys quite easily getting rid of the complexity introduced by the carry. --[ 7 - Exploiting weak keys Let's first define a weak key. According to wikipedia: "In cryptography, a weak key is a key which when used with a specific cipher, makes the cipher behave in some undesirable way. Weak keys usually represent a very small fraction of the overall keyspace, which usually means that if one generates a random key to encrypt a message weak keys are very unlikely to give rise to a security problem. Nevertheless, it is considered desirable for a cipher to have no weak keys." Actually we identified a particular subset |W of |K allowing us to deal quite easily with the carry problem. A key "k" is part of |W if and only if for each round the shift parameter is a multiple of 8. The reader should understand why later. We will first present the attack on a reduced version of DPA for simplicity purpose and generalize it later to the full version. --[ 7.1 - Playing with a toy cipher Our toy cipher is a 2 rounds DPA. Moreover, the cipher takes as input 4*8 bits instead of 16*8 = 128 bits which means that DPA_BLOCK_SIZE = 4. We also make a little modification in bytechain() operation. Let's remember the bytechain() function: void rbytechain(unsigned char *block) { int i; for (i = 0; i < DPA_BLOCK_SIZE; ++i) block[i] ^= block[(i + 1) % DPA_BLOCK_SIZE]; return; } Since block is both input AND output of the function then we have for DPA_BLOCK_SIZE = 4: V(0) = U(0) xor U(1) V(1) = U(1) xor U(2) V(2) = U(2) xor U(3) V(3) = U(3) xor V(0) = U(0) xor U(1) xor U(3) Where V(x) is the x'th byte element. Thus with our modification: V(0) = U(0) xor U(1) V(1) = U(1) xor U(2) V(2) = U(2) xor U(3) V(3) = U(3) xor U(0) Regarding the mathematical notation (pay your ascii !@#): - U,V,W,Y vector notation of section 5 remains. - Xj(i) is the i'th bit of vector Xj where j is j'th round. - U0 vector is equivalent to P where P is a plaintext. - m is the shift of round 0 - n is the shift of round 1 - xor will be written '+' since calculation is done in GF(2) - All calculation of subscript will be done in the ring ZZ_32 How did we choose |W? Using algebra in GF(2) implies to deal with the carry. However, if k is a weak key (part of |W), then we can manage the calculation so that it's not painful anymore. Let i be the lowest bit of any input byte. Therefore for each i part of the set {0,8,16,24} we have: u0(i) = p(i) v0(i) = p(i) + p(i+8) w0(i+m) = v0(i) y0(i) = w0(i) + k0(i) + C0(i) y0(i+m) = w0(i+m) + k0(i+m) + C0(i+m) y0(i+m) = p(i) + p(i+8) + k0(i+m) + C0(i+m) /* carry(0) = 0 */ y0(i+m) = p(i) + p(i+8) + k0(i+m) u1(i) = y0(i) v1(i) = y0(i) + y0(i+8) w1(i+n) = v1(i) y1(i) = w1(i) + k1(i) + C1(i) y1(i+n) = w1(i+n) + k1(i+n) + C1(i+n) y1(i+n) = y0(i) + y0(i+8) + k1(i+n) + C1(i+n) y1(i+n+m) = y0(i+m) + y0(i+m+8) + k1(i+n+m) + C1(i+n+m) /* carry(0) = 0 */ y1(i+n+m) = p(i) + p(i+8) + k0(i+m) + p(i+8) + p(i+16) + k0(i+m+8) + k1(i+n+m) y1(i+n+m) = p(i) + k0(i+m) + p(i+16) + k0(i+m+8) + k1(i+n+m) As stated before, i is part of the set {0,8,16,24} so we can write: y1(n+m) = p(0) + k0(m) + p(16) + k0(m+8) + k1(n+m) y1(8+n+m) = p(8) + k0(8+m) + p(24) + k0(m+16) + k1(8+n+m) y1(16+n+m) = p(16) + k0(16+m) + p(0) + k0(m+24) + k1(16+n+m) y1(24+n+m) = p(24) + k0(24+m) + p(8) + k0(m) + k1(24+n+m) In the case of a known plaintext attack, the attacker has the knowledge of a set of couples (P,Y1). Therefore considering the previous system, the lowest bit of K0 and K1 vectors are the unknowns. Here we have a system which is clearly underdefined since it is composed of 4 equations and 4*2 unknowns. It will give us the relations between each lowest bit of Y and the lowest bits of K0 and K1. Remark 1: n,m are unknown. A trivial approach is to determine them which costs a complexity of (2^4)^2 = 2^8. Although it may seem a good idea, let's recall the reader that we are considering a round reduced cipher! Indeed, applying the same idea to the full 16 rounds would cost us (2^4)^16 = 2^64! Such a complexity is a pain in the ass even nowadays :-) A much better approach is to guess (n+m) as it costs 2^4 what ever the number of rounds. It gives us the opportunity to write relations between some input and output bits. We do not need to know exactly m and n. The knowledge of the intermediate variables k0(x+m) and k1(y+n+m) is sufficient. Remark 2: An underdefined system brings several solutions. We are thus able to choose arbitrarily 4 variables thus fixing them with values of our choice. Of course we have to choose so that we are able to solve the system with remaining variables. For example taking k0(m), k0(m+8) and k1(n+m) together is not fine because of the first equation. However, fixing all the k0(x+m) may be a good idea as it automatically gives the k1(y+n+m) corresponding ones. Now let's go further. Let i be part of the set {1,9,17,25}. We can write: u0(i) = p(i) v0(i) = p(i) + p(i+8) w0(i+m) = v0(i) y0(i) = w0(i) + k0(i) + w0(i-1)*k0(i-1) y0(i+m) = w0(i+m) + k0(i+m) + w0(i+m-1)*k0(i+m-1) y0(i+m) = p(i) + p(i+8) + k0(i+m) + w0(i+m-1)*k0(i+m-1) y0(i+m) = p(i) + p(i+8) + k0(i+m) + (p(i-1) + p(i-1+8))*k0(i+m-1) u1(i) = y0(i) v1(i) = y0(i) + y0(i+8) w1(i+n) = v1(i) y1(i) = w1(i) + k1(i) + C1(i) y1(i) = w1(i) + k1(i) + w1(i-1)*k1(i-1) y1(i+n) = w1(i+n) + k1(i+n) + w1(i-1+n)*k1(i-1+n) y1(i+n) = y0(i) + y0(i+8) + k1(i+n) + (y0(i-1) + y0(i+8-1)) * k1(i-1+n) y1(i+n+m) = y0(i+m) + y0(i+m+8) + k1(i+m+n) + (y0(i+m-1) + y0(i+m+8-1)) * k1(i+m+n-1) y1(i+n+m) = p(i) + p(i+8) + k0(i+m) + (p(i-1) + p(i-1+8)) * k0(i+m-1) + p(i+8) + p(i+16) + k0(i+m+8) + (p(i+8-1) + p(i-1+16)) * k0(i+m-1+8) + k1(i+n+m) + k1(i+m+n-1) * [p(i-1) + p(i+8-1) + k0(i+m-1)] + k1(i+m+n-1) * [p(i-1+8) + p(i+16-1) + k0(i+m-1+8)] y1(i+n+m) = p(i) + k0(i+m) + (p(i-1) + p(i-1+8)) * k0(i+m-1) + p(i+16) + k0(i+m+8) + (p(i+8-1) + p(i-1+16)) * k0(i+m-1+8) + k1(i+n+m) + k1(i+m+n-1)*[p(i-1) + k0(i+m-1)] + k1(i+m+n-1)*[p(i-1+16) + k0(i+m-1+8)] Thanks to the previous system resolution, we have the knowledge of k0(i+m+n-1+x) and k1(i+m-1+y) variables. Therefore, we can reduce the previous equation to: A(i) = k0(i+m) + k0(i+m+8) + k1(i+n+m) (alpha) where A(i) is a known value for the attacker. Remark 1: This equation represents the same system as found in case of i being the lowest bit! Therefore all previous remarks remain. Remark 2: If we hadn't have the knowledge of k0(i+m+n-1+x) and k1(i+m-1+y) bits then the number of variables would have grown seriously. Moreover we would have had to deal with some degree 2 monomials :-/. We can thus conjecture that the equation alpha will remain true for each i part of {a,a+8,a+16,a+24} where 0 <= a < 8. --[ 7.2 - Generalization and expected complexity Let's deal with the real bytechain() function now. As stated before and for DPA_BLOCK_SIZE = 4 we have: V(0) = U(0) xor U(1) V(1) = U(1) xor U(2) V(2) = U(2) xor U(3) V(3) = U(0) xor U(1) xor U(3) This is clearly troublesome as the last byte V(3) is NOT calculated like V(0), V(1) and V(2). Because of the rotations involved, we wont be able to know when the bit manipulated is part of V(3) or not. Therefore, we have to use a general formula: V(i) = U(i) + U(i+1) + a(i).U(i+2) where a(i) = 1 for i = 24 to 31 For i part of {0,8,16,24} we have: u0(i) = p(i) v0(i) = p(i) + p(i+8) + a0(i).p(i+16) w0(i+m) = v0(i) y0(i) = w0(i) + k0(i) + C0(i) y0(i+m) = w0(i+m) + k0(i+m) + C0(i+m) y0(i+m) = p(i) + p(i+8) + a0(i).p(i+16) + k0(i+m) + C0(i+m) /*carry(0) = 0*/ y0(i+m) = p(i) + p(i+8) + a0(i).p(i+16) + k0(i+m) So in the second round: u1(i) = y0(i) v1(i) = y0(i) + y0(i+8) + a1(i).y0(i+16) w1(i+n) = v1(i) y1(i) = w1(i) + k1(i) + C1(i) y1(i+n) = w1(i+n) + k1(i+n) + C1(i+n) y1(i+n) = y0(i) + y0(i+8) + a1(i).y0(i+16) + k1(i+n) + C1(i+n) y1(i+n+m) = y0(i+m) + y0(i+m+8) + a1(i+m).y0(i+m+16) + k1(i+n+m) y1(i+n+m) = p(i) + p(i+8) + a0(i).p(i+16) + k0(i+m) + p(i+8) + p(i+16) + a0(i).p(i+24) + k0(i+m+8) + a1(i+m).[p(i+16) + p(i+24) + a0(i).p(i) + k0(i+m+16)] + k1(i+n+m) y1(i+n+m) = p(i) + a0(i).p(i+16) + k0(i+m) + p(i+16) + a0(i).p(i+24) + k0(i+m+8) + a1(i+m).[p(i+16) + p(i+24) + a0(i).p(i) + k0(i+m+16)] + k1(i+n+m) a0(i) is not a problem since we know it. This is coherent with the fact that the first operation of the cipher is rbytechain() which is invertible for the attacker. However, the problem lies in the a1(i+m) variables. Guessing a1(i+m) is out of question as it would cost us a complexity of (2^4)^15 = 2^60 for the 16 rounds! The solution is to consider a1(i+m) as an other set of 4 variables. We can also add the equation to our system: a1(m) + a1(m+8) + a1(m+16) + a1(m+24) = 1 This equation will remain true for other bits. So what is the global complexity? Obviously with DPA_BLOCK_SIZE = 16 each system is composed of 16+1 equations of 16+1 variables (we fixed the others). Therefore, the complexity of the resolution is: log(17^3) / log(2) ~ 2^13. We will solve 8 systems since there are 8 bits per byte. Thus the global complexity is around (2^13)*8 = 2^16. Remark: We didn't take into account the calculation of equation as it is assumed to be determined using a formal calculation program such as pari-gp or magma. --[ 7.3 - Cardinality of |W What is the probability of choosing a weak key? We have seen that our weak key criterion is that for each round, the rotation parameter needs to be multiple of 8. Obviously, it happens with 16 / 128 = 1/8 theoretical probability per round. Since we consider subkeys being random, the generation of rotation parameters are independent which means that the overall probability is (1/16)^16 = 1/2^64. Although a probability of 1/2^64 means a (huge) set of 2^64 weak keys, in the real life, there are very few chances to choose one of them. In fact, you probably have much more chances to win lottery ;) However, two facts must be noticed: - We presented one set of weak keys but there be some more! - We illustrated an other weakness in the conception of DPA-128 Remark: A probability of 1/8 per round is completely theoretic as it supposes a uniform distribution of hash32() output. Considering the extreme simplicity of the hash32() function, it wouldn't be too surprising to be different in practice. Therefore we made a short test to compute the real probability (Annexe B). $ gcc test.hash32.c checksum128.o hash32.o -o test.hash32 -O3 -fomit-frame-pointer $ time ./test.hash32 [+] Probability is 0.125204 real 0m14.654s user 0m14.649s sys 0m0.000s $ gp -q ? (1/0.125204) ^ 16 274226068900783.2739747241633 ? log(274226068900783.2739747241633) / log(2) 47.96235905375676878381741198 ? This result tells us clearly that the probability of shift being multiple of 8 is around 1/2^2.99 ~ 1/8 per round which is assimilated to the theoretical one since the difference is too small to be significant. In order to improve the measure, we used checksum128() as an input of hash32(). Furthermore, we also tried to test hash32() without the "*k &&" bug mentioned earlier. Both tests gave similar results which means that the bug is not important in practice and that checksum128() doesn't seem to be particularly skewed. This is a good point for DPA! :-D --[ 8 - Breaking DPA-based unkeyed hash function In his paper, Veins also explains how a hash function can be built out of DPA. We will analyze the proposed scheme and will show how to completely break it. --[ 8.1 - Introduction to hash functions Quoting once again the excellent HAC [MenVan]: "A hash function is a function h which has, as a minimum, the following two properties: 1. compression - h maps an input x of arbitrary finit bitlength, to an output h(x) of fixed bitlength n. 2. ease of computation - given h and an input x, h(x) is easy to compute. In cryptography there are essentially two families of hash functions: 1. The MAC (Message Authentication Codes). They are keyed ones and provides both authentication (of source) and integrity of messages. 2. The MDC (Modification Detection Code), sometimes referred as MIC. They are unkeyed and only provide integrity. We will focus on this kind of functions. When designing his hash function, the cryptographer generally wants it to satisfy the three properties: - preimage resistance. For any y, it should not be possible (that is to say computationally infeasible) to find an x such as h(x) = y. Such a property implies that the function has to be non invertible. - 2nd preimage resistance. For any x, it should not be possible to find an x' such as h(x) = h(x') when x and x' are different. - collision resistance. It should not be possible to find an x and an x' (with x different of x') such that h(x) = h(x'). Remark 1: Properties 1 and 2 and essentials when dealing with binary integrity. Remark 2: The published attacks on MD5 and SHA-0/SHA-1 were dealing with the third property. While it is true that finding collisions on a hash function is enough for the crypto community to consider it insecure (and sometimes leads to a new standard [NIST2007]), for most of usages it still remains sufficient. There are many way to design an MDC function. Some functions are based on MD4 function such as MD5 or SHA* functions which heavily rely on boolean algebra and operations in GF(2^32), some are based on NP problems such as RSA and finally some others are block cipher based. The third category is particularly interesting since the security of the hash function can be reduced to the one of the underlying block cipher. This is of course only true with a good design. --[ 8.2 - DPAsum() algorithm The DPA-based hash function lies in the functions DPA_sum() and DPA_sum_write_to_file() which can be found respectively in file sum.c and data.c. Let's detail them a little bit using pseudo code: Let M be the message to hash, let M(i) be the i'th 128b block message. Let N = DPA_BLOCK_SIZE * i + j be the size in bytes of the message where i and j are integers such as i = N / DPA_BLOCK_SIZE and 0 <= j < 16. Let C be an array of 128 bits elements were intermediary results of hash calculation are stored. The last element of this array is the hash of the message. func DPA_sum(K0,M,C): K0 = key("deadbeef"); IV = "0123456789abcdef"; C(0) = E( IV , K0); C(1) = E( IV xor M(0) , K0); FOR a = 1 to i-1: C(a+1) = E( C(a) xor M(a) , K0); if j == 0: C(i+1) = E( C(i) xor 000...000 , K0) else C(i+1) = E( C(i) xor PAD( M(i) ); C(i+2) = E( C(i+1) xor 000...00S , K0) /* s = 16-j */ return; func DPA_sum_write_to_file(C, file): write(file,C(last_element)); return; --[ 8.3 - Weaknesses in the design/implementation We noticed several implementation mistakes in the code: a) Using the algorithm of hash calculation, every element of array C is defined recursively however C(0) is never used in calculation. This doesn't impact security in itself but is somewhat strange and could let us think that the function was not designed before being programmed. b) When the size of M is not a multiple of DPA_BLOCK_SIZE (j is not equal to 0) then the algorithms calculates the last element using a xor mask where the last byte gives information on the size of the original message. However, what is included in the padding is not the size of the message in itself but rather the size of padding. If we take the example of the well known Merkle-Damgard construction on which are based MD{4,5} and SHA-{0,1} functions, then the length of the message was initially appended in order to prevent collisions attacks for messages of different sizes. Therefore in the DPASum() case, appending j to the message is not sufficient as it would be possible to find collisions for messages of size (DPA_BLOCK_SIZE*a + j) and (DPA_BLOCK_SIZE*b + j) were obviously a and b are different. Remark: The fact that the IV and the master key are initially fixed is not a problem in itself since we are dealing with MDC here. --[ 8.4 - A (2nd) preimage attack Because of the hash function construction properties, being given a message X, it is trivial to create a message X' such as h(X) = h(X'). This is called building a 2nd preimage attack. We built a quick & dirty program to illustrate it (Annexe C). It takes a 32 bytes message as input and produces an other 32 bytes one with the same hash: $ cat to.hack | hexdump -C 00000000 58 41 4c 4b 58 43 4c 4b 53 44 4c 46 4b 53 44 46 |XALKXCLKSDLFKSDF| 00000010 58 4c 4b 58 43 4c 4b 53 44 4c 46 4b 53 44 46 0a |XLKXCLKSDLFKSDF.| 00000020 $ ./dpa -s to.hack 6327b5becaab3e5c61a00430e375b734 $ gcc break_hash.c *.o -o break_hash -I ./include $ ./break_hash to.hack > hacked $ ./dpa -s hacked 6327b5becaab3e5c61a00430e375b734 $ cat hacked | hexdump -C 00000000 43 4f 4d 50 4c 45 54 45 4c 59 42 52 4f 4b 45 4e |COMPLETELYBROKEN| 00000010 3e bf de 93 d7 17 7e 1d 2a c7 c6 70 66 bb eb a3 |>.....~.*..pf...| 00000020 Nice isn't it ? :-) We were able to write arbitrary data in the first 16 bytes and then to calculate the next 16 bytes so that the 'hacked' file had the exact same hash. But how did we do such an evil thing? Assuming the size of both messages is 32 bytes then: h(Mi) = E(E(Mi(0) xor IV,K0) xor Mi(1),K0) Therefore, it is obvious that: h(M1) = h(M2) is equivalent to E(E(M1(0) xor IV,K0) xor M1(1),K0) = E(E(M2(0) xor IV,K0) xor M2(1),K0) Which can be reduced to: E(M1(0) xor IV,K0) xor M1(1) = E(M2(0) xor IV,K0) xor M2(1) Which therefore gives us: M2(1) = E(M2(0) xor IV,K0) xor E(M1(0) xor IV,K0) xor M1(1) [A] Since M1,IV,K0 are known parameters then for a chosen M2(0), [A] gives us M2(1) so that h(M1) = h(M2). Remark 1: Actually such a result can be easily generalized to n bytes messages. In particular, the attacker can put anything in his message and "correct it" using the last blocks (if n >= 32). Remark 2: Of course building a preimage attack is also very easy. We mentioned previously that we had for a 32 bytes message: h(Mi) = E(E(Mi(0) xor IV,K0) xor Mi(1),K0) Therefore, Mi(1) = E^-1(h(Mi),K0) xor E(Mi(0) xor IV,K0) [B] The [B] equation tells us how to generate Mi(1) so that we have h(Mi) in output. It doesn't seem to be really a one way hash function does it ? ;-) Building a hash function out of a block cipher is a well known problem in cryptography which doesn't only involve the security of the underlying block cipher. One should rely on one of the many well known and heavily analyzed algorithms for this purpose instead of trying to design one. --[ 9 - Conclusion We put into evidence some weaknesses of the cipher and were also able to totally break the proposed hash function built out of DPA. In his paper, Veins implicitly set the bases of a discussion to which we wish to deliver our opinion. We claim that it is necessary to understand properly cryptology. The goal of this paper wasn't to illustrate anything else but that fact. Being hacker or not, paranoid or simply careful, the rule is the same for everybody in this domain: nothing should be done without reflexion. --[ 10 - Greetings #TF crypto dudes for friendly and smart discussions and specially X for giving me a lot of hints. I learned a lot from you guys :-) #K40r1 friends for years of fun ;-) Hi all :) Finally but not least my GF and her kindness which is her prime characteristic :> (However if she finds out the joke in the last sentence I may die :|) --[ 11 - Bibliography [DPA128] A Polyalphabetic Substitution Cipher, Veins, Phrack 62. [FakeP63] Keeping 0day Safe, m1lt0n, Phrack(.nl) 63. [MenVan] Handbook of Applied Cryptography, Menezes, Oorschot & Vanstone. [Knud99] Correlation in RC6, L. Knudsen & W. Meier. [CrypTo] Two balls ownz one, http://fr.wikipedia.org/wiki/Cryptorchidie [Vaud] An Experiment on DES - Statistical Cryptanalysis, S. Vaudenay. [Ryabko] Adaptative chi-square test and its application to some cryptographic problems, B. Ryabko. [CourtAlg] How Fast can be Algebraic Attacks on Block Ciphers ?, Courtois. [BiSha90] Differential Cryptanalysis of DES-like Cryptosystems, E. Biham & A. Shamir, Advances in Cryptology - CRYPTO 1990. [Matsui92] A new method for known plaintext attack of FEAL cipher, Matsui & A. Yamagishi, EUROCRYPT 1992. [NSA2007] Just kidding ;-) [SHOUP] NTL library, V. Shoup, http://www.shoup.net/ntl/ [NIST2007] NIST, http://www.csrc.nist.gov/pki/HashWorkshop/index.html, 2007 --[ Annexe A - Breaking the linearised version 8<- - - - 8< - - - - - 8< - - - - - 8< - - - - - 8< - - - - - 8< - - - - - /* Crappy C/C++ source. I'm in a hurry for the paper redaction so don't * blame me toooooo much please ! :> */ #include #include #include #include #include #include #include #include #include #include "dpa.h" using namespace NTL; void S_E2 (unsigned char *key, unsigned char *block, unsigned int s) { int i; for (i = 0; i < DPA_BLOCK_SIZE; ++i) { block[i] ^= (key[i] ^ s) % 256; } return; } void E2 (unsigned char *key, unsigned char *block, unsigned int shift) { rbytechain (block); rbitshift (block, shift); S_E2 (key, block, shift); } void DPA_ecb_encrypt (DPA_KEY * key, unsigned char * src, unsigned char * dst) { int j; memcpy (dst, src, DPA_BLOCK_SIZE); for (j = 0; j < 16; j++) E2 (key->subkey[j].key, dst, key->subkey[j].shift); return; } void affichage(unsigned char *chaine) { int i; for(i=0; i<16; i++) printf("%.2x",(unsigned char )chaine[i]); printf("\n"); } unsigned char test_p[] = "ABCD_ABCD_12____"; unsigned char test_c1[16]; unsigned char test_c2[16]; DPA_KEY key; RC4_KEY rc4_key; struct vect { unsigned char plaintxt[16]; unsigned char ciphertxt[16]; }; struct vect toto[128]; unsigned char src1[16], src2[16]; unsigned char block1[16], block2[16]; int main() { /* Key */ unsigned char str_key[] = " _323DFF?FF4cxsdé&"; DPA_set_key (&key, str_key, DPA_KEY_SIZE); /* Init our RANDOM generator */ char time_key[16]; snprintf(time_key, 16, "%d%d",(int)time(NULL), (int)time(NULL)); RC4_set_key(&rc4_key, strlen(time_key), (unsigned char *)time_key); /* Let's crypt 16 plaintexts */ printf("[+] Generating the plaintexts / ciphertexts\n"); int i=0; int a=0; for(; i<128; i++) { RC4(&rc4_key, 16, src1, src1); // Input is nearly random :) DPA_ecb_encrypt (&key, src1, block1); RC4(&rc4_key, 16, src2, src2); // Input is nearly random :) DPA_ecb_encrypt (&key, src2, block2); for(a=0;a<16; a++) { toto[i].plaintxt[a] = src1[a] ^ src2[a]; toto[i].ciphertxt[a] = block1[a] ^ block2[a]; } } /* Now the NTL stuff */ printf("[+] NTL stuff !\n"); vec_GF2 m2(INIT_SIZE,128); vec_GF2 B(INIT_SIZE,128); mat_GF2 M(INIT_SIZE,128,128); mat_GF2 Mf(INIT_SIZE,128,128); // The final matrix ! clear(Mf); clear(M); clear(m2); clear(B); /* Lets fill M correctly */ int k=0; int j=0; for(k=0; k<128; k++) // each row ! { for(i=0; i<16; i++) { for(j=0; j<8; j++) M.put(i*8+j,k,(toto[k].plaintxt[i] >> j)&0x1); } } GF2 d; determinant(d,M); /* if !det then it means the vector were linearly linked :'( */ if(IsZero(d)) { std::cout << "det(M) = 0\n" ; exit(1); } /* Let's solve the 128 system :) */ printf("[+] Calculation of Mf\n"); for(k=0; k<16; k++) { for(j=0; j<8; j++) { for(i=0; i<128; i++) { B.put(i,(toto[i].ciphertxt[k] >> j)&0x1); } solve(d, m2, M, B); #ifdef __debug__ std::cout << "m2 is " << m2 << "\n"; #endif int b=0; for(;b<128;b++) Mf.put(k*8+j,b,m2.get(b)); } } #ifdef __debug__ std::cout << "Mf = " << Mf << "\n"; #endif /* Now that we have Mf, let's make a test ;) */ printf("[+] Let's make a test !\n"); bzero(test_c1, 16); bzero(test_c2, 16); char DELTA_X[16]; char DELTA_Y[16]; bzero(DELTA_X, 16); bzero(DELTA_Y, 16); DPA_ecb_encrypt (&key, test_p, test_c1); // DELTA_X ! unsigned char U0[] = "ABCDEFGHABCDEFG1"; unsigned char Y0[16]; DPA_ecb_encrypt (&key, U0, Y0); for(i=0; i<16; i++) { DELTA_X[i] = test_p[i] ^ U0[i]; } // DELTA_Y ! vec_GF2 X(INIT_SIZE,128); vec_GF2 Y(INIT_SIZE,128); clear(X); clear(Y); for(k=0; k<16; k++) { for(j=0; j<8; j++) { X.put(k*8+j,(DELTA_X[k] >> j)&0x1); } } Y = Mf * X; #ifdef __debug__ std::cout << "X = " << X << "\n"; std::cout << "Y = " << Y << "\n"; #endif GF2 z; for(k=0; k<16; k++) { for(j=0; j<8; j++) { z = Y.get(k*8+j); if(IsOne(z)) DELTA_Y[k] |= (1 << j); } } // test_c2 ! for(i=0; i<16; i++) test_c2[i] = DELTA_Y[i] ^ Y0[i]; /* Compare the two vectors */ if(!memcmp(test_c1,test_c2,16)) printf("\t=> Well done boy :>\n"); else printf("\t=> Hell !@#\n"); #ifdef __debug__ affichage(test_c1); affichage(test_c2); #endif return 0; } 8<- - - - 8< - - - - - 8< - - - - - 8< - - - - - 8< - - - - - 8< - - - - - --[ Annexe B - Probability evaluation of (hash32()%8 == 0) 8<- - - - 8< - - - - - 8< - - - - - 8< - - - - - 8< - - - - - 8< - - - - - #include #include #include #include #define NBR_TESTS 0xFFFFF int main() { int i = 0, j = 0; char buffer[16]; int cmpt = 0; int rand = (time_t)time(NULL); float proba = 0; srandom(rand); for(;i #include #include #include #include #include #include "dpa.h" void E2 (unsigned char *key, unsigned char *block, unsigned int shift) { rbytechain (block); rbitshift (block, shift); S_E (key, block, shift); } void DPA_ecb_encrypt (DPA_KEY * key, unsigned char * src, unsigned char * dst) { int j; memcpy (dst, src, DPA_BLOCK_SIZE); for (j = 0; j < 16; j++) E2 (key->subkey[j].key, dst, key->subkey[j].shift); return; } void affichage(unsigned char *chaine) { int i; for(i=0; i<16; i++) printf("%.2x",(unsigned char )chaine[i]); printf("\n"); } int main(int argc, char **argv) { DPA_KEY key; unsigned char str_key[] = "deadbeef"; unsigned char IV[] = "0123456789abcdef"; unsigned char evil_payload[] = "COMPLETELYBROKEN"; unsigned char D0[16],D1[16]; unsigned char final_message[32]; int fd_r = 0; int i = 0; if(argc < 2) { printf("Usage : %s \n",argv[0]); exit(EXIT_FAILURE); } DPA_set_key (&key, str_key,8); if((fd_r = open(argv[1], O_RDONLY)) < 0) { printf("[+] Fuck !@#\n"); exit(EXIT_FAILURE); } if(read(fd_r, D0, 16) != DPA_BLOCK_SIZE) { printf("Too short !@#\n"); exit(EXIT_FAILURE); } if(read(fd_r, D1, 16) != DPA_BLOCK_SIZE) { printf("Too short 2 !@#\n"); exit(EXIT_FAILURE); } close(fd_r); memcpy(final_message, evil_payload, DPA_BLOCK_SIZE); blockchain(evil_payload, IV); DPA_ecb_encrypt (&key, evil_payload, evil_payload); blockchain(D0,IV); DPA_ecb_encrypt (&key, D0, D0); blockchain(D0,D1); blockchain(evil_payload, D0); memcpy(final_message+DPA_BLOCK_SIZE, evil_payload, DPA_BLOCK_SIZE); for(i=0; i