Inconsistent operand order between SSE and AVX
Created by: YValeri
There is currently a problem regarding the way DynamoRIO manages AVX and SSE floating point instructions, explained in this post : (https://groups.google.com/forum/#!topic/DynamoRIO-Users/EwjJLo-fBdo). For AVX, the Intel standard (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf) specifies that for the instruction are as followed : "instr destination, source0, source1", so all the operands are explicitely given. However, for SSE, the standard differs a bit : "instr destination(=source0), source1", so the destination and first source are the same, the instructions are done in-place. Regarding DynamoRIO's way of doing, it is the correct one for AVX, when using instr_get_src(instr, 0), we get the source0 of the instruction, and instr_get_src(instr, 1) gives the source1 (and instr_get_dst works fine aswell). On the contrary, for SSE, since the first operand is a destination AND source at the same time, DynamoRIO considers the destination as an explicit operand, and the source as implicit. That implies that the source is put at the last position in the instruction's sources, so instr_get_src(instr, 0) actually gives back the source1, while instr_get_src(instr, 1) gives source0. From a simple logic point, this is pretty strange, but mostly, it is really problematic regarding instruction overload because one would have to manage SSE and AVX instructions differently because of DynamoRIO, not because of the Intel standard. Also, since no floating point operation is commutative, this problem actually change the values of the results which is extremely problematic. Many instructions are impacted, here is a non limited list : SSE -> {add|mul|sub|div}{s|p}{s|d} AVX -> v{add|mul|sub|div}{s|p}{s|d}
Moreover, as stated in the same post, this is even more obvious for FMA instructions, because Intel manages these the same way as for SSE instructions : "instr destination(=source0), source1, source2". However, the instr for FMA is very specific for Intel, because the order of FMA is specified in the OP_code. So for instance, we can have the following : "VFMADD231SS xmm0, xmm1, xmm2" -> multiply xmm1 by xmm2, add xmm0, put the result in xmm0. Since the source0 is implicit, DynamoRIO places it at the end of the sources, so in third position, meaning that if don't specifically for this case, we would get the following : "VFMADD231SS xmm0, xmm1, xmm2" becomes for DynamoRIO "VFMADD231SS xmm1, xmm2, xmm0" -> multiply xmm2 by xmm0, add xmm1, put the result in xmm0. Obviously, this is not at all the result wanted. Now consider that there is not only one order for fma instructions, but 3 (231, 213 and 132), there are scalar and packed instructions, and instructions for double and float. So you get a whooping total of 3 * 2 * 2 = 12 special cases to take into account. Now we include FMS, NFMA, NFMA, FMAS, NFMAS, and the amount of particular cases is way too big to adapt to.
Finally, as said in the same post, this is also not consistant for PEXT instructions, as the mask is put in the second source, which is source1, and source0 (the value on which to apply the mask") is also the destination, so source0 is implicit, so DynamoRIO put it at the end of sources, meaning that if we don't adapt to this particular case, we would use source0 as the mask, not source1, and the result would be totally messed up.
Overall, the problem comes from the fact that when one of the source is also the destination, DynamoRIO considers it as implicit, and put the source/destination at the end of source. This means that one would have to change the order of the sources to have the correct result, so when we would overload instructions that don't have implicit sources, the order wouldn't be correct anymore. I'd advice that DynamoRIO changes the way implicit operands are managed in case of these operands also being the destination, and put them at their normal position, following the Intel standard to the letter. But when one of the source is implicit but not the destination (i.e. push for instance, which only takes one explicit source, but has another implicit source, rsp), the current way of doing is fine in my opinion. This problem is important for floating point arithmetic, as the author of the post first stated, but is even more troublesome for FMA and PEXT.