i#5483 add support for avx512 bf16 instructions (!5489) · Merge requests · DynamoRIO / dynamorio

Merged prasun3 requested to merge i5483-add-support-for-avx512-bf16-instructions into master May 10, 2022

Added support for AVX512 bfloat16 instructions

These are the three bfloat16 instructions.

VCVTNE2PS2BF16—Convert Two Packed Single Data to One Packed BF16 Data

EVEX.128.F2.0F38.W0 72 /r VCVTNE2PS2BF16 xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
EVEX.256.F2.0F38.W0 72 /r VCVTNE2PS2BF16 ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
EVEX.512.F2.0F38.W0 72 /r VCVTNE2PS2BF16 zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst

Op/En   Tuple   Operand 1       Operand 2       Operand 3
A       Full    ModRM:reg (w)   EVEX.vvvv (r)   ModRM:r/m (r)

VCVTNEPS2BF16—Convert Packed Single Data to Packed BF16 Data

EVEX.128.F3.0F38.W0 72 /r VCVTNEPS2BF16 xmm1{k1}{z}, xmm2/m128/m32bcst
EVEX.256.F3.0F38.W0 72 /r VCVTNEPS2BF16 xmm1{k1}{z}, ymm2/m256/m32bcst
EVEX.512.F3.0F38.W0 72 /r VCVTNEPS2BF16 ymm1{k1}{z}, zmm2/m512/m32bcst

Op/En   Tuple   Operand 1       Operand 2
A       Full    ModRM:reg (w)   ModRM:r/m (r)

VDPBF16PS—Dot Product of BF16 Pairs Accumulated into Packed Single Precision

EVEX.128.F3.0F38.W0 52 /r VDPBF16PS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
EVEX.256.F3.0F38.W0 52 /r VDPBF16PS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
EVEX.512.F3.0F38.W0 52 /r VDPBF16PS zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst

Op/En   Tuple   Operand 1       Operand 2       Operand 3
A       Full    ModRM:reg (w)   EVEX.vvvv (r)   ModRM:r/m (r)

List of places to update

From https://github.com/DynamoRIO/dynamorio/blob/master/core/ir/x86/opcode_api.h#L53

 * When adding new instructions, be sure to update all of these places:
 *   1) decode_table op_instr array
 *   2) decode_table decoding table entries
 *   3) OP_ enum (here) via x86opnums.pl
 *   4) update OP_LAST at end of enum here
 *   5) decode_fast tables if necessary (they are conservative)
 *   6) instr_create macros
 *   7) suite/tests/api/ir* tests
 *   8) add binutils tests in third_party/binutils/test_decenc

Step 1: update op_instr array

Added entries to op_instr. These point directly to evex_Wb_extensions since these instructions only have evex encoding.

Step 2: add decode_table entries

updated third_byte_38 table to point to prefix_extensions since these instructions have common opcodes and differ in prefix.
- The instructions VCVTNEPS2BF16 and VCVTNE2PS2BF16 have three byte opcodes starting with 0f 38 so the decoder looks at third_byte_38[third_byte_38_index[opcode]]
- Since these instructions have the same opcode (72) and differ only in the prefix (f2/f3), we need to point the third_byte_38 to prefix_extensions which in turn points to the appropriate EVEX_Wb entries.
- The instruction VDPBF16PS has the same opcode (52) as the VNNI instruction vpdpwsd and they differ only in the prefix (F3/66). We need to update that entry to point to prefix_extensions instead of e_vex_extensions. This causes the e_vex_extensions entry ( e_vex ext 151) to be orphaned - do we remove this entry?
added entries in prefix_extensions to point to appropriate vex/evex entries
added leaf entries in evex_Wb_extensions

Updated opcodes for invalid entries in e_vex ext 151 and 152 for consistency.

Step 3: add OP_ enums

Done

Step 4: update OP_LAST

Not needed since OP_LAST already points to the last enum.

Step 5: decode_fast tables if necessary

Not done

Step 6: instr_create macros

Added 1dst_3src macros for VCVTNE2PS2BF16 and VDPBF16PS since they write to operand 1 and read from mask register, operand 2, and operand 3.

Added 1dst_2src macro for VCVTNEPS2BF16 since it writes to operand 1 and reads from mask register and operand 2. We are setting the destination size explicitly since this writes to "half" the destination.

Step 7: suite/tests/api/ir tests

Added tests in ir_x86_3args_avx512_evex_mask.h and ir_x86_4args_avx512_evex_mask_C.h.

Currently commented out the VCVTNEPS2BF16 test because the destination size needs to be set explicitly.

Step 8: binutils tests

Added binutils tests that encode the assembly instructions using instr_create_.. APIs and match against the opcode bytes rather than the opposite because we don't produce disassembly that can match exactly against binutils disassembly.

These currently have two workarounds

set dest size explicitly
set zeroing prefix explicitly