ce426f
Full backports of the following patches:
ce426f
ce426f
commit b97eb2bdb1ed72982a7821c3078be591051cef59
ce426f
Author: H.J. Lu <hjl.tools@gmail.com>
ce426f
Date:   Mon Mar 16 14:58:43 2015 -0700
ce426f
ce426f
    Preserve bound registers in _dl_runtime_resolve
ce426f
    
ce426f
    We need to add a BND prefix before indirect branch at the end of
ce426f
    _dl_runtime_resolve to preserve bound registers.
ce426f
ce426f
commit ddd85a65b6e3d6ec1e756c1f78559f99a2c943ca
ce426f
Author: H.J. Lu <hjl.tools@gmail.com>
ce426f
Date:   Tue Jul 7 05:23:24 2015 -0700
ce426f
ce426f
    Add and use sysdeps/i386/link-defines.sym
ce426f
    
ce426f
    Define macros for fields in La_i86_regs and La_i86_retval and use them
ce426f
    in dl-trampoline.S, instead of hardcoded values.
ce426f
ce426f
commit 14c5cbabc2d11004ab223ae5eae761ddf83ef99e
ce426f
Author: Igor Zamyatin <igor.zamyatin@intel.com>
ce426f
Date:   Thu Jul 9 06:50:12 2015 -0700
ce426f
ce426f
    Preserve bound registers for pointer pass/return
ce426f
    
ce426f
    We need to save/restore bound registers and add a BND prefix before
ce426f
    branches in _dl_runtime_profile so that bound registers for pointer
ce426f
    pass and return are preserved when LD_AUDIT is used.
ce426f
ce426f
ce426f
commit f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e
ce426f
Author: H.J. Lu <hjl.tools@gmail.com>
ce426f
Date:   Tue Aug 25 04:33:54 2015 -0700
ce426f
ce426f
    Save and restore vector registers in x86-64 ld.so
ce426f
    
ce426f
    This patch adds SSE, AVX and AVX512 versions of _dl_runtime_resolve
ce426f
    and _dl_runtime_profile, which save and restore the first 8 vector
ce426f
    registers used for parameter passing.  elf_machine_runtime_setup
ce426f
    selects the proper _dl_runtime_resolve or _dl_runtime_profile based
ce426f
    on _dl_x86_cpu_features.  It avoids race condition caused by
ce426f
    FOREIGN_CALL macros, which are only used for x86-64.
ce426f
    
ce426f
    Performance impact of saving and restoring 8 vector registers are
ce426f
    negligible on Nehalem, Sandy Bridge, Ivy Bridge and Haswell when
ce426f
    ld.so is optimized with SSE2.
ce426f
ce426f
commit fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604
ce426f
Author: H.J. Lu <hjl.tools@gmail.com>
ce426f
Date:   Tue Sep 6 08:50:55 2016 -0700
ce426f
ce426f
    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
ce426f
    
ce426f
    There is transition penalty when SSE instructions are mixed with 256-bit
ce426f
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
ce426f
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
ce426f
    registers, there is transition penalty when SSE instructions are used
ce426f
    with lazy binding on AVX and AVX512 processors.
ce426f
    
ce426f
    To avoid SSE transition penalty, if only the lower 128 bits of the first
ce426f
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
ce426f
    with the zero upper bits.
ce426f
    
ce426f
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
ce426f
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
ce426f
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
ce426f
    non-zero portion of vector registers with AVX/AVX512 load instructions
ce426f
    which will zero-extend upper bits of vector registers.
ce426f
    
ce426f
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
ce426f
    XMM registers with 128-bit AVX store/load instructions.  It is used to
ce426f
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
ce426f
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
ce426f
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
ce426f
    that we store and load only the non-zero portion of vector registers.
ce426f
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
ce426f
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
ce426f
    registers are used.
ce426f
    
ce426f
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
ce426f
    don't support XGETBV with ECX == 1.  Since there is no SSE transition
ce426f
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
ce426f
    _dl_runtime_resolve_avx512_slow isn't provided.
ce426f
ce426f
commit 3403a17fea8ccef7dc5f99553a13231acf838744
ce426f
Author: H.J. Lu <hjl.tools@gmail.com>
ce426f
Date:   Thu Feb 9 12:19:44 2017 -0800
ce426f
ce426f
    x86-64: Verify that _dl_runtime_resolve preserves vector registers
ce426f
    
ce426f
    On x86-64, _dl_runtime_resolve must preserve the first 8 vector
ce426f
    registers.  Add 3 _dl_runtime_resolve tests to verify that SSE,
ce426f
    AVX and AVX512 registers are preserved.
ce426f
ce426f
commit c15f8eb50cea7ad1a4ccece6e0982bf426d52c00
ce426f
Author: H.J. Lu <hjl.tools@gmail.com>
ce426f
Date:   Tue Mar 21 10:59:31 2017 -0700
ce426f
ce426f
    x86-64: Improve branch predication in _dl_runtime_resolve_avx512_opt [BZ #21258]
ce426f
    
ce426f
    On Skylake server, _dl_runtime_resolve_avx512_opt is used to preserve
ce426f
    the first 8 vector registers.  The code layout is
ce426f
    
ce426f
      if only %xmm0 - %xmm7 registers are used
ce426f
         preserve %xmm0 - %xmm7 registers
ce426f
      if only %ymm0 - %ymm7 registers are used
ce426f
         preserve %ymm0 - %ymm7 registers
ce426f
      preserve %zmm0 - %zmm7 registers
ce426f
    
ce426f
    Branch predication always executes the fallthrough code path to preserve
ce426f
    %zmm0 - %zmm7 registers speculatively, even though only %xmm0 - %xmm7
ce426f
    registers are used.  This leads to lower CPU frequency on Skylake
ce426f
    server.  This patch changes the fallthrough code path to preserve
ce426f
    %xmm0 - %xmm7 registers instead:
ce426f
    
ce426f
      if whole %zmm0 - %zmm7 registers are used
ce426f
        preserve %zmm0 - %zmm7 registers
ce426f
      if only %ymm0 - %ymm7 registers are used
ce426f
         preserve %ymm0 - %ymm7 registers
ce426f
      preserve %xmm0 - %xmm7 registers
ce426f
ce426f
    Tested on Skylake server.
ce426f
    
ce426f
            [BZ #21258]
ce426f
            * sysdeps/x86_64/dl-trampoline.S (_dl_runtime_resolve_opt):
ce426f
            Define only if _dl_runtime_resolve is defined to
ce426f
            _dl_runtime_resolve_sse_vex.
ce426f
            * sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_opt):
ce426f
            Fallthrough to _dl_runtime_resolve_sse_vex.
ce426f
ce426f
Index: glibc-2.17-c758a686/nptl/sysdeps/x86_64/tcb-offsets.sym
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/nptl/sysdeps/x86_64/tcb-offsets.sym
ce426f
+++ glibc-2.17-c758a686/nptl/sysdeps/x86_64/tcb-offsets.sym
ce426f
@@ -15,7 +15,6 @@ VGETCPU_CACHE_OFFSET	offsetof (tcbhead_t
ce426f
 #ifndef __ASSUME_PRIVATE_FUTEX
ce426f
 PRIVATE_FUTEX		offsetof (tcbhead_t, private_futex)
ce426f
 #endif
ce426f
-RTLD_SAVESPACE_SSE	offsetof (tcbhead_t, rtld_savespace_sse)
ce426f
 
ce426f
 -- Not strictly offsets, but these values are also used in the TCB.
ce426f
 TCB_CANCELSTATE_BITMASK	 CANCELSTATE_BITMASK
ce426f
Index: glibc-2.17-c758a686/nptl/sysdeps/x86_64/tls.h
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/nptl/sysdeps/x86_64/tls.h
ce426f
+++ glibc-2.17-c758a686/nptl/sysdeps/x86_64/tls.h
ce426f
@@ -67,12 +67,13 @@ typedef struct
ce426f
 # else
ce426f
   int __unused1;
ce426f
 # endif
ce426f
-  int rtld_must_xmm_save;
ce426f
+  int __glibc_unused1;
ce426f
   /* Reservation of some values for the TM ABI.  */
ce426f
   void *__private_tm[5];
ce426f
   long int __unused2;
ce426f
-  /* Have space for the post-AVX register size.  */
ce426f
-  __128bits rtld_savespace_sse[8][4] __attribute__ ((aligned (32)));
ce426f
+  /* Must be kept even if it is no longer used by glibc since programs,
ce426f
+     like AddressSanitizer, depend on the size of tcbhead_t.  */
ce426f
+  __128bits __glibc_unused2[8][4] __attribute__ ((aligned (32)));
ce426f
 
ce426f
   void *__padding[8];
ce426f
 } tcbhead_t;
ce426f
@@ -380,41 +381,6 @@ typedef struct
ce426f
 # define THREAD_GSCOPE_WAIT() \
ce426f
   GL(dl_wait_lookup_done) ()
ce426f
 
ce426f
-
ce426f
-# ifdef SHARED
ce426f
-/* Defined in dl-trampoline.S.  */
ce426f
-extern void _dl_x86_64_save_sse (void);
ce426f
-extern void _dl_x86_64_restore_sse (void);
ce426f
-
ce426f
-# define RTLD_CHECK_FOREIGN_CALL \
ce426f
-  (THREAD_GETMEM (THREAD_SELF, header.rtld_must_xmm_save) != 0)
ce426f
-
ce426f
-/* NB: Don't use the xchg operation because that would imply a lock
ce426f
-   prefix which is expensive and unnecessary.  The cache line is also
ce426f
-   not contested at all.  */
ce426f
-#  define RTLD_ENABLE_FOREIGN_CALL \
ce426f
-  int old_rtld_must_xmm_save = THREAD_GETMEM (THREAD_SELF,		      \
ce426f
-					      header.rtld_must_xmm_save);     \
ce426f
-  THREAD_SETMEM (THREAD_SELF, header.rtld_must_xmm_save, 1)
ce426f
-
ce426f
-#  define RTLD_PREPARE_FOREIGN_CALL \
ce426f
-  do if (THREAD_GETMEM (THREAD_SELF, header.rtld_must_xmm_save))	      \
ce426f
-    {									      \
ce426f
-      _dl_x86_64_save_sse ();						      \
ce426f
-      THREAD_SETMEM (THREAD_SELF, header.rtld_must_xmm_save, 0);	      \
ce426f
-    }									      \
ce426f
-  while (0)
ce426f
-
ce426f
-#  define RTLD_FINALIZE_FOREIGN_CALL \
ce426f
-  do {									      \
ce426f
-    if (THREAD_GETMEM (THREAD_SELF, header.rtld_must_xmm_save) == 0)	      \
ce426f
-      _dl_x86_64_restore_sse ();					      \
ce426f
-    THREAD_SETMEM (THREAD_SELF, header.rtld_must_xmm_save,		      \
ce426f
-		   old_rtld_must_xmm_save);				      \
ce426f
-  } while (0)
ce426f
-# endif
ce426f
-
ce426f
-
ce426f
 #endif /* __ASSEMBLER__ */
ce426f
 
ce426f
 #endif	/* tls.h */
ce426f
Index: glibc-2.17-c758a686/sysdeps/i386/Makefile
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/sysdeps/i386/Makefile
ce426f
+++ glibc-2.17-c758a686/sysdeps/i386/Makefile
ce426f
@@ -33,6 +33,7 @@ sysdep-CFLAGS += -mpreferred-stack-bound
ce426f
 else
ce426f
 ifeq ($(subdir),csu)
ce426f
 sysdep-CFLAGS += -mpreferred-stack-boundary=4
ce426f
+gen-as-const-headers += link-defines.sym
ce426f
 else
ce426f
 # Likewise, any function which calls user callbacks
ce426f
 uses-callbacks += -mpreferred-stack-boundary=4
ce426f
Index: glibc-2.17-c758a686/sysdeps/i386/configure
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/sysdeps/i386/configure
ce426f
+++ glibc-2.17-c758a686/sysdeps/i386/configure
ce426f
@@ -179,5 +179,32 @@ fi
ce426f
 { $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_cc_novzeroupper" >&5
ce426f
 $as_echo "$libc_cv_cc_novzeroupper" >&6; }
ce426f
 
ce426f
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for Intel MPX support" >&5
ce426f
+$as_echo_n "checking for Intel MPX support... " >&6; }
ce426f
+if ${libc_cv_asm_mpx+:} false; then :
ce426f
+  $as_echo_n "(cached) " >&6
ce426f
+else
ce426f
+  cat > conftest.s <<\EOF
ce426f
+        bndmov %bnd0,(%esp)
ce426f
+EOF
ce426f
+if { ac_try='${CC-cc} -c $ASFLAGS conftest.s 1>&5'
ce426f
+  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
ce426f
+  (eval $ac_try) 2>&5
ce426f
+  ac_status=$?
ce426f
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
ce426f
+  test $ac_status = 0; }; }; then
ce426f
+  libc_cv_asm_mpx=yes
ce426f
+else
ce426f
+  libc_cv_asm_mpx=no
ce426f
+fi
ce426f
+rm -f conftest*
ce426f
+fi
ce426f
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_asm_mpx" >&5
ce426f
+$as_echo "$libc_cv_asm_mpx" >&6; }
ce426f
+if test $libc_cv_asm_mpx == yes; then
ce426f
+  $as_echo "#define HAVE_MPX_SUPPORT 1" >>confdefs.h
ce426f
+
ce426f
+fi
ce426f
+
ce426f
 $as_echo "#define PI_STATIC_AND_HIDDEN 1" >>confdefs.h
ce426f
 
ce426f
Index: glibc-2.17-c758a686/sysdeps/i386/configure.in
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/sysdeps/i386/configure.in
ce426f
+++ glibc-2.17-c758a686/sysdeps/i386/configure.in
ce426f
@@ -53,6 +53,21 @@ LIBC_TRY_CC_OPTION([-mno-vzeroupper],
ce426f
 		   [libc_cv_cc_novzeroupper=no])
ce426f
 ])
ce426f
 
ce426f
+dnl Check whether asm supports Intel MPX
ce426f
+AC_CACHE_CHECK(for Intel MPX support, libc_cv_asm_mpx, [dnl
ce426f
+cat > conftest.s <<\EOF
ce426f
+        bndmov %bnd0,(%esp)
ce426f
+EOF
ce426f
+if AC_TRY_COMMAND(${CC-cc} -c $ASFLAGS conftest.s 1>&AS_MESSAGE_LOG_FD); then
ce426f
+  libc_cv_asm_mpx=yes
ce426f
+else
ce426f
+  libc_cv_asm_mpx=no
ce426f
+fi
ce426f
+rm -f conftest*])
ce426f
+if test $libc_cv_asm_mpx == yes; then
ce426f
+  AC_DEFINE(HAVE_MPX_SUPPORT)
ce426f
+fi
ce426f
+
ce426f
 dnl It is always possible to access static and hidden symbols in an
ce426f
 dnl position independent way.
ce426f
 AC_DEFINE(PI_STATIC_AND_HIDDEN)
ce426f
Index: glibc-2.17-c758a686/sysdeps/i386/dl-trampoline.S
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/sysdeps/i386/dl-trampoline.S
ce426f
+++ glibc-2.17-c758a686/sysdeps/i386/dl-trampoline.S
ce426f
@@ -17,6 +17,13 @@
ce426f
    <http://www.gnu.org/licenses/>.  */
ce426f
 
ce426f
 #include <sysdep.h>
ce426f
+#include <link-defines.h>
ce426f
+
ce426f
+#ifdef HAVE_MPX_SUPPORT
ce426f
+# define PRESERVE_BND_REGS_PREFIX bnd
ce426f
+#else
ce426f
+# define PRESERVE_BND_REGS_PREFIX .byte 0xf2
ce426f
+#endif
ce426f
 
ce426f
 	.text
ce426f
 	.globl _dl_runtime_resolve
ce426f
@@ -161,24 +168,47 @@ _dl_runtime_profile:
ce426f
 	    +4      free
ce426f
 	   %esp     free
ce426f
 	*/
ce426f
-	subl $20, %esp
ce426f
-	cfi_adjust_cfa_offset (20)
ce426f
-	movl %eax, (%esp)
ce426f
-	movl %edx, 4(%esp)
ce426f
-	fstpt 8(%esp)
ce426f
-	fstpt 20(%esp)
ce426f
+#if LONG_DOUBLE_SIZE != 12
ce426f
+# error "long double size must be 12 bytes"
ce426f
+#endif
ce426f
+	# Allocate space for La_i86_retval and subtract 12 free bytes.
ce426f
+	subl $(LRV_SIZE - 12), %esp
ce426f
+	cfi_adjust_cfa_offset (LRV_SIZE - 12)
ce426f
+	movl %eax, LRV_EAX_OFFSET(%esp)
ce426f
+	movl %edx, LRV_EDX_OFFSET(%esp)
ce426f
+	fstpt LRV_ST0_OFFSET(%esp)
ce426f
+	fstpt LRV_ST1_OFFSET(%esp)
ce426f
+#ifdef HAVE_MPX_SUPPORT
ce426f
+	bndmov %bnd0, LRV_BND0_OFFSET(%esp)
ce426f
+	bndmov %bnd1, LRV_BND1_OFFSET(%esp)
ce426f
+#else
ce426f
+	.byte 0x66,0x0f,0x1b,0x44,0x24,LRV_BND0_OFFSET
ce426f
+	.byte 0x66,0x0f,0x1b,0x4c,0x24,LRV_BND1_OFFSET
ce426f
+#endif
ce426f
 	pushl %esp
ce426f
 	cfi_adjust_cfa_offset (4)
ce426f
-	leal 36(%esp), %ecx
ce426f
-	movl 56(%esp), %eax
ce426f
-	movl 60(%esp), %edx
ce426f
+	# Address of La_i86_regs area.
ce426f
+	leal (LRV_SIZE + 4)(%esp), %ecx
ce426f
+	# PLT2
ce426f
+	movl (LRV_SIZE + 4 + LR_SIZE)(%esp), %eax
ce426f
+	# PLT1
ce426f
+	movl (LRV_SIZE + 4 + LR_SIZE + 4)(%esp), %edx
ce426f
 	call _dl_call_pltexit
ce426f
-	movl (%esp), %eax
ce426f
-	movl 4(%esp), %edx
ce426f
-	fldt 20(%esp)
ce426f
-	fldt 8(%esp)
ce426f
-	addl $60, %esp
ce426f
-	cfi_adjust_cfa_offset (-60)
ce426f
+	movl LRV_EAX_OFFSET(%esp), %eax
ce426f
+	movl LRV_EDX_OFFSET(%esp), %edx
ce426f
+	fldt LRV_ST1_OFFSET(%esp)
ce426f
+	fldt LRV_ST0_OFFSET(%esp)
ce426f
+#ifdef HAVE_MPX_SUPPORT
ce426f
+	bndmov LRV_BND0_OFFSET(%esp), %bnd0
ce426f
+	bndmov LRV_BND1_OFFSET(%esp), %bnd1
ce426f
+#else
ce426f
+	.byte 0x66,0x0f,0x1a,0x44,0x24,LRV_BND0_OFFSET
ce426f
+	.byte 0x66,0x0f,0x1a,0x4c,0x24,LRV_BND1_OFFSET
ce426f
+#endif
ce426f
+	# Restore stack before return.
ce426f
+	addl $(LRV_SIZE + 4 + LR_SIZE + 4), %esp
ce426f
+	cfi_adjust_cfa_offset (-(LRV_SIZE + 4 + LR_SIZE + 4))
ce426f
+	PRESERVE_BND_REGS_PREFIX
ce426f
 	ret
ce426f
 	cfi_endproc
ce426f
 	.size _dl_runtime_profile, .-_dl_runtime_profile
ce426f
Index: glibc-2.17-c758a686/sysdeps/i386/link-defines.sym
ce426f
===================================================================
ce426f
--- /dev/null
ce426f
+++ glibc-2.17-c758a686/sysdeps/i386/link-defines.sym
ce426f
@@ -0,0 +1,20 @@
ce426f
+#include "link.h"
ce426f
+#include <stddef.h>
ce426f
+
ce426f
+--
ce426f
+LONG_DOUBLE_SIZE	sizeof (long double)
ce426f
+
ce426f
+LR_SIZE			sizeof (struct La_i86_regs)
ce426f
+LR_EDX_OFFSET		offsetof (struct La_i86_regs, lr_edx)
ce426f
+LR_ECX_OFFSET		offsetof (struct La_i86_regs, lr_ecx)
ce426f
+LR_EAX_OFFSET		offsetof (struct La_i86_regs, lr_eax)
ce426f
+LR_EBP_OFFSET		offsetof (struct La_i86_regs, lr_ebp)
ce426f
+LR_ESP_OFFSET		offsetof (struct La_i86_regs, lr_esp)
ce426f
+
ce426f
+LRV_SIZE		sizeof (struct La_i86_retval)
ce426f
+LRV_EAX_OFFSET		offsetof (struct La_i86_retval, lrv_eax)
ce426f
+LRV_EDX_OFFSET		offsetof (struct La_i86_retval, lrv_edx)
ce426f
+LRV_ST0_OFFSET		offsetof (struct La_i86_retval, lrv_st0)
ce426f
+LRV_ST1_OFFSET		offsetof (struct La_i86_retval, lrv_st1)
ce426f
+LRV_BND0_OFFSET		offsetof (struct La_i86_retval, lrv_bnd0)
ce426f
+LRV_BND1_OFFSET		offsetof (struct La_i86_retval, lrv_bnd1)
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86/bits/link.h
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/sysdeps/x86/bits/link.h
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86/bits/link.h
ce426f
@@ -38,6 +38,8 @@ typedef struct La_i86_retval
ce426f
   uint32_t lrv_edx;
ce426f
   long double lrv_st0;
ce426f
   long double lrv_st1;
ce426f
+  uint64_t lrv_bnd0;
ce426f
+  uint64_t lrv_bnd1;
ce426f
 } La_i86_retval;
ce426f
 
ce426f
 
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86/cpu-features.c
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/sysdeps/x86/cpu-features.c
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86/cpu-features.c
ce426f
@@ -130,6 +130,20 @@ init_cpu_features (struct cpu_features *
ce426f
 	      break;
ce426f
 	    }
ce426f
 	}
ce426f
+
ce426f
+      /* To avoid SSE transition penalty, use _dl_runtime_resolve_slow.
ce426f
+         If XGETBV suports ECX == 1, use _dl_runtime_resolve_opt.  */
ce426f
+      cpu_features->feature[index_Use_dl_runtime_resolve_slow]
ce426f
+	|= bit_Use_dl_runtime_resolve_slow;
ce426f
+      if (cpu_features->max_cpuid >= 0xd)
ce426f
+	{
ce426f
+	  unsigned int eax;
ce426f
+
ce426f
+	  __cpuid_count (0xd, 1, eax, ebx, ecx, edx);
ce426f
+	  if ((eax & (1 << 2)) != 0)
ce426f
+	    cpu_features->feature[index_Use_dl_runtime_resolve_opt]
ce426f
+	      |= bit_Use_dl_runtime_resolve_opt;
ce426f
+	}
ce426f
     }
ce426f
   /* This spells out "AuthenticAMD".  */
ce426f
   else if (ebx == 0x68747541 && ecx == 0x444d4163 && edx == 0x69746e65)
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86/cpu-features.h
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/sysdeps/x86/cpu-features.h
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86/cpu-features.h
ce426f
@@ -34,6 +34,9 @@
ce426f
 #define bit_AVX512DQ_Usable		(1 << 13)
ce426f
 #define bit_Prefer_MAP_32BIT_EXEC	(1 << 16)
ce426f
 #define bit_Prefer_No_VZEROUPPER	(1 << 17)
ce426f
+#define bit_Use_dl_runtime_resolve_opt	(1 << 20)
ce426f
+#define bit_Use_dl_runtime_resolve_slow	(1 << 21)
ce426f
+
ce426f
 
ce426f
 /* CPUID Feature flags.  */
ce426f
 
ce426f
@@ -95,6 +98,9 @@
ce426f
 # define index_AVX512DQ_Usable		FEATURE_INDEX_1*FEATURE_SIZE
ce426f
 # define index_Prefer_MAP_32BIT_EXEC	FEATURE_INDEX_1*FEATURE_SIZE
ce426f
 # define index_Prefer_No_VZEROUPPER	FEATURE_INDEX_1*FEATURE_SIZE
ce426f
+# define index_Use_dl_runtime_resolve_opt FEATURE_INDEX_1*FEATURE_SIZE
ce426f
+# define index_Use_dl_runtime_resolve_slow FEATURE_INDEX_1*FEATURE_SIZE
ce426f
+
ce426f
 
ce426f
 # if defined (_LIBC) && !IS_IN (nonlib)
ce426f
 #  ifdef __x86_64__
ce426f
@@ -273,6 +279,8 @@ extern const struct cpu_features *__get_
ce426f
 # define index_AVX512DQ_Usable		FEATURE_INDEX_1
ce426f
 # define index_Prefer_MAP_32BIT_EXEC	FEATURE_INDEX_1
ce426f
 # define index_Prefer_No_VZEROUPPER     FEATURE_INDEX_1
ce426f
+# define index_Use_dl_runtime_resolve_opt FEATURE_INDEX_1
ce426f
+# define index_Use_dl_runtime_resolve_slow FEATURE_INDEX_1
ce426f
 
ce426f
 #endif	/* !__ASSEMBLER__ */
ce426f
 
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86_64/Makefile
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/sysdeps/x86_64/Makefile
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86_64/Makefile
ce426f
@@ -21,6 +21,11 @@ endif
ce426f
 ifeq ($(subdir),elf)
ce426f
 sysdep-dl-routines += tlsdesc dl-tlsdesc
ce426f
 
ce426f
+tests += ifuncmain8
ce426f
+modules-names += ifuncmod8
ce426f
+
ce426f
+$(objpfx)ifuncmain8: $(objpfx)ifuncmod8.so
ce426f
+
ce426f
 tests += tst-quad1 tst-quad2
ce426f
 modules-names += tst-quadmod1 tst-quadmod2
ce426f
 
ce426f
@@ -34,18 +39,32 @@ tests-pie += $(quad-pie-test)
ce426f
 $(objpfx)tst-quad1pie: $(objpfx)tst-quadmod1pie.o
ce426f
 $(objpfx)tst-quad2pie: $(objpfx)tst-quadmod2pie.o
ce426f
 
ce426f
+tests += tst-sse tst-avx tst-avx512
ce426f
+test-extras += tst-avx-aux tst-avx512-aux
ce426f
+extra-test-objs += tst-avx-aux.o tst-avx512-aux.o
ce426f
+
ce426f
 tests += tst-audit10
ce426f
-modules-names += tst-auditmod10a tst-auditmod10b
ce426f
+modules-names += tst-auditmod10a tst-auditmod10b \
ce426f
+		 tst-ssemod tst-avxmod tst-avx512mod
ce426f
 
ce426f
 $(objpfx)tst-audit10: $(objpfx)tst-auditmod10a.so
ce426f
 $(objpfx)tst-audit10.out: $(objpfx)tst-auditmod10b.so
ce426f
 tst-audit10-ENV = LD_AUDIT=$(objpfx)tst-auditmod10b.so
ce426f
 
ce426f
+$(objpfx)tst-sse: $(objpfx)tst-ssemod.so
ce426f
+$(objpfx)tst-avx: $(objpfx)tst-avx-aux.o $(objpfx)tst-avxmod.so
ce426f
+$(objpfx)tst-avx512: $(objpfx)tst-avx512-aux.o $(objpfx)tst-avx512mod.so
ce426f
+
ce426f
+CFLAGS-tst-avx-aux.c += $(AVX-CFLAGS)
ce426f
+CFLAGS-tst-avxmod.c += $(AVX-CFLAGS)
ce426f
+
ce426f
 ifeq (yes,$(config-cflags-avx512))
ce426f
 AVX512-CFLAGS = -mavx512f
ce426f
 CFLAGS-tst-audit10.c += $(AVX512-CFLAGS)
ce426f
 CFLAGS-tst-auditmod10a.c += $(AVX512-CFLAGS)
ce426f
 CFLAGS-tst-auditmod10b.c += $(AVX512-CFLAGS)
ce426f
+CFLAGS-tst-avx512-aux.c += $(AVX512-CFLAGS)
ce426f
+CFLAGS-tst-avx512mod.c += $(AVX512-CFLAGS)
ce426f
 endif
ce426f
 endif
ce426f
 
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86_64/dl-machine.h
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/sysdeps/x86_64/dl-machine.h
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86_64/dl-machine.h
ce426f
@@ -66,8 +66,15 @@ static inline int __attribute__ ((unused
ce426f
 elf_machine_runtime_setup (struct link_map *l, int lazy, int profile)
ce426f
 {
ce426f
   Elf64_Addr *got;
ce426f
-  extern void _dl_runtime_resolve (ElfW(Word)) attribute_hidden;
ce426f
-  extern void _dl_runtime_profile (ElfW(Word)) attribute_hidden;
ce426f
+  extern void _dl_runtime_resolve_sse (ElfW(Word)) attribute_hidden;
ce426f
+  extern void _dl_runtime_resolve_avx (ElfW(Word)) attribute_hidden;
ce426f
+  extern void _dl_runtime_resolve_avx_slow (ElfW(Word)) attribute_hidden;
ce426f
+  extern void _dl_runtime_resolve_avx_opt (ElfW(Word)) attribute_hidden;
ce426f
+  extern void _dl_runtime_resolve_avx512 (ElfW(Word)) attribute_hidden;
ce426f
+  extern void _dl_runtime_resolve_avx512_opt (ElfW(Word)) attribute_hidden;
ce426f
+  extern void _dl_runtime_profile_sse (ElfW(Word)) attribute_hidden;
ce426f
+  extern void _dl_runtime_profile_avx (ElfW(Word)) attribute_hidden;
ce426f
+  extern void _dl_runtime_profile_avx512 (ElfW(Word)) attribute_hidden;
ce426f
 
ce426f
   if (l->l_info[DT_JMPREL] && lazy)
ce426f
     {
ce426f
@@ -95,7 +102,12 @@ elf_machine_runtime_setup (struct link_m
ce426f
 	 end in this function.  */
ce426f
       if (__builtin_expect (profile, 0))
ce426f
 	{
ce426f
-	  *(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_profile;
ce426f
+	  if (HAS_ARCH_FEATURE (AVX512F_Usable))
ce426f
+	    *(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_profile_avx512;
ce426f
+	  else if (HAS_ARCH_FEATURE (AVX_Usable))
ce426f
+	    *(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_profile_avx;
ce426f
+	  else
ce426f
+	    *(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_profile_sse;
ce426f
 
ce426f
 	  if (GLRO(dl_profile) != NULL
ce426f
 	      && _dl_name_match_p (GLRO(dl_profile), l))
ce426f
@@ -104,9 +116,34 @@ elf_machine_runtime_setup (struct link_m
ce426f
 	    GL(dl_profile_map) = l;
ce426f
 	}
ce426f
       else
ce426f
-	/* This function will get called to fix up the GOT entry indicated by
ce426f
-	   the offset on the stack, and then jump to the resolved address.  */
ce426f
-	*(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_resolve;
ce426f
+	{
ce426f
+	  /* This function will get called to fix up the GOT entry
ce426f
+	     indicated by the offset on the stack, and then jump to
ce426f
+	     the resolved address.  */
ce426f
+	  if (HAS_ARCH_FEATURE (AVX512F_Usable))
ce426f
+	    {
ce426f
+	      if (HAS_ARCH_FEATURE (Use_dl_runtime_resolve_opt))
ce426f
+		*(ElfW(Addr) *) (got + 2)
ce426f
+		  = (ElfW(Addr)) &_dl_runtime_resolve_avx512_opt;
ce426f
+	      else
ce426f
+		*(ElfW(Addr) *) (got + 2)
ce426f
+		  = (ElfW(Addr)) &_dl_runtime_resolve_avx512;
ce426f
+	    }
ce426f
+	  else if (HAS_ARCH_FEATURE (AVX_Usable))
ce426f
+	    {
ce426f
+	      if (HAS_ARCH_FEATURE (Use_dl_runtime_resolve_opt))
ce426f
+		*(ElfW(Addr) *) (got + 2)
ce426f
+		  = (ElfW(Addr)) &_dl_runtime_resolve_avx_opt;
ce426f
+	      else if (HAS_ARCH_FEATURE (Use_dl_runtime_resolve_slow))
ce426f
+		*(ElfW(Addr) *) (got + 2)
ce426f
+		  = (ElfW(Addr)) &_dl_runtime_resolve_avx_slow;
ce426f
+	      else
ce426f
+		*(ElfW(Addr) *) (got + 2)
ce426f
+		  = (ElfW(Addr)) &_dl_runtime_resolve_avx;
ce426f
+	    }
ce426f
+	  else
ce426f
+	    *(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_resolve_sse;
ce426f
+	}
ce426f
     }
ce426f
 
ce426f
   if (l->l_info[ADDRIDX (DT_TLSDESC_GOT)] && lazy)
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86_64/dl-trampoline.S
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/sysdeps/x86_64/dl-trampoline.S
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86_64/dl-trampoline.S
ce426f
@@ -18,28 +18,52 @@
ce426f
 
ce426f
 #include <config.h>
ce426f
 #include <sysdep.h>
ce426f
+#include <cpu-features.h>
ce426f
 #include <link-defines.h>
ce426f
 
ce426f
-#if (RTLD_SAVESPACE_SSE % 32) != 0
ce426f
-# error RTLD_SAVESPACE_SSE must be aligned to 32 bytes
ce426f
+#ifndef DL_STACK_ALIGNMENT
ce426f
+/* Due to GCC bug:
ce426f
+
ce426f
+   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
ce426f
+
ce426f
+   __tls_get_addr may be called with 8-byte stack alignment.  Although
ce426f
+   this bug has been fixed in GCC 4.9.4, 5.3 and 6, we can't assume
ce426f
+   that stack will be always aligned at 16 bytes.  We use unaligned
ce426f
+   16-byte move to load and store SSE registers, which has no penalty
ce426f
+   on modern processors if stack is 16-byte aligned.  */
ce426f
+# define DL_STACK_ALIGNMENT 8
ce426f
 #endif
ce426f
 
ce426f
+#ifndef DL_RUNIME_UNALIGNED_VEC_SIZE
ce426f
+/* The maximum size of unaligned vector load and store.  */
ce426f
+# define DL_RUNIME_UNALIGNED_VEC_SIZE 16
ce426f
+#endif
ce426f
+
ce426f
+/* True if _dl_runtime_resolve should align stack to VEC_SIZE bytes.  */
ce426f
+#define DL_RUNIME_RESOLVE_REALIGN_STACK \
ce426f
+  (VEC_SIZE > DL_STACK_ALIGNMENT \
ce426f
+   && VEC_SIZE > DL_RUNIME_UNALIGNED_VEC_SIZE)
ce426f
+
ce426f
+/* Align vector register save area to 16 bytes.  */
ce426f
+#define REGISTER_SAVE_VEC_OFF	0
ce426f
+
ce426f
 /* Area on stack to save and restore registers used for parameter
ce426f
    passing when calling _dl_fixup.  */
ce426f
 #ifdef __ILP32__
ce426f
-/* X32 saves RCX, RDX, RSI, RDI, R8 and R9 plus RAX.  */
ce426f
-# define REGISTER_SAVE_AREA	(8 * 7)
ce426f
-# define REGISTER_SAVE_RAX	0
ce426f
+# define REGISTER_SAVE_RAX	(REGISTER_SAVE_VEC_OFF + VEC_SIZE * 8)
ce426f
+# define PRESERVE_BND_REGS_PREFIX
ce426f
 #else
ce426f
-/* X86-64 saves RCX, RDX, RSI, RDI, R8 and R9 plus RAX as well as BND0,
ce426f
-   BND1, BND2, BND3.  */
ce426f
-# define REGISTER_SAVE_AREA	(8 * 7 + 16 * 4)
ce426f
 /* Align bound register save area to 16 bytes.  */
ce426f
-# define REGISTER_SAVE_BND0	0
ce426f
+# define REGISTER_SAVE_BND0	(REGISTER_SAVE_VEC_OFF + VEC_SIZE * 8)
ce426f
 # define REGISTER_SAVE_BND1	(REGISTER_SAVE_BND0 + 16)
ce426f
 # define REGISTER_SAVE_BND2	(REGISTER_SAVE_BND1 + 16)
ce426f
 # define REGISTER_SAVE_BND3	(REGISTER_SAVE_BND2 + 16)
ce426f
 # define REGISTER_SAVE_RAX	(REGISTER_SAVE_BND3 + 16)
ce426f
+# ifdef HAVE_MPX_SUPPORT
ce426f
+#  define PRESERVE_BND_REGS_PREFIX bnd
ce426f
+# else
ce426f
+#  define PRESERVE_BND_REGS_PREFIX .byte 0xf2
ce426f
+# endif
ce426f
 #endif
ce426f
 #define REGISTER_SAVE_RCX	(REGISTER_SAVE_RAX + 8)
ce426f
 #define REGISTER_SAVE_RDX	(REGISTER_SAVE_RCX + 8)
ce426f
@@ -48,376 +72,71 @@
ce426f
 #define REGISTER_SAVE_R8	(REGISTER_SAVE_RDI + 8)
ce426f
 #define REGISTER_SAVE_R9	(REGISTER_SAVE_R8 + 8)
ce426f
 
ce426f
-	.text
ce426f
-	.globl _dl_runtime_resolve
ce426f
-	.type _dl_runtime_resolve, @function
ce426f
-	.align 16
ce426f
-	cfi_startproc
ce426f
-_dl_runtime_resolve:
ce426f
-	cfi_adjust_cfa_offset(16) # Incorporate PLT
ce426f
-	subq $REGISTER_SAVE_AREA,%rsp
ce426f
-	cfi_adjust_cfa_offset(REGISTER_SAVE_AREA)
ce426f
-	# Preserve registers otherwise clobbered.
ce426f
-	movq %rax, REGISTER_SAVE_RAX(%rsp)
ce426f
-	movq %rcx, REGISTER_SAVE_RCX(%rsp)
ce426f
-	movq %rdx, REGISTER_SAVE_RDX(%rsp)
ce426f
-	movq %rsi, REGISTER_SAVE_RSI(%rsp)
ce426f
-	movq %rdi, REGISTER_SAVE_RDI(%rsp)
ce426f
-	movq %r8, REGISTER_SAVE_R8(%rsp)
ce426f
-	movq %r9, REGISTER_SAVE_R9(%rsp)
ce426f
-#ifndef __ILP32__
ce426f
-	# We also have to preserve bound registers.  These are nops if
ce426f
-	# Intel MPX isn't available or disabled.
ce426f
-# ifdef HAVE_MPX_SUPPORT
ce426f
-	bndmov %bnd0, REGISTER_SAVE_BND0(%rsp)
ce426f
-	bndmov %bnd1, REGISTER_SAVE_BND1(%rsp)
ce426f
-	bndmov %bnd2, REGISTER_SAVE_BND2(%rsp)
ce426f
-	bndmov %bnd3, REGISTER_SAVE_BND3(%rsp)
ce426f
-# else
ce426f
-	.byte 0x66,0x0f,0x1b,0x44,0x24,REGISTER_SAVE_BND0
ce426f
-	.byte 0x66,0x0f,0x1b,0x4c,0x24,REGISTER_SAVE_BND1
ce426f
-	.byte 0x66,0x0f,0x1b,0x54,0x24,REGISTER_SAVE_BND2
ce426f
-	.byte 0x66,0x0f,0x1b,0x5c,0x24,REGISTER_SAVE_BND3
ce426f
-# endif
ce426f
-#endif
ce426f
-	# Copy args pushed by PLT in register.
ce426f
-	# %rdi: link_map, %rsi: reloc_index
ce426f
-	movq (REGISTER_SAVE_AREA + 8)(%rsp), %rsi
ce426f
-	movq REGISTER_SAVE_AREA(%rsp), %rdi
ce426f
-	call _dl_fixup		# Call resolver.
ce426f
-	movq %rax, %r11		# Save return value
ce426f
-#ifndef __ILP32__
ce426f
-	# Restore bound registers.  These are nops if Intel MPX isn't
ce426f
-	# avaiable or disabled.
ce426f
-# ifdef HAVE_MPX_SUPPORT
ce426f
-	bndmov REGISTER_SAVE_BND3(%rsp), %bnd3
ce426f
-	bndmov REGISTER_SAVE_BND2(%rsp), %bnd2
ce426f
-	bndmov REGISTER_SAVE_BND1(%rsp), %bnd1
ce426f
-	bndmov REGISTER_SAVE_BND0(%rsp), %bnd0
ce426f
-# else
ce426f
-	.byte 0x66,0x0f,0x1a,0x5c,0x24,REGISTER_SAVE_BND3
ce426f
-	.byte 0x66,0x0f,0x1a,0x54,0x24,REGISTER_SAVE_BND2
ce426f
-	.byte 0x66,0x0f,0x1a,0x4c,0x24,REGISTER_SAVE_BND1
ce426f
-	.byte 0x66,0x0f,0x1a,0x44,0x24,REGISTER_SAVE_BND0
ce426f
-# endif
ce426f
+#define VEC_SIZE		64
ce426f
+#define VMOVA			vmovdqa64
ce426f
+#if DL_RUNIME_RESOLVE_REALIGN_STACK || VEC_SIZE <= DL_STACK_ALIGNMENT
ce426f
+# define VMOV			vmovdqa64
ce426f
+#else
ce426f
+# define VMOV			vmovdqu64
ce426f
 #endif
ce426f
-	# Get register content back.
ce426f
-	movq REGISTER_SAVE_R9(%rsp), %r9
ce426f
-	movq REGISTER_SAVE_R8(%rsp), %r8
ce426f
-	movq REGISTER_SAVE_RDI(%rsp), %rdi
ce426f
-	movq REGISTER_SAVE_RSI(%rsp), %rsi
ce426f
-	movq REGISTER_SAVE_RDX(%rsp), %rdx
ce426f
-	movq REGISTER_SAVE_RCX(%rsp), %rcx
ce426f
-	movq REGISTER_SAVE_RAX(%rsp), %rax
ce426f
-	# Adjust stack(PLT did 2 pushes)
ce426f
-	addq $(REGISTER_SAVE_AREA + 16), %rsp
ce426f
-	cfi_adjust_cfa_offset(-(REGISTER_SAVE_AREA + 16))
ce426f
-	jmp *%r11		# Jump to function address.
ce426f
-	cfi_endproc
ce426f
-	.size _dl_runtime_resolve, .-_dl_runtime_resolve
ce426f
-
ce426f
-
ce426f
-#ifndef PROF
ce426f
-	.globl _dl_runtime_profile
ce426f
-	.type _dl_runtime_profile, @function
ce426f
-	.align 16
ce426f
-	cfi_startproc
ce426f
-
ce426f
-_dl_runtime_profile:
ce426f
-	cfi_adjust_cfa_offset(16) # Incorporate PLT
ce426f
-	/* The La_x86_64_regs data structure pointed to by the
ce426f
-	   fourth paramater must be 16-byte aligned.  This must
ce426f
-	   be explicitly enforced.  We have the set up a dynamically
ce426f
-	   sized stack frame.  %rbx points to the top half which
ce426f
-	   has a fixed size and preserves the original stack pointer.  */
ce426f
-
ce426f
-	subq $32, %rsp		# Allocate the local storage.
ce426f
-	cfi_adjust_cfa_offset(32)
ce426f
-	movq %rbx, (%rsp)
ce426f
-	cfi_rel_offset(%rbx, 0)
ce426f
-
ce426f
-	/* On the stack:
ce426f
-		56(%rbx)	parameter #1
ce426f
-		48(%rbx)	return address
ce426f
-
ce426f
-		40(%rbx)	reloc index
ce426f
-		32(%rbx)	link_map
ce426f
-
ce426f
-		24(%rbx)	La_x86_64_regs pointer
ce426f
-		16(%rbx)	framesize
ce426f
-		 8(%rbx)	rax
ce426f
-		  (%rbx)	rbx
ce426f
-	*/
ce426f
-
ce426f
-	movq %rax, 8(%rsp)
ce426f
-	movq %rsp, %rbx
ce426f
-	cfi_def_cfa_register(%rbx)
ce426f
-
ce426f
-	/* Actively align the La_x86_64_regs structure.  */
ce426f
-	andq $0xfffffffffffffff0, %rsp
ce426f
-# if defined HAVE_AVX_SUPPORT || defined HAVE_AVX512_ASM_SUPPORT
ce426f
-	/* sizeof(La_x86_64_regs).  Need extra space for 8 SSE registers
ce426f
-	   to detect if any xmm0-xmm7 registers are changed by audit
ce426f
-	   module.  */
ce426f
-	subq $(LR_SIZE + XMM_SIZE*8), %rsp
ce426f
-# else
ce426f
-	subq $LR_SIZE, %rsp		# sizeof(La_x86_64_regs)
ce426f
-# endif
ce426f
-	movq %rsp, 24(%rbx)
ce426f
-
ce426f
-	/* Fill the La_x86_64_regs structure.  */
ce426f
-	movq %rdx, LR_RDX_OFFSET(%rsp)
ce426f
-	movq %r8,  LR_R8_OFFSET(%rsp)
ce426f
-	movq %r9,  LR_R9_OFFSET(%rsp)
ce426f
-	movq %rcx, LR_RCX_OFFSET(%rsp)
ce426f
-	movq %rsi, LR_RSI_OFFSET(%rsp)
ce426f
-	movq %rdi, LR_RDI_OFFSET(%rsp)
ce426f
-	movq %rbp, LR_RBP_OFFSET(%rsp)
ce426f
-
ce426f
-	leaq 48(%rbx), %rax
ce426f
-	movq %rax, LR_RSP_OFFSET(%rsp)
ce426f
-
ce426f
-	/* We always store the XMM registers even if AVX is available.
ce426f
-	   This is to provide backward binary compatility for existing
ce426f
-	   audit modules.  */
ce426f
-	movaps %xmm0,		   (LR_XMM_OFFSET)(%rsp)
ce426f
-	movaps %xmm1, (LR_XMM_OFFSET +   XMM_SIZE)(%rsp)
ce426f
-	movaps %xmm2, (LR_XMM_OFFSET + XMM_SIZE*2)(%rsp)
ce426f
-	movaps %xmm3, (LR_XMM_OFFSET + XMM_SIZE*3)(%rsp)
ce426f
-	movaps %xmm4, (LR_XMM_OFFSET + XMM_SIZE*4)(%rsp)
ce426f
-	movaps %xmm5, (LR_XMM_OFFSET + XMM_SIZE*5)(%rsp)
ce426f
-	movaps %xmm6, (LR_XMM_OFFSET + XMM_SIZE*6)(%rsp)
ce426f
-	movaps %xmm7, (LR_XMM_OFFSET + XMM_SIZE*7)(%rsp)
ce426f
-
ce426f
-# ifndef __ILP32__
ce426f
-#  ifdef HAVE_MPX_SUPPORT
ce426f
-	bndmov %bnd0, 		   (LR_BND_OFFSET)(%rsp)  # Preserve bound
ce426f
-	bndmov %bnd1, (LR_BND_OFFSET +   BND_SIZE)(%rsp)  # registers. Nops if
ce426f
-	bndmov %bnd2, (LR_BND_OFFSET + BND_SIZE*2)(%rsp)  # MPX not available
ce426f
-	bndmov %bnd3, (LR_BND_OFFSET + BND_SIZE*3)(%rsp)  # or disabled.
ce426f
-#  else
ce426f
-	.byte 0x66,0x0f,0x1b,0x84,0x24;.long (LR_BND_OFFSET)
ce426f
-	.byte 0x66,0x0f,0x1b,0x8c,0x24;.long (LR_BND_OFFSET + BND_SIZE)
ce426f
-	.byte 0x66,0x0f,0x1b,0x84,0x24;.long (LR_BND_OFFSET + BND_SIZE*2)
ce426f
-	.byte 0x66,0x0f,0x1b,0x8c,0x24;.long (LR_BND_OFFSET + BND_SIZE*3)
ce426f
-#  endif
ce426f
-# endif
ce426f
-
ce426f
-# if defined HAVE_AVX_SUPPORT || defined HAVE_AVX512_ASM_SUPPORT
ce426f
-	.data
ce426f
-L(have_avx):
ce426f
-	.zero 4
ce426f
-	.size L(have_avx), 4
ce426f
-	.previous
ce426f
-
ce426f
-	cmpl	$0, L(have_avx)(%rip)
ce426f
-	jne	L(defined)
ce426f
-	movq	%rbx, %r11		# Save rbx
ce426f
-	movl	$1, %eax
ce426f
-	cpuid
ce426f
-	movq	%r11,%rbx		# Restore rbx
ce426f
-	xorl	%eax, %eax
ce426f
-	// AVX and XSAVE supported?
ce426f
-	andl	$((1 << 28) | (1 << 27)), %ecx
ce426f
-	cmpl	$((1 << 28) | (1 << 27)), %ecx
ce426f
-	jne	10f
ce426f
-#  ifdef HAVE_AVX512_ASM_SUPPORT
ce426f
-	// AVX512 supported in processor?
ce426f
-	movq	%rbx, %r11		# Save rbx
ce426f
-	xorl	%ecx, %ecx
ce426f
-	mov	$0x7, %eax
ce426f
-	cpuid
ce426f
-	andl	$(1 << 16), %ebx
ce426f
-#  endif
ce426f
-	xorl	%ecx, %ecx
ce426f
-	// Get XFEATURE_ENABLED_MASK
ce426f
-	xgetbv
ce426f
-#  ifdef HAVE_AVX512_ASM_SUPPORT
ce426f
-	test	%ebx, %ebx
ce426f
-	movq	%r11, %rbx		# Restore rbx
ce426f
-	je	20f
ce426f
-	// Verify that XCR0[7:5] = '111b' and
ce426f
-	// XCR0[2:1] = '11b' which means
ce426f
-	// that zmm state is enabled
ce426f
-	andl	$0xe6, %eax
ce426f
-	cmpl	$0xe6, %eax
ce426f
-	jne	20f
ce426f
-	movl	%eax, L(have_avx)(%rip)
ce426f
-L(avx512):
ce426f
-#   define RESTORE_AVX
ce426f
-#   define VMOV    vmovdqu64
ce426f
-#   define VEC(i)  zmm##i
ce426f
-#   define MORE_CODE
ce426f
-#   include "dl-trampoline.h"
ce426f
-#   undef VMOV
ce426f
-#   undef VEC
ce426f
-#   undef RESTORE_AVX
ce426f
-#  endif
ce426f
-20:	andl	$0x6, %eax
ce426f
-10:	subl	$0x5, %eax
ce426f
-	movl	%eax, L(have_avx)(%rip)
ce426f
-	cmpl	$0, %eax
ce426f
-
ce426f
-L(defined):
ce426f
-	js	L(no_avx)
ce426f
-#  ifdef HAVE_AVX512_ASM_SUPPORT
ce426f
-	cmpl	$0xe6, L(have_avx)(%rip)
ce426f
-	je	L(avx512)
ce426f
-#  endif
ce426f
-
ce426f
-#  define RESTORE_AVX
ce426f
-#  define VMOV    vmovdqu
ce426f
-#  define VEC(i)  ymm##i
ce426f
-#  define MORE_CODE
ce426f
-#  include "dl-trampoline.h"
ce426f
-
ce426f
-	.align 16
ce426f
-L(no_avx):
ce426f
-# endif
ce426f
-
ce426f
-# undef RESTORE_AVX
ce426f
-# include "dl-trampoline.h"
ce426f
-
ce426f
-	cfi_endproc
ce426f
-	.size _dl_runtime_profile, .-_dl_runtime_profile
ce426f
+#define VEC(i)			zmm##i
ce426f
+#define _dl_runtime_resolve	_dl_runtime_resolve_avx512
ce426f
+#define _dl_runtime_profile	_dl_runtime_profile_avx512
ce426f
+#define RESTORE_AVX
ce426f
+#include "dl-trampoline.h"
ce426f
+#undef _dl_runtime_resolve
ce426f
+#undef _dl_runtime_profile
ce426f
+#undef VEC
ce426f
+#undef VMOV
ce426f
+#undef VMOVA
ce426f
+#undef VEC_SIZE
ce426f
+
ce426f
+#define VEC_SIZE		32
ce426f
+#define VMOVA			vmovdqa
ce426f
+#if DL_RUNIME_RESOLVE_REALIGN_STACK || VEC_SIZE <= DL_STACK_ALIGNMENT
ce426f
+# define VMOV			vmovdqa
ce426f
+#else
ce426f
+# define VMOV			vmovdqu
ce426f
 #endif
ce426f
-
ce426f
-
ce426f
-#ifdef SHARED
ce426f
-	.globl _dl_x86_64_save_sse
ce426f
-	.type _dl_x86_64_save_sse, @function
ce426f
-	.align 16
ce426f
-	cfi_startproc
ce426f
-_dl_x86_64_save_sse:
ce426f
-# if defined HAVE_AVX_SUPPORT || defined HAVE_AVX512_ASM_SUPPORT
ce426f
-	cmpl	$0, L(have_avx)(%rip)
ce426f
-	jne	L(defined_5)
ce426f
-	movq	%rbx, %r11		# Save rbx
ce426f
-	movl	$1, %eax
ce426f
-	cpuid
ce426f
-	movq	%r11,%rbx		# Restore rbx
ce426f
-	xorl	%eax, %eax
ce426f
-	// AVX and XSAVE supported?
ce426f
-	andl	$((1 << 28) | (1 << 27)), %ecx
ce426f
-	cmpl	$((1 << 28) | (1 << 27)), %ecx
ce426f
-	jne	1f
ce426f
-#  ifdef HAVE_AVX512_ASM_SUPPORT
ce426f
-	// AVX512 supported in a processor?
ce426f
-	movq	%rbx, %r11              # Save rbx
ce426f
-	xorl	%ecx,%ecx
ce426f
-	mov	$0x7,%eax
ce426f
-	cpuid
ce426f
-	andl	$(1 << 16), %ebx
ce426f
-#  endif
ce426f
-	xorl	%ecx, %ecx
ce426f
-	// Get XFEATURE_ENABLED_MASK
ce426f
-	xgetbv
ce426f
-#  ifdef HAVE_AVX512_ASM_SUPPORT
ce426f
-	test	%ebx, %ebx
ce426f
-	movq	%r11, %rbx		# Restore rbx
ce426f
-	je	2f
ce426f
-	// Verify that XCR0[7:5] = '111b' and
ce426f
-	// XCR0[2:1] = '11b' which means
ce426f
-	// that zmm state is enabled
ce426f
-	andl	$0xe6, %eax
ce426f
-	movl	%eax, L(have_avx)(%rip)
ce426f
-	cmpl	$0xe6, %eax
ce426f
-	je	L(avx512_5)
ce426f
-#  endif
ce426f
-
ce426f
-2:	andl	$0x6, %eax
ce426f
-1:	subl	$0x5, %eax
ce426f
-	movl	%eax, L(have_avx)(%rip)
ce426f
-	cmpl	$0, %eax
ce426f
-
ce426f
-L(defined_5):
ce426f
-	js	L(no_avx5)
ce426f
-#  ifdef HAVE_AVX512_ASM_SUPPORT
ce426f
-	cmpl	$0xe6, L(have_avx)(%rip)
ce426f
-	je	L(avx512_5)
ce426f
-#  endif
ce426f
-
ce426f
-	vmovdqa %ymm0, %fs:RTLD_SAVESPACE_SSE+0*YMM_SIZE
ce426f
-	vmovdqa %ymm1, %fs:RTLD_SAVESPACE_SSE+1*YMM_SIZE
ce426f
-	vmovdqa %ymm2, %fs:RTLD_SAVESPACE_SSE+2*YMM_SIZE
ce426f
-	vmovdqa %ymm3, %fs:RTLD_SAVESPACE_SSE+3*YMM_SIZE
ce426f
-	vmovdqa %ymm4, %fs:RTLD_SAVESPACE_SSE+4*YMM_SIZE
ce426f
-	vmovdqa %ymm5, %fs:RTLD_SAVESPACE_SSE+5*YMM_SIZE
ce426f
-	vmovdqa %ymm6, %fs:RTLD_SAVESPACE_SSE+6*YMM_SIZE
ce426f
-	vmovdqa %ymm7, %fs:RTLD_SAVESPACE_SSE+7*YMM_SIZE
ce426f
-	ret
ce426f
-#  ifdef HAVE_AVX512_ASM_SUPPORT
ce426f
-L(avx512_5):
ce426f
-	vmovdqu64 %zmm0, %fs:RTLD_SAVESPACE_SSE+0*ZMM_SIZE
ce426f
-	vmovdqu64 %zmm1, %fs:RTLD_SAVESPACE_SSE+1*ZMM_SIZE
ce426f
-	vmovdqu64 %zmm2, %fs:RTLD_SAVESPACE_SSE+2*ZMM_SIZE
ce426f
-	vmovdqu64 %zmm3, %fs:RTLD_SAVESPACE_SSE+3*ZMM_SIZE
ce426f
-	vmovdqu64 %zmm4, %fs:RTLD_SAVESPACE_SSE+4*ZMM_SIZE
ce426f
-	vmovdqu64 %zmm5, %fs:RTLD_SAVESPACE_SSE+5*ZMM_SIZE
ce426f
-	vmovdqu64 %zmm6, %fs:RTLD_SAVESPACE_SSE+6*ZMM_SIZE
ce426f
-	vmovdqu64 %zmm7, %fs:RTLD_SAVESPACE_SSE+7*ZMM_SIZE
ce426f
-	ret
ce426f
-#  endif
ce426f
-L(no_avx5):
ce426f
-# endif
ce426f
-	movdqa	%xmm0, %fs:RTLD_SAVESPACE_SSE+0*XMM_SIZE
ce426f
-	movdqa	%xmm1, %fs:RTLD_SAVESPACE_SSE+1*XMM_SIZE
ce426f
-	movdqa	%xmm2, %fs:RTLD_SAVESPACE_SSE+2*XMM_SIZE
ce426f
-	movdqa	%xmm3, %fs:RTLD_SAVESPACE_SSE+3*XMM_SIZE
ce426f
-	movdqa	%xmm4, %fs:RTLD_SAVESPACE_SSE+4*XMM_SIZE
ce426f
-	movdqa	%xmm5, %fs:RTLD_SAVESPACE_SSE+5*XMM_SIZE
ce426f
-	movdqa	%xmm6, %fs:RTLD_SAVESPACE_SSE+6*XMM_SIZE
ce426f
-	movdqa	%xmm7, %fs:RTLD_SAVESPACE_SSE+7*XMM_SIZE
ce426f
-	ret
ce426f
-	cfi_endproc
ce426f
-	.size _dl_x86_64_save_sse, .-_dl_x86_64_save_sse
ce426f
-
ce426f
-
ce426f
-	.globl _dl_x86_64_restore_sse
ce426f
-	.type _dl_x86_64_restore_sse, @function
ce426f
-	.align 16
ce426f
-	cfi_startproc
ce426f
-_dl_x86_64_restore_sse:
ce426f
-# if defined HAVE_AVX_SUPPORT || defined HAVE_AVX512_ASM_SUPPORT
ce426f
-	cmpl	$0, L(have_avx)(%rip)
ce426f
-	js	L(no_avx6)
ce426f
-#  ifdef HAVE_AVX512_ASM_SUPPORT
ce426f
-	cmpl	$0xe6, L(have_avx)(%rip)
ce426f
-	je	L(avx512_6)
ce426f
-#  endif
ce426f
-
ce426f
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+0*YMM_SIZE, %ymm0
ce426f
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+1*YMM_SIZE, %ymm1
ce426f
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+2*YMM_SIZE, %ymm2
ce426f
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+3*YMM_SIZE, %ymm3
ce426f
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+4*YMM_SIZE, %ymm4
ce426f
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+5*YMM_SIZE, %ymm5
ce426f
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+6*YMM_SIZE, %ymm6
ce426f
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+7*YMM_SIZE, %ymm7
ce426f
-	ret
ce426f
-#  ifdef HAVE_AVX512_ASM_SUPPORT
ce426f
-L(avx512_6):
ce426f
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+0*ZMM_SIZE, %zmm0
ce426f
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+1*ZMM_SIZE, %zmm1
ce426f
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+2*ZMM_SIZE, %zmm2
ce426f
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+3*ZMM_SIZE, %zmm3
ce426f
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+4*ZMM_SIZE, %zmm4
ce426f
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+5*ZMM_SIZE, %zmm5
ce426f
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+6*ZMM_SIZE, %zmm6
ce426f
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+7*ZMM_SIZE, %zmm7
ce426f
-	ret
ce426f
-#  endif
ce426f
-L(no_avx6):
ce426f
-# endif
ce426f
-	movdqa	%fs:RTLD_SAVESPACE_SSE+0*XMM_SIZE, %xmm0
ce426f
-	movdqa	%fs:RTLD_SAVESPACE_SSE+1*XMM_SIZE, %xmm1
ce426f
-	movdqa	%fs:RTLD_SAVESPACE_SSE+2*XMM_SIZE, %xmm2
ce426f
-	movdqa	%fs:RTLD_SAVESPACE_SSE+3*XMM_SIZE, %xmm3
ce426f
-	movdqa	%fs:RTLD_SAVESPACE_SSE+4*XMM_SIZE, %xmm4
ce426f
-	movdqa	%fs:RTLD_SAVESPACE_SSE+5*XMM_SIZE, %xmm5
ce426f
-	movdqa	%fs:RTLD_SAVESPACE_SSE+6*XMM_SIZE, %xmm6
ce426f
-	movdqa	%fs:RTLD_SAVESPACE_SSE+7*XMM_SIZE, %xmm7
ce426f
-	ret
ce426f
-	cfi_endproc
ce426f
-	.size _dl_x86_64_restore_sse, .-_dl_x86_64_restore_sse
ce426f
+#define VEC(i)			ymm##i
ce426f
+#define _dl_runtime_resolve	_dl_runtime_resolve_avx
ce426f
+#define _dl_runtime_resolve_opt	_dl_runtime_resolve_avx_opt
ce426f
+#define _dl_runtime_profile	_dl_runtime_profile_avx
ce426f
+#include "dl-trampoline.h"
ce426f
+#undef _dl_runtime_resolve
ce426f
+#undef _dl_runtime_resolve_opt
ce426f
+#undef _dl_runtime_profile
ce426f
+#undef VEC
ce426f
+#undef VMOV
ce426f
+#undef VMOVA
ce426f
+#undef VEC_SIZE
ce426f
+
ce426f
+/* movaps/movups is 1-byte shorter.  */
ce426f
+#define VEC_SIZE		16
ce426f
+#define VMOVA			movaps
ce426f
+#if DL_RUNIME_RESOLVE_REALIGN_STACK || VEC_SIZE <= DL_STACK_ALIGNMENT
ce426f
+# define VMOV			movaps
ce426f
+#else
ce426f
+# define VMOV			movups
ce426f
+ #endif
ce426f
+#define VEC(i)			xmm##i
ce426f
+#define _dl_runtime_resolve	_dl_runtime_resolve_sse
ce426f
+#define _dl_runtime_profile	_dl_runtime_profile_sse
ce426f
+#undef RESTORE_AVX
ce426f
+#include "dl-trampoline.h"
ce426f
+#undef _dl_runtime_resolve
ce426f
+#undef _dl_runtime_profile
ce426f
+#undef VMOV
ce426f
+#undef VMOVA
ce426f
+
ce426f
+/* Used by _dl_runtime_resolve_avx_opt/_dl_runtime_resolve_avx512_opt
ce426f
+   to preserve the full vector registers with zero upper bits.  */
ce426f
+#define VMOVA			vmovdqa
ce426f
+#if DL_RUNTIME_RESOLVE_REALIGN_STACK || VEC_SIZE <= DL_STACK_ALIGNMENT
ce426f
+# define VMOV			vmovdqa
ce426f
+#else
ce426f
+# define VMOV			vmovdqu
ce426f
 #endif
ce426f
+#define _dl_runtime_resolve	_dl_runtime_resolve_sse_vex
ce426f
+#define _dl_runtime_resolve_opt	_dl_runtime_resolve_avx512_opt
ce426f
+#include "dl-trampoline.h"
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86_64/dl-trampoline.h
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/sysdeps/x86_64/dl-trampoline.h
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86_64/dl-trampoline.h
ce426f
@@ -1,6 +1,5 @@
ce426f
-/* Partial PLT profile trampoline to save and restore x86-64 vector
ce426f
-   registers.
ce426f
-   Copyright (C) 2009, 2011 Free Software Foundation, Inc.
ce426f
+/* PLT trampolines.  x86-64 version.
ce426f
+   Copyright (C) 2009-2015 Free Software Foundation, Inc.
ce426f
    This file is part of the GNU C Library.
ce426f
 
ce426f
    The GNU C Library is free software; you can redistribute it and/or
ce426f
@@ -17,16 +16,355 @@
ce426f
    License along with the GNU C Library; if not, see
ce426f
    <http://www.gnu.org/licenses/>.  */
ce426f
 
ce426f
-#ifdef RESTORE_AVX
ce426f
+#undef REGISTER_SAVE_AREA_RAW
ce426f
+#ifdef __ILP32__
ce426f
+/* X32 saves RCX, RDX, RSI, RDI, R8 and R9 plus RAX as well as VEC0 to
ce426f
+   VEC7.  */
ce426f
+# define REGISTER_SAVE_AREA_RAW	(8 * 7 + VEC_SIZE * 8)
ce426f
+#else
ce426f
+/* X86-64 saves RCX, RDX, RSI, RDI, R8 and R9 plus RAX as well as
ce426f
+   BND0, BND1, BND2, BND3 and VEC0 to VEC7. */
ce426f
+# define REGISTER_SAVE_AREA_RAW	(8 * 7 + 16 * 4 + VEC_SIZE * 8)
ce426f
+#endif
ce426f
+
ce426f
+#undef REGISTER_SAVE_AREA
ce426f
+#undef LOCAL_STORAGE_AREA
ce426f
+#undef BASE
ce426f
+#if DL_RUNIME_RESOLVE_REALIGN_STACK
ce426f
+# define REGISTER_SAVE_AREA	(REGISTER_SAVE_AREA_RAW + 8)
ce426f
+/* Local stack area before jumping to function address: RBX.  */
ce426f
+# define LOCAL_STORAGE_AREA	8
ce426f
+# define BASE			rbx
ce426f
+# if (REGISTER_SAVE_AREA % VEC_SIZE) != 0
ce426f
+#  error REGISTER_SAVE_AREA must be multples of VEC_SIZE
ce426f
+# endif
ce426f
+#else
ce426f
+# define REGISTER_SAVE_AREA	REGISTER_SAVE_AREA_RAW
ce426f
+/* Local stack area before jumping to function address:  All saved
ce426f
+   registers.  */
ce426f
+# define LOCAL_STORAGE_AREA	REGISTER_SAVE_AREA
ce426f
+# define BASE			rsp
ce426f
+# if (REGISTER_SAVE_AREA % 16) != 8
ce426f
+#  error REGISTER_SAVE_AREA must be odd multples of 8
ce426f
+# endif
ce426f
+#endif
ce426f
+
ce426f
+	.text
ce426f
+#ifdef _dl_runtime_resolve_opt
ce426f
+/* Use the smallest vector registers to preserve the full YMM/ZMM
ce426f
+   registers to avoid SSE transition penalty.  */
ce426f
+
ce426f
+# if VEC_SIZE == 32
ce426f
+/* Check if the upper 128 bits in %ymm0 - %ymm7 registers are non-zero
ce426f
+   and preserve %xmm0 - %xmm7 registers with the zero upper bits.  Since
ce426f
+   there is no SSE transition penalty on AVX512 processors which don't
ce426f
+   support XGETBV with ECX == 1, _dl_runtime_resolve_avx512_slow isn't
ce426f
+   provided.   */
ce426f
+	.globl _dl_runtime_resolve_avx_slow
ce426f
+	.hidden _dl_runtime_resolve_avx_slow
ce426f
+	.type _dl_runtime_resolve_avx_slow, @function
ce426f
+	.align 16
ce426f
+_dl_runtime_resolve_avx_slow:
ce426f
+	cfi_startproc
ce426f
+	cfi_adjust_cfa_offset(16) # Incorporate PLT
ce426f
+	vorpd %ymm0, %ymm1, %ymm8
ce426f
+	vorpd %ymm2, %ymm3, %ymm9
ce426f
+	vorpd %ymm4, %ymm5, %ymm10
ce426f
+	vorpd %ymm6, %ymm7, %ymm11
ce426f
+	vorpd %ymm8, %ymm9, %ymm9
ce426f
+	vorpd %ymm10, %ymm11, %ymm10
ce426f
+	vpcmpeqd %xmm8, %xmm8, %xmm8
ce426f
+	vorpd %ymm9, %ymm10, %ymm10
ce426f
+	vptest %ymm10, %ymm8
ce426f
+	# Preserve %ymm0 - %ymm7 registers if the upper 128 bits of any
ce426f
+	# %ymm0 - %ymm7 registers aren't zero.
ce426f
+	PRESERVE_BND_REGS_PREFIX
ce426f
+	jnc _dl_runtime_resolve_avx
ce426f
+	# Use vzeroupper to avoid SSE transition penalty.
ce426f
+	vzeroupper
ce426f
+	# Preserve %xmm0 - %xmm7 registers with the zero upper 128 bits
ce426f
+	# when the upper 128 bits of %ymm0 - %ymm7 registers are zero.
ce426f
+	PRESERVE_BND_REGS_PREFIX
ce426f
+	jmp _dl_runtime_resolve_sse_vex
ce426f
+	cfi_adjust_cfa_offset(-16) # Restore PLT adjustment
ce426f
+	cfi_endproc
ce426f
+	.size _dl_runtime_resolve_avx_slow, .-_dl_runtime_resolve_avx_slow
ce426f
+# endif
ce426f
+
ce426f
+/* Use XGETBV with ECX == 1 to check which bits in vector registers are
ce426f
+   non-zero and only preserve the non-zero lower bits with zero upper
ce426f
+   bits.  */
ce426f
+	.globl _dl_runtime_resolve_opt
ce426f
+	.hidden _dl_runtime_resolve_opt
ce426f
+	.type _dl_runtime_resolve_opt, @function
ce426f
+	.align 16
ce426f
+_dl_runtime_resolve_opt:
ce426f
+	cfi_startproc
ce426f
+	cfi_adjust_cfa_offset(16) # Incorporate PLT
ce426f
+	pushq %rax
ce426f
+	cfi_adjust_cfa_offset(8)
ce426f
+	cfi_rel_offset(%rax, 0)
ce426f
+	pushq %rcx
ce426f
+	cfi_adjust_cfa_offset(8)
ce426f
+	cfi_rel_offset(%rcx, 0)
ce426f
+	pushq %rdx
ce426f
+	cfi_adjust_cfa_offset(8)
ce426f
+	cfi_rel_offset(%rdx, 0)
ce426f
+	movl $1, %ecx
ce426f
+	xgetbv
ce426f
+	movl %eax, %r11d
ce426f
+	popq %rdx
ce426f
+	cfi_adjust_cfa_offset(-8)
ce426f
+	cfi_restore (%rdx)
ce426f
+	popq %rcx
ce426f
+	cfi_adjust_cfa_offset(-8)
ce426f
+	cfi_restore (%rcx)
ce426f
+	popq %rax
ce426f
+	cfi_adjust_cfa_offset(-8)
ce426f
+	cfi_restore (%rax)
ce426f
+# if VEC_SIZE == 32
ce426f
+	# For YMM registers, check if YMM state is in use.
ce426f
+	andl $bit_YMM_state, %r11d
ce426f
+	# Preserve %xmm0 - %xmm7 registers with the zero upper 128 bits if
ce426f
+	# YMM state isn't in use.
ce426f
+	PRESERVE_BND_REGS_PREFIX
ce426f
+	jz _dl_runtime_resolve_sse_vex
ce426f
+# elif VEC_SIZE == 16
ce426f
+	# For ZMM registers, check if YMM state and ZMM state are in
ce426f
+	# use.
ce426f
+	andl $(bit_YMM_state | bit_ZMM0_15_state), %r11d
ce426f
+	cmpl $bit_YMM_state, %r11d
ce426f
+	# Preserve %zmm0 - %zmm7 registers if ZMM state is in use.
ce426f
+	PRESERVE_BND_REGS_PREFIX
ce426f
+	jg _dl_runtime_resolve_avx512
ce426f
+	# Preserve %ymm0 - %ymm7 registers with the zero upper 256 bits if
ce426f
+	# ZMM state isn't in use.
ce426f
+	PRESERVE_BND_REGS_PREFIX
ce426f
+	je _dl_runtime_resolve_avx
ce426f
+	# Preserve %xmm0 - %xmm7 registers with the zero upper 384 bits if
ce426f
+	# neither YMM state nor ZMM state are in use.
ce426f
+# else
ce426f
+#  error Unsupported VEC_SIZE!
ce426f
+# endif
ce426f
+	cfi_adjust_cfa_offset(-16) # Restore PLT adjustment
ce426f
+	cfi_endproc
ce426f
+	.size _dl_runtime_resolve_opt, .-_dl_runtime_resolve_opt
ce426f
+#endif
ce426f
+	.globl _dl_runtime_resolve
ce426f
+	.hidden _dl_runtime_resolve
ce426f
+	.type _dl_runtime_resolve, @function
ce426f
+	.align 16
ce426f
+	cfi_startproc
ce426f
+_dl_runtime_resolve:
ce426f
+	cfi_adjust_cfa_offset(16) # Incorporate PLT
ce426f
+#if DL_RUNIME_RESOLVE_REALIGN_STACK
ce426f
+# if LOCAL_STORAGE_AREA != 8
ce426f
+#  error LOCAL_STORAGE_AREA must be 8
ce426f
+# endif
ce426f
+	pushq %rbx			# push subtracts stack by 8.
ce426f
+	cfi_adjust_cfa_offset(8)
ce426f
+	cfi_rel_offset(%rbx, 0)
ce426f
+	mov %RSP_LP, %RBX_LP
ce426f
+	cfi_def_cfa_register(%rbx)
ce426f
+	and $-VEC_SIZE, %RSP_LP
ce426f
+#endif
ce426f
+	sub $REGISTER_SAVE_AREA, %RSP_LP
ce426f
+	cfi_adjust_cfa_offset(REGISTER_SAVE_AREA)
ce426f
+	# Preserve registers otherwise clobbered.
ce426f
+	movq %rax, REGISTER_SAVE_RAX(%rsp)
ce426f
+	movq %rcx, REGISTER_SAVE_RCX(%rsp)
ce426f
+	movq %rdx, REGISTER_SAVE_RDX(%rsp)
ce426f
+	movq %rsi, REGISTER_SAVE_RSI(%rsp)
ce426f
+	movq %rdi, REGISTER_SAVE_RDI(%rsp)
ce426f
+	movq %r8, REGISTER_SAVE_R8(%rsp)
ce426f
+	movq %r9, REGISTER_SAVE_R9(%rsp)
ce426f
+	VMOV %VEC(0), (REGISTER_SAVE_VEC_OFF)(%rsp)
ce426f
+	VMOV %VEC(1), (REGISTER_SAVE_VEC_OFF + VEC_SIZE)(%rsp)
ce426f
+	VMOV %VEC(2), (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 2)(%rsp)
ce426f
+	VMOV %VEC(3), (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 3)(%rsp)
ce426f
+	VMOV %VEC(4), (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 4)(%rsp)
ce426f
+	VMOV %VEC(5), (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 5)(%rsp)
ce426f
+	VMOV %VEC(6), (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 6)(%rsp)
ce426f
+	VMOV %VEC(7), (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 7)(%rsp)
ce426f
+#ifndef __ILP32__
ce426f
+	# We also have to preserve bound registers.  These are nops if
ce426f
+	# Intel MPX isn't available or disabled.
ce426f
+# ifdef HAVE_MPX_SUPPORT
ce426f
+	bndmov %bnd0, REGISTER_SAVE_BND0(%rsp)
ce426f
+	bndmov %bnd1, REGISTER_SAVE_BND1(%rsp)
ce426f
+	bndmov %bnd2, REGISTER_SAVE_BND2(%rsp)
ce426f
+	bndmov %bnd3, REGISTER_SAVE_BND3(%rsp)
ce426f
+# else
ce426f
+#  if REGISTER_SAVE_BND0 == 0
ce426f
+	.byte 0x66,0x0f,0x1b,0x04,0x24
ce426f
+#  else
ce426f
+	.byte 0x66,0x0f,0x1b,0x44,0x24,REGISTER_SAVE_BND0
ce426f
+#  endif
ce426f
+	.byte 0x66,0x0f,0x1b,0x4c,0x24,REGISTER_SAVE_BND1
ce426f
+	.byte 0x66,0x0f,0x1b,0x54,0x24,REGISTER_SAVE_BND2
ce426f
+	.byte 0x66,0x0f,0x1b,0x5c,0x24,REGISTER_SAVE_BND3
ce426f
+# endif
ce426f
+#endif
ce426f
+	# Copy args pushed by PLT in register.
ce426f
+	# %rdi: link_map, %rsi: reloc_index
ce426f
+	mov (LOCAL_STORAGE_AREA + 8)(%BASE), %RSI_LP
ce426f
+	mov LOCAL_STORAGE_AREA(%BASE), %RDI_LP
ce426f
+	call _dl_fixup		# Call resolver.
ce426f
+	mov %RAX_LP, %R11_LP	# Save return value
ce426f
+#ifndef __ILP32__
ce426f
+	# Restore bound registers.  These are nops if Intel MPX isn't
ce426f
+	# avaiable or disabled.
ce426f
+# ifdef HAVE_MPX_SUPPORT
ce426f
+	bndmov REGISTER_SAVE_BND3(%rsp), %bnd3
ce426f
+	bndmov REGISTER_SAVE_BND2(%rsp), %bnd2
ce426f
+	bndmov REGISTER_SAVE_BND1(%rsp), %bnd1
ce426f
+	bndmov REGISTER_SAVE_BND0(%rsp), %bnd0
ce426f
+# else
ce426f
+	.byte 0x66,0x0f,0x1a,0x5c,0x24,REGISTER_SAVE_BND3
ce426f
+	.byte 0x66,0x0f,0x1a,0x54,0x24,REGISTER_SAVE_BND2
ce426f
+	.byte 0x66,0x0f,0x1a,0x4c,0x24,REGISTER_SAVE_BND1
ce426f
+#  if REGISTER_SAVE_BND0 == 0
ce426f
+	.byte 0x66,0x0f,0x1a,0x04,0x24
ce426f
+#  else
ce426f
+	.byte 0x66,0x0f,0x1a,0x44,0x24,REGISTER_SAVE_BND0
ce426f
+#  endif
ce426f
+# endif
ce426f
+#endif
ce426f
+	# Get register content back.
ce426f
+	movq REGISTER_SAVE_R9(%rsp), %r9
ce426f
+	movq REGISTER_SAVE_R8(%rsp), %r8
ce426f
+	movq REGISTER_SAVE_RDI(%rsp), %rdi
ce426f
+	movq REGISTER_SAVE_RSI(%rsp), %rsi
ce426f
+	movq REGISTER_SAVE_RDX(%rsp), %rdx
ce426f
+	movq REGISTER_SAVE_RCX(%rsp), %rcx
ce426f
+	movq REGISTER_SAVE_RAX(%rsp), %rax
ce426f
+	VMOV (REGISTER_SAVE_VEC_OFF)(%rsp), %VEC(0)
ce426f
+	VMOV (REGISTER_SAVE_VEC_OFF + VEC_SIZE)(%rsp), %VEC(1)
ce426f
+	VMOV (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 2)(%rsp), %VEC(2)
ce426f
+	VMOV (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 3)(%rsp), %VEC(3)
ce426f
+	VMOV (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 4)(%rsp), %VEC(4)
ce426f
+	VMOV (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 5)(%rsp), %VEC(5)
ce426f
+	VMOV (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 6)(%rsp), %VEC(6)
ce426f
+	VMOV (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 7)(%rsp), %VEC(7)
ce426f
+#if DL_RUNIME_RESOLVE_REALIGN_STACK
ce426f
+	mov %RBX_LP, %RSP_LP
ce426f
+	cfi_def_cfa_register(%rsp)
ce426f
+	movq (%rsp), %rbx
ce426f
+	cfi_restore(%rbx)
ce426f
+#endif
ce426f
+	# Adjust stack(PLT did 2 pushes)
ce426f
+	add $(LOCAL_STORAGE_AREA + 16), %RSP_LP
ce426f
+	cfi_adjust_cfa_offset(-(LOCAL_STORAGE_AREA + 16))
ce426f
+	# Preserve bound registers.
ce426f
+	PRESERVE_BND_REGS_PREFIX
ce426f
+	jmp *%r11		# Jump to function address.
ce426f
+	cfi_endproc
ce426f
+	.size _dl_runtime_resolve, .-_dl_runtime_resolve
ce426f
+
ce426f
+
ce426f
+/* To preserve %xmm0 - %xmm7 registers, dl-trampoline.h is included
ce426f
+   twice, for _dl_runtime_resolve_sse and _dl_runtime_resolve_sse_vex.
ce426f
+   But we don't need another _dl_runtime_profile for XMM registers.  */
ce426f
+#if !defined PROF && defined _dl_runtime_profile
ce426f
+# if (LR_VECTOR_OFFSET % VEC_SIZE) != 0
ce426f
+#  error LR_VECTOR_OFFSET must be multples of VEC_SIZE
ce426f
+# endif
ce426f
+
ce426f
+	.globl _dl_runtime_profile
ce426f
+	.hidden _dl_runtime_profile
ce426f
+	.type _dl_runtime_profile, @function
ce426f
+	.align 16
ce426f
+_dl_runtime_profile:
ce426f
+	cfi_startproc
ce426f
+	cfi_adjust_cfa_offset(16) # Incorporate PLT
ce426f
+	/* The La_x86_64_regs data structure pointed to by the
ce426f
+	   fourth paramater must be VEC_SIZE-byte aligned.  This must
ce426f
+	   be explicitly enforced.  We have the set up a dynamically
ce426f
+	   sized stack frame.  %rbx points to the top half which
ce426f
+	   has a fixed size and preserves the original stack pointer.  */
ce426f
+
ce426f
+	sub $32, %RSP_LP	# Allocate the local storage.
ce426f
+	cfi_adjust_cfa_offset(32)
ce426f
+	movq %rbx, (%rsp)
ce426f
+	cfi_rel_offset(%rbx, 0)
ce426f
+
ce426f
+	/* On the stack:
ce426f
+		56(%rbx)	parameter #1
ce426f
+		48(%rbx)	return address
ce426f
+
ce426f
+		40(%rbx)	reloc index
ce426f
+		32(%rbx)	link_map
ce426f
+
ce426f
+		24(%rbx)	La_x86_64_regs pointer
ce426f
+		16(%rbx)	framesize
ce426f
+		 8(%rbx)	rax
ce426f
+		  (%rbx)	rbx
ce426f
+	*/
ce426f
+
ce426f
+	movq %rax, 8(%rsp)
ce426f
+	mov %RSP_LP, %RBX_LP
ce426f
+	cfi_def_cfa_register(%rbx)
ce426f
+
ce426f
+	/* Actively align the La_x86_64_regs structure.  */
ce426f
+	and $-VEC_SIZE, %RSP_LP
ce426f
+# if defined HAVE_AVX_SUPPORT || defined HAVE_AVX512_ASM_SUPPORT
ce426f
+	/* sizeof(La_x86_64_regs).  Need extra space for 8 SSE registers
ce426f
+	   to detect if any xmm0-xmm7 registers are changed by audit
ce426f
+	   module.  */
ce426f
+	sub $(LR_SIZE + XMM_SIZE*8), %RSP_LP
ce426f
+# else
ce426f
+	sub $LR_SIZE, %RSP_LP		# sizeof(La_x86_64_regs)
ce426f
+# endif
ce426f
+	movq %rsp, 24(%rbx)
ce426f
+
ce426f
+	/* Fill the La_x86_64_regs structure.  */
ce426f
+	movq %rdx, LR_RDX_OFFSET(%rsp)
ce426f
+	movq %r8,  LR_R8_OFFSET(%rsp)
ce426f
+	movq %r9,  LR_R9_OFFSET(%rsp)
ce426f
+	movq %rcx, LR_RCX_OFFSET(%rsp)
ce426f
+	movq %rsi, LR_RSI_OFFSET(%rsp)
ce426f
+	movq %rdi, LR_RDI_OFFSET(%rsp)
ce426f
+	movq %rbp, LR_RBP_OFFSET(%rsp)
ce426f
+
ce426f
+	lea 48(%rbx), %RAX_LP
ce426f
+	movq %rax, LR_RSP_OFFSET(%rsp)
ce426f
+
ce426f
+	/* We always store the XMM registers even if AVX is available.
ce426f
+	   This is to provide backward binary compatibility for existing
ce426f
+	   audit modules.  */
ce426f
+	movaps %xmm0,		   (LR_XMM_OFFSET)(%rsp)
ce426f
+	movaps %xmm1, (LR_XMM_OFFSET +   XMM_SIZE)(%rsp)
ce426f
+	movaps %xmm2, (LR_XMM_OFFSET + XMM_SIZE*2)(%rsp)
ce426f
+	movaps %xmm3, (LR_XMM_OFFSET + XMM_SIZE*3)(%rsp)
ce426f
+	movaps %xmm4, (LR_XMM_OFFSET + XMM_SIZE*4)(%rsp)
ce426f
+	movaps %xmm5, (LR_XMM_OFFSET + XMM_SIZE*5)(%rsp)
ce426f
+	movaps %xmm6, (LR_XMM_OFFSET + XMM_SIZE*6)(%rsp)
ce426f
+	movaps %xmm7, (LR_XMM_OFFSET + XMM_SIZE*7)(%rsp)
ce426f
+
ce426f
+# ifndef __ILP32__
ce426f
+#  ifdef HAVE_MPX_SUPPORT
ce426f
+	bndmov %bnd0, 		   (LR_BND_OFFSET)(%rsp)  # Preserve bound
ce426f
+	bndmov %bnd1, (LR_BND_OFFSET +   BND_SIZE)(%rsp)  # registers. Nops if
ce426f
+	bndmov %bnd2, (LR_BND_OFFSET + BND_SIZE*2)(%rsp)  # MPX not available
ce426f
+	bndmov %bnd3, (LR_BND_OFFSET + BND_SIZE*3)(%rsp)  # or disabled.
ce426f
+#  else
ce426f
+	.byte 0x66,0x0f,0x1b,0x84,0x24;.long (LR_BND_OFFSET)
ce426f
+	.byte 0x66,0x0f,0x1b,0x8c,0x24;.long (LR_BND_OFFSET + BND_SIZE)
ce426f
+	.byte 0x66,0x0f,0x1b,0x94,0x24;.long (LR_BND_OFFSET + BND_SIZE*2)
ce426f
+	.byte 0x66,0x0f,0x1b,0x9c,0x24;.long (LR_BND_OFFSET + BND_SIZE*3)
ce426f
+#  endif
ce426f
+# endif
ce426f
+
ce426f
+# ifdef RESTORE_AVX
ce426f
 	/* This is to support AVX audit modules.  */
ce426f
-	VMOV %VEC(0),		      (LR_VECTOR_OFFSET)(%rsp)
ce426f
-	VMOV %VEC(1), (LR_VECTOR_OFFSET +   VECTOR_SIZE)(%rsp)
ce426f
-	VMOV %VEC(2), (LR_VECTOR_OFFSET + VECTOR_SIZE*2)(%rsp)
ce426f
-	VMOV %VEC(3), (LR_VECTOR_OFFSET + VECTOR_SIZE*3)(%rsp)
ce426f
-	VMOV %VEC(4), (LR_VECTOR_OFFSET + VECTOR_SIZE*4)(%rsp)
ce426f
-	VMOV %VEC(5), (LR_VECTOR_OFFSET + VECTOR_SIZE*5)(%rsp)
ce426f
-	VMOV %VEC(6), (LR_VECTOR_OFFSET + VECTOR_SIZE*6)(%rsp)
ce426f
-	VMOV %VEC(7), (LR_VECTOR_OFFSET + VECTOR_SIZE*7)(%rsp)
ce426f
+	VMOVA %VEC(0),		      (LR_VECTOR_OFFSET)(%rsp)
ce426f
+	VMOVA %VEC(1), (LR_VECTOR_OFFSET +   VECTOR_SIZE)(%rsp)
ce426f
+	VMOVA %VEC(2), (LR_VECTOR_OFFSET + VECTOR_SIZE*2)(%rsp)
ce426f
+	VMOVA %VEC(3), (LR_VECTOR_OFFSET + VECTOR_SIZE*3)(%rsp)
ce426f
+	VMOVA %VEC(4), (LR_VECTOR_OFFSET + VECTOR_SIZE*4)(%rsp)
ce426f
+	VMOVA %VEC(5), (LR_VECTOR_OFFSET + VECTOR_SIZE*5)(%rsp)
ce426f
+	VMOVA %VEC(6), (LR_VECTOR_OFFSET + VECTOR_SIZE*6)(%rsp)
ce426f
+	VMOVA %VEC(7), (LR_VECTOR_OFFSET + VECTOR_SIZE*7)(%rsp)
ce426f
 
ce426f
 	/* Save xmm0-xmm7 registers to detect if any of them are
ce426f
 	   changed by audit module.  */
ce426f
@@ -38,7 +376,7 @@
ce426f
 	vmovdqa %xmm5, (LR_SIZE + XMM_SIZE*5)(%rsp)
ce426f
 	vmovdqa %xmm6, (LR_SIZE + XMM_SIZE*6)(%rsp)
ce426f
 	vmovdqa %xmm7, (LR_SIZE + XMM_SIZE*7)(%rsp)
ce426f
-#endif
ce426f
+# endif
ce426f
 
ce426f
 	mov %RSP_LP, %RCX_LP	# La_x86_64_regs pointer to %rcx.
ce426f
 	mov 48(%rbx), %RDX_LP	# Load return address if needed.
ce426f
@@ -63,21 +401,7 @@
ce426f
 	movaps (LR_XMM_OFFSET + XMM_SIZE*6)(%rsp), %xmm6
ce426f
 	movaps (LR_XMM_OFFSET + XMM_SIZE*7)(%rsp), %xmm7
ce426f
 
ce426f
-#ifndef __ILP32__
ce426f
-# ifdef HAVE_MPX_SUPPORT
ce426f
-	bndmov 		    (LR_BND_OFFSET)(%rsp), %bnd0  # Restore bound
ce426f
-	bndmov (LR_BND_OFFSET +   BND_SIZE)(%rsp), %bnd1  # registers.
ce426f
-	bndmov (LR_BND_OFFSET + BND_SIZE*2)(%rsp), %bnd2
ce426f
-	bndmov (LR_BND_OFFSET + BND_SIZE*3)(%rsp), %bnd3
ce426f
-# else
ce426f
-	.byte 0x66,0x0f,0x1a,0x84,0x24;.long (LR_BND_OFFSET)
ce426f
-	.byte 0x66,0x0f,0x1a,0x8c,0x24;.long (LR_BND_OFFSET + BND_SIZE)
ce426f
-	.byte 0x66,0x0f,0x1a,0x94,0x24;.long (LR_BND_OFFSET + BND_SIZE*2)
ce426f
-	.byte 0x66,0x0f,0x1a,0x9c,0x24;.long (LR_BND_OFFSET + BND_SIZE*3)
ce426f
-# endif
ce426f
-#endif
ce426f
-
ce426f
-#ifdef RESTORE_AVX
ce426f
+# ifdef RESTORE_AVX
ce426f
 	/* Check if any xmm0-xmm7 registers are changed by audit
ce426f
 	   module.  */
ce426f
 	vpcmpeqq (LR_SIZE)(%rsp), %xmm0, %xmm8
ce426f
@@ -86,7 +410,7 @@
ce426f
 	je 2f
ce426f
 	vmovdqa	%xmm0, (LR_VECTOR_OFFSET)(%rsp)
ce426f
 	jmp 1f
ce426f
-2:	VMOV (LR_VECTOR_OFFSET)(%rsp), %VEC(0)
ce426f
+2:	VMOVA (LR_VECTOR_OFFSET)(%rsp), %VEC(0)
ce426f
 	vmovdqa	%xmm0, (LR_XMM_OFFSET)(%rsp)
ce426f
 
ce426f
 1:	vpcmpeqq (LR_SIZE + XMM_SIZE)(%rsp), %xmm1, %xmm8
ce426f
@@ -95,7 +419,7 @@
ce426f
 	je 2f
ce426f
 	vmovdqa	%xmm1, (LR_VECTOR_OFFSET + VECTOR_SIZE)(%rsp)
ce426f
 	jmp 1f
ce426f
-2:	VMOV (LR_VECTOR_OFFSET + VECTOR_SIZE)(%rsp), %VEC(1)
ce426f
+2:	VMOVA (LR_VECTOR_OFFSET + VECTOR_SIZE)(%rsp), %VEC(1)
ce426f
 	vmovdqa	%xmm1, (LR_XMM_OFFSET + XMM_SIZE)(%rsp)
ce426f
 
ce426f
 1:	vpcmpeqq (LR_SIZE + XMM_SIZE*2)(%rsp), %xmm2, %xmm8
ce426f
@@ -104,7 +428,7 @@
ce426f
 	je 2f
ce426f
 	vmovdqa	%xmm2, (LR_VECTOR_OFFSET + VECTOR_SIZE*2)(%rsp)
ce426f
 	jmp 1f
ce426f
-2:	VMOV (LR_VECTOR_OFFSET + VECTOR_SIZE*2)(%rsp), %VEC(2)
ce426f
+2:	VMOVA (LR_VECTOR_OFFSET + VECTOR_SIZE*2)(%rsp), %VEC(2)
ce426f
 	vmovdqa	%xmm2, (LR_XMM_OFFSET + XMM_SIZE*2)(%rsp)
ce426f
 
ce426f
 1:	vpcmpeqq (LR_SIZE + XMM_SIZE*3)(%rsp), %xmm3, %xmm8
ce426f
@@ -113,7 +437,7 @@
ce426f
 	je 2f
ce426f
 	vmovdqa	%xmm3, (LR_VECTOR_OFFSET + VECTOR_SIZE*3)(%rsp)
ce426f
 	jmp 1f
ce426f
-2:	VMOV (LR_VECTOR_OFFSET + VECTOR_SIZE*3)(%rsp), %VEC(3)
ce426f
+2:	VMOVA (LR_VECTOR_OFFSET + VECTOR_SIZE*3)(%rsp), %VEC(3)
ce426f
 	vmovdqa	%xmm3, (LR_XMM_OFFSET + XMM_SIZE*3)(%rsp)
ce426f
 
ce426f
 1:	vpcmpeqq (LR_SIZE + XMM_SIZE*4)(%rsp), %xmm4, %xmm8
ce426f
@@ -122,7 +446,7 @@
ce426f
 	je 2f
ce426f
 	vmovdqa	%xmm4, (LR_VECTOR_OFFSET + VECTOR_SIZE*4)(%rsp)
ce426f
 	jmp 1f
ce426f
-2:	VMOV (LR_VECTOR_OFFSET + VECTOR_SIZE*4)(%rsp), %VEC(4)
ce426f
+2:	VMOVA (LR_VECTOR_OFFSET + VECTOR_SIZE*4)(%rsp), %VEC(4)
ce426f
 	vmovdqa	%xmm4, (LR_XMM_OFFSET + XMM_SIZE*4)(%rsp)
ce426f
 
ce426f
 1:	vpcmpeqq (LR_SIZE + XMM_SIZE*5)(%rsp), %xmm5, %xmm8
ce426f
@@ -131,7 +455,7 @@
ce426f
 	je 2f
ce426f
 	vmovdqa	%xmm5, (LR_VECTOR_OFFSET + VECTOR_SIZE*5)(%rsp)
ce426f
 	jmp 1f
ce426f
-2:	VMOV (LR_VECTOR_OFFSET + VECTOR_SIZE*5)(%rsp), %VEC(5)
ce426f
+2:	VMOVA (LR_VECTOR_OFFSET + VECTOR_SIZE*5)(%rsp), %VEC(5)
ce426f
 	vmovdqa	%xmm5, (LR_XMM_OFFSET + XMM_SIZE*5)(%rsp)
ce426f
 
ce426f
 1:	vpcmpeqq (LR_SIZE + XMM_SIZE*6)(%rsp), %xmm6, %xmm8
ce426f
@@ -140,7 +464,7 @@
ce426f
 	je 2f
ce426f
 	vmovdqa	%xmm6, (LR_VECTOR_OFFSET + VECTOR_SIZE*6)(%rsp)
ce426f
 	jmp 1f
ce426f
-2:	VMOV (LR_VECTOR_OFFSET + VECTOR_SIZE*6)(%rsp), %VEC(6)
ce426f
+2:	VMOVA (LR_VECTOR_OFFSET + VECTOR_SIZE*6)(%rsp), %VEC(6)
ce426f
 	vmovdqa	%xmm6, (LR_XMM_OFFSET + XMM_SIZE*6)(%rsp)
ce426f
 
ce426f
 1:	vpcmpeqq (LR_SIZE + XMM_SIZE*7)(%rsp), %xmm7, %xmm8
ce426f
@@ -149,13 +473,29 @@
ce426f
 	je 2f
ce426f
 	vmovdqa	%xmm7, (LR_VECTOR_OFFSET + VECTOR_SIZE*7)(%rsp)
ce426f
 	jmp 1f
ce426f
-2:	VMOV (LR_VECTOR_OFFSET + VECTOR_SIZE*7)(%rsp), %VEC(7)
ce426f
+2:	VMOVA (LR_VECTOR_OFFSET + VECTOR_SIZE*7)(%rsp), %VEC(7)
ce426f
 	vmovdqa	%xmm7, (LR_XMM_OFFSET + XMM_SIZE*7)(%rsp)
ce426f
 
ce426f
 1:
ce426f
-#endif
ce426f
+# endif
ce426f
+
ce426f
+# ifndef __ILP32__
ce426f
+#  ifdef HAVE_MPX_SUPPORT
ce426f
+	bndmov              (LR_BND_OFFSET)(%rsp), %bnd0  # Restore bound
ce426f
+	bndmov (LR_BND_OFFSET +   BND_SIZE)(%rsp), %bnd1  # registers.
ce426f
+	bndmov (LR_BND_OFFSET + BND_SIZE*2)(%rsp), %bnd2
ce426f
+	bndmov (LR_BND_OFFSET + BND_SIZE*3)(%rsp), %bnd3
ce426f
+#  else
ce426f
+	.byte 0x66,0x0f,0x1a,0x84,0x24;.long (LR_BND_OFFSET)
ce426f
+	.byte 0x66,0x0f,0x1a,0x8c,0x24;.long (LR_BND_OFFSET + BND_SIZE)
ce426f
+	.byte 0x66,0x0f,0x1a,0x94,0x24;.long (LR_BND_OFFSET + BND_SIZE*2)
ce426f
+	.byte 0x66,0x0f,0x1a,0x9c,0x24;.long (LR_BND_OFFSET + BND_SIZE*3)
ce426f
+#  endif
ce426f
+# endif
ce426f
+
ce426f
 	mov  16(%rbx), %R10_LP	# Anything in framesize?
ce426f
 	test %R10_LP, %R10_LP
ce426f
+	PRESERVE_BND_REGS_PREFIX
ce426f
 	jns 3f
ce426f
 
ce426f
 	/* There's nothing in the frame size, so there
ce426f
@@ -166,14 +506,15 @@
ce426f
 	movq LR_RSI_OFFSET(%rsp), %rsi
ce426f
 	movq LR_RDI_OFFSET(%rsp), %rdi
ce426f
 
ce426f
-	movq %rbx, %rsp
ce426f
+	mov %RBX_LP, %RSP_LP
ce426f
 	movq (%rsp), %rbx
ce426f
-	cfi_restore(rbx)
ce426f
+	cfi_restore(%rbx)
ce426f
 	cfi_def_cfa_register(%rsp)
ce426f
 
ce426f
-	addq $48, %rsp		# Adjust the stack to the return value
ce426f
+	add $48, %RSP_LP	# Adjust the stack to the return value
ce426f
 				# (eats the reloc index and link_map)
ce426f
 	cfi_adjust_cfa_offset(-48)
ce426f
+	PRESERVE_BND_REGS_PREFIX
ce426f
 	jmp *%r11		# Jump to function address.
ce426f
 
ce426f
 3:
ce426f
@@ -186,13 +527,13 @@
ce426f
 	   temporary buffer of the size specified by the 'framesize'
ce426f
 	   returned from _dl_profile_fixup */
ce426f
 
ce426f
-	leaq LR_RSP_OFFSET(%rbx), %rsi	# stack
ce426f
-	addq $8, %r10
ce426f
-	andq $0xfffffffffffffff0, %r10
ce426f
-	movq %r10, %rcx
ce426f
-	subq %r10, %rsp
ce426f
-	movq %rsp, %rdi
ce426f
-	shrq $3, %rcx
ce426f
+	lea LR_RSP_OFFSET(%rbx), %RSI_LP # stack
ce426f
+	add $8, %R10_LP
ce426f
+	and $-16, %R10_LP
ce426f
+	mov %R10_LP, %RCX_LP
ce426f
+	sub %R10_LP, %RSP_LP
ce426f
+	mov %RSP_LP, %RDI_LP
ce426f
+	shr $3, %RCX_LP
ce426f
 	rep
ce426f
 	movsq
ce426f
 
ce426f
@@ -200,23 +541,24 @@
ce426f
 	movq 32(%rdi), %rsi
ce426f
 	movq 40(%rdi), %rdi
ce426f
 
ce426f
+	PRESERVE_BND_REGS_PREFIX
ce426f
 	call *%r11
ce426f
 
ce426f
-	mov 24(%rbx), %rsp	# Drop the copied stack content
ce426f
+	mov 24(%rbx), %RSP_LP	# Drop the copied stack content
ce426f
 
ce426f
 	/* Now we have to prepare the La_x86_64_retval structure for the
ce426f
 	   _dl_call_pltexit.  The La_x86_64_regs is being pointed by rsp now,
ce426f
 	   so we just need to allocate the sizeof(La_x86_64_retval) space on
ce426f
 	   the stack, since the alignment has already been taken care of. */
ce426f
-#ifdef RESTORE_AVX
ce426f
+# ifdef RESTORE_AVX
ce426f
 	/* sizeof(La_x86_64_retval).  Need extra space for 2 SSE
ce426f
 	   registers to detect if xmm0/xmm1 registers are changed
ce426f
 	   by audit module.  */
ce426f
-	subq $(LRV_SIZE + XMM_SIZE*2), %rsp
ce426f
-#else
ce426f
-	subq $LRV_SIZE, %rsp	# sizeof(La_x86_64_retval)
ce426f
-#endif
ce426f
-	movq %rsp, %rcx		# La_x86_64_retval argument to %rcx.
ce426f
+	sub $(LRV_SIZE + XMM_SIZE*2), %RSP_LP
ce426f
+# else
ce426f
+	sub $LRV_SIZE, %RSP_LP	# sizeof(La_x86_64_retval)
ce426f
+# endif
ce426f
+	mov %RSP_LP, %RCX_LP	# La_x86_64_retval argument to %rcx.
ce426f
 
ce426f
 	/* Fill in the La_x86_64_retval structure.  */
ce426f
 	movq %rax, LRV_RAX_OFFSET(%rcx)
ce426f
@@ -225,26 +567,26 @@
ce426f
 	movaps %xmm0, LRV_XMM0_OFFSET(%rcx)
ce426f
 	movaps %xmm1, LRV_XMM1_OFFSET(%rcx)
ce426f
 
ce426f
-#ifdef RESTORE_AVX
ce426f
+# ifdef RESTORE_AVX
ce426f
 	/* This is to support AVX audit modules.  */
ce426f
-	VMOV %VEC(0), LRV_VECTOR0_OFFSET(%rcx)
ce426f
-	VMOV %VEC(1), LRV_VECTOR1_OFFSET(%rcx)
ce426f
+	VMOVA %VEC(0), LRV_VECTOR0_OFFSET(%rcx)
ce426f
+	VMOVA %VEC(1), LRV_VECTOR1_OFFSET(%rcx)
ce426f
 
ce426f
 	/* Save xmm0/xmm1 registers to detect if they are changed
ce426f
 	   by audit module.  */
ce426f
 	vmovdqa %xmm0,		  (LRV_SIZE)(%rcx)
ce426f
 	vmovdqa %xmm1, (LRV_SIZE + XMM_SIZE)(%rcx)
ce426f
-#endif
ce426f
+# endif
ce426f
 
ce426f
-#ifndef __ILP32__
ce426f
-# ifdef HAVE_MPX_SUPPORT
ce426f
+# ifndef __ILP32__
ce426f
+#  ifdef HAVE_MPX_SUPPORT
ce426f
 	bndmov %bnd0, LRV_BND0_OFFSET(%rcx)  # Preserve returned bounds.
ce426f
 	bndmov %bnd1, LRV_BND1_OFFSET(%rcx)
ce426f
-# else
ce426f
+#  else
ce426f
 	.byte  0x66,0x0f,0x1b,0x81;.long (LRV_BND0_OFFSET)
ce426f
 	.byte  0x66,0x0f,0x1b,0x89;.long (LRV_BND1_OFFSET)
ce426f
+#  endif
ce426f
 # endif
ce426f
-#endif
ce426f
 
ce426f
 	fstpt LRV_ST0_OFFSET(%rcx)
ce426f
 	fstpt LRV_ST1_OFFSET(%rcx)
ce426f
@@ -261,49 +603,47 @@
ce426f
 	movaps LRV_XMM0_OFFSET(%rsp), %xmm0
ce426f
 	movaps LRV_XMM1_OFFSET(%rsp), %xmm1
ce426f
 
ce426f
-#ifdef RESTORE_AVX
ce426f
+# ifdef RESTORE_AVX
ce426f
 	/* Check if xmm0/xmm1 registers are changed by audit module.  */
ce426f
 	vpcmpeqq (LRV_SIZE)(%rsp), %xmm0, %xmm2
ce426f
 	vpmovmskb %xmm2, %esi
ce426f
 	cmpl $0xffff, %esi
ce426f
 	jne 1f
ce426f
-	VMOV LRV_VECTOR0_OFFSET(%rsp), %VEC(0)
ce426f
+	VMOVA LRV_VECTOR0_OFFSET(%rsp), %VEC(0)
ce426f
 
ce426f
 1:	vpcmpeqq (LRV_SIZE + XMM_SIZE)(%rsp), %xmm1, %xmm2
ce426f
 	vpmovmskb %xmm2, %esi
ce426f
 	cmpl $0xffff, %esi
ce426f
 	jne 1f
ce426f
-	VMOV LRV_VECTOR1_OFFSET(%rsp), %VEC(1)
ce426f
+	VMOVA LRV_VECTOR1_OFFSET(%rsp), %VEC(1)
ce426f
 
ce426f
 1:
ce426f
-#endif
ce426f
+# endif
ce426f
 
ce426f
-#ifndef __ILP32__
ce426f
-# ifdef HAVE_MPX_SUPPORT
ce426f
-	bndmov LRV_BND0_OFFSET(%rcx), %bnd0  # Restore bound registers.
ce426f
-	bndmov LRV_BND1_OFFSET(%rcx), %bnd1
ce426f
-# else
ce426f
-	.byte  0x66,0x0f,0x1a,0x81;.long (LRV_BND0_OFFSET)
ce426f
-	.byte  0x66,0x0f,0x1a,0x89;.long (LRV_BND1_OFFSET)
ce426f
+# ifndef __ILP32__
ce426f
+#  ifdef HAVE_MPX_SUPPORT
ce426f
+	bndmov LRV_BND0_OFFSET(%rsp), %bnd0  # Restore bound registers.
ce426f
+	bndmov LRV_BND1_OFFSET(%rsp), %bnd1
ce426f
+#  else
ce426f
+	.byte  0x66,0x0f,0x1a,0x84,0x24;.long (LRV_BND0_OFFSET)
ce426f
+	.byte  0x66,0x0f,0x1a,0x8c,0x24;.long (LRV_BND1_OFFSET)
ce426f
+#  endif
ce426f
 # endif
ce426f
-#endif
ce426f
 
ce426f
 	fldt LRV_ST1_OFFSET(%rsp)
ce426f
 	fldt LRV_ST0_OFFSET(%rsp)
ce426f
 
ce426f
-	movq %rbx, %rsp
ce426f
+	mov %RBX_LP, %RSP_LP
ce426f
 	movq (%rsp), %rbx
ce426f
-	cfi_restore(rbx)
ce426f
+	cfi_restore(%rbx)
ce426f
 	cfi_def_cfa_register(%rsp)
ce426f
 
ce426f
-	addq $48, %rsp		# Adjust the stack to the return value
ce426f
+	add $48, %RSP_LP	# Adjust the stack to the return value
ce426f
 				# (eats the reloc index and link_map)
ce426f
 	cfi_adjust_cfa_offset(-48)
ce426f
+	PRESERVE_BND_REGS_PREFIX
ce426f
 	retq
ce426f
 
ce426f
-#ifdef MORE_CODE
ce426f
-	cfi_adjust_cfa_offset(48)
ce426f
-	cfi_rel_offset(%rbx, 0)
ce426f
-	cfi_def_cfa_register(%rbx)
ce426f
-# undef MORE_CODE
ce426f
+	cfi_endproc
ce426f
+	.size _dl_runtime_profile, .-_dl_runtime_profile
ce426f
 #endif
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86_64/ifuncmain8.c
ce426f
===================================================================
ce426f
--- /dev/null
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86_64/ifuncmain8.c
ce426f
@@ -0,0 +1,32 @@
ce426f
+/* Test IFUNC selector with floating-point parameters.
ce426f
+   Copyright (C) 2015 Free Software Foundation, Inc.
ce426f
+   This file is part of the GNU C Library.
ce426f
+
ce426f
+   The GNU C Library is free software; you can redistribute it and/or
ce426f
+   modify it under the terms of the GNU Lesser General Public
ce426f
+   License as published by the Free Software Foundation; either
ce426f
+   version 2.1 of the License, or (at your option) any later version.
ce426f
+
ce426f
+   The GNU C Library is distributed in the hope that it will be useful,
ce426f
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
ce426f
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
ce426f
+   Lesser General Public License for more details.
ce426f
+
ce426f
+   You should have received a copy of the GNU Lesser General Public
ce426f
+   License along with the GNU C Library; if not, see
ce426f
+   <http://www.gnu.org/licenses/>.  */
ce426f
+
ce426f
+#include <stdlib.h>
ce426f
+
ce426f
+extern float foo (float);
ce426f
+
ce426f
+static int
ce426f
+do_test (void)
ce426f
+{
ce426f
+  if (foo (2) != 3)
ce426f
+    abort ();
ce426f
+  return 0;
ce426f
+}
ce426f
+
ce426f
+#define TEST_FUNCTION do_test ()
ce426f
+#include "../test-skeleton.c"
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86_64/ifuncmod8.c
ce426f
===================================================================
ce426f
--- /dev/null
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86_64/ifuncmod8.c
ce426f
@@ -0,0 +1,36 @@
ce426f
+/* Test IFUNC selector with floating-point parameters.
ce426f
+   Copyright (C) 2015 Free Software Foundation, Inc.
ce426f
+   This file is part of the GNU C Library.
ce426f
+
ce426f
+   The GNU C Library is free software; you can redistribute it and/or
ce426f
+   modify it under the terms of the GNU Lesser General Public
ce426f
+   License as published by the Free Software Foundation; either
ce426f
+   version 2.1 of the License, or (at your option) any later version.
ce426f
+
ce426f
+   The GNU C Library is distributed in the hope that it will be useful,
ce426f
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
ce426f
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
ce426f
+   Lesser General Public License for more details.
ce426f
+
ce426f
+   You should have received a copy of the GNU Lesser General Public
ce426f
+   License along with the GNU C Library; if not, see
ce426f
+   <http://www.gnu.org/licenses/>.  */
ce426f
+
ce426f
+#include <emmintrin.h>
ce426f
+
ce426f
+void * foo_ifunc (void) __asm__ ("foo");
ce426f
+__asm__(".type foo, %gnu_indirect_function");
ce426f
+
ce426f
+static float
ce426f
+foo_impl (float x)
ce426f
+{
ce426f
+  return x + 1;
ce426f
+}
ce426f
+
ce426f
+void *
ce426f
+foo_ifunc (void)
ce426f
+{
ce426f
+  __m128i xmm = _mm_set1_epi32 (-1);
ce426f
+  asm volatile ("movdqa %0, %%xmm0" : : "x" (xmm) : "xmm0" );
ce426f
+  return foo_impl;
ce426f
+}
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-avx-aux.c
ce426f
===================================================================
ce426f
--- /dev/null
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-avx-aux.c
ce426f
@@ -0,0 +1,47 @@
ce426f
+/* Test case for preserved AVX registers in dynamic linker, -mavx part.
ce426f
+   Copyright (C) 2017 Free Software Foundation, Inc.
ce426f
+   This file is part of the GNU C Library.
ce426f
+
ce426f
+   The GNU C Library is free software; you can redistribute it and/or
ce426f
+   modify it under the terms of the GNU Lesser General Public
ce426f
+   License as published by the Free Software Foundation; either
ce426f
+   version 2.1 of the License, or (at your option) any later version.
ce426f
+
ce426f
+   The GNU C Library is distributed in the hope that it will be useful,
ce426f
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
ce426f
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
ce426f
+   Lesser General Public License for more details.
ce426f
+
ce426f
+   You should have received a copy of the GNU Lesser General Public
ce426f
+   License along with the GNU C Library; if not, see
ce426f
+   <http://www.gnu.org/licenses/>.  */
ce426f
+
ce426f
+#include <immintrin.h>
ce426f
+#include <stdlib.h>
ce426f
+#include <string.h>
ce426f
+
ce426f
+int
ce426f
+tst_avx_aux (void)
ce426f
+{
ce426f
+#ifdef __AVX__
ce426f
+  extern __m256i avx_test (__m256i, __m256i, __m256i, __m256i,
ce426f
+			   __m256i, __m256i, __m256i, __m256i);
ce426f
+
ce426f
+  __m256i ymm0 = _mm256_set1_epi32 (0);
ce426f
+  __m256i ymm1 = _mm256_set1_epi32 (1);
ce426f
+  __m256i ymm2 = _mm256_set1_epi32 (2);
ce426f
+  __m256i ymm3 = _mm256_set1_epi32 (3);
ce426f
+  __m256i ymm4 = _mm256_set1_epi32 (4);
ce426f
+  __m256i ymm5 = _mm256_set1_epi32 (5);
ce426f
+  __m256i ymm6 = _mm256_set1_epi32 (6);
ce426f
+  __m256i ymm7 = _mm256_set1_epi32 (7);
ce426f
+  __m256i ret = avx_test (ymm0, ymm1, ymm2, ymm3,
ce426f
+			  ymm4, ymm5, ymm6, ymm7);
ce426f
+  ymm0 =  _mm256_set1_epi32 (0x12349876);
ce426f
+  if (memcmp (&ymm0, &ret, sizeof (ret)))
ce426f
+    abort ();
ce426f
+  return 0;
ce426f
+#else  /* __AVX__ */
ce426f
+  return 77;
ce426f
+#endif  /* __AVX__ */
ce426f
+}
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-avx.c
ce426f
===================================================================
ce426f
--- /dev/null
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-avx.c
ce426f
@@ -0,0 +1,49 @@
ce426f
+/* Test case for preserved AVX registers in dynamic linker.
ce426f
+   Copyright (C) 2017 Free Software Foundation, Inc.
ce426f
+   This file is part of the GNU C Library.
ce426f
+
ce426f
+   The GNU C Library is free software; you can redistribute it and/or
ce426f
+   modify it under the terms of the GNU Lesser General Public
ce426f
+   License as published by the Free Software Foundation; either
ce426f
+   version 2.1 of the License, or (at your option) any later version.
ce426f
+
ce426f
+   The GNU C Library is distributed in the hope that it will be useful,
ce426f
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
ce426f
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
ce426f
+   Lesser General Public License for more details.
ce426f
+
ce426f
+   You should have received a copy of the GNU Lesser General Public
ce426f
+   License along with the GNU C Library; if not, see
ce426f
+   <http://www.gnu.org/licenses/>.  */
ce426f
+
ce426f
+#include <cpuid.h>
ce426f
+
ce426f
+int tst_avx_aux (void);
ce426f
+
ce426f
+static int
ce426f
+avx_enabled (void)
ce426f
+{
ce426f
+  unsigned int eax, ebx, ecx, edx;
ce426f
+
ce426f
+  if (__get_cpuid (1, &eax, &ebx, &ecx, &edx) == 0
ce426f
+      || (ecx & (bit_AVX | bit_OSXSAVE)) != (bit_AVX | bit_OSXSAVE))
ce426f
+    return 0;
ce426f
+
ce426f
+  /* Check the OS has AVX and SSE saving enabled.  */
ce426f
+  asm ("xgetbv" : "=a" (eax), "=d" (edx) : "c" (0));
ce426f
+
ce426f
+  return (eax & 6) == 6;
ce426f
+}
ce426f
+
ce426f
+static int
ce426f
+do_test (void)
ce426f
+{
ce426f
+  /* Run AVX test only if AVX is supported.  */
ce426f
+  if (avx_enabled ())
ce426f
+    return tst_avx_aux ();
ce426f
+  else
ce426f
+    return 77;
ce426f
+}
ce426f
+
ce426f
+#define TEST_FUNCTION do_test ()
ce426f
+#include "../../test-skeleton.c"
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-avx512-aux.c
ce426f
===================================================================
ce426f
--- /dev/null
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-avx512-aux.c
ce426f
@@ -0,0 +1,48 @@
ce426f
+/* Test case for preserved AVX512 registers in dynamic linker,
ce426f
+   -mavx512 part.
ce426f
+   Copyright (C) 2017 Free Software Foundation, Inc.
ce426f
+   This file is part of the GNU C Library.
ce426f
+
ce426f
+   The GNU C Library is free software; you can redistribute it and/or
ce426f
+   modify it under the terms of the GNU Lesser General Public
ce426f
+   License as published by the Free Software Foundation; either
ce426f
+   version 2.1 of the License, or (at your option) any later version.
ce426f
+
ce426f
+   The GNU C Library is distributed in the hope that it will be useful,
ce426f
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
ce426f
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
ce426f
+   Lesser General Public License for more details.
ce426f
+
ce426f
+   You should have received a copy of the GNU Lesser General Public
ce426f
+   License along with the GNU C Library; if not, see
ce426f
+   <http://www.gnu.org/licenses/>.  */
ce426f
+
ce426f
+#include <immintrin.h>
ce426f
+#include <stdlib.h>
ce426f
+#include <string.h>
ce426f
+
ce426f
+int
ce426f
+tst_avx512_aux (void)
ce426f
+{
ce426f
+#ifdef __AVX512F__
ce426f
+  extern __m512i avx512_test (__m512i, __m512i, __m512i, __m512i,
ce426f
+			      __m512i, __m512i, __m512i, __m512i);
ce426f
+
ce426f
+  __m512i zmm0 = _mm512_set1_epi32 (0);
ce426f
+  __m512i zmm1 = _mm512_set1_epi32 (1);
ce426f
+  __m512i zmm2 = _mm512_set1_epi32 (2);
ce426f
+  __m512i zmm3 = _mm512_set1_epi32 (3);
ce426f
+  __m512i zmm4 = _mm512_set1_epi32 (4);
ce426f
+  __m512i zmm5 = _mm512_set1_epi32 (5);
ce426f
+  __m512i zmm6 = _mm512_set1_epi32 (6);
ce426f
+  __m512i zmm7 = _mm512_set1_epi32 (7);
ce426f
+  __m512i ret = avx512_test (zmm0, zmm1, zmm2, zmm3,
ce426f
+			     zmm4, zmm5, zmm6, zmm7);
ce426f
+  zmm0 =  _mm512_set1_epi32 (0x12349876);
ce426f
+  if (memcmp (&zmm0, &ret, sizeof (ret)))
ce426f
+    abort ();
ce426f
+  return 0;
ce426f
+#else  /* __AVX512F__ */
ce426f
+  return 77;
ce426f
+#endif  /* __AVX512F__ */
ce426f
+}
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-avx512.c
ce426f
===================================================================
ce426f
--- /dev/null
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-avx512.c
ce426f
@@ -0,0 +1,57 @@
ce426f
+/* Test case for preserved AVX512 registers in dynamic linker.
ce426f
+   Copyright (C) 2017 Free Software Foundation, Inc.
ce426f
+   This file is part of the GNU C Library.
ce426f
+
ce426f
+   The GNU C Library is free software; you can redistribute it and/or
ce426f
+   modify it under the terms of the GNU Lesser General Public
ce426f
+   License as published by the Free Software Foundation; either
ce426f
+   version 2.1 of the License, or (at your option) any later version.
ce426f
+
ce426f
+   The GNU C Library is distributed in the hope that it will be useful,
ce426f
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
ce426f
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
ce426f
+   Lesser General Public License for more details.
ce426f
+
ce426f
+   You should have received a copy of the GNU Lesser General Public
ce426f
+   License along with the GNU C Library; if not, see
ce426f
+   <http://www.gnu.org/licenses/>.  */
ce426f
+
ce426f
+#include <cpuid.h>
ce426f
+
ce426f
+int tst_avx512_aux (void);
ce426f
+
ce426f
+static int
ce426f
+avx512_enabled (void)
ce426f
+{
ce426f
+#ifdef bit_AVX512F
ce426f
+  unsigned int eax, ebx, ecx, edx;
ce426f
+
ce426f
+  if (__get_cpuid (1, &eax, &ebx, &ecx, &edx) == 0
ce426f
+      || (ecx & (bit_AVX | bit_OSXSAVE)) != (bit_AVX | bit_OSXSAVE))
ce426f
+    return 0;
ce426f
+
ce426f
+  __cpuid_count (7, 0, eax, ebx, ecx, edx);
ce426f
+  if (!(ebx & bit_AVX512F))
ce426f
+    return 0;
ce426f
+
ce426f
+  asm ("xgetbv" : "=a" (eax), "=d" (edx) : "c" (0));
ce426f
+
ce426f
+  /* Verify that ZMM, YMM and XMM states are enabled.  */
ce426f
+  return (eax & 0xe6) == 0xe6;
ce426f
+#else
ce426f
+  return 0;
ce426f
+#endif
ce426f
+}
ce426f
+
ce426f
+static int
ce426f
+do_test (void)
ce426f
+{
ce426f
+  /* Run AVX512 test only if AVX512 is supported.  */
ce426f
+  if (avx512_enabled ())
ce426f
+    return tst_avx512_aux ();
ce426f
+  else
ce426f
+    return 77;
ce426f
+}
ce426f
+
ce426f
+#define TEST_FUNCTION do_test ()
ce426f
+#include "../../test-skeleton.c"
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-avx512mod.c
ce426f
===================================================================
ce426f
--- /dev/null
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-avx512mod.c
ce426f
@@ -0,0 +1,48 @@
ce426f
+/* Test case for x86-64 preserved AVX512 registers in dynamic linker.  */
ce426f
+
ce426f
+#ifdef __AVX512F__
ce426f
+#include <stdlib.h>
ce426f
+#include <string.h>
ce426f
+#include <immintrin.h>
ce426f
+
ce426f
+__m512i
ce426f
+avx512_test (__m512i x0, __m512i x1, __m512i x2, __m512i x3,
ce426f
+	     __m512i x4, __m512i x5, __m512i x6, __m512i x7)
ce426f
+{
ce426f
+  __m512i zmm;
ce426f
+
ce426f
+  zmm = _mm512_set1_epi32 (0);
ce426f
+  if (memcmp (&zmm, &x0, sizeof (zmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  zmm = _mm512_set1_epi32 (1);
ce426f
+  if (memcmp (&zmm, &x1, sizeof (zmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  zmm = _mm512_set1_epi32 (2);
ce426f
+  if (memcmp (&zmm, &x2, sizeof (zmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  zmm = _mm512_set1_epi32 (3);
ce426f
+  if (memcmp (&zmm, &x3, sizeof (zmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  zmm = _mm512_set1_epi32 (4);
ce426f
+  if (memcmp (&zmm, &x4, sizeof (zmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  zmm = _mm512_set1_epi32 (5);
ce426f
+  if (memcmp (&zmm, &x5, sizeof (zmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  zmm = _mm512_set1_epi32 (6);
ce426f
+  if (memcmp (&zmm, &x6, sizeof (zmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  zmm = _mm512_set1_epi32 (7);
ce426f
+  if (memcmp (&zmm, &x7, sizeof (zmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  return _mm512_set1_epi32 (0x12349876);
ce426f
+}
ce426f
+#endif
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-ssemod.c
ce426f
===================================================================
ce426f
--- /dev/null
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-ssemod.c
ce426f
@@ -0,0 +1,46 @@
ce426f
+/* Test case for x86-64 preserved SSE registers in dynamic linker.  */
ce426f
+
ce426f
+#include <stdlib.h>
ce426f
+#include <string.h>
ce426f
+#include <immintrin.h>
ce426f
+
ce426f
+__m128i
ce426f
+sse_test (__m128i x0, __m128i x1, __m128i x2, __m128i x3,
ce426f
+	  __m128i x4, __m128i x5, __m128i x6, __m128i x7)
ce426f
+{
ce426f
+  __m128i xmm;
ce426f
+
ce426f
+  xmm = _mm_set1_epi32 (0);
ce426f
+  if (memcmp (&xmm, &x0, sizeof (xmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  xmm = _mm_set1_epi32 (1);
ce426f
+  if (memcmp (&xmm, &x1, sizeof (xmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  xmm = _mm_set1_epi32 (2);
ce426f
+  if (memcmp (&xmm, &x2, sizeof (xmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  xmm = _mm_set1_epi32 (3);
ce426f
+  if (memcmp (&xmm, &x3, sizeof (xmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  xmm = _mm_set1_epi32 (4);
ce426f
+  if (memcmp (&xmm, &x4, sizeof (xmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  xmm = _mm_set1_epi32 (5);
ce426f
+  if (memcmp (&xmm, &x5, sizeof (xmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  xmm = _mm_set1_epi32 (6);
ce426f
+  if (memcmp (&xmm, &x6, sizeof (xmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  xmm = _mm_set1_epi32 (7);
ce426f
+  if (memcmp (&xmm, &x7, sizeof (xmm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  return _mm_set1_epi32 (0x12349876);
ce426f
+}
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-sse.c
ce426f
===================================================================
ce426f
--- /dev/null
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-sse.c
ce426f
@@ -0,0 +1,46 @@
ce426f
+/* Test case for preserved SSE registers in dynamic linker.
ce426f
+   Copyright (C) 2017 Free Software Foundation, Inc.
ce426f
+   This file is part of the GNU C Library.
ce426f
+
ce426f
+   The GNU C Library is free software; you can redistribute it and/or
ce426f
+   modify it under the terms of the GNU Lesser General Public
ce426f
+   License as published by the Free Software Foundation; either
ce426f
+   version 2.1 of the License, or (at your option) any later version.
ce426f
+
ce426f
+   The GNU C Library is distributed in the hope that it will be useful,
ce426f
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
ce426f
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
ce426f
+   Lesser General Public License for more details.
ce426f
+
ce426f
+   You should have received a copy of the GNU Lesser General Public
ce426f
+   License along with the GNU C Library; if not, see
ce426f
+   <http://www.gnu.org/licenses/>.  */
ce426f
+
ce426f
+#include <immintrin.h>
ce426f
+#include <stdlib.h>
ce426f
+#include <string.h>
ce426f
+
ce426f
+extern __m128i sse_test (__m128i, __m128i, __m128i, __m128i,
ce426f
+                        __m128i, __m128i, __m128i, __m128i);
ce426f
+
ce426f
+static int
ce426f
+do_test (void)
ce426f
+{
ce426f
+  __m128i xmm0 = _mm_set1_epi32 (0);
ce426f
+  __m128i xmm1 = _mm_set1_epi32 (1);
ce426f
+  __m128i xmm2 = _mm_set1_epi32 (2);
ce426f
+  __m128i xmm3 = _mm_set1_epi32 (3);
ce426f
+  __m128i xmm4 = _mm_set1_epi32 (4);
ce426f
+  __m128i xmm5 = _mm_set1_epi32 (5);
ce426f
+  __m128i xmm6 = _mm_set1_epi32 (6);
ce426f
+  __m128i xmm7 = _mm_set1_epi32 (7);
ce426f
+  __m128i ret = sse_test (xmm0, xmm1, xmm2, xmm3,
ce426f
+                         xmm4, xmm5, xmm6, xmm7);
ce426f
+  xmm0 =  _mm_set1_epi32 (0x12349876);
ce426f
+  if (memcmp (&xmm0, &ret, sizeof (ret)))
ce426f
+    abort ();
ce426f
+  return 0;
ce426f
+}
ce426f
+
ce426f
+#define TEST_FUNCTION do_test ()
ce426f
+#include "../../test-skeleton.c"
ce426f
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-avxmod.c
ce426f
===================================================================
ce426f
--- /dev/null
ce426f
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-avxmod.c
ce426f
@@ -0,0 +1,49 @@
ce426f
+
ce426f
+/* Test case for x86-64 preserved AVX registers in dynamic linker.  */
ce426f
+
ce426f
+#ifdef __AVX__
ce426f
+#include <stdlib.h>
ce426f
+#include <string.h>
ce426f
+#include <immintrin.h>
ce426f
+
ce426f
+__m256i
ce426f
+avx_test (__m256i x0, __m256i x1, __m256i x2, __m256i x3,
ce426f
+	  __m256i x4, __m256i x5, __m256i x6, __m256i x7)
ce426f
+{
ce426f
+  __m256i ymm;
ce426f
+
ce426f
+  ymm = _mm256_set1_epi32 (0);
ce426f
+  if (memcmp (&ymm, &x0, sizeof (ymm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  ymm = _mm256_set1_epi32 (1);
ce426f
+  if (memcmp (&ymm, &x1, sizeof (ymm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  ymm = _mm256_set1_epi32 (2);
ce426f
+  if (memcmp (&ymm, &x2, sizeof (ymm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  ymm = _mm256_set1_epi32 (3);
ce426f
+  if (memcmp (&ymm, &x3, sizeof (ymm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  ymm = _mm256_set1_epi32 (4);
ce426f
+  if (memcmp (&ymm, &x4, sizeof (ymm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  ymm = _mm256_set1_epi32 (5);
ce426f
+  if (memcmp (&ymm, &x5, sizeof (ymm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  ymm = _mm256_set1_epi32 (6);
ce426f
+  if (memcmp (&ymm, &x6, sizeof (ymm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  ymm = _mm256_set1_epi32 (7);
ce426f
+  if (memcmp (&ymm, &x7, sizeof (ymm)))
ce426f
+    abort ();
ce426f
+
ce426f
+  return _mm256_set1_epi32 (0x12349876);
ce426f
+}
ce426f
+#endif