Tree - rpms/glibc - CentOS Git server

rpms / glibc

Blame SOURCES/glibc-rh1880670.patch

Blob History Raw

		edfbb2	`commit d3c57027470b78dba79c6d931e4e409b1fecfc80`
		edfbb2	`Author: Patrick McGehearty <patrick.mcgehearty@oracle.com>`
		edfbb2	`Date: Mon Sep 28 20:11:28 2020 +0000`
		edfbb2
		edfbb2	`Reversing calculation of __x86_shared_non_temporal_threshold`
		edfbb2
		edfbb2	`The __x86_shared_non_temporal_threshold determines when memcpy on x86`
		edfbb2	`uses non_temporal stores to avoid pushing other data out of the last`
		edfbb2	`level cache.`
		edfbb2
		edfbb2	`This patch proposes to revert the calculation change made by H.J. Lu's`
		edfbb2	`patch of June 2, 2017.`
		edfbb2
		edfbb2	`H.J. Lu's patch selected a threshold suitable for a single thread`
		edfbb2	`getting maximum performance. It was tuned using the single threaded`
		edfbb2	`large memcpy micro benchmark on an 8 core processor. The last change`
		edfbb2	`changes the threshold from using 3/4 of one thread's share of the`
		edfbb2	`cache to using 3/4 of the entire cache of a multi-threaded system`
		edfbb2	`before switching to non-temporal stores. Multi-threaded systems with`
		edfbb2	`more than a few threads are server-class and typically have many`
		edfbb2	`active threads. If one thread consumes 3/4 of the available cache for`
		edfbb2	`all threads, it will cause other active threads to have data removed`
		edfbb2	`from the cache. Two examples show the range of the effect. John`
		edfbb2	`McCalpin's widely parallel Stream benchmark, which runs in parallel`
		edfbb2	`and fetches data sequentially, saw a 20% slowdown with this patch on`
		edfbb2	`an internal system test of 128 threads. This regression was discovered`
		edfbb2	`when comparing OL8 performance to OL7. An example that compares`
		edfbb2	`normal stores to non-temporal stores may be found at`
		edfbb2	`https://vgatherps.github.io/2018-09-02-nontemporal/. A simple test`
		edfbb2	`shows performance loss of 400 to 500% due to a failure to use`
		edfbb2	`nontemporal stores. These performance losses are most likely to occur`
		edfbb2	`when the system load is heaviest and good performance is critical.`
		edfbb2
		edfbb2	`The tunable x86_non_temporal_threshold can be used to override the`
		edfbb2	`default for the knowledgable user who really wants maximum cache`
		edfbb2	`allocation to a single thread in a multi-threaded system.`
		edfbb2	`The manual entry for the tunable has been expanded to provide`
		edfbb2	`more information about its purpose.`
		edfbb2
		edfbb2	`modified: sysdeps/x86/cacheinfo.c`
		edfbb2	`modified: manual/tunables.texi`
		edfbb2
		edfbb2	`Conflicts:`
		edfbb2	`manual/tunables.texi`
		edfbb2	`(Downstream uses the glibc.tune namespace, upstream uses`
		edfbb2	`glibc.cpu.)`
		edfbb2	`sysdeps/x86/cacheinfo.c`
		edfbb2	`(Downstream does not have rep_movsb_threshold,`
		edfbb2	`x86_rep_stosb_threshold tunables.)`
		edfbb2
		edfbb2	`diff --git a/manual/tunables.texi b/manual/tunables.texi`
		edfbb2	`index 3dc6f9a44592c030..3e1e519dff153b09 100644`
		edfbb2	`--- a/manual/tunables.texi`
		edfbb2	`+++ b/manual/tunables.texi`
		edfbb2	`@@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.`
		edfbb2
		edfbb2	`@deftp Tunable glibc.tune.x86_non_temporal_threshold`
		edfbb2	`The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user`
		edfbb2	`-to set threshold in bytes for non temporal store.`
		edfbb2	`+to set threshold in bytes for non temporal store. Non temporal stores`
		edfbb2	`+give a hint to the hardware to move data directly to memory without`
		edfbb2	`+displacing other data from the cache. This tunable is used by some`
		edfbb2	`+platforms to determine when to use non temporal stores in operations`
		edfbb2	`+like memmove and memcpy.`
		edfbb2
		edfbb2	`This tunable is specific to i386 and x86-64.`
		edfbb2	`@end deftp`
		edfbb2	`diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c`
		edfbb2	`index b9444ddd52051e05..42b468d0c4885bad 100644`
		edfbb2	`--- a/sysdeps/x86/cacheinfo.c`
		edfbb2	`+++ b/sysdeps/x86/cacheinfo.c`
		edfbb2	`@@ -778,14 +778,20 @@ intel_bug_no_cache_info:`
		edfbb2	`__x86_shared_cache_size = shared;`
		edfbb2	`}`
		edfbb2
		edfbb2	`- /* The large memcpy micro benchmark in glibc shows that 6 times of`
		edfbb2	`- shared cache size is the approximate value above which non-temporal`
		edfbb2	`- store becomes faster on a 8-core processor. This is the 3/4 of the`
		edfbb2	`- total shared cache size. */`
		edfbb2	`+ /* The default setting for the non_temporal threshold is 3/4 of one`
		edfbb2	`+ thread's share of the chip's cache. For most Intel and AMD processors`
		edfbb2	`+ with an initial release date between 2017 and 2020, a thread's typical`
		edfbb2	`+ share of the cache is from 500 KBytes to 2 MBytes. Using the 3/4`
		edfbb2	`+ threshold leaves 125 KBytes to 500 KBytes of the thread's data`
		edfbb2	`+ in cache after a maximum temporal copy, which will maintain`
		edfbb2	`+ in cache a reasonable portion of the thread's stack and other`
		edfbb2	`+ active data. If the threshold is set higher than one thread's`
		edfbb2	`+ share of the cache, it has a substantial risk of negatively`
		edfbb2	`+ impacting the performance of other threads running on the chip. */`
		edfbb2	`__x86_shared_non_temporal_threshold`
		edfbb2	`= (cpu_features->non_temporal_threshold != 0`
		edfbb2	`? cpu_features->non_temporal_threshold`
		edfbb2	`- : __x86_shared_cache_size * threads * 3 / 4);`
		edfbb2	`+ : __x86_shared_cache_size * 3 / 4);`
		edfbb2	`}`
		edfbb2
		edfbb2	`#endif`

rpms / glibc

Source Code

Blame SOURCES/glibc-rh1880670.patch