00db10
Description: Makes trimming work consistently across arenas.
00db10
Author: Mel Gorman <mgorman@suse.de>
00db10
Origin: git://sourceware.org/git/glibc.git
00db10
Bug-RHEL: N/A
00db10
Bug-Fedora: N/A
00db10
Bug-Upstream: #17195
00db10
Upstream status: committed
00db10
00db10
Part of commit 8a35c3fe122d49ba76dff815b3537affb5a50b45 is also included
00db10
to allow the use of ALIGN_UP within malloc/arena.c.
00db10
00db10
commit c26efef9798914e208329c0e8c3c73bb1135d9e3
00db10
Author: Mel Gorman <mgorman@suse.de>
00db10
Date:   Thu Apr 2 12:14:14 2015 +0530
00db10
00db10
    malloc: Consistently apply trim_threshold to all heaps [BZ #17195]
00db10
    
00db10
    Trimming heaps is a balance between saving memory and the system overhead
00db10
    required to update page tables and discard allocated pages. The malloc
00db10
    option M_TRIM_THRESHOLD is a tunable that users are meant to use to decide
00db10
    where this balance point is but it is only applied to the main arena.
00db10
    
00db10
    For scalability reasons, glibc malloc has per-thread heaps but these are
00db10
    shrunk with madvise() if there is one page free at the top of the heap.
00db10
    In some circumstances this can lead to high system overhead if a thread
00db10
    has a control flow like
00db10
    
00db10
        while (data_to_process) {
00db10
            buf = malloc(large_size);
00db10
            do_stuff();
00db10
            free(buf);
00db10
        }
00db10
    
00db10
    For a large size, the free() will call madvise (pagetable teardown, page
00db10
    free and TLB flush) every time followed immediately by a malloc (fault,
00db10
    kernel page alloc, zeroing and charge accounting). The kernel overhead
00db10
    can dominate such a workload.
00db10
    
00db10
    This patch allows the user to tune when madvise gets called by applying
00db10
    the trim threshold to the per-thread heaps and using similar logic to the
00db10
    main arena when deciding whether to shrink. Alternatively if the dynamic
00db10
    brk/mmap threshold gets adjusted then the new values will be obeyed by
00db10
    the per-thread heaps.
00db10
    
00db10
    Bug 17195 was a test case motivated by a problem encountered in scientific
00db10
    applications written in python that performance badly due to high page fault
00db10
    overhead. The basic operation of such a program was posted by Julian Taylor
00db10
    https://sourceware.org/ml/libc-alpha/2015-02/msg00373.html
00db10
    
00db10
    With this patch applied, the overhead is eliminated. All numbers in this
00db10
    report are in seconds and were recorded by running Julian's program 30
00db10
    times.
00db10
    
00db10
    pyarray
00db10
                                     glibc               madvise
00db10
                                      2.21                    v2
00db10
    System  min             1.81 (  0.00%)        0.00 (100.00%)
00db10
    System  mean            1.93 (  0.00%)        0.02 ( 99.20%)
00db10
    System  stddev          0.06 (  0.00%)        0.01 ( 88.99%)
00db10
    System  max             2.06 (  0.00%)        0.03 ( 98.54%)
00db10
    Elapsed min             3.26 (  0.00%)        2.37 ( 27.30%)
00db10
    Elapsed mean            3.39 (  0.00%)        2.41 ( 28.84%)
00db10
    Elapsed stddev          0.14 (  0.00%)        0.02 ( 82.73%)
00db10
    Elapsed max             4.05 (  0.00%)        2.47 ( 39.01%)
00db10
    
00db10
                   glibc     madvise
00db10
                    2.21          v2
00db10
    User          141.86      142.28
00db10
    System         57.94        0.60
00db10
    Elapsed       102.02       72.66
00db10
    
00db10
    Note that almost a minutes worth of system time is eliminted and the
00db10
    program completes 28% faster on average.
00db10
    
00db10
    To illustrate the problem without python this is a basic test-case for
00db10
    the worst case scenario where every free is a madvise followed by a an alloc
00db10
    
00db10
    /* gcc bench-free.c -lpthread -o bench-free */
00db10
    static int num = 1024;
00db10
    
00db10
    void __attribute__((noinline,noclone)) dostuff (void *p)
00db10
    {
00db10
    }
00db10
    
00db10
    void *worker (void *data)
00db10
    {
00db10
      int i;
00db10
    
00db10
      for (i = num; i--;)
00db10
        {
00db10
          void *m = malloc (48*4096);
00db10
          dostuff (m);
00db10
          free (m);
00db10
        }
00db10
    
00db10
      return NULL;
00db10
    }
00db10
    
00db10
    int main()
00db10
    {
00db10
      int i;
00db10
      pthread_t t;
00db10
      void *ret;
00db10
      if (pthread_create (&t, NULL, worker, NULL))
00db10
        exit (2);
00db10
      if (pthread_join (t, &ret))
00db10
        exit (3);
00db10
      return 0;
00db10
    }
00db10
    
00db10
    Before the patch, this resulted in 1024 calls to madvise. With the patch applied,
00db10
    madvise is called twice because the default trim threshold is high enough to avoid
00db10
    this.
00db10
    
00db10
    This a more complex case where there is a mix of frees. It's simply a different worker
00db10
    function for the test case above
00db10
    
00db10
    void *worker (void *data)
00db10
    {
00db10
      int i;
00db10
      int j = 0;
00db10
      void *free_index[num];
00db10
    
00db10
      for (i = num; i--;)
00db10
        {
00db10
          void *m = malloc ((i % 58) *4096);
00db10
          dostuff (m);
00db10
          if (i % 2 == 0) {
00db10
            free (m);
00db10
          } else {
00db10
            free_index[j++] = m;
00db10
          }
00db10
        }
00db10
      for (; j >= 0; j--)
00db10
        {
00db10
          free(free_index[j]);
00db10
        }
00db10
    
00db10
      return NULL;
00db10
    }
00db10
    
00db10
    glibc 2.21 calls malloc 90305 times but with the patch applied, it's
00db10
    called 13438. Increasing the trim threshold will decrease the number of
00db10
    times it's called with the option of eliminating the overhead.
00db10
    
00db10
    ebizzy is meant to generate a workload resembling common web application
00db10
    server workloads. It is threaded with a large working set that at its core
00db10
    has an allocation, do_stuff, free loop that also hits this case. The primary
00db10
    metric of the benchmark is records processed per second. This is running on
00db10
    my desktop which is a single socket machine with an I7-4770 and 8 cores.
00db10
    Each thread count was run for 30 seconds. It was only run once as the
00db10
    performance difference is so high that the variation is insignificant.
00db10
    
00db10
                    glibc 2.21              patch
00db10
    threads 1            10230              44114
00db10
    threads 2            19153              84925
00db10
    threads 4            34295             134569
00db10
    threads 8            51007             183387
00db10
    
00db10
    Note that the saving happens to be a concidence as the size allocated
00db10
    by ebizzy was less than the default threshold. If a different number of
00db10
    chunks were specified then it may also be necessary to tune the threshold
00db10
    to compensate
00db10
    
00db10
    This is roughly quadrupling the performance of this benchmark. The difference in
00db10
    system CPU usage illustrates why.
00db10
    
00db10
    ebizzy running 1 thread with glibc 2.21
00db10
    10230 records/s 306904
00db10
    real 30.00 s
00db10
    user  7.47 s
00db10
    sys  22.49 s
00db10
    
00db10
    22.49 seconds was spent in the kernel for a workload runinng 30 seconds. With the
00db10
    patch applied
00db10
    
00db10
    ebizzy running 1 thread with patch applied
00db10
    44126 records/s 1323792
00db10
    real 30.00 s
00db10
    user 29.97 s
00db10
    sys   0.00 s
00db10
    
00db10
    system CPU usage was zero with the patch applied. strace shows that glibc
00db10
    running this workload calls madvise approximately 9000 times a second. With
00db10
    the patch applied madvise was called twice during the workload (or 0.06
00db10
    times per second).
00db10
    
00db10
    2015-02-10  Mel Gorman  <mgorman@suse.de>
00db10
    
00db10
      [BZ #17195]
00db10
      * malloc/arena.c (free): Apply trim threshold to per-thread heaps
00db10
        as well as the main arena.
00db10
00db10
Index: glibc-2.17-c758a686/malloc/arena.c
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/malloc/arena.c
00db10
+++ glibc-2.17-c758a686/malloc/arena.c
00db10
@@ -661,7 +661,7 @@ heap_trim(heap_info *heap, size_t pad)
00db10
   unsigned long pagesz = GLRO(dl_pagesize);
00db10
   mchunkptr top_chunk = top(ar_ptr), p, bck, fwd;
00db10
   heap_info *prev_heap;
00db10
-  long new_size, top_size, extra, prev_size, misalign;
00db10
+  long new_size, top_size, top_area, extra, prev_size, misalign;
00db10
 
00db10
   /* Can this heap go away completely? */
00db10
   while(top_chunk == chunk_at_offset(heap, sizeof(*heap))) {
00db10
@@ -695,9 +695,16 @@ heap_trim(heap_info *heap, size_t pad)
00db10
     set_head(top_chunk, new_size | PREV_INUSE);
00db10
     /*check_chunk(ar_ptr, top_chunk);*/
00db10
   }
00db10
+
00db10
+  /* Uses similar logic for per-thread arenas as the main arena with systrim
00db10
+     by preserving the top pad and at least a page.  */
00db10
   top_size = chunksize(top_chunk);
00db10
-  extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
00db10
-  if(extra < (long)pagesz)
00db10
+  top_area = top_size - MINSIZE - 1;
00db10
+  if (top_area <= pad)
00db10
+    return 0;
00db10
+
00db10
+  extra = ALIGN_DOWN(top_area - pad, pagesz);
00db10
+  if ((unsigned long) extra < mp_.trim_threshold)
00db10
     return 0;
00db10
   /* Try to shrink. */
00db10
   if(shrink_heap(heap, extra) != 0)
00db10
Index: glibc-2.17-c758a686/malloc/malloc.c
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/malloc/malloc.c
00db10
+++ glibc-2.17-c758a686/malloc/malloc.c
00db10
@@ -236,6 +236,8 @@
00db10
 /* For va_arg, va_start, va_end.  */
00db10
 #include <stdarg.h>
00db10
 
00db10
+/* For ALIGN_UP.  */
00db10
+#include <libc-internal.h>
00db10
 
00db10
 /*
00db10
   Debugging: