ce426f
Description: Makes trimming work consistently across arenas.
ce426f
Author: Mel Gorman <mgorman@suse.de>
ce426f
Origin: git://sourceware.org/git/glibc.git
ce426f
Bug-RHEL: N/A
ce426f
Bug-Fedora: N/A
ce426f
Bug-Upstream: #17195
ce426f
Upstream status: committed
ce426f
ce426f
Part of commit 8a35c3fe122d49ba76dff815b3537affb5a50b45 is also included
ce426f
to allow the use of ALIGN_UP within malloc/arena.c.
ce426f
ce426f
commit c26efef9798914e208329c0e8c3c73bb1135d9e3
ce426f
Author: Mel Gorman <mgorman@suse.de>
ce426f
Date:   Thu Apr 2 12:14:14 2015 +0530
ce426f
ce426f
    malloc: Consistently apply trim_threshold to all heaps [BZ #17195]
ce426f
    
ce426f
    Trimming heaps is a balance between saving memory and the system overhead
ce426f
    required to update page tables and discard allocated pages. The malloc
ce426f
    option M_TRIM_THRESHOLD is a tunable that users are meant to use to decide
ce426f
    where this balance point is but it is only applied to the main arena.
ce426f
    
ce426f
    For scalability reasons, glibc malloc has per-thread heaps but these are
ce426f
    shrunk with madvise() if there is one page free at the top of the heap.
ce426f
    In some circumstances this can lead to high system overhead if a thread
ce426f
    has a control flow like
ce426f
    
ce426f
        while (data_to_process) {
ce426f
            buf = malloc(large_size);
ce426f
            do_stuff();
ce426f
            free(buf);
ce426f
        }
ce426f
    
ce426f
    For a large size, the free() will call madvise (pagetable teardown, page
ce426f
    free and TLB flush) every time followed immediately by a malloc (fault,
ce426f
    kernel page alloc, zeroing and charge accounting). The kernel overhead
ce426f
    can dominate such a workload.
ce426f
    
ce426f
    This patch allows the user to tune when madvise gets called by applying
ce426f
    the trim threshold to the per-thread heaps and using similar logic to the
ce426f
    main arena when deciding whether to shrink. Alternatively if the dynamic
ce426f
    brk/mmap threshold gets adjusted then the new values will be obeyed by
ce426f
    the per-thread heaps.
ce426f
    
ce426f
    Bug 17195 was a test case motivated by a problem encountered in scientific
ce426f
    applications written in python that performance badly due to high page fault
ce426f
    overhead. The basic operation of such a program was posted by Julian Taylor
ce426f
    https://sourceware.org/ml/libc-alpha/2015-02/msg00373.html
ce426f
    
ce426f
    With this patch applied, the overhead is eliminated. All numbers in this
ce426f
    report are in seconds and were recorded by running Julian's program 30
ce426f
    times.
ce426f
    
ce426f
    pyarray
ce426f
                                     glibc               madvise
ce426f
                                      2.21                    v2
ce426f
    System  min             1.81 (  0.00%)        0.00 (100.00%)
ce426f
    System  mean            1.93 (  0.00%)        0.02 ( 99.20%)
ce426f
    System  stddev          0.06 (  0.00%)        0.01 ( 88.99%)
ce426f
    System  max             2.06 (  0.00%)        0.03 ( 98.54%)
ce426f
    Elapsed min             3.26 (  0.00%)        2.37 ( 27.30%)
ce426f
    Elapsed mean            3.39 (  0.00%)        2.41 ( 28.84%)
ce426f
    Elapsed stddev          0.14 (  0.00%)        0.02 ( 82.73%)
ce426f
    Elapsed max             4.05 (  0.00%)        2.47 ( 39.01%)
ce426f
    
ce426f
                   glibc     madvise
ce426f
                    2.21          v2
ce426f
    User          141.86      142.28
ce426f
    System         57.94        0.60
ce426f
    Elapsed       102.02       72.66
ce426f
    
ce426f
    Note that almost a minutes worth of system time is eliminted and the
ce426f
    program completes 28% faster on average.
ce426f
    
ce426f
    To illustrate the problem without python this is a basic test-case for
ce426f
    the worst case scenario where every free is a madvise followed by a an alloc
ce426f
    
ce426f
    /* gcc bench-free.c -lpthread -o bench-free */
ce426f
    static int num = 1024;
ce426f
    
ce426f
    void __attribute__((noinline,noclone)) dostuff (void *p)
ce426f
    {
ce426f
    }
ce426f
    
ce426f
    void *worker (void *data)
ce426f
    {
ce426f
      int i;
ce426f
    
ce426f
      for (i = num; i--;)
ce426f
        {
ce426f
          void *m = malloc (48*4096);
ce426f
          dostuff (m);
ce426f
          free (m);
ce426f
        }
ce426f
    
ce426f
      return NULL;
ce426f
    }
ce426f
    
ce426f
    int main()
ce426f
    {
ce426f
      int i;
ce426f
      pthread_t t;
ce426f
      void *ret;
ce426f
      if (pthread_create (&t, NULL, worker, NULL))
ce426f
        exit (2);
ce426f
      if (pthread_join (t, &ret))
ce426f
        exit (3);
ce426f
      return 0;
ce426f
    }
ce426f
    
ce426f
    Before the patch, this resulted in 1024 calls to madvise. With the patch applied,
ce426f
    madvise is called twice because the default trim threshold is high enough to avoid
ce426f
    this.
ce426f
    
ce426f
    This a more complex case where there is a mix of frees. It's simply a different worker
ce426f
    function for the test case above
ce426f
    
ce426f
    void *worker (void *data)
ce426f
    {
ce426f
      int i;
ce426f
      int j = 0;
ce426f
      void *free_index[num];
ce426f
    
ce426f
      for (i = num; i--;)
ce426f
        {
ce426f
          void *m = malloc ((i % 58) *4096);
ce426f
          dostuff (m);
ce426f
          if (i % 2 == 0) {
ce426f
            free (m);
ce426f
          } else {
ce426f
            free_index[j++] = m;
ce426f
          }
ce426f
        }
ce426f
      for (; j >= 0; j--)
ce426f
        {
ce426f
          free(free_index[j]);
ce426f
        }
ce426f
    
ce426f
      return NULL;
ce426f
    }
ce426f
    
ce426f
    glibc 2.21 calls malloc 90305 times but with the patch applied, it's
ce426f
    called 13438. Increasing the trim threshold will decrease the number of
ce426f
    times it's called with the option of eliminating the overhead.
ce426f
    
ce426f
    ebizzy is meant to generate a workload resembling common web application
ce426f
    server workloads. It is threaded with a large working set that at its core
ce426f
    has an allocation, do_stuff, free loop that also hits this case. The primary
ce426f
    metric of the benchmark is records processed per second. This is running on
ce426f
    my desktop which is a single socket machine with an I7-4770 and 8 cores.
ce426f
    Each thread count was run for 30 seconds. It was only run once as the
ce426f
    performance difference is so high that the variation is insignificant.
ce426f
    
ce426f
                    glibc 2.21              patch
ce426f
    threads 1            10230              44114
ce426f
    threads 2            19153              84925
ce426f
    threads 4            34295             134569
ce426f
    threads 8            51007             183387
ce426f
    
ce426f
    Note that the saving happens to be a concidence as the size allocated
ce426f
    by ebizzy was less than the default threshold. If a different number of
ce426f
    chunks were specified then it may also be necessary to tune the threshold
ce426f
    to compensate
ce426f
    
ce426f
    This is roughly quadrupling the performance of this benchmark. The difference in
ce426f
    system CPU usage illustrates why.
ce426f
    
ce426f
    ebizzy running 1 thread with glibc 2.21
ce426f
    10230 records/s 306904
ce426f
    real 30.00 s
ce426f
    user  7.47 s
ce426f
    sys  22.49 s
ce426f
    
ce426f
    22.49 seconds was spent in the kernel for a workload runinng 30 seconds. With the
ce426f
    patch applied
ce426f
    
ce426f
    ebizzy running 1 thread with patch applied
ce426f
    44126 records/s 1323792
ce426f
    real 30.00 s
ce426f
    user 29.97 s
ce426f
    sys   0.00 s
ce426f
    
ce426f
    system CPU usage was zero with the patch applied. strace shows that glibc
ce426f
    running this workload calls madvise approximately 9000 times a second. With
ce426f
    the patch applied madvise was called twice during the workload (or 0.06
ce426f
    times per second).
ce426f
    
ce426f
    2015-02-10  Mel Gorman  <mgorman@suse.de>
ce426f
    
ce426f
      [BZ #17195]
ce426f
      * malloc/arena.c (free): Apply trim threshold to per-thread heaps
ce426f
        as well as the main arena.
ce426f
ce426f
Index: glibc-2.17-c758a686/malloc/arena.c
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/malloc/arena.c
ce426f
+++ glibc-2.17-c758a686/malloc/arena.c
ce426f
@@ -661,7 +661,7 @@ heap_trim(heap_info *heap, size_t pad)
ce426f
   unsigned long pagesz = GLRO(dl_pagesize);
ce426f
   mchunkptr top_chunk = top(ar_ptr), p, bck, fwd;
ce426f
   heap_info *prev_heap;
ce426f
-  long new_size, top_size, extra, prev_size, misalign;
ce426f
+  long new_size, top_size, top_area, extra, prev_size, misalign;
ce426f
 
ce426f
   /* Can this heap go away completely? */
ce426f
   while(top_chunk == chunk_at_offset(heap, sizeof(*heap))) {
ce426f
@@ -695,9 +695,16 @@ heap_trim(heap_info *heap, size_t pad)
ce426f
     set_head(top_chunk, new_size | PREV_INUSE);
ce426f
     /*check_chunk(ar_ptr, top_chunk);*/
ce426f
   }
ce426f
+
ce426f
+  /* Uses similar logic for per-thread arenas as the main arena with systrim
ce426f
+     by preserving the top pad and at least a page.  */
ce426f
   top_size = chunksize(top_chunk);
ce426f
-  extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
ce426f
-  if(extra < (long)pagesz)
ce426f
+  top_area = top_size - MINSIZE - 1;
ce426f
+  if (top_area <= pad)
ce426f
+    return 0;
ce426f
+
ce426f
+  extra = ALIGN_DOWN(top_area - pad, pagesz);
ce426f
+  if ((unsigned long) extra < mp_.trim_threshold)
ce426f
     return 0;
ce426f
   /* Try to shrink. */
ce426f
   if(shrink_heap(heap, extra) != 0)
ce426f
Index: glibc-2.17-c758a686/malloc/malloc.c
ce426f
===================================================================
ce426f
--- glibc-2.17-c758a686.orig/malloc/malloc.c
ce426f
+++ glibc-2.17-c758a686/malloc/malloc.c
ce426f
@@ -236,6 +236,8 @@
ce426f
 /* For va_arg, va_start, va_end.  */
ce426f
 #include <stdarg.h>
ce426f
 
ce426f
+/* For ALIGN_UP.  */
ce426f
+#include <libc-internal.h>
ce426f
 
ce426f
 /*
ce426f
   Debugging: