Home > C++11 > Stack, Heap, Pool

Stack, Heap, Pool

Update 10/7/15: Because this post lacked a context-setting introduction, I wrote a prequel for it after the fact. Please read the prequel before reading this post.

Three Ways

The figure below shows three ways of allocating memory from within a C++ application: stack, heap, custom written pool.

MemAllocation

 

Comparing Performance

The code below measures the performance of the three allocation methods. It computes and prints the time to allocate 1000, 1MB objects from: the stack, the C++ free store, and a custom-written memory pool (the implementation of Pool appears at the end of this post).

#include <chrono>
#include <iostream>
#include <array>
#include "Pool.h"

class BigObject {

public:

  //Each BigObject occupies 1MB of RAM
  static const int32_t NUM_BYTES{1000 *1024};

  void reset() {} //Required by the Pool class template

  void setFirstChar(char c) {byteArray[0] = c;}

private:

  std::array<char, NUM_BYTES> byteArray{};

};

//Compare the performance of three types of memory allocation:
//Stack, free store (heap), custom-written pool
int main() {

  using namespace std::chrono; //Reduce verbosity of subsequent code
  constexpr int32_t NUM_ALLOCS{1000};
  void processObj(BigObject* obj);

  //Time heap allocations
  auto ts = system_clock::now();
  for(int32_t i=0; i<NUM_ALLOCS; ++i) {
    BigObject* obj{new BigObject{}};
    //Do stuff.......
    processObj(obj);
  }
  auto te = system_clock::now();
  std::cout << "Heap Allocations took "
            << duration_cast<milliseconds>(te-ts).count()
            << " milliseconds"
            << std::endl;

  //Time stack allocations
  ts = system_clock::now();
  for(int32_t i=0; i<NUM_ALLOCS; ++i) {
    BigObject obj{};
    //Do stuff.......
    processObj(&obj);
  } //obj is auto-destroyed here
  te = system_clock::now();
  std::cout << "Stack Allocations took "
            << duration_cast<milliseconds>(te-ts).count()
            << " milliseconds"
            <<  std::endl;

  //Pool construction time
  ts = system_clock::now();
  Pool<BigObject, NUM_ALLOCS> objPool;
  te = system_clock::now();
  std::cout << "Pool construction took "
            << duration_cast<milliseconds>(te-ts).count()
            << " milliseconds"
            <<  std::endl;

  //Time pool allocations
  ts = system_clock::now();
  for(int32_t i=0; i<NUM_ALLOCS; ++i) {
    BigObject* poolObj{objPool.acquire()};
    //Do Stuff.......
    processObj(poolObj);
  }
  te = system_clock::now();
  std::cout << "Pool Allocations took "
            << duration_cast<milliseconds>(te-ts).count()
            << " milliseconds";

}

void processObj(BigObject* obj) {

  obj->setFirstChar('a');

}

The results for a typical program run (gcc 4.9.2, cygwin/Win10, compiler) are shown as follows:

DebugRelease

As expected, stack allocation is much faster than heap allocation (3X in this case) because there is no such thing as a “fragmented” stack. The interesting result is how much faster custom pool allocation is than either of the two “standard” out of the box methods. Since each object is 1MB in size, and all of the memory managed by the pool is allocated only once (during construction of the pool object), dynamically acquiring and releasing pointers to the pre-constructed pool objects during runtime is blazingly fast compared to creating a lumbering 1MB object every time the user code needs one.

After you peruse the Pool implementation that follows, I’d love to hear your comments on this post. I’d also appreciate it if you could try to replicate the results I got.

The Pool Class Template – std::array Implementation

Here is my implementation of the Pool class used in the above test code. Note that the Pool class does not use any dynamically resizable STL containers in its implementation. It uses the smarter C++ implementation of the plain ole statically allocated C array. Some safety-critical domains require that no dynamic memory allocation is performed post-initialization – all memory must be allocated either at compile time or upon program initialization.

//Pool.h
#ifndef POOL_H_
#define POOL_H_

#include <array>
#include <algorithm>
#include <memory>

// The OBJ template argument supplied by the user
// must have a publicly defined default ctor and a publicly defined
// OBJ::reset() member function that cleans the object innards so that it
// can be reused again (if needed).
template<typename OBJ, int32_t NUM_OBJS>
class Pool {

public:

  //RAII: Fill up the _available array with newly allocated objects.
  //Since all objects are available for the user to "acquire"
  //on initialization, fill up the _inUse array will nullptrs.
  Pool(){

    //Fill the _available array with default-constructed objects
    std::for_each(std::begin(_available), std::end(_available),
        [](OBJ*& mem){mem = new OBJ{};});

    //Since no objects are currently in use, fill the _inUse array 
    //with nullptrs
    std::for_each(std::begin(_inUse), std::end(_inUse),
        [](OBJ*& mem){mem = nullptr;});

  }

  //Allows a user to acquire an object to manipulate as she wishes
  OBJ* acquire() {

    //Try to find an available object
    auto iter = std::find_if(std::begin(_available), std::end(_available),
        [](OBJ* mem){return (mem not_eq nullptr);});

    //Is the pool exhausted?
    if(iter == std::end(_available)) {

      return nullptr; //Replace this with a "throw" if you need to.

    }
    else {

      //Whoo hoo! An object is available for usage by the user
      OBJ* mem{*iter};

      //Mark it as in-use in the _available array
      *iter = nullptr;

      //Find a spot for it in the _inUse list
      iter = std::find(std::begin(_inUse), std::end(_inUse), nullptr);
  
      //Insert it into the _inUse list. We don't
      //have to check for iter == std::end(_inUse)
      //because we are fully in control!
      *iter = mem;

      //Finally, return the object to the user
      return mem;

    }

  }

  //Returns an object to the pool
  void release(OBJ* obj) {

    //Ensure that the object is indeed in-use. If
    //the user gives us an address that we don't own,
    //we'll have a leak and bad things can happen when we
    //delete the memory during Pool destruction.
    auto iter = std::find(std::begin(_inUse), std::end(_inUse), obj);

    //Simply return control back to the user
    //if we didn't find the object in our
    //in-use list
    if(iter == std::end(_inUse)) {

        return;

    }
    else {

      //Clear out the innards of the object being returned
      //so that the next user doesn't have to "remember" to do it.
      obj->reset();

      //temporarily hold onto the object while we mark it
      //as available in the _inUse array
       OBJ* mem{*iter};

       //Mark it!
       *iter = nullptr;

       //Find a spot for it in the _available list
       iter = std::find(std::begin(_available), std::end(_available), nullptr);

       //Insert the object's address into the _available array
       *iter = mem;
    }

  }

  ~Pool() {

    //Cleanup after ourself: RAII
    std::for_each(std::begin(_available), std::end(_available),
        std::default_delete<OBJ>());

    std::for_each(std::begin(_inUse), std::end(_inUse),
        std::default_delete<OBJ>());

  }

private:

  //Keeps track of all objects available to users
  std::array<OBJ*, NUM_OBJS> _available;

  //Keeps track of all objects in-use by users
  std::array<OBJ*, NUM_OBJS> _inUse;

};

#endif /* POOL_H_ */

The Pool Class Template – std::multiset Implementation

For grins, I also implemented the Pool class below using std::multiset to keep track of the available and in-use objects. For large numbers of objects, the acquire()/release() calls will most likely be faster than their peers in the std::array implementation, but the implementation violates the “no post-intialization dynamically allocated memory allowed” requirement – if you have one.

Since the implementation does perform dynamic memory allocation during runtime, it would probably make more sense to pass the size of the pool via the constructor rather than as a template argument. If so, then functionality can be added to allow users to resize the pool during runtime if needed.

//Pool.h
#ifndef POOL_H_
#define POOL_H_

#include <algorithm>
#include <set>
#include <memory>

// The OBJ template argument supplied by the user
// must have a publicly defined default ctor and a publicly defined
// OBJ::reset() member function that cleans the object innards so that it
// can be reused again (if needed).
template<typename OBJ, int32_t NUM_OBJS>
class Pool {

public:

  //RAII: Fill up the _available set with newly allocated objects.
  //Since all objects are available for the user to "acquire"
  //on initialization, fill up the _inUse set will nullptrs.
  Pool(){

    for(int32_t i=0; i<NUM_OBJS; ++i) {

      _available.insert(new OBJ{});
      _inUse.insert(nullptr);

    }

  }

  //Allows a user to acquire an object to manipulate as she wishes
  OBJ* acquire() {

    //Try to find an available object
    auto iter = std::find_if(std::begin(_available), std::end(_available),
        [](OBJ* mem){return (mem not_eq nullptr);});

    //Is the pool exhausted?
    if(iter == std::end(_available)) {

      return nullptr; //Replace this with a "throw" if you need to.

    }
    else {

      //Whoo hoo! An object is available for usage by the user
      OBJ* mem{*iter};

      //Mark it as in-use in the _available set
      _available.erase(iter);
      _available.insert(nullptr);

      //Find a spot for it in the _inUse set
      iter = std::find(std::begin(_inUse), std::end(_inUse), nullptr);

      //Insert it into the _inUse set. We don't
      //have to check for iter == std::end(_inUse)
      //because we are fully in control!
      _inUse.erase(iter);
      _inUse.insert(mem);

      //Finally, return the object to the user
      return mem;

    }

  }

  //Returns an object to the pool
  void release(OBJ* obj) {

    //Ensure that the object is indeed in-use. If
    //the user gives us an address that we don't own,
    //we'll have a leak and bad things can happen when we
    //delete the memory during Pool destruction.
    auto iter = std::find(std::begin(_inUse), std::end(_inUse), obj);

    //Simply return control back to the user
    //if we didn't find the object in our
    //in-use set
    if(iter == std::end(_inUse)) {

        return;

    }
    else {

      //Clear out the innards of the object being returned
      //so that the next user doesn't have to "remember" to do it.
      obj->reset();

      //temporarily hold onto the object while we mark it
      //as available in the _inUse array
       OBJ* mem{*iter};

       //Mark it!
       _inUse.erase(iter);
       _inUse.insert(nullptr);

       //Find a spot for it in the _available list
       iter = std::find(std::begin(_available), std::end(_available), nullptr);

       //Insert the object's address into the _available array
       _available.erase(iter);
       _available.insert(mem);
    }

  }

  ~Pool() {

    //Cleanup after ourself: RAII
    std::for_each(std::begin(_available), std::end(_available),
        std::default_delete<OBJ>());

    std::for_each(std::begin(_inUse), std::end(_inUse),
        std::default_delete<OBJ>());

  }

private:

  //Keeps track of all objects available to users
  std::multiset<OBJ*> _available;

  //Keeps track of all objects in-use by users
  std::multiset<OBJ*> _inUse;

};

#endif /* POOL_H_ */

Unit Testing

In case you were wondering, I used Phil Nash’s lightweight Catch unit testing framework to verify/validate the behavior of both Pool class implementations:

TEST_CASE( "Allocations/Deallocations", "[Pool]" ) {

    constexpr int32_t NUM_ALLOCS{2};
    Pool<BigObject, NUM_ALLOCS> objPool;

    BigObject* obj1{objPool.acquire()};
    REQUIRE(obj1 not_eq nullptr);

    BigObject* obj2 = objPool.acquire();
    REQUIRE(obj2 not_eq nullptr);

    objPool.release(obj1);
    BigObject* obj3 = objPool.acquire();
    REQUIRE(obj3 == obj1);

    BigObject* obj4 = objPool.acquire();
    REQUIRE(obj4 == nullptr);

    objPool.release(obj2);
    BigObject* obj5 = objPool.acquire();
    REQUIRE(obj5 == obj2);

}

==============================================
All tests passed (5 assertions in 1 test case)

C++11 Features

For fun, I listed the C++11 features I used in the source code:

    * auto
    * std::array<T, int>
    * std::begin(), std::end()
    * nullptr
    * lambda functions
    * Brace initialization { }
    * std::chrono::system_clock, std::chrono::duration_cast<T>
    * std::default_delete<T>

Did I miss any features/classes? How would the C++03 implementations stack up against these C++11 implementations in terms of readability/understandability?

Categories: C++11 Tags: ,
  1. Artur
    September 15, 2015 at 5:55 am

    I’m sorry but for me it seems like wrong deduction has been made according to times that took for each operation, that is, stack, heap and pool. Pool “acquiring and manipulation” was split by you into two parts. In total it took 369 ms. The longest from all those methods. So what’s your point then? That as soon as you have allocated memory moving pointers is fast? Surely that’s obvious? But none the less you must spend time on allocation of memory on the heap. There is no way of avoiding that. I really don’t see point of this exercise.

    • September 15, 2015 at 6:17 am

      I don’t think you understand the purpose of memory pools. Yes, pool construction takes a long time (unsurprisingly as long as allocating from the heap), but in a real program it only has to be performed once – on initialization. Thereafter, the application can acquire/release objects from the pool much more quickly than either of the other two techniques. Performance critical applications use memory pools all the time. Check out Wikipedia (https://en.wikipedia.org/wiki/Memory_pool), Boost.Pool (http://www.boost.org/doc/libs/1_59_0/libs/pool/doc/html/index.html) and do some more grokking on when/why pre-allocated memory pools are used in applications. Thanks for your comment.

  2. Andrew
    September 15, 2015 at 9:31 am

    I agree with Artur here. The comparison isn’t correct. There are four distinct steps that need to be considered: allocation, construction, destruction, and deallocation. In your test cases above, only the stack allocation test case is executing all four. The point of the pool allocators is typically to reduce the time taken during allocation and deallocation. In fact, you should refer to the links provided. They all mention allocation, not construction. Further, the speed up from pool allocation is typically realized with small objects (i.e. less than one cache line, depending on your hardware), not large objects.

    With your pool allocator you’ve proven that it may be useful to allocate and construct a large number of objects up front. You’ve proven that it’s a great optimization if that is the workflow of your application. However, this fails when you need to change the parameters that are passed into the constructor. The STL allocator concept allows you to do this because they recognize that allocators need to address the four steps listed above.

    • September 15, 2015 at 9:45 am

      I’m gonna go out on a limb here and assert that runtime allocation/de-allocation from pre-constructed pools beats standard heap and stack allocation/deallocation every time; and regardless of the size of the objects being used.

      • Andrew
        September 18, 2015 at 9:25 am

        “Further, the speed up from pool allocation is typically realized with small objects (i.e. less than one cache line, depending on your hardware), not large objects.” Not quite worded the way I intended it to be, but in the context given, I agree with your remark.

  3. Artur
    September 15, 2015 at 2:26 pm

    I’m gona say that this:
    “Yes, pool construction takes a long time (unsurprisingly as long as allocating from the heap), but in a real program it only has to be performed once – on initialization.”
    Is purely academic. I’ve never met real world program which needs to do it just once in its whole life time.
    But that is not really important. The important thing here is that you do not present your information in one uniform way, and therefore it is misleading.
    Sure technique you’re using is useful (less often then you’d think though) and on the contrary to what you rather curtly suggested, I do understand what is the purpose of memory pools. The problem I have is with you presenting your information incorrectly.

    • September 15, 2015 at 2:33 pm

      I’ve encountered, and written, many real world programs that have to use memory pools (initialized once at startup) because of performance concerns. I’m working on one right now – that is what triggered the post.

      Well, I guess I should take the post down because it’s misleading and not useful. Beside Andrew, I wonder how many other readers agree with you. If I get enough of them then I will take the post down.

      Someone on the isocpp.org site (https://isocpp.org/blog/2015/09/stack-heap-pool-tony-bulldozer00-bd00-dasilva) linked to this post. Maybe you want to go there and tell them to take down the link?

      • September 16, 2015 at 2:35 am

        Tony, the article is good – please keep it 🙂

        Maybe you could be more explicit very early on what kind of application would benefit from this and why? (not for me, but for those who are disbelievers)

      • September 16, 2015 at 3:29 am

        Thank you. I agree, I should have wrote a more comprehensive introduction.

    • September 16, 2015 at 2:32 am

      Artur, believe me, this is standard procedure in my kinds of applications – you might just not have been exposed to that segment yet. Realtime communiction software, like VoIP applications, for instance, use and rely on this heavily for guaranteed performance.

  4. Artur
    September 15, 2015 at 3:29 pm

    A) I don’t want to go anywhere and tell anyone to take down anything

    B) I find your attitude unprofessional and mildly (and inappropriately) confrontational judging by your posts directed to me. I, in my first post, politely asked you (normal?) questions. Your reply was? I do not understand the subject. Because of that behavior of yours I do not wish to have any further encounters in any form with you.
    C) Try to avoid sounding hysterical – your second paragraph in your last post is great example of things you should avoid in order to not sound like that.

    Best regards

    • September 15, 2015 at 3:31 pm

      Thanks for the lecture. Time for both of us to move on.

  5. encyclopedist
    September 15, 2015 at 8:13 pm

    Doing any benchmarks with optimizations off is completely useless for c++.
    Your pool implementation is extremely strange and it not what other people mean talking about “pool allocator”. Classic implementation would allocate one big continuous region of memory (and not many separate small ones as you do). This is crucial for cache locality and also would allow to avoid any slow std::maps for managing it. The classic way to keep free objects is a linked list. This way both allocation and deallocation operations are O(1), comparing to O(n) for your pool.
    Even for this algorithm the implementation is low quality. Using std::find_if is highly suboptimal, because it is going to be linear search. You should use find method instead.

    • September 16, 2015 at 3:35 am

      I agree. Since I was contemplating using a std::set of std::shared_ptr for my app instead of a pe-allocated pool, I wanted to get a quick feel for the performance difference between a pool and “new”. So I whipped up the pool implementations in relatively short order. Do you have an implementation(s) I can learn from? Thanks for the info.

      I re-ran the code with -O3 and posted both the -O0 and -O3 results. The relative performance differences between the 3 methods seem to be the same for both optimizer settings.

  6. Bartosz
    September 20, 2015 at 4:25 pm

    Nice post!
    I have just one esthetical point. The source code of multiset based pool is incorrectly formatted, but thats not big deal.

    As @encyclopedist pointed out, if the pool will hold a big chunk of memory and provide objects on demand (slab allocation https://en.wikipedia.org/wiki/Slab_allocation) this might impact results but I don’t know how.

    There is also something which got me thinking. Changing Pool allocator to be multithread friendly will not be unnoticed because of mutex. If we decide not to use mutexes on pool allocation, and use atomics instead, then the multiset implementation is out of the question. But again use of atomic will still might some footprint on results. In fact I think I might want to check that. Also cache locality will play a role in multithreaded environment.
    Unless you wanted to show the use of pool allocator only on single thread?

    • September 20, 2015 at 8:03 pm

      Thanks Bartosz. If you’re Bartosz Milewski, then I’m honored that you visited my blog. I think your concurrency videos are brilliant! You were born to teach.

      I really hadn’t considered the concurrency aspects of the pool implementations when I wrote the post. I’m actually quite surprised at the conversations that this post triggered.

      I think (hope) I reformatted the code so that it displays correctly now. I’ve had trouble rendering code on WordPress.com for years. It’s very frustrating. It looks fine one minute, and then it goes to hell on the next rendering.

      • Bartosz
        September 21, 2015 at 7:33 am

        I am not Bartosz Milewski. Thank you for your reply and edit.

      • September 21, 2015 at 8:25 am

        Ur welcome. Thanks for your helpful comments.

  7. October 2, 2015 at 11:26 am

    Hi,

    Thanks for the food of thoughts. I found your blog post thanks to CppCast and got curious.

    I cannot help but want to point out something in your test code that looks odd to me. Maybe i have missed something. But let me try to point out my observations and conclusions.

    A/ In the stack allocation test you are measuring the time to create and destruct the BigObject on the same stack frame over and over again. On each pass the object gets created on the stack and then clean up on line 50. In the end the code will never have allocated 1000 objects in memory at the same time. So the code is measuring the following steps: (1) extend the virtual memory backing the stack by 1MB once. (2) constructing the object in the same piece of memory 1000 times with initializing the memory beforehand. (3) destructing the object 1000 times.

    B/ In the heap allocation test you are allocating 1 MB more from the heap on each iteration. But you do not delete the object when the loop ends in line 37. So we have at one point 1 GB of memory allocated on the heap. This test therefore measures the following steps: (1) allocate 1 MB of memory 1000 times. (2) construct an object in the allocated memory from the heap 1000 time. (3) going to the operating system to increase the heap.

    C/ In the pool allocator you allocate the same amount of memory as in the heap example. That’s why to me you initalization time of the pool has In the actual test you are then acquiring all previously initialized object, but never releasing any of them. This test therefore measures during the initialization the following steps: (1) allocate 1 MB of memory 1000 times. (2) construct an object in the allocated memory from the heap 1000 time. (3) going to the operating system to increase the heap. (4) Construction of your management structures and additional memory allocations to add 1000 instances of your BigObject. During the actual usage of the pool allocator you are then measuring the following steps: (1) call function to move one pointer from one list to another list.

    D/ The implementation of the reset() method in line 13 is not actually doing a real reset of the content. The heap and stack solution both guarantee memory that was cleaned up before you get your object. But the pool allocator does not. I do agree that you can make more informed choices as developer on the BigObject of what absolutely needs a content reset in the object and what not. But omitting this step here is skewing the results.

    IMHO to get in this micro benchmark you have to address the above issues in order to have results that will give us real world information. Otherwise this test is measuring how long your OS requires to allocate 1 GB of memory, but not how the memory management structures used (stack vs pool vs heap) are impacting your performance.

    E/ In all tests you are allocating always the same size of object. I think there will be very few applications that have objects of only one size all the time. To make this measurement more real world, then you have to test with allocating objects of varying sizes. Only then you can actually actually measure the impact of heap fragmentation and the how to avoid the overhead of the heap management data structures by using an allocators that has specialized bucket sizes for a set number of objects. And then you can compare that to different memory allocators like e.g. gmalloc or tcmalloc which implement to my knowledge some form of memory buckets mechanism.

    The results you posted for the compiler optimization level -O3 reminded me of the talk Chandler Carruth gave at CppCon 2015 (https://www.youtube.com/watch?v=nXaxk27zwlk). I think that the optimizer will eliminate a lot of the memory operations and movements of the pointers from one list to the other as you never actually read the memory again. I’m not really sure what is happening there. But i recommend to look into the assembly to see what the optimizer actually created from the code above.

    Hope to read an update from you soon!
    Cheers!

    • October 3, 2015 at 5:44 pm

      Thanks for your interest and deep, thoughtful, analysis cceelen. This post wasn’t intended to generate as much interest as I thought it would. For my current project, I needed to choose between either using the free store or a pre-allocated pool of fixed-size, large, objects (3D cartesian grids) that I will have to reuse over and over again at high speed during runtime. Using objects allocated/deallocated from the stack wasn’t even an option. As you point out, comparing stack allocation/deallocation with the two dynamic allocation schemes is an apples to oranges comparison. I simply threw it in because it was easy to do. Thanks again.

  8. April 24, 2017 at 2:54 pm

    Great article! I do a lot of 3D rendering, video/audio processing and optics calculations which can require a lot of repetition on data. Your post was quite illuminating from my perspective – thanks!!

  1. September 14, 2015 at 3:46 am
  2. September 14, 2015 at 4:36 am
  3. September 20, 2015 at 12:46 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: