Archive

Posts Tagged ‘concurrency’

Concurrency Support

February 3, 2012 Leave a comment

Assuming that I remain a lowly, banana-eating programmer and I don’t catch the wanna-be-uh-manager-supervisor-director-executive fever, I’m excited about the new features and library additions provided in the C++11 standard.

Specifically, I’m thrilled by the support for “dangerousmulti-threaded programming that C++11 serves up.

For more info on the what, why, and how of these features and library additions, check out Scott Meyers’ pre-book training package, Anthony Williams’ new book, and Bjarne’s C++11 FAQ page.

Persistent Discomfort

November 3, 2011 Leave a comment

As part of the infrastructure of the distributed, multi-process, multi-threaded system that my team is developing, a parameterized, mutex protected, inter-thread message queue class has been written and dropped into a general purpose library. To unburden application component developers from having to do it, the library-based queue class manages a reusable pool of message buffers that functionally “flow” from one thread to the next.

On the “push” side of the queue, usage is as follows:

  • Thread acquires a handle to the next empty Message buffer
  • Thread fills Message buffer
  • Thread returns handle to the queue (push)

On the “pop” side of the queue, usage is as follows:

  • Thread acquires a handle to the next full Message buffer (pop)
  • Thread processes the Message content
  • Thread returns handle to the queue

So far, so good, right? I thought so too – at the beginning of the project. But as I’ve moved forward during the development of my application component, I’ve been experiencing a growing and persistent discomfort. D’oh!

Using the figure below, I’m gonna share the cause of my “inner thread” discomfort with you.

In order to functionally process an input message and propagate it forward, the inner thread must do the following work:

  • Acquire a handle to the next input Message buffer from queue 1 (pop)
  • Acquire a handle to the next empty output Message buffer from queue 2
  • Utilize the content of the Message from queue 1 to compute/fill in the Message to queue 2
  • Return the handle of the input message to queue 1
  • Return the handle of the output message to queue 2 (push)

For small messages and/or when the messages are of different types, I don’t see much wrong with this inter-thread message passing approach. However, when the messages are big and of the same type, my discomfort surfaces. In this case (as we shall see), the “utilize” bullet amounts to an unnecessary copy. The more “inner” threads there are in the pipeline, the more performance degradation there is from unnecessary copies.

So, how can the copies be eliminated and system performance increased? One way, as the figure below shows, is to move message buffer management responsibility out of the local queue class and into a global, shared message pool class.

In this memory-less queue design, the two pipeline end point threads explicitly assume the responsibility of acquiring and releasing the Message buffer handles from the mutex protected, shared message pool. The first thread “acquires” and the last thread “releases” message buffer handles. Each inner thread, i, in the pipeline performs the following work:

  • Pop the handle to the next input Message buffer from queue i-1
  • Process the message
  • Push the Message buffer handle to queue i

The key to avoiding unessential inner thread copies is that the messages must be intentionally designed to be of the same type.

As soon as I get some schedule breathing room (which may be never), I’m gonna refactor my application infrastructure design and rewrite the code to implement the memoryless queue + global message pool approach. That is, unless someone points out a fatal flaw in my reasoning and/or presents a superior inter-thread message communication pattern.

The Parallel Patterns Library

August 24, 2011 Leave a comment

Kate Gregory doesn’t work for Microsoft, but she’s a certified MVP for C++. In her talk at Tech Ed North America, “Modern Native C++ Development for Maximum Productivity”, she introduced the Parallel Patterns Library (PPL) and the concurrency runtime provided in Microsoft Visual C++ 2010.

The graphic below shows how a C++ developer interfaces with the concurrency runtime via the facilities provided in the PPL. Casting aside the fact that the stack only runs on Windows platforms, it’s the best of both worlds – parallelism and concurrency.

Note that although they are provided via the “synchronization types” facilities in the PPL, writing multi-threaded programs doesn’t require programmers to use error-prone locks. The messaging primitives (for inter-process communication) and concurrent collections (for intra-process communication) provide easy-to-use abstractions over sockets and locks programming. The messy, high-maintenance details are buried out of programmer sight inside the concurrency runtime.

I don’t develop for Microsoft platforms, but if I did, it would be cool to use the PPL. It would be nice if the PPL + runtime stack was platform independent. But the way Microsoft does business, I don’t think we’ll ever see linux or RTOS versions of the stack. Bummer.

DataLoggerThread

August 8, 2011 Leave a comment

The figure below models a program in which a pipeline of worker threads communicate with each other via message passing. The accordion thingies ‘tween the threads are message queues that keep the threads loosely coupled and prevent message bursts from overwhelming  downstream threads.

During the process of writing one of these multi-threaded programs to handle bursty, high rate, message streams, I needed a way to periodically extract state information from each thread so that I could “see” and evaluate what the hell was happening inside the system during runtime. Thus, I wrote a generic “Data Logger” thread and added periodic state reporting functionality to each worker thread to round out the system:

Because the reporting frequency is low (it’s configurable for each worker thread and the default value is once every 5 seconds) and the state report messages are small, I didn’t feel the need to provide a queue per worker thread – YAGNI.

The figure below shows a more detailed design model of the data logging facility in the form of a bent” UML class diagram. Upon construction, each DataLoggerThread object can be configured to output state messages to a user named disk file and/or the global console during runtime. The rate at which a DataLoggerThread object “pops” state report messages from its input queue is also configurable.

The DataLoggerThread class provides two different methods of access to user code at runtime:

void DataLoggerThread::record_txt_block(const Data&)

and

void DataLoggerThread::operator<<(const Data&).

Objects of the DataLoggerThread class run in their own thread of execution – transparently in the background to mainline user code. On construction, each object instance creates a mutex-protected, inter-thread queue and auto-starts its own thread of operation behind the scenes. On destruction, the object gracefully self-terminates. During runtime, each DataLoggerThread object polls its input queue and formats/writes the queue entries to the global console (which is protected from simultaneous, multiple thread access by a previously developed CoutMonitor class) and/or to a user-named disk log file. The queue is drained of all entries on each (configurable,) periodic activation by the underlying (Boost) threads library.

DataLoggerThread objects pre-pend a “milliseconds since midnight” timestamp to each log entry just prior to being pushed onto the queue and a date-time stamp is pre-pended to each user supplied filespec so that file access collisions don’t occur between multiple instances of the class.

That’s all I’m gonna disclose for now, but that’s OK because every programmer who writes soft, real-time, multi-threaded code has their own homegrown contraption, no?

My Erlang Learning Status – IV

July 21, 2011 4 comments

I haven’t progressed forward at all on my previously stated goal of learning how to program idiomatically in Erlang. I’m still at the same point in the two books (“Erlang And OTP In Action“; “Erlang Programming“) that I’m using to learn the language and I’m finding it hard to pick them up and move forward.

I’m still a big fan (from afar) of Erlang and the Erlang community, but my initial excitement over discovering this terrific language has waned quite a bit. I think it’s because:

  1. I work in C++ everyday
  2. C++11 is upon us and learning it has moved up to number 1 on my priority list.
  3. There are no Erlang projects in progress or in the planning stages where I work. Most people don’t even know the language exists.

Because of the excuses, uh, reasons above, I’ve lowered my expectations. I’ve changed my goal from “learning to program idiomatically” in Erlang to finishing reading the two terrific books that I have at my disposal.

Note: If you’re interested in reading my previous Erlang learning status reports, here are the links:

Centralized, Federated, Decentralized

1 Prelude

A colleague on LinkedIn.com pointed me toward this Doug Schmidt, et al, paper: “Evaluating Technologies for Tactical Information Management in Net-Centric Systems“. In it, Doug and crew qualitatively (scalability, availability, configurability) and quantitatively (latency, jitter) evaluate three different architectural implementations of the Object Management Group‘s (OMG) Data Distribution Service (DDS): centralized, federated, and decentralized.

The stacked trio of figures below model the three DDS architecture types. They’re slightly enhanced renderings of the sketches in the paper.

2 Quantitative Comparisons

DDS was specifically designed to meet the demanding latency and jitter (the standard deviation of latency) performance attributes that are characteristic of streaming, Distributed Real-Time Event (DRE) systems like defense and air traffic control radars. Unlike most client-server, request-response systems, if data required for human or computer decision-making is not made available in a timely fashion, people could die. It’s as simple and potentially horrible as that.

Applying the systems thinking idiom of “purposeful, selective ignorance“, the pics below abstract away the unimportant details of the pics above so that the architecture types can be compared in terms of latency and jitter performance.

By inspecting the figures, it’s a no brainer, right? The steady-state latency and jitter performance of the decentralized and centralized architectures should exceed that of the federated architecture. There is no “middleman“, a daemon, for application layer data messages to pass through.

Sure enough, on their two node test fixture (I don’t know why they even bothered with the one node fixture since that really isn’t a “distributed” system in my mind) , the Schmidt et al measurements indicate that the latency/jitter performance of the decentralized and centralized architectures exceed that of the federated architecture. The performance difference that they measured was on the order of 2X.

3 Qualitative Comparisons

In all distributed systems, both DRE and Client-Server types, achieving high operational availability is a huge challenge. Hell, when the system goes bust, fuggedaboud the timeliness of the data, no freakin’ work can get done and panic can and usually does set in. D’oh!

With that scary aspect in mind, let’s look at each of the three architectures in terms of their ability to withstand faults.

3.1 Decentralized Architecture Availability

In a decentralized architecture, there are no invasive daemons that “leak” into the application plane so we can’t talk about daemon crashes. Thus, right off the bat we can “arguably” say that a decentralized architecture is more resistant to faults than the federated or centralized architectures.

As the picture below shows, when a user application layer process dies, the others can continue to communicate with each other. Depending on what the specific application is required to do during operation, at least some work may be able to still get accomplished even though one or more app components go kaput.

3.2 Federated Architecture Availability

In a federated architecture, when a daemon process dies, a whole node and all the subscriber application user processes running on it are severed from communicating with the user processes running on the other nodes (see the sketch below) in the system. Thus, the federated architecture is “arguably” less fault tolerant than the decentralized and (as we’ll see) centralized architectures. However, through judicious “allocation” of user processes to nodes (the fewer the better – which sort of defeats the purpose of choosing a federation for per node intra-communication performance optimization), some work still may be able to get accomplished when a node’s daemon crashes.

If the node daemons stay viable but a user application dies, then the behavior of a federated architecture, DDS-based system is the same as that of a decentralized architecture

3.3 Centralized Architecture Availability

Finally, we come to the robustness of a centralized DDS architecture. As shown below, since the single daemon overlord in the system is not (or should not be) involved in inter-process application layer data communications, if it crashes, then the system can continue to do its full workload. When a user process crashes instead of, or in addition to, the daemon, then the system’s behavior is the same as a decentralized architecture.

4 BD00 Commentary

Because he works on data streaming DRE radar systems, Jimmy likes, I mean BD00 likes, the DDS pub-sub architectural style over broker-based, distributed communication technologies like C/S CORBA and JMS queues. It should be obvious that the latter technologies are not a good match for high availability and low latency DRE applications. Thus, trying to jam fit a new DRE application into a CORBA or JMS communication platform “just because we have one” is a dumb-ass thing to do and is sure to lead to high downstream maintenance costs and a quicker route to archeosclerosis.

Within the DDS space, BD00 prefers the decentralized architecture over the federated and centralized styles because of the semi-objective conclusions arrived at and documented in this post.

Using The New C++ DDS API

June 10, 2011 2 comments

PrismTech‘s Angelo Corsaro is a passionate and tireless promoter of the OMG’s Data Distribution Service (DDS) distributed system communication middleware technology. (IMHO, the real power of DDS over other messaging services and messaging-based languages like Erlang (which I love) is the rich set of Quality of Service (QoS) settings it supports). In his terrific “The Present and Future of DDS” pitch, Angelo introduces the new, more streamlined,  C++ and Java DDS APIs.

The UML activity diagram below illustrates the steps for setting up and sending (receiving) topic samples over DDS: define a domain; create a domain participant; create a QoS configured, domain-resident publisher (or subscriber); create a QoS configured and type-safe topic; create a QoS configured, publisher-resident, and topic-specific dataWriter; and then go!

Concrete C++ implementations of the activity diagram, snipped from Angelo’s presentation, are presented below. Note the incorporation of overloaded insertion operators and class templates in the new API. Relatively short and sweet, no?

Even though the sample code shows non-blocking, synchronous send/receive usage, DDS, like any other communication service worth its salt, provides API support for blocked synchronous and asynchronous notification usage.

So, what are you waiting for? Skidaddle over to PrismTech’s OpenSplice DDS page, download the community edition libraries for your platform, and start experimenting!

%d bloggers like this: