Sunday, October 26, 2014

Introduction to Multithreaded and Parallel Development - Part 3

Locks and Synchronizations

Thread synchronization is the act of notifying all executing code of the latest information regarding the resources they need, locks are there just to make it easy.

Why locking is bad?

- Priority, when a low priority process holds a lock that a high priority process needs there is no easy way of telling it to let go.
- Convoying, when a thread that holds the lock is paused, all other threads are stuck waiting.
- Deadlocks, when one lock is blocking another lock from being released.
 
We have several types of locks in the .NET framework, each one is designed to address a different problem but all locks are designed to give us a way to update a shared resource or to read a stateful resource, such as a file stream that requires you to seek to a specific location before reading a value.

Monitor - good general purpose locking and synchronization mechanism, it provides a way to lock and notify where Monitor.Enter and Monitor.Exit provide the locking and Monitor.Wait and Monitor.Pulse or Monitor.PulseAll provide the notification part, used for relatively long term locking (more than a few commands or for IO).

lock - a syntactic sugar for Monitor which is a type of scoped lock if you're familiar with C++ locking.

Interlocked - easy access to locked bus atomic operations, mostly used for Add, CompareExchange / Exchange and Increment/Decrement.

Mutex - Mostly used for synchronization across processes, since this lock is serviced by the Kernel, its relatively slow.

SpinLock - Very fast lock but CPU intensive as it uses a busy wait to lock, use for only a very short duration such as a few commands like data structure modifications, variable value updates, etc'.

ReaderWriterLock and ReaderWriterLockSlim are locks for managing read/write resources, use ReaderWriterLock for long term locking and ReaderWriterLockSlim for short term locking, slim locks use the SpinLock internally.

Semaphore and SemaphoreSlim can be used as locks, but are best used to limit the amount of concurrency a certain resource should have, for example, you would like only 5 open files at the same time, you can use semaphores to block the 6th request.
 

Multithreading Implementations

.NET strives to provides us with all the tools for whichever multithreaded job we'd like, a few of them would be:

Thread - a thread is a single execution unit, you can create a thread with new Thread(new ThreadStart(..)), Start it and Abort it, which can be caught inside the thread as ThreadAbortException and then cancelled with ResetAbort, then you can wait for it to end with Join, threads can be prevented from taking CPU with Sleep or give up the rest of their time slice with Yield.

Threadpool - A threadpool is useful if you want to QueueUserWorkItems for execution, then you can be notified when they are done if you'd like.

BackgroundWorker - A wrapper for common threading concepts, it creates a thread, you can pass the code to execute with DoWork and have events for RunWorkerCompleted and ProgressChanged, you can also CancelAsync the job. You start the background worker with RunWorkerAsync and get results and exceptions inside the RunWorkerCompleted.

Threading.Timer/Timers.Timer - Both provide a way to execute a piece of code at certain intervals, while Timers.Timer execute in the UI's thread, the Threading.Timer executed in the threadpool, so their usage depends on your use case.

Parallel LINQ (PLINQ) - An extension to LINQ which can parallelize LINQ queries with a simple AsParallel method.

The Task Parallel library (TPL) - Task can be created (or StartNew)and chained (ContinueWith), Started and Waited for. Tasks use the TaskScheduler, and by default execute on the threadpool. The Parallel class is also a part of TPL and contain Parallel.For which can do a parallel for loop, a Parallel.ForEach which can execute a parallel foreach and Parallel.Invoke which can execute multiple actions in parallel.

Data Containers and Types

ConcurrentDictionary - A threadsafe implementation of a Dictionary, the main differences is that its preferred to use its Try methods, TryAdd, TryGetValue, TryRemove,TryUpdate, the implementation is using Monitor locks for write/count operations and no locks for read operations.

ConcurrentQueue - A threadsafe implementation of a Queue (FIFO).

ConcurrentStack - A threadsafe implementation of a Stack (LIFO).

ConcurrentBag - A threadsafe implementation of an unordered list of items.

BlockingCollection - a Threadsafe implementation of  a Producer-Consumer collection, it will block on Add and Take when the collection is full/empty.

ThreadStaticAttribute - An attribute that works on static fields, it will create a separate variable for each thread, almost like an opposite of volatile, so instead of synchronizing across threads, it will be completely separate.

ThreadLocal - A helper for creating a local copy of a variable for each thread that uses it.

Lazy<T> with isThreadSafe true - A helper for lazy initialization of resources, when used with IsThreadSafe=true parameter, it will initialize the value in a threadsafe manner and create a single instance.

Lock-Free algorithms - Lock free algorithms are all about implicit locking, they use Interlocked instructions for synchronization, CompareExchange (which has implicit memory barrier) is used so it maintains state by trying and retrying to change a value atomically. Lock-Free algorithms are usually slightly faster in low contention situations and can gain more performance over locking algorithms in high contention situations.

John Hanna implemented a nice collection of lock-free algorithms - https://hackcraft.github.io/Ariadne/

[MethodImpl(MethodImplOptions.Synchronized)] - The lazy person locking scheme, it locks the instance on each member function the attribute is on. It's like making a class threadsafe without the hassle of knowing what and how multithreading works, performance is equivalent to executing serially on a single thread plus the time it takes to lock on each method.

Browser Multithreading

In the old days (286 anyone?), DOS multi tasking such as DESQView used to provide a basic experience of multi tasking by saving the current state and switching to a different task. These days software is a lot more complex but the same principle of context switching remains, this idea can serve us well when trying to multitask with browsers.
 
While all Browsers provide setTimeout, setInterval which is just scheduling execution on the browser event loop, HTML5 standard provides us with web workers which can use multithreading in the browsers.
 
Based on the same principles of DESQView a developer can provide a rich experience to the user with async programming with the help of setTimeout/setInterval and if a richer, more CPU intensive processing is required, to start a new web worker and give it a task to process in the background.
 
For example, jQuery can do animations thanks to setInterval, it splits the animation to multiple steps which are executed each time setInterval is called (Neeraj Singh wrote an article about it), this can give the illusion that multiple things are happening on the same time.

Exceptions

Thread exceptions are handled silently, the starting thread doesn't know if a certain thread failed, this creates a unique situation in which an application needs to handle thread failures explicitly, to make matters worse, if a thread fails while holding a lock, that lock may stay locked forever, on top of that, the default thread abort mechanism works by throwing exception inside the thread, so it might make the unlocked locks problem even worse.
 
You should always handle thread exceptions, a thread body should always be inside a try catch with ThreadAbortException at the very least and release all locks in finally section.

try
{
}catch (ThreadAbortException)
{
}
 
An exception to this is if you're only using the lock syntactic sugar as its released automatically, but good habits prevent future problems in case you decide to use a different lock.

Debugging Threads

Debugging threads is not a trivial task, Visual Studio Performance Profiler can do Concurrency profiling, which helps to understand what threads are doing and how locks are affecting them.

Visual Studio Profiler



Moreover, you should name your threads so when looking at the Threads window in Visual Studio, you can locate the threads you're interested in easily and you can select which thread you would like to step in next.

Threads Window


Summary

Threads are not a magic bullet for performance problems, their use should be heavily considered against the cost, complexity, development and testing time making sure the extra effort is worth the potential benefits, POC and profile your design so you won't be disappointed.

Many algorithms have faster alternatives which should be investigated first, proper multithreaded development requires both care and experience and if not careful, can introduce concurrency based bugs (or Heisenbugs as some like to call them), always look in the documentation if a certain piece of code is instance threadsafe and method threadsafe before making decisions and using locks, when you do use locks, use the most lightweight lock you can for each one you use and use the proper one.

I've came to the conclusion that solving multithreaded issues is a tedious job and its best to avoid these problems in both of concurrency thought process and performance, in each lock there is a performance penalty while using lock-free collections is both safer and quicker the implementation is more prone to memory model based error and can provide a challenge when facing with random errors. IMHO the best multithreaded application is one that simplicity is balanced with performance, which can be achieved mostly with Messaging, Tasks and separation of concerns.

Avoid modifying sturcts, arrays with elements smaller than the memory alignment (32/64 bits usually), small elements could be written to memory as 'get whole element, modify only 1 byte, push back the modified element', you can see the problem in this, same thing with arrays, in general I would recommend splitting the arrays/structs to the data needed to be modified separately and join the modifications back to the struct when all threads are done.

For small tasks the more common scenarios can be solved using a task library or dividing the tasks to smaller tasks and executing them in threadpools.

For larger tasks/collections, like matrix calculations, it is better to divide the tasks to ranges and let each thread handle a separate range to avoid the mandatory locks and CPU cache issues.

Starting/stopping threads to perform tasks is usually more costly as the price of starting threads is significant and therefore a threadpool of some sort should be used in most cases.
In addition for tight loops, trying to avoid code branches might pay off as branch predication is not 100% and each miss incurs penalty.

Lastly, locks are more of a statistic probability kind rather than the perfect first thread which waited gets the lock, welcome to the era of statistical probabilities programming.

Further Reading

Great presentation slides from Gaƫl Fraiteur - http://www.slideshare.net/sharpcrafters/multithreading-fundamentals
Article about locks - http://locklessinc.com/articles/locks/
Threading in C# by Joseph Albahari - http://www.albahari.com/threading/
https://stackoverflow.com/questions/3342941/kill-child-process-when-parent-process-is-killed
PLINQ Partitioning - http://blogs.msdn.com/b/pfxteam/archive/2009/05/28/9648672.aspx
MSDN Overview of Synchronization Primitives  - http://msdn.microsoft.com/en-us/library/ms228964(v=vs.110).aspx
Microsoft CHESS tool for testing concurrency issues, Microsoft released the source code but the project seems abandoned - https://research.microsoft.com/en-us/projects/chess/
 
Problems and Solutions
- Producer Consumer/Bounded Buffers, easy to implement with BlockingCollection or ConcurrentQueue.
- Readers / Writers, easy to implement with ReaderWriterLock and ReaderWriterLockSlim.
- Dining Philosophers, an interesting implementation from Microsoft.
 
 
Synchronization Algorithms - Some are more or less relevant to .NET, but are interesting read altogether.
- Thundering herd problem - https://en.wikipedia.org/wiki/Thundering_herd_problem
- Lock Convoy https://en.wikipedia.org/wiki/Lock_convoy
- Sleeping barber problem - https://en.wikipedia.org/wiki/Sleeping_barber_problem
- Cigarette smokers problem - https://en.wikipedia.org/wiki/Cigarette_smokers_problem
- Readers-writers problem - https://en.wikipedia.org/wiki/Readers-writers_problem
- ABA problem - https://en.wikipedia.org/wiki/ABA_problem
- Readers-writer lock - https://en.wikipedia.org/wiki/Read/write_lock_pattern
- Lamport's bakery algorithm - https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
- Dekker's algorithm - https://en.wikipedia.org/wiki/Dekker%27s_algorithm
- Eisenberg & McGuire algorithm - https://en.wikipedia.org/wiki/Eisenberg_%26_McGuire_algorithm
- Peterson's algorithm - https://en.wikipedia.org/wiki/Peterson%27s_algorithm
- Szymanski's Algorithm - https://en.wikipedia.org/wiki/Szymanski%27s_Algorithm
- Spinlock - https://en.wikipedia.org/wiki/Spinlock
- Seqlock / frlock / ticket lock
 


Introduction to Multithreaded and Parallel Development - Part 2

Multithreading Basic issues with Compilers

From Wikipedia:
"A compiler is a computer program (or set of programs) that transforms source code written in a programming language (the source language) into another computer language (the target language...)..."

When we compile a line such as

x=x + y

Its first compiled to about 4 IL command:

ldsfld int32 x
ldsfld int32 y
add
stsfld int32 x
 
and about 2 assembly commands:

mov         eax,dword ptr ds:[00650C4Ch] 
add         dword ptr ds:[650C48h],eax 
 
We have no problem with IL code as it's not really executed by the CPU, its only later compiled to machine code.

We might encounter the following problems:

- A thread could be scheduled off after the 1st line of machine code, another thread will start and modify [00650C4Ch]  with a different value, leaving eax with incorrect value. this is called a racing condition, as the threads are racing with each other, the result is usually non-deterministic and unpredictable, but the problem might not show immediately, as the development machine's CPU might be slower or faster, have less or more cores and other differences between executing machines.

- Variables might be cached by the compiler, the compiler doesn't understand that other threads might want to use it, this might actually speed things up on a single thread program but will create many points of stale data, leading to incorrect results, but in .NET it could be solved by using volatile (which creates a performance problem as it requires the bus to lock and force the cores' cache to update).

- Compilers change, with almost every new version of .NET framework, Microsoft work hard to make the compiled code execute faster, these optimizations can cause your code to misbehave as new racing conditions may be created and that's one of the reasons it's important to write correct multithreaded code and not rely on the fact that your code does not crash or produce incorrect results.

Introduction to Threads and Synchronization

 
Critical Sections are sections of code that can only be executed one at a time, if multiple threads are attempting to enter a critical section, the first one will execute, the rest will wait for it to finish and then execute serially one by one.

Atomic operations are blocks of code that can execute without external elements affecting their state/data, so if we have a CPU atomic operation in relation to the example above, the value of X will always be X+Y.

Threads are execution units, they have their own stack and execute a method that was passed to them. When threads are created and started, they are scheduled in the OS scheduler for execution but they do not execute concurrently with every other thread in the system as there are less cores than threads.

What happens is, the OS have a list of threads to execute and by switching threads on and off it creates an illusion of concurrency. the amount of actual concurrency depends on the amount of CPU cores, so if we have 2 cores and 10 threads, only 2 threads are executed concurrently. Which threads are executed depends on the process's priority and the thread's priority, that process of scheduling threads on and off is called context switching, which is not free, so the more threads we have in the system, there is a performance penalty as there is less work actually done and more CPU resources are spent on context switching.

Locks guarantee that the code block locked is serially executed one at a time, locks are implementations of critical sections, there are many types of locks, each suitable for a different purpose.

Interlocked operations are considered to be atomic operations, consider the following, on a single core CPU, a single CPU command is considered atomic, a single line of code could be many CPU commands. on a multiple core CPU, a single command is not considered atomic because other cores could access the same memory location, to prevent that situation the CPU can lock the bus to guarantee that. so assuming the CPU command INC (increment by one) is desired, alone it will work on a single core CPU, but on multiple core CPU, LOCK INC must be executed instead, which first locks the bus, increment by one and unlocks the bus.

The problem most developers find with threads is not the simplicity that threads share memory and execute code, but the timing and how memory is written. When you write sequential code you don't care if the compiler decides to cache a variable or how it writes and decomposes long stream of bytes, When you write to a memory location, you don't expect other bytes to be read and written, only the bytes you want, but these bytes could be stale and not contain the real value, or they could be part of a bigger memory address, like an array, a struct or just a part of a collection of bytes defined under a different data type.
 

- Why we shouldn't lock every variable?

Lock acquisition is slow and even if it were fast, you would lose most of the gains you could have gotten from having multiple threads as the locked block is executed serially.

Locks needs to be thought about, as deadlocks are very easy to create with multiple locks.

If you're locking a single variable and not a process, it's easier to use an Interlocked operation if it applies.
 

- Shared State vs. Shared Memory

While shared memory is what we have in a process, there is no problem writing to a memory location whenever we want, the CPU doesn't make us lock that memory location or anything like that. What we do have to consider is that writing to a memory location could potentially affect other parts of our software, these parts look for variable values to determine their state and what they need to do next. Other than hardware problems memory cannot be corrupted, what gets corrupted is the state our threads see, that state needs to be modified as a whole.

Let's take a simple Dictionary as an example, first, items are divided to buckets, then buckets get full and links to other buckets can be created. Assuming we don't lock our dictionary's STATE, multiple threads can all create their own buckets, each one having only one item, so when they are added to the Dictionary, the end result could be only one bucket with one item or any kind of mutation in the dictionary's state. To make matters worse, we might have a counter that counts how many items we have in the dictionary and that counter can show 5 items while we have only one. Now let's make it even worse by saying we have a process that reorders the buckets after 10 inserts and well, you understand the gravity of our situation.

So that's another reason why locking individual variables is not a good practice.
 

- How to divide the work?

The easiest, most common, is to divide tasks to their own threads. In the example above, one task listens to incoming document, another does the conversion, etc'.

Another way is to divide long loops to shorter loops, each one executes in its own thread.
But, if all threads are updating the same variable, that variable could be a bottleneck and all cores needs to synchronize their cached value from the same memory location, this process of updating the cache is relatively slow and will slow things down if not make the whole process slower than a single thread.

You should take into account that starting threads takes time, so if the thread does not have a long-term task, you should use thread pool implementation, there is one provided by the framework, but if your tasks are depending on each other, you might want to consider implementing multiple TaskScheduler(s).

Multithreaded application usually doesn't linearly scale as there are races to resources, sometimes not even ones that you're aware of, CPU cache is one of them, you'll usually get the best performance if you don't share any modifiable resources between threads, read only resources are ok. at the end of the task, join the data in your calling thread.
 

- What happens if a thread fails?

Exception handling is important in a multithreaded environment, threads, like any other code, can fail, plan for failure, always catch exceptions and know how to handle them. In our example, what happens if the conversion fails? do we retry? how many times? do we mark the document as faulty and try again later? do we give up?

But most importantly, what are the reasons for failure? can we prevent them with better technique? strive to avoid problems as problems might indicate your logic is not fully optimized.

Our threads are just like any other block code, so we can try-catch and report back to the main thread what just happened or in the case of threadpools, BackgroundWorker has RunWorkerCompletedEventArgs which can report errors back.

If all else fails, application wide exception handling could be done with AppDomain.UnhandledException, but it should not be done for error handling, only for error reporting.

In my view, there are two most important reasons to monitor failed threads, the first is releasing locks (the lock syntactic sugar will release the lock, any other type of lock needs to be explicitly unlocked) and the second is to make sure that if another thread waits for the input from the first thread it will wait indefinitely, therefore not doing its task and causing the program to misbehave.
 

- What resource do threads take?

The most obvious one is CPU resources, when you add another thread, the time slice your application gets is being divided by the amount of threads you have, more threads doesn't necessarily mean more work or more time. There is a point in which more threads means less work as context switching takes more CPU than actual work is being done, the best range is the number of CPU cores up to about twice the CPU cores, depending on the ratio between work time and sleep time.

A less obvious one is Memory, each new thread takes about 4KB and reserved 1MB of memory for stack, while this is not much on a modern machine it's still something to take into account. note that if 1MB is not enough, it's usually because the thread misbehaves or uses extreme recursion (which could be replaced with queues).
 

- How to gracefully terminate threads?

To gracefully terminate threads, you should have an exit strategy, the simplest one would be a boolean volatile field which all threads listen to so they can finish their cycle, release locks and other resources (files, close network connections, etc') and end gracefully, you can wait for the threads termination with Join.

You should also have a backup plan, if any of the thread are not terminating in a decent amount of time (for example, 10 seconds), you can Abort them, but keep in mind that locks (other than the lock syntactic sugar) might stay locked, so make sure you release them in finally.
 

- How to have multiple threads access the same resource?

If they are only reading data, it's usually ok, but to know that, you really have to know the internal structure of the objects you're accessing, a misbehaving example would be a LFU caching object keeping track of which item is used the most, if the internal collection is not threadsafe (a threadsafe method is considered to be safe to call from multiple different threads in a defined and predictable manner), the object might fail either gracefully or disastrously.

If they are read/write then don't. In most cases it's simply not worth the time, potential bugs, locks and headache of having the same resource shared by multiple threads, the performance improvements could be significant, but then you wouldn't read an 'introduction to multithreading' but advanced material.

Instead of using read/write shared resources, create a service from these resources, have each thread request and wait for an answer from the service and do all the work for you. It might be easy to implement using queues, as each thread queues a request and wait for an answer, it might be slower, but the time that could be spent on fixing concurrency bugs usually doesn't pay off in business sense.
 

- Threadpool

As threads take time to start and end, their context switching also a cost and you end up with a whole bunch of CPU taken away from you just by using them.

The solution is A threadpool, which is an optimized way of using multiple threads.

The idea is that you have a number of threads about the same number of cores (depending on implementation) which will wait on a queue, since pushing and pulling jobs from a queue is relatively quick, you can push very small jobs and still enjoy the benefit of using multiple threads.
 

- what is this map/reduce thing I've been reading all over? they say its 'web scale'.

I dislike buzzwords, but the basic idea is that machines/processes are responsible for collecting only the information relevant (hence map) to a particular operation (hence reduce), this way you can have large numbers of machines/processes, each one doing a small amount of work and thus gain performance improvements while working on a cumulative large dataset.

Common Problems and Solutions 

Deadlocks

Deadlocks occur as multiple threads are attempting to use different locks which depend upon each other, causing a deadlock which locks all threads involved as none of the threads release their first lock, preventing others from using the resources they need, similar to gridlocks that occur in big cities when drivers are entering an occupied junction.

There are a few ways of resolving deadlocks, smart mechanisms that abort the whole process and retry again later, prevent locks from occurring if the entire list of locks is unavailable or keep track on lock wait time and abort/retry when the lock waits for too long, you can also try to avoid deadlocks by avoiding nested locks and keeping locking order, but these are just guidelines and might not prevent deadlocks in edge cases you might not have thought about.

In my experience, avoiding locks is a lot better and easier than trying to overcome deadlocks, but as system complexity is rising higher and higher avoiding locks might not be possible and deadlock prevention algorithms will be needed.

Locks in .NET involve the following: Monitor, Mutex, SpinLock, ReaderWriterLock, Semaphore, SemaphoreSlim and Interlocked operations, which involve atomic changes to variables, we'll get to all of these later.

For further reading, take a look at Deadlock prevention algorithms.

Livelock

A livelock is similar to deadlock, except a deadlock will basically block a thread from advancing while livelock will do work only to fail at the locked block, the negative side of a livelock is that although it's not changing anything it shouldn't, it is also not doing any work to advance the program. Similar to pressing both the break and the throttle pedals at full power in your car.

Racing Conditions/Data Races

Racing conditions occur as multiple threads are attempting to write to the same memory location making non-deterministic changes, the simplest example would be i=i + 1. You might think that this one line execute in one clock cycle and as one command in the CPU, but it could be compiled to more than one command.
For the sake of simplicity, let's think of it as such:

1. get value from i into register a;
2. add 1 to register a;
3. save value back from register a to variable i;
 
So if we have two threads executing the same line, our results will vary depending on when line 1 was executed, since register a might contain stale value if two threads already executed line 1, but when reaching to line 3, both will have the same value, defeating the purpose of the line i=i + 1.

The easiest way to avoid racing conditions is using locks, but locks make everything run in serial, preventing us from really parallelizing commands, they take a lot of time to lock and for other threads to continue and some of them are even asking the kernel for the actual lock which takes even more time.

The quicker, hardware-based solution is to use Interlocked commands, such as Interlocked.Increment / Interlocked.Add. Keep in mind that interlocked commands also run in serial across cores,  what they do is lock the bus, do their thing and cause a cache miss for other cores, so they need to reload the data from main memory.  If you have many Interlocked commands executed, perhaps its more beneficial to keep a thread local counter and update the main counter at certain intervals.

CPU Execution Order

CPU Out-of-order execution exists in CPUs since the 90's, what it means that the CPU optimize a block of commands to execute in a more efficient manner - memory wise, so if you have a few commands that already have the needed memory and a few commands that don't have the needed memory, the CPU will execute the commands that have the memory and fetch the memory it needs for the commands that don't have their data and then execute them, so less waiting is happening.

This actually cause problems in multithreaded applications since the programmer assumes that the CPU is executing the commands in the order they have written it, to make matter worse, some CPU reordering are actually writing to memory a lot later than the programmer intended, like at the end of the block of commands.

So if you have a few threads that rely on the fact that variable a is assigned before variable b, you might need to use memory barriers, especially in weak-consistency memory model CPUs (ARM/Itanium).

volatile, lock syntactic sugar, Thread.MemoryBarrier and Interlocked commands are your friends in this situation but like all synchronization commands, they have a performance penalty.

Logic Execution Order

Threads take time to start, a threadpool might already be executing other jobs and the whole system might be busy with 100% CPU utilization making everything slow and execute a little bit iffy, never count of one threads' result being ready, use messages, locks, flags etc' to signal the current state.

Again, remember that memory update order could be different than what you wrote, so use memory barriers where its important.

Cached Values  - Cache Coherence

Main memory access is order of magnitude slower than the CPU L1 or L2 cached access, for example, accessing L1 cache takes ~4 cycles, L2 takes ~10 cycles, L3 takes about 40-75 cycles and main memory takes ~100ns. On Nehalem (Intel i7) series of processors, L1 cache size if 64KB, L2 is 256KB and L3 is about 4-12 MB.

Considering the size and time to access these caches provide, it's important to know where your values are coming from and where they are going.

Let's think of the following situation, we have two threads, one is doing a write, the other reads from the same memory location, you might assume that the CPU knows when its cache is getting stale and it does, but only when notifying the CPU that it should note these memory locations, these operations are the Interlocked operations, these cause the CPU to signal a cache change when these values change.

The situation is getting worse when the two threads are intensively write the same memory location, what happens is that each time there's a write, the cache is becoming stale and needs to be refreshed, that process being repeated multiple times is actually slowing down the whole thing instead of speeding it up.

Access time - https://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory

Further reading - https://stackoverflow.com/questions/2538070/atomic-operation-cost

Excessive Parallelization - Excessive Context Switching

Sometimes threading is thought to be the magic solution to all problems, but threading comes with a cost, startup, ending and context switches are part of that cost. Starting 100 threads might not affect the system negatively but starting 10000 threads can cause some systems to slow down significantly, having the CPU spend more of its time on context switches will actually waste CPU rather than use it.

If you're starting 10000 threads, you might be doing something wrong, check if a threadpool can do a better job of scheduling your jobs or perhaps if async programming suit your needs, the idea behind Node.js is actually the solution to excessive context switching. Node.js is using a single thread for processing the JS code, opening  a new connection simply adds another job to the event queue, if that job is waiting indefinitely for a database response, it doesn't hold on to a thread that might or might not be switched into and then switched back, not doing anything but wasting CPU cycles on the context switch itself.

Reactive Extensions for .NET could be a nice read.

Starvation/Lock Contention/Convoying

These are related if not the same cause and effect, When you have multiple threads attempting to access the same resource, two things are actually happening, all threads that don't have a lock can't do their job, so they either wait or spin, the 2nd thing that happens is that there is a race toward getting a lock, it's not first come first serve, it's more like trying to push in a busy train at a 3rd world country, some will get in the first try, some will get in after the 3rd. So basically it means that the waiting threads lock acquisition is limited by the slowest thread lock duration.

To make sure your locks are not killing your performance, do the minimal amount of work inside the locks and remove as much logic as possible from inside the locks, as always Profiling is your friend, so you should check if two lock/release cycles can do a better job than having your logic in one large locked block.

In systems where fairness is important, one way to solve it is to use a ticket lock. At the time of this article, .NET does not implement ticket locks but it should be relatively easy to implement (though I'm not sure what the performance of such lock can be).

Spinning

There are two waiting schemes, one is to wait for notification, the other is to spin, when spinning, a loop waits for a condition to change, like a variable changing value. spinning has both advantages and disadvantages.

The biggest advantage is that it is very fast while the disadvantage is that it takes CPU, imagine 10 threads busy waiting (or spinning) on a variable value, this can take 100% of an 8 core CPU, which is wasteful.

Spinning locks are best used for very short locks, like for changing a variable, or updating a dictionary, never use them for IO locks.

To prevent spinning waits from affecting a system in a very bad way, Microsoft actually implemented them in an adaptive form the first few spins are only CPU, then they add a yield and eventually sleep, this way a programmers' mistake won't affect performance so much.

For spinning waiting schemes, you can use SpinWait.

For notification waiting scheme, you can use WaitHandle/CountdownEvent/AutoResetEvent/ManualResetEvent/ManualResetEventSlim or Monitor's Wait and Pulse/PulseAll.

Branching

While this is not directly related to multithreaded development but performance issues, it's good to know that CPU executes "if" commands by comparing values, if these values match, it jumps to one address, if not, another. There is also a part in the CPU that predicts which way a program will go and fetches the next predicted address. A misprediction takes time and when these mispredictions occur often it can affect the program's performance negatively.

In very tight loops, avoiding multiple IFs and switches can be beneficial as the predictor works by analyzing past results and its about 90% accurate, but if your conditionals coding results in unpredictable results, it will affect the performance.

Programming Errors

There are many programming errors that can occur with multithreading, but you should especially notice resource allocation, which thread is creating which resources and which threads destroy them. To make things simple, keep resources as local as possible, and only the creating thread should destroy these resources, this way your application logic will remain as simple as possible, writing complex code will most probably lead to bugs and time consuming debugging.

You might want to use a resource pool, so each thread is only responsible for releasing the resources while the pool is responsible for destroying them, use try/catch/finally or using statements so your resources are always disposed, even in case of an exception.

Create global resources before the threads are starting if possible or use Lazy with isThreadsafe true.

Lock-reentrancy means that a thread might attempt to use the same lock more than once. Most .NET locks are reentrant, which means you can lock multiple times, what you do have to remember is to release the lock the same amount of times, please read the documentation before relying on this fact as Semaphore is not reentrant and ReaderWriterLockSlim can be both.

Introduction to Multithreaded and Parallel Development - Part 1

In recent years, as limitations arise on CPU clock speeds, manufacturers have been adding cores to continue delivering faster and faster processors, as a result, multithreaded development is more and more necessary to fully utilize the machines' capabilities. 

Back when computers had a single core CPU, normal, procedural and even object oriented programming was pretty straight forward, while multithreaded development requires a different thought process it is simple and similar enough to single core programming that you can get started very quickly. In addition, Microsoft has been keen on adding simplified ways of using threads with BackgroundWorker, Task and PLINQ libraries

There are two distinctive concepts with regard to multithreading:

Concurrency - is when two distinctive processes are executed at the same time, sharing time on the CPU, taking turns.

Parallelism - is when two distinctive processes are executed at the same time without sharing time, meaning, each one is executed in a separate core.
 
On a single core, there is no way to execute parallel tasks, as tasks need to share the CPU time to execute concurrently. On a multicore CPU both concurrency and parallelism takes place as cores can execute parallel tasks but they also share time as most systems have more threads than CPUs.

There are great number of resources on the web for learning how to write exact code, I thought I'll save both of our time by not rewriting all the examples again but by explaining the underlying decisions that led to how and what is in these examples, IMHO understanding is more important.

Let's start with the basics.


What are the basic building blocks?

A Thread - a single execution unit, a thread might execute or sleep. a thread can also wait for a wakeup from another thread or process, can give up the rest of its allotted time slot and can access the whole of the process memory and resources.
A Process - a process is a collection of threads, it can have 1 or more threads.
An Application - an application is a collection of processes, it can have 1 or more processes
A Job - a job allows grouping of processes, it's possible to manage a few processes with jobs. an example would be chrome browser, if you kill the main process, all the other processes are also killed, while not implemented directly in .NET, it is advantageous to know and use in the right circumstances (an example by Matt Howells)

For a sense of completeness, there are also fibers and UMS, but both are not implemented in the .NET frameworks, fibers are basically light threads and UMS is highly specialized for high performance highly multithreaded applications, the work involved in using it makes it cost prohibitive to most developers interesting in pursuing this adventure.
 
Lets imagine a system we would like to use as an example:
- It's a document management system.
- Documents are coming in via multiple ways, e.g. email, ftp, web uploads.
- Documents needs to be converted to a common format
- Documents needs to be scanned for information
- Both converted documents and scanned information needs to be saved for later processing
 
We have the following options for architecture:
- create a process for all tasks, execute sequentially.
- create a process for each task.
- create a process for all tasks and use threads for each task.
 
What ways do we have to communicate between these tasks?
- File system
- Shared Memory
- IPC (named pipes)
- Messaging (like MSMQ)
 
While Databases can also be used to communicate state between systems, IMHO its less than ideal solution, while it can provide as good experience as Messaging, it is not as simple as a Messaging mechanism. However, Microsoft SQL does support Queuing mechanisms since 2005, if your requirements are that the Queues be backed up regularly with SQL and that the messages be a part of a database workflow and/or transactions, it might be beneficial to look into Microsoft SQL Service Broker.
 
 
  Scalability Safety Speed Disadvantages
File System A typical directory can hold many files. Parallel writes are not easily implemented, file system crashes are relatively rare with Mirrored/RAIDed systems. Limited by disk speed, with today's SSDs, we're talking typically on 100's of MB/s or 10's of thousands of IOPS. Relatively slow
Shared Memory Limited to an individual system memory. One error can take the whole application with it, if data is not persisted in a correct way, data loss will result. Limited by system's memory bus speed and application architecture. Requires understanding on how memory is used by compiler/processor and multithreaded programming knowledge.
IPC/Named pipes Limited to an individual machine, between machines a better solution would be socket/WCF. Considered to be safe. Less than Shared memory but more than messaging. Limited to same machine or slow in case of remoting.
Messaging Accepted around the software industry as the scalability choice. Multiple subscribers/publishers can go down, as long as the queue is still alive, the system will be resilient to crashes. Typically its limited by hardware, if multiple queues and multiple servers, can scale horizontally beyond current needs. Requires different though process, in terms of 'packets' rather than 'functions'.
 
What do these mean in the context of our example application?
 

File System

File system can be used to communicate between different applications or different parts of our application, for example, all incoming documents can be saved to one folder as soon as they arrive before any processing, then they can be converted, again, saving the converted files to another folder, then they can be scanned and the scanning results can be saved to a third file.

This is simple, but will probably grind our disks on a high load system.

Advantages:
- The biggest advantage in my opinion is that if the application/server crashes, we have the last state saved to disk and we can pick up where we left off.
 
Disadvantages:
- The biggest disadvantage is that file system operations are relatively slow.
- We need to implement a file locking and retry mechanism.
- While two processes can read from the same file it is hard to implement a scalable solution, file locks can help, but .NET requires exceptions to know that and it is a slow process.
- A lot of CPU is wasted on the file system, try catch and file saving and loading, so for a low profile application it might do the trick, but if every cycle of your CPU is important, avoid this solution.
 
Implementation orientation:
- Use proper locking, if you only want one process to access one file, use exclusive locks (FileShare.None).
- Use try catch on all file accesses and locks are throwing exceptions.
- Use FileSystemWatcher instead of constantly monitoring for changes, also, FileSystemWatcher is known to be not 100% reliable when its internal buffer is full, so implement a scanner and execute it in timed intervals to make sure you didn't miss a change or increase FileSystemWatcher.InternalBufferSize.
 

Shared Memory

Different parts of our application can do everything in memory, making things faster, for example, incoming document contents (or pointer to content) can be passed around to multiple functions, conversion and scanning can be done in memory and very quickly.
 
Advantages:
- Speed
 
Disadvantages:
- Need to synchronize memory accesses, multiple readers can read the same memory location but multiple writes should be synchronized.
- Application/server crashes leads to loss of data, so a program might need to retry uncompleted documents and still save state when a batch of documents is completed to save time in case of a crash.
 
Implementation orientation:
- Shared memory across processes can be implemented using Memory Mapped Files.
- Shared memory in the same process could be implemented using variables and concurrent collections with threads, threadpools and Tasks.
- Since multiple writes should be synchronized and for most cases read should be prevented or stale data can be used instead while write is in progress, a ReaderWriterLock should be considered.
- It is the purpose of this article to introduce the ideas behind this solution, so read on.
 

IPC/Named Pipes

Multiple parts of our application can receive the information they need from named pipes, each one is doing their task and pushing the results to the next pipe.
 
Advantages:
- Relative speed and separation of concerns.
 
Disadvantages:
- IPC and Socket programming could be tricky, implementing a retry, knowing when to listen and when to open/close connections to achieve optimum performance is not trivial, while WCF provides a simplified way of doing things, it still requires much attention to security and retry mechanisms.
- Using Named Pipes on the same machine is fast, but using sockets between different machines is relatively slow and network speed will affect the performance of the application greatly.
 
Implementation orientation:
- Named pipes could be implemented using NamedPipeServerStream and NamedPipeClientStream.
- I recommend using WCF facilities for this, although it adds overhead, the configuration options, flexibility and optional security and encryption will save a lot of time in the future in case the application becomes a large scale operation.
 

Messaging

Multiple parts of our application are listening to different queues, these parts are also responsible for pushing the next message to the next queue, this goes on and on until the incoming documents are handled to completion.
 
Advantages:
- Horizontally scalable (depending on the queuing mechanism and architecture of curse)
- Simple programming and thought processes.
- Adding publishers/subscribers to our queue is relatively simple, more listeners, meaning easier concurrent processing, so if we have two listeners to the same queue, two of them can work on the same task but with a different data at the same time.
 
Disadvantages:
- While providing high bandwidth processing it also provides a relatively high latency, meaning, it takes time to pass the document through all the queues.
 
Implementation orientation:
- For in-app implementation, a simple ConcurrentQueue with a waiting thread could do the trick (and is very fast), add waiting threads if your application demands it AND you have available cores.
- For more robust solution use MSMQ, ActiveMQ, ZeroMQ etc', note that I have used only MSMQ so I can't recommend any specific solution, one size doesn't fit all and you should do your own research on which one is the best for you, many cloud providers also provide a messaging solution, so it might be beneficial to take a look at them.
- Passing only a document id is much preferred to passing the whole document.
 

So what are Concurrency Models?

Concurrency models are the basic building blocks for adding concurrency to your application, consider the following:

- Shared Memory, which relies on the basic premise that all code in the program can access all memory, in a shared memory model, we need to synchronize access to shared resources as conflicts might cause a program to misbehave due to the result being non-deterministic.

- Immutable Data Structures, which relies on all data being immutable (e.g. unchangeable), so if you want to add value 1 to an array containing 1,2,3, the value is not simply added, but a copy of the array is created, 1 is added and the new array is being returned. multiple writes will each return a different array with different values and will be valid only for that calling function, to sync all of these arrays, another one is created by combining all results. this is an easier model to understand but with huge memory structures it could be a problem, the biggest advantage of immutable data structures is that they don't need to be synchronized or locked for shared access.

- Message Passing, which relies on each process to be responsible for its own actions, want to tell it to do something? send a message. JavaScript multithreading (Web Worker) is based on that.