Thursday, July 31, 2014

A newer kernel is required to run this binary. (__kernel_cmpxchg64 helper)

While porting multithreaded C++ code to Android, I've encountered this error:

A newer kernel is required to run this binary. (__kernel_cmpxchg64 helper)


If you also encounter this issue, its important to understand all the elements involved in this error.

  • It is part of gcc 4.7 and up
  • it will fail below kernel 3.1, only then __kernel_cmpxchg64 was implemented
  • The problem is that 64 bit atomic operation is implemented using the kernel on arm unless armv7 is used.
  • Using arm instead of thumb will not solve your problem.
  • Enabling arm v7 hard-float might solve your problem (it solved mine).
  • Using  -mfpu=vfp and -mfloat-abi=hard might solve your problem (i didn't try as armv7 was desired).

Tuesday, June 24, 2014

Performance, Profiling and Optimization 101

This article is about common sense sprinkled with personal opinion and experience, don’t follow everything to the letter, remember:

There is an exception to every rule

In my opinion, optimization is for either one of two cases, customer demands it, or you’re doing it for fun. We’ll look at most of these components and attitudes in this article, some optimizations are more for a challenge than a real income as the performance gains could be great but will benefit no one other than our own pat on our backs. I enjoy both attitudes as some of my past articles show, I’ll optimize string format for fun, but check if I can accelerate an entire processing job for business if my customer demands the minimalistic latency that a computer can possibly achieve, but I will give them the information regarding which process will take how many developer resources and what’s the maximum amount of benefit that can be gained so they will decide if that second is worth 2 days of development time. 

Though everyone knows the value of tests, in performance tuning there is a special weight for it, not only that we improved the performance but also to make sure the outcome is the correct one, as performance tuning makes things more complicated it is sometimes hard to keep everything perfectly together, an outside inspector (tests) can help you keep your mistakes to a minimum.

What is performance analysis?

- From Wikipedia: profiling is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or frequency and duration of function calls. The most common use of profiling information is to aid program optimization.

What is software optimization? 

- From Wikipedia: software optimization  is the process of modifying a software system to make some aspect of it work more efficiently or use fewer resources. In general, a computer program may be optimized so that it executes more rapidly, or is capable of operating with less memory storage or other resources, or draw less power.

Let us assume

  • The more work you do, the more resources it will take, where resources are any combination of CPU time, IO (network/disk) and memory.
  • For the same amount of data, a few big chunks load faster than many small chunks, where chunks can be files on disk or network, the two major reasons are latency and overhead.
  • Don’t optimize unless you have to!  Many times the optimization is idealized in the developer’s eyes, write this optimized, its more efficient, write that optimized, its less resources, most of the time optimizations make things more complicated and less maintainable and you don’t want that overhead when a real bug comes in and you need to figure out what went wrong. Many times optimizations without profiling is time wasted, think like a businessman in that sense, if you don’t have the extra cash, don’t spend it, if you're considering a loan, then be prepared to pay the interest (delays).
  • For many server applications its cheaper to throw more hardware on the problem than optimize.
  • For many client applications the customer’s time is more expensive than developer time! If you’re offering a free application you might encourage your users to abandon it, if it’s a paid application, you should think of your customers, their time is not free and if their productivity is lowered, they won’t want to spend much time/money on your application, when thinking about customer time, double the amount of users with the time wasted, is it worth more than developer time and potential income?
  • Understand before optimization. Always understand what the application is trying to accomplish before you try to optimize, a big picture will show you that perhaps the whole process isn't necessary to achieve the end result, now you've cut a significant amount of time and resources. Algorithms are bottlenecks, is there a more efficient one?
  • Your hardware has finite resources. 

David Knuth wrote in The Art of Computer Programming - “premature optimization is the root of all evil”.

  • It is hard to be certain where a bottleneck will be in production system, you might assume that a certain piece of code slows things down, but in reality there could be so many variables affecting the system, that the piece you thought was the problem is negligible in the grand scheme of things, its also inefficient to optimize tasks you don’t care they are slow and tasks that execute rarely, unless these task need to provide an answer in real time.
  • Design with performance in mind. Its much cheaper to develop something that will withstand or almost withstand the planned performance requirements and then optimize rather than develop a cheap solution that will require much more work to change it later for the designed performance, find your balance for time to market vs performance. I know this looks like a contradiction to premature optimization (...), but really its not, assume you have a web application that needs to display 1000’s of records in a grid, planning to display 10 and later optimizing it will waste time, infinite scroll or other common practice is much more suitable.
  • Do a preliminary profiling as early as possible for the busiest parts so you’ll have a big picture of what smells and needs another peek. Do your profiling later when the system goes beta/production so you’ll know where the real bottlenecks are.
  • Working, Correctly, Fast. Your correct development flow should be first to make things work. Then work correctly and only then look for optimizations. Its easier and more accurate to do your performance optimization after everything works correctly.
  • Optimization usually complicates rather than simplifies. Optimizing code for performance usually requires caches, dictionaries, arrays and other less readable methodologies, that’s why it should be done at the end, so that the first working code will actually work correctly, complexity usually defies readability and ease of maintenance.
  • Waste not, want not. Don’t waste resources, don’t over-calculate stuff, don’t store things you don’t need, don’t retrieve data you’re not going to use. This will leave you with more resources for the stuff that really needs it. When you load more data than you need, IO needs to work more, depending on latency and throughput, in turn this uses more CPU and sometimes also memory, in the end, you paid a high price for doing unnecessary work.
  • Don’t waste memory resources. High memory usage leads to swapping, don’t cache what you don’t need to cache or might rarely use, cache needs to be maintained so it won’t become stale. Overusing memory leads to swapping and leaving less memory for file system cache which could make the whole system slow. One of the first things I check when I have a slow server is how much memory is available and if SQL is not configured with a hard limit. If you suspect your program is misbehaving, profile it!
  • Profile, Optimize and Validate, repeat as needed, don’t assume anything is faster/slower and don’t assume your improvement actually improves anything, under the right circumstances slow things can be fast and vice versa. Validate, I can’t stress that enough.
  • Don’t keep your code if it’s a minuscule improvement, most of us have a sentiment to the code we write, thinking we’ve spent an hour or more on something that it shouldn’t go to waste but remember that optimizations usually impair readability and maintainability and keeping the code will do more harm than good, if you’re really attached to it, open a graveyard blog and put it on display, this way others can learn and you get to keep it.
  • Why code metrics is not enough. Code metrics can be great to detect complexity and maintainability hotspots in your code, but it can’t measure performance, the fact that your code is complex or simple doesn't mean its slow or fast. A lot of performance bottlenecks can only be detected based on the data that goes through the program’s pipeline, that’s why its even more important to check the performance with real-world data rather than a mockup.
  • Understand Big O notation for algorithmic use. Big O notation is a measurement that describes the performance a certain function will have based on input length. The closer it is to 1, the more chance a long input will affect performance linearly.

How to get performance improvements?

  • Tweaking, usually minuscule, while reordering commands might affect CPU cache performance and can speed things up and using SIMD commands can even double the performance for certain actions,  but unless these commands execute in big loops or on very big data, these tweaks will probably save a second here and there, if you see a significant amount of work being done on blocks of data, like matrix/vector calculations, you should look into SIMD programming, this is completely outside the scope of this article and you’ll probably need to go native – C++, but there is a also a .NET Implementation.
  • Memory access - Understanding reference/value. All references are 32/64bits (depending on architecture), primitives and structs are value based and you should know that calling a method with a parameters by value will copy the contents, this is a type of micro-optimizations but in large loops can have an effect on the performance. In case you didn't know, strings are passed by reference but strings are also immutable, which means that any manipulation on strings creates a copy and discards the old reference, strings are a nice meal for the garbage collector. In any case, the time spent on copying structs is directly related to their sizes and worse case, you can always pass them by reference with ref.
  • Algorithmic. Usually big improvements, in memory and/or time. For example, different sorting algorithms, Search trees, etc'.
  • Object reuse. Object pools were created because we know that allocating memory and creating objects is time consuming (not to mention destroying them), its probably going to be more efficient to use object pools, you can request an object and when you’re done with it, return it to the pool.
  • Exceptions. try-catch is usually handled by saving the current state, executing the try body and in case there’s an error, the state is rolled back, this is expensive in terms of CPU resources and try-catch is advertised as a low performance mechanism. In extreme cases where performance is critical, it might be wrong in design sense but right in performance sense to use other error handling methods, again, if you have such significant performance loses due to exceptions, you should reconsider your design anyway.
  • Caching. Processing power costs money, memory costs money, time costs money, Caching can be a balance between the three, but it can also take your application to the wrong direction. Don’t cache everything, maintaining a dirty cache is as bad as no cache. Distributed cache can be a scale out solution but it can also be a design pitfall. With cache – design for failure, never assume something is already in the cache or fresh enough, keep a timestamp or other marks to make sure your cache is not stale.
  • Deferred execution. If you don’t need real time handling, don’t do real time. Processing things now is costly and most servers are not busy 24/7, if you can postpone your reporting until the server is idle, you’ll look more professional than letting your users wait for an undetermined amount of time. Don’t promise the report to run in real time and get your system frozen until it finishes and don’t display the due time and miss it, regularly. Its unprofessional.
  • Serial execution/Queues. If your application cannot run concurrently without using too many locks and creating random deadlocks (other than the fact that its misbehaving), sometimes its more beneficial to use queued execution, for example, sometimes its even faster to execute jobs in serial as none of them is attempting any locks. 
  • Push vs Polling. Both have their pros and cons, while push is usually faster and in some sense its less resource intensive since its not being queried every certain amount of time, its also keeping a connection open all the time. You should consider both options and decide based on how much delay you're willing to accept and how much resources you're willing to invest. Push can be more complicated to implement but solutions such as SignalR/Socket.IO/WCF callbacks provide enough infrastructure to never use that excuse again.
  • Casts. Casts take time, not a lot, but in some cases can affect performance. You should use Generics or Interfaces to make the program more readable and not worry about casts too much.
  • Accuracy. If you only need float, don't use double. This has varying results on different architectures, sometimes double is faster, sometimes its slower, I've contemplated if I should add it as an optimization, if your code is heavy on floating point operations you should at least look at it.

Where to optimize?

This is Visual Studio Profiler. I've made a little test program that compares several sorting algorithms, we can see clearly which ones take more CPU, the faster ones don't even show up, this is just to show you how easy it is to look for performance bottlenecks with this tool.
Note the views at the top, each view help you determine which are the most probable sources for bottlenecks, each one has a different purpose so explore all of them.

Internet Explorer

IE 10 Profiler on
What should we look at?

  • setAttribute is executed 51k times and takes 329ms. perhaps its affecting the UI's performance? is there something we can optimize in that function or the calling functions?
  • elementFromPoint is not executed too many times but still takes 316ms, what does it do? does it have internal loops?
  • hideChartView is executed only twice but take 192ms. What else does it do beside hiding a chart view?
From first glance it doesn't look that this web application is taking too much CPU, so unless our customer demands it, we will probably not optimize anything.

But wait, we should also look at the call tree:

Here we can actually see which function is calling which, it can help us to understand some of the hotspots calling tree and we see that the most significant slow down is when a mouse is moving, something is rendering, creating a rectangle and setting stroke and fill. With some more digging, perhaps its even possible to speed up the execution.


While IE provides a basic profiler, Chrome has a more robust profiler, I recommend going over the documentation from google, Network, Timeline, Javascript and Memory.

Chrome can also show you how CSS selectors slow down your application, it is part of the style recalculation.

Javascript Profiling

Timeline - Events Profiling


Yahoo's YSlow is a nice plugin, it gives you a general performance overview, if you loaded too many files, where you link you styles and other common issues, it doesn't give you a thorough knowledge of the application, but its still needed information for optimizing your application/website.

There are many more tools that can help you diagnose and optimize your websites and applications, but I've come to like Chrome and Visual Studio and for now they are exactly what I needed.

Methods of profiling:

  • Sampling  - Sampling takes a snapshot of the currently executing threads stacks, by statistically analyzing which functions are caught more in these snapshots, you can see which method takes more CPU resources.
  • Instrumentation – Instrumentation inserts interception points in the code so its being measured whenever a function is called. This is more accurate than statistical sampling, but also slows down the execution significantly, it should be used in environments where the system being profiled is under load and its not known how many resources the application actually have at its disposal, the reason is that if the application have unstable CPU resources, the results will be skewed whenever there is or there isn't any load on the system, fast methods will seem like slow methods. Its also useful in multithreaded applications where some threads are affecting other threads performances though concurrency profiling might be more suitable.
  • Performance Counters – Application performance can also be monitored by windows Performance Monitor, the user can also assign triggers to values you know have bad effect on performance, but this way you leave it to the administrator instead of implementing your own. Its very useful in server environments where these performance counters can be collected and analyzed later. 
  • Memory Profiling - Performance issues may arise as more and more memory is being allocated and the system goes into swapping, but that's not the only concern, each allocation incurs an overhead, it is usually better to allocate more than needed right now than to allocate very small chunks. Like everything, the key is balance. Visual Studio Profiler allows you to see which functions allocated memory, how many allocations and their total size.
  • Resource Contention (concurrency) Profiling - When developing multithreaded applications, sometimes its not clear why a certain function is slow, you may see that there is a lot of time wasted on locks, but to understand how the interaction between the threads is hampering the application, you can use Resource Contention (concurrency) profiling, which visualizes how the application behaves.

Understanding trade-offs

There is no “best” solution, there are trade-offs, either memory efficiency, high performance or readability and maintainability. It is rare to have more than two, in a cost-effective planning you should decide which is more important to you for each module.
  • List vs Dictionary. While list is usually more memory efficient, pulling non-sequential information from it, is usually slower, not to mention a search, which is O(n) on the other hand an ideal dictionary is close to O(1) but nothing is really ideal, on the other hand, adding a value to a list is quicker than adding it to a dictionary as a dictionary has to maintain an internal state for its key.
  • Buffering. Buffering data can do two things, it can either prepare a larger block for an IO operation or it can buffer incoming blocks to achieve the same result in one large operation instead of multiple small operations, Buffering is usually done to reduce overheads at the cost of delay and memory. Buffering can be used in many ways, you can buffer a number of messages you want to pass to the client and then push one big bulk of messages, you can buffer incoming information so you'll execute a parser only once to show a few.
  • Filtering/Grouping. If you do store a dataset in the cache, consider breaking it down to what your queries are going to use, storing multiple groups can be more efficient for later processing, but remember that storing multiple groups can add more latency so then again, it can also be less efficient.
  • Precompute. Precomputing results can save time if these results will be used multiple times, partial precomputing can also help, again, take into account the problems stale data might create.
  • Caching. Caching can help with static or partially static data especially when using a slow storage medium, this might give you a rough idea of what to expect when you think about caching:
Read 1 MB sequentially from memory... 250,000 ns   = 250 µs
Read 1 MB sequentially from SSD* .... 1,000,000 ns   = 1 ms
Disk seek ........................... 10,000,000 ns  = 10 ms
Read 1 MB sequentially from disk .... 20,000,000 ns  = 20 ms
Send packet CA->Netherlands->CA ..... 150,000,000 ns = 150 ms

Note: that it talks about a packet, not 1 MB.
  • Hashing. Lets think of this scenario, you want to find duplicate files in a file system based on content, for that you can really compare each file content to each file content, which will give you the best results. Lets think of optimizations for that program, first, we’ll remove all the 0 size files from comparison, because you can’t really compare 0 to 0. Then we can only compare same file sizes, if the file size doesn't match, the files are not the same anyway. Then we’re left with all the files with the same size. But what do we do with these 100mb files? We have 20 of them, comparing them is the same as comparing 19GB of data! We can save some time by hashing all the files, that’s only 2GB, then comparing all the hashes and only if any of the hashes match, compare the file contents, so now we’re comparing a minimum of 2GB and already know if there’s a chance anything else will match, we can also decide if our hash is enough for us or we want to compare the actual files. Now, a bigger question should be, should we optimize the hash function? Well, does it hash slower than your disk drive? How many comparisons like that are we going to execute each day? 1? Not worth it. All day? Perhaps worth it. Is it critical we’ll compare everything in the shortest amount of time? Could be worth it. Who is paying for that? You get the point.
  • Parallelism. For a single user its probably more efficient to use their entire CPU with all their cores, but consider a situation on a server that you optimize a certain operation to use all its cores, a 2nd user comes along and gets access to all its cores, the performance effects are unpredictable, especially if the application is designed to create more and more threads, context switching is a real issue when using too many threads. I would advise against using multiple threads in web applications for the same request. Instead, you can use queuing mechanism and keep the heavy lifting on a different server.
  • Inlining. Inlining has a minuscule performance gain, with that in mind, Inlining is a method the JIT compiler uses to speed things up for very small methods, when a program calls a method, a stack has to be filled with all parameters and the processor jumps to the method’s body. Inlining saves some of these steps by pushing the method’s body directly into the calling method, for very small methods it can actually speed things up, for medium to large methods, its negligible. .NET 4.0 introduced flags to force inlining (AggressiveInlining), previous frameworks only supplied an attribute to avoid inlining (NoInlining).
  • Lambda/LINQ limitations. Lambda and Linq (which could use lambda) are great, I love them, especially for RAD, but since all lambda methods can’t be optimized at JIT compiler time, inlining will never happen. Take that into account if you’re processing many records. You should also consider using CompiledQuery for your LINQ.

Most probable performance hotspots:

  • Loops/For/for each/recursion.  Any loop has the potential to occupy the CPU and use a lot of memory, You should check how many objects are being processed and how much time each loop takes.
  • IO (DISK/Network/Database) . Disk/Network/Database IO are many times slower than memory access, see if you can load the data in the background or load less data.
  • UI. UI updates takes time, each layout has to be calculated and each control has to draw itself.
  • Program flow. You should make sure a 2nd command is not executed if it depends on the first one to be successful - you are just throwing resources away, if attempting a retry for something, don’t discard the work that was already done if you can.

Most probable memory hogs

  • Indexes, Hash-tables, Lists, Arrays. Containers of any sort hold data by definition, check if its really needed, offload it to cache if its giving you performance benefits but still want it to be off the application's memory.  Be careful with distributed cache/out of process cache, network latency and throughput takes away the benefit of storing large collections as they need to be serialized and deserialized for going over the network, this is also important when using ASP.NET state service.
  • RAW data, file contents (audio, video, text files). Most probably they don’t need to be stored in memory, the file system cache does a great job of using the unused memory.

Basic SQL Optimizations

  • IO is expensive! You can minimize the amount of data being transferred inside the engine by using index includes and using the smallest possible temporary tables. Table variables are suitable for 1000 records or less in most cases.
  • Functions are expensive! Especially when they run on each row, reconsider every function! Also, in that context, avoid joining on expressions, filtering by expressions and selecting with expressions, if you do, check the execution plans and make sure these queries are not too heavy on the CPU and not falling back to table scans.
  • Avoid comparing different data types, some of them will cause a scan, it might be better to have an indexed computed column.
  • Consider using Persisted Computed Columns with indexes and consider executing your queries on them instead of expressions, persisted computed columns execute the function on row update and also update their associated indexes like regular columns which makes them very efficient for a data-warehouse application. 
  • Execution plans are expensive! Use query parameters or stored procedures.
  • Cross joins are expensive! Cartesian products are expensive, for each row in table ‘a’ join each row on table ‘b’. that’s a times b rows!
  • Avoid Cursors! Cursors are slow, sometimes they are the only way to go but for many problems there are faster options than cursors.
  • Indexes can make your life either heaven or hell depending on your understanding of them. Indexes can help you quickly retrieve data but each index adds overhead for each update, insert and delete. Use index includes (include columns in each index for quicker retrieval – avoids Key Lookup) when the benefits are greater than the downsides, index includes duplicate the data, that means more memory and IO, you can even add indexes on persisted calculated columns and have indexed views (schemabinding).
  • Partitioning. Partitioning is an old trick, RAID0 is a type of partitioning without redundancy, RAID5 is a 3 disk partitioning. While SATA3/SAS removed most of the bottlenecks controller wise, more common disk drives are still relatively slow, by splitting your database files on multiple disks, you can have more IO/s available for your database, also, datacenters are starting to use SSD more and more, which speed things up even further.
  • Memory. SQL assumes all the server’s memory belongs to it, it might free some memory when the server is in a really bad situation, but from my experience that’s usually too little too late and can bring a server to its knees. Its best to tell it how much memory to use, reserve about 1 GB of for Windows and even more if the server is not a dedicated database server. Ideally you should have at least enough memory for all the active database tables + enough memory for their most common queries. These days memory is cheap enough that saving money on it doesn't pay off but actually causes you to waste money on a slow server.
  • Avoid negative comparisons. Negative comparisons will most likely cause a scan since it can’t use an index, consider your negative comparisons and the table sizes, Check your execution plans!
  • Retrieve only what you need, join only what you have to. Retrieving more adds overhead which is usually CPU/memory/network/disk waste, save these resources for operations that needs them.
  • Avoid Dynamic SQL, use prepared statements if you have to. All queries are compiled to execution plans, if these plans contain changing values then a new plan has to be created each time the query executes instead of reusing the previous query.
  • Avoid Like comparison, especially with wildcards. Like with wildcards causes a table scan, if you must search for text, use full text search or avoid wildcards.
  • Learn to read execution plans! I can’t stress that enough!
  • Once your database is running for a while, get the top 10 most expensive queries to get a clue whats taking most of your resources.
  • SQL Profiler is a great tool for knowing what's going on in real time, if I have performance issues, I'm setting it up to show everything above 100ms. Bear in mind that running a profiler on a server is not without risks, I've had to restart the SQL Server service once after a profiler froze and got the database to misbehave.

Basic HTML/Javascript Optimizations

  • Most expensive is IO and UI! Every time you show/hide/create/delete an element, the browser recalculates where everything should be, its called a reflow.
  • Big/Multiple loops are slow, use associative arrays if you need a dictionary and regular arrays if you want fast iteration over the elements (for.. in is slower).
  • On old browsers, DOM updates/parsing is very slow, innerHTML on IE is faster than on Chrome and Firefox.
  • Avoid updating elements which cause reflows, padding on IE6-IE8 has the highest performance penalty (x4), while DOM manipulations generally trigger reflows, multiple DOM updates which create reflows are slow to terrible performance, absolute positioned objects have only their own performance penalties rather than affecting the whole document. 
  • Updating elements is faster before they are part of the DOM for the same reason - reflows.
  • CSS wildcards are slower, use the most selective selectors
  • Avoid IFRAMEs, browser in browser have a similar penalty/overhead to having a new browser window open. Also, IFRAMES block onload until they are done.
  • Nested DIVs are slower in certain circumstances, deeper elements to reflow.
  • Mind your scope! local scope is faster than global scope, prefix local variables with var.
  • Avoid using nested properties in loops, they are not optimized. Make local variables and use them instead.
  • Always measure! Don’t assume anything in the browser, there are too many variables! 
var start = (new Date());
console.log("executed " + ((new Date()) - start).toString() + "ms");

  • Use CDNs where needed. Browsers have a limitation for how many concurrent open connections they have, loading resources from multiple domains can overcome that limitation, in addition, CDNs can provide a closer to browser server, which can serve resources faster.
  • Avoid eval(s), not only for performance issues but also security.
  • Minify your Javascript files and CSSs
  • Know when to use setInterval and recursive setTimeout. setInterval raises an event every x ms, but if your browser is a bit slow then it will be unresponsive as its executing the interval events. Another option is to use setTimeout and part of the function, call setTimeout again, this way you can tell the browser how much time to rest between events instead.

C# Optimizations

  • Most expensive is IO and UI!
  • C# is very fast, in some conditions competitive with C++
  • Use the data types you need, avoid structs as method parameters or use ref if applicable, structs are considered values, a copy is created for every function call.
  • Avoid dynamic. While its not directly related to performance, and it is a static type, its not checked at compile time, dynamic types are making your code less readable and could introduce problems. 
  • Avoid Reflection, accessing object type on runtime is slow, if you have to, cache accessors and learn about dynamic proxies.
  • Avoid casting, generics is typesafe and avoids unneeded boxing/unboxing.
  • Avoid COM, prefer managed code.
  • Use Dictionaries for fast retrieval, avoid searching/linq on lists in critical sections, compare with CompiledQuery(ies).
  • StringBuilder. Use StringBuilder if you have long list of manipulations, this can save both CPU time and memory as strings are immutable.
  • If using multithreading, understand different locking mechanisms, use busy waits for very quick operations and wait locks for long locks.
  • If creating/destroying many/large objects, understand garbage collection, generations, Dispose/Finalize, SuppressFinalize and Large Object Heap. While garbage collection provides a simple memory management for .NET applications, its important to know that reference rich and large objects have a certain penalty, reference rich objects needs to go through more work for unreferencing them and large objects are stored in the Large Object Heap which is unable to relocate objects and therefore doesn't free memory as often. While garbage collection could be a performance hit, it should only be optimized if collection times are affecting performance significantly
  • Like everything else, measure! (StopWatch)
  • Defer IO execution until the system is either idle or you must. IO takes CPU time, but it also uses system interrupts, which sometimes lock the whole system for the duration of the interrupt, more IO means less CPU for other tasks.
  • Virtual methods/Interfaces. They have a performance penalty of about 10%, I would not recommend avoiding them as most of the time, the readability and maintainability they provide well exceeds their penalty, on the other hand, a property override might have a higher penalty (using 'new' keyword).
  • Use System.Diagnostics.Debug if you want the compiler to remove the calls when compiling in Release and System.Diagnostics.Trace if you want to keep.

Guidelines for designing a new application/feature

  • Ask/Decide on minimum requirements. Don’t optimize prematurely but design for requirements!
  • Prefer library/framework methods, most of them have been already optimized but don’t follow blindly, when you profile, also look at the libraries.
  • Avoid unnecessary function calls, where necessary could also be necessary right now.
  • Prefer positive checks rather than negative in SQL. 
  • Learn to use threading mechanisms, Tasks, Actions, Async.
  • Don’t waste!

Ask yourselves:

  • Is the execution order optimal?
  • Does it have to run now?
  • Am I discarding data?
  • Is the application doing too much work in this point in time? How can I minimize it?
  • Does the code trigger too many actions? Can I combine them to a single execution?

Further reading:

Monday, April 28, 2014

Debugging FFMPEG V4L2 on Android

So I've figured out what's wrong with V4L2 in the master branch: git://

In the past few days I've been trying to figure out why compiling FFMPEG on Android didn't work properly with any Android/v4l2 webcams.

I activated the uvcvideo module's trace with this:

echo 2047 > /sys/module/uvcvideo/parameters/trace

And fired up another terminal with this so I can see the kernel messages in real time:

cat /proc/kmsg

Which produced a result similar to this:

<.>[.] uvcvideo: Queuing buffer 8.
<.>[.] uvcvideo: uvc_v4l2_ioctl(VIDIOC_DQBUF)
<.>[.] uvcvideo: Frame complete (EOF found).
<.>[.] uvcvideo: Dequeuing buffer 9 (4, 614400 bytes).
<.>[.] uvcvideo: uvc_v4l2_ioctl(VIDIOC_QBUF)
<.>[.] uvcvideo: Queuing buffer 9.
<.>[.] uvcvideo: uvc_v4l2_ioctl(VIDIOC_DQBUF)

Which looks ok, so no hint there.

Then I dug in uvccapture which seemed to work, but could only capture very slow framerates. so I've changed that (from sleep to usleep in the capture loop) and tried again - seemed fine to me, time to find the differences...

So I've compared v4l2uvc.c from uvccapture and v4l2.c from ffmpeg:

At first glance they seemed the same, except for the O_NONBLOCK flag. so I've read a little about it and then seen that line in v4l2.c:

while ((res = v4l2_ioctl(s->fd, VIDIOC_DQBUF, &buf)) < 0 && (errno == EINTR));

Whenever I see a while loop without a CPU yield it looks suspicious... 

So I've attempted to remove O_NONBLOCK and recompile.

Great, now ffmpeg is not crashing anymore, but the video file looks weird, like its not getting enough frames and the video freezes but still not the full 30fps potential.

Not knowing what's wrong, I've started reading through the v4l2 documentation, pausing once every to check the code in ffmpeg and uvccapture is actually following the specs.

Everything looked fine, but I've really wanted to see what I'm getting from the uvcvideo device, so I've added this right after the VIDIOC_DQBUF line:

av_log(ctx, AV_LOG_DEBUG, "VIDIOC_DQBUF index: %d type: %d bytesused: %d flags: %d field: %d\n",
buf.index, buf.type, buf.bytesused, buf.flags, buf.field);

Again, the results looked fine, but I really wanted to see if the data was the same or different for every packet, I've looked through the docs and found out about framecrc:

ffmpeg -loglevel debug -f v4l2 -i /dev/video0 -f framecrc -

The result was curious, apparently, only when buf.flags has V4L2_BUF_FLAG_MAPPED there's also a change in buffer contents, which puzzled me because everything else seemed the same and Camera ICS was getting high FPS and smooth video.

I've started going over the method calls again and looking at the compile log, which showed a warning about long incompatible with int64 which raised another warning light, so I've looked through the header files to see which function wasn't getting the correct parameters.

I've overlooked this before, but this time it drew my attention:

int (*open_f)(const char *file, int oflag, ...);
int (*close_f)(int fd);
int (*dup_f)(int fd);
int (*ioctl_f)(int fd, unsigned long int request, ...);
ssize_t (*read_f)(int fd, void *buffer, size_t n);
void *(*mmap_f)(void *start, size_t length, int prot, int flags, int fd, int64_t offset);
int (*munmap_f)(void *_start, size_t length);


#define v4l2_open s->open_f
#define v4l2_close s->close_f
#define v4l2_dup s->dup_f
#define v4l2_ioctl s->ioctl_f
#define v4l2_read s->read_f
#define v4l2_mmap s->mmap_f
#define v4l2_munmap s->munmap_f

and decided to modify it a bit to call the native functions:

#define v4l2_open   open
#define v4l2_close  close
#define v4l2_dup    dup
#define v4l2_ioctl  ioctl
#define v4l2_read   read
#define v4l2_mmap   mmap
#define v4l2_munmap munmap

Which won't work with libv4l2 but I just wanted to test and see if it could be it... said naaa a few times while it compiled and tried it again:

0,          0,          0,        1,   614400, 0xa7c24d33
0,          1,          1,        1,   614400, 0xa39f740e
0,          2,          2,        1,   614400, 0x4a917832
0,          3,          3,        1,   614400, 0x0a972f6f
0,          4,          4,        1,   614400, 0x0d7511f3
0,          5,          5,        1,   614400, 0xa6507276
0,          6,          6,        1,   614400, 0xb6fb52dd

Beautiful. It works.

Saturday, March 16, 2013


Ever since .NET 4.0 came out I've been wanting to play with MemoryMappedFile.

MemoryMappedFile is a way to map a file to memory address, so you're able to access it just like you would a byte array (for example).

At first I've attempted to create a simple document store like library, but MemoryMappedFile has its own quirks and working around them seemed a like a task for a library.

So feel free to use this FileStore library in your own projects, mind you, this library wasn't tested beyond simple sanity tests, no unit tests or integration tests have been done and in its current form it is in no way thread safe.

There are two main classes in the library:
1. MemoryMapper which is a wrapper around MemoryMappedFile, designed to allow file growing and multiple views/streams create/cleanup.
It has CreateView/CreateStream for creating accessors and Flush which is used to free unused accessors and flush the accessors changes to disk.

2. FileStoreManager - a document store-like object which stores data on one file and the data table with guid key on another file, benchmarks are pretty good for its scope, main bottlenecks are the binary formatter and the fact that there's no index so it has to scan through the data table file, I've added a dictionary for faster access, also there's no handling for modifying or deleting objects.

So here are a few conclusions from that project:

1. CreateViewAccessor and CreateViewStream are VERY slow, use blocks as large as possible over creating views/streams for each block.
2. There's a limit to how many Views and Streams you can create, since its mapping to a contiguous memory area, it might pose a problem for the calling program to allocate memory for other things or throw an exception itself ("Not enough storage is available to process this command."). this is more likely to happen in x86 because of the short address space of 32bit applications.
3. Disposing the accessors flushes their changes to disk, if you're doing many of these, it could slow things down.
4. Resizing the file requires disposing the MemoryMappedFile and its Streams/Views, doing it is an expensive operation so you should try to minimize file resizes.

You might want to take a look at RaptorDB by Mehdi Gholam.

You can find the project here:

Saturday, February 9, 2013

NuGet Package for RTree

I've just created a NuGet Package for the RTree porting I've done a few years ago, I've seen people are still interested in that project so why not make their lives easier?

I've also updated the project at (Just SVN):

I've added simple tests and changed the log4net dependency to NuGet.

I've also seen that Aled Morris updated his java documentation at:

Its not exactly the same anymore as the version I've ported is from 2009.

Friday, February 1, 2013

Passing Multiple Values to SQL

This article started as a collection of ways to pass multiple values to SQL, some of the articles I've written in the past became benchmarks because I just can't resist the temptation of checking how much can one thing do over the other and what's the best uses for each case.

So allow me to show you some of my findings.

In the past (SQL 2005 era) when we wanted to pass multiple values, we probably used splits, XML, dynamic queries and whatnot.

SQL 2008 introduced user defined table types which allowed us to pass whole tables as a SQL parameter.

Starting SQL 2008 we got merge which could be used on user defined table types and XML for bulk inserts/updates, which, in my opinion, is a great possibility.

So lets see what we're going to cover.

1. Split

2. SQLBulkCopy

3. XML Parameters

4. Dynamic SQL

5. Pre-rendered SQL (which is just a fancy way of saying I'm going to convert SqlCommand to string so we can see if that's the way ADO.NET passes the values).

6. User Defined Table Types

So Lets Start:

1. Split

Lets say we have a column with multiple values in the same varchar field, but we need to join on these values to get a report our boss needs.

What do we do?

Well, here's a split function that Mladen Prajdić wrote:

CREATE FUNCTION dbo.Split(@data NVARCHAR(MAX), @delimiter NVARCHAR(5))
    SELECT    @textXML = CAST('<d>' + REPLACE(@data, @delimiter, '</d><d>') + '</d>' AS XML);

    INSERT INTO @t(data)
    SELECT  T.split.value('.', 'nvarchar(max)') AS data
    FROM    @textXML.nodes('/d') T(split)

Performance is not so bad but also no great. so great, we have that report, but what if they decide to make that report available to clients?

Well, I've got a solution for that but I'm not sure you're gonna like it, it has something to do with triggers.

How then? well, create a cache table and triggers to update it, then you can index that cache table and get the job done.

2. SQLBulkCopy

One of the oldest ways of bulk inserts is with BULK INSERT , but its more of a DBA role than a developer, but ADO.NET does provide a way for bulk inserts, its called SQLBulkCopy.

We can set the batch size, timeouts, column mappings and some more settings.

using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connection, SqlBulkCopyOptions.Default,null))
    bulkCopy.BatchSize = this.BulkOperationsCount;
    bulkCopy.DestinationTableName = "Entities";

3. XML Parameters

Starting SQL 2005, there is XML data type, which meant we could pass XML to stored procedures, store it in tables and query by it.

This made SQL server very robust as there is a possibility to pass multiple values without resorting to workarounds.

with upusers (EntityId, Name, Phone, Email, Address, City, Zip, State, Country, BirthDate)
        T.val.value('(@EntityId)','int') as EntityId,
        T.val.value('(@Name)','nvarchar(max)') as Name,
        T.val.value('(@Phone)','nvarchar(max)') as Phone,
        T.val.value('(@Email)','nvarchar(max)') as Email,
        T.val.value('(@Address)','nvarchar(max)') as Address,
        T.val.value('(@City)','nvarchar(max)') as City,
        T.val.value('(@Zip)','nvarchar(max)') as Zip,
        T.val.value('(@State)','nvarchar(max)') as State,
        T.val.value('(@Country)','nvarchar(max)') as Country,
        T.val.value('(@BirthDate)','nvarchar(max)') as BirthDate

    FROM @xmlval.nodes('/root/item') T(val)

merge Entities as target
using (select * from upusers) as source (EntityId, Name, Phone, Email, Address, City, Zip, State, Country, BirthDate)
on (target.EntityId = source.EntityId)
when matched then
        set Name = source.Name,
         Phone = source.Phone,
         Email = source.Email,
         Address = source.Address,
         City = source.City,
         Zip = source.Zip,
         State = source.State,
         Country = source.Country,
         BirthDate = source.Birthdate
when Not Matched Then
    insert (Name, Phone, Email, Address, City, Zip, State, Country, Birthdate)
    values (source.Name, source.Phone, source.Email, source.Address, source.City, source.Zip, source.State, source.Country, source.Birthdate)

4. Dynamic SQL

Dynamic was always an ugly love child of programmers, it made everything simpler from a program flow point of view, it made things hacky and ugly when it wasn't made properly and a hell to fix whenever there's a problem.

Dynamic SQL done wrong is the main cause of SQL injection, but like many good things Microsoft brought to us in SQL 2005 which sp_executesql was one of them, its possible to write dynamic SQL and avoid most of the injection problems.

It should be noted that writing proper insert queries is also affecting performance, which you can see in NoOptimizationsDynamicExecution.

So an unoptimized insert query would look like this:

INSERT INTO tablename(col1, col2,...)
     VALUES (..,..,..);
INSERT INTO tablename(col1, col2,...)
     VALUES (..,..,..);
INSERT INTO tablename(col1, col2,...)
     VALUES (..,..,..);  

while optimized insert query would look like this:

INSERT INTO tablename(col1, col2,...)
     VALUES (..,..,..),(..,..,..),(..,..,..)...

5. Pre-render SQL

RenderSQL.cs is a class from the IQueryable to SQL project, I've converted it to SqlCommand to see if I can better understand how the query is converted from parameters to sp_executesql.

For reference, the benchmarks is executing the User Defined Table Type query, passed through RenderSQL.

6. User Defined Table Type

User Defined Table Types provides us with reasonable performance and IMHO, huge readability and program flow benefits. Instead of going through multiple layers, converting the data from X to Y and doing the same in SQL, you just pass a List of SqlDataRecord, SqlDataReader or a DataTable and on SQL side, you just use the parameter as a normal table variable.

so assuming we have this:

string testcommand = @"
select * from Entities
where Entities.EntityId in ({0})

using (SqlConnection connection = new SqlConnection(this.ConnectionString))
    using (SqlCommand command = new SqlCommand(string.Format(testcommand, string.Join(",", this.SelectRecords.Select(i => i.GetValue(0)).ToArray())),connection)) 
        command.CommandType = CommandType.Text;

        using (var reader = command.ExecuteReader())

We just need to use this instead:

string testcommand = @"
select * from Entities
join @intt 
    on [@intt].Id = Entities.EntityId

using (SqlConnection connection = new SqlConnection(ConfigurationManager.ConnectionStrings["TestConnectionString"].ConnectionString))
    using (SqlCommand command = new SqlCommand(testcommand, connection))
        command.CommandType = CommandType.Text;

        var param = command.CreateParameter();
        param.SqlDbType = SqlDbType.Structured;
        param.ParameterName = "@intt";
        param.TypeName = "udt_inttable";

        var dt = new DataTable();

        param.Value = dt;


        using (var reader = command.ExecuteReader())

Simple, right?

After digging around (and looking at SQL Statistics, Compilations/Recompilations and this article), it seems that the query is not cached and according to this, its by design.


And Benchmarks, I've attempted to make exaggerated caching so it will affect the benchmarks as little as possible, in the real world, actually getting the data to SQL might be your bottleneck.

During the test I got this very nice error message:

The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a
very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.