Discussing Optimization Approaches
Feedback on my Problem Synopsis
Since he brought up interesting aspects, I asked him if he'd be okay if I shared his mail and my reply on this blog; he said yes, so here we go:
I like the approach you take by looking at what real life users will want from gnuradio when transitioning from an academic perspective to realtime system. Measurements are always not enough to understand the specifics of your system so I'm looking forward to see how your project provides measurements to the gnuradio user.
Building on block computing performance measurements there is one thing I would like to see in gnuradio and that is a flowgraph optimizer.
To be more specific, a flowgraph optimizer would try to adapt the parameters of the blocks (e.g. the data chunk passed to each block) in order to optimize one/more parameter(s) of a flowgraph (e.g. overall processing time). In a normal way this optimizer should be run just once to determine the optimum parameters that will be used subsequently. If we see the problem to solve from a general perspective the optimizer would fall in the category of multi-objective optimization which has a numerous solutions and has been thoroughly discussed in the academia and industry (gaming is usualy doing multi-bjective optimization through AI). Another real-life example would be the optimizer in the 4Nec2 antenna simulation program that uses AI to optimize the antenna when a set of objectives (variables) is set by the user, e.g. minimum SWR, Z close to a value, etc.
In my opinion gnuradio will really benefit from such an optimizer as the values of block parameters can provide quite different end results.
Not sure if this can be part of your GSOC project but I thought it worth mentioning to you and gnuradio users on this list. Maybe can be part of the next GSOC.
A few Thoughts on How GNU Radio inherently optimizes
This brought up a lot of interesting points, so here's my reply:
thanks for your comment :)
Such an optimizer would be really, really fancy.
In a way, though, GNU Radio already does this when running a flow graph: It just asks blocks to compute a reasonable amount of items to fill up the downstream buffer. This actually (conceptually simple) approach is one great strength, because it just keeps the computer "busy" as much as possible.
There might be space for optimization, though, I agree: Maybe it would be better for some blocks just to wait longer (and thus, not utilize the CPU) if it was computationally beneficial to work on larger chunks, as long as there are enough other blocks competing for CPU power.
However, this leads to the problem of balancing average throughput with latency.
What the GNU Radio infrastructure does to approach this is actually quite simple: 1. Although it might be "fastest" to process all data at once, buffer lengths set a natural limit to the chunk size, and thus latency. So we have an upper boundary. 2. It is best to be closest possible to that upper boundary. To achieve that, block developers are always encouraged to consume as many input items and produce as much output as possible, even if the overhead of having multiple (general_)work calls is minute. This ensures that adjacent blocks don't get asked to produce / consume small item chunks (which will happen if they were in a waiting state and a small number of items was produced or a small amount of output buffer was marked as read).
Optimizing this will be hard. Maybe one could profile the same flowgraph with a lot of different settings of per-block maximum output chunk sizes, but I do believe this will only give as little more information than what the block developer already knew when he optimized the block in the first place. If he didn't optimize, his main interest will be if his block poses a problem at at all; for that, standard settings should be employed.
To give developers an API to inform the runtime of item amount preference, different methods exist, however. I'll give a short rundown of them.
1. Most notable are the fixed_rate properties of gr::block, as implemented in sync_block, and the decimator and interpolator block types 2. If your block will only produce a multiples of a certain number of items, the set_output_multiple is a method that will potentially decrease overhead introduced by pointless calls to forecast and/or work. 3. In hardware optimization, alignment is often the performance critical factor. To account for that, set_alignment was introduced. It's working very similar to set_output_multiple, but does not enforce the multiples, but sets an unaligned flag if non-multiple consumption occurred. The runtime will always try to achieve that the start of your current item chunk is memory-aligned to a certain item multiple. If however less was produced, your block might still be called, to keep the data flowing.
To properly apply these flags, you'll basically need a human understanding of what the block does. It may, nevertheless, be very helpful to understand how well your block performs with different item chunk sizes. To realize that, some mechanism to change scheduling behavior
I will look into that; I think it should be possible to manipulate the block_executors to manipulate them into changing their forecasting/work calling behavior at runtime, but I'm quite sure that this will bring new code into the main tree .
All in all, right now I'm really stuck with what I actually want to improve with the performance analysis of GNU Radio flowgraphs offered by performance counters/gr-perf-monitorx, because they address many of these issues already. Your execution-per-item over chunk size idea is excellent!
 I'll really have to take a deeeep look at block_executor and the tpb scheduler to tell; if I decide to add functionality that introduces significant runtime overhead or changes too much of internal behaviour, noone will be pleased, so I might take this slow and will have to discuss it with experienced core developers. I'm not very hesitant when it comes to fiddling with in-tree source code, but my workings almost never make it to the public, because I always figure they don't address a problem properly or break too much in comparison to what they can possibly improve.
To my pleasant surprise, I've found a real extension of the measurement scope of the currently existing performance counters. This will actually help me in getting to find a first point of attack for gr-benchmark, which is proving to be more versatile than I thought when writing my proposal.