A Measurement Toolbox for GNU Radio - My Google Summer of Code project


While writing my application for Google Summer of Code, I was actually considering a lot of thing that I wanted to tackle in GNU Radio for quite a while. Amongst optimization, an improved documentation, distributable flow graphs and a lot of other things, one issue stood out:

GNU Radio is actually easy to use, but it's hard to prove that you're using it properly. To explain this, we'd have to take a look at the purposes GNU Radio serves by today:

1. GNU Radio is used for academic purposes. That is, almost always for starters, for creating simulations of some communication system, but very often later for easy conversion of a simulated system to a real-world software defined radio. Here, proving that you're getting the right results is absolutely crucial. After all, this is science, and you can hardly call it progress if you can't document, reproduce and communicate your findings in a way that makes your audience confident in your credibility.

2. GNU Radio has a ever-growing audience of communication applications. This means, that people buy hardware and invest a lot of engineering time into developing real-time operational systems. This leads to a high demand of optimization; GNU Radio already tackles that with approaches like the Vector optimized Library of Kernels, which highly increases the performance of specific numeric operations on large sets of data.

The following will try to put my understanding of the state of the art into words, explaining as I move along these topics what I plan for the Measurement Toolbox.

Doing Measurements for Science

A very common task is to measure the performance of a transceiver system, for example with respect to bit error rate over a varying SNR.

Approaches to generating a BER curve

To illustrate, I'll introduce the idea of purely simulated transceiver with a channel model that defines noise as additive.

To create the BER-over-SNR graph , the user can do a lot of thing in GNU Radio:

  • Build the simulation flow graph, and add instrumentation sinks for BER, as well as GUI input widgets (e.g. sliders) to vary the noise power. Then, change the noise power manually, note down the BER value that is shown in the graphical sink, rinse, repeat. In the end, use the gathered data to generate the graph. Obviously, this is tedious, and a little error prone.
  • Build the simulation flow graph, equip it with a head block to limit execution duration and use a file sink to write away the BER, and running that flowgraph repeatedly with different noise powers. This can, for example, be done by constructing the flow graph in GRC, and executing the result python program with varying command line options. This sounds a lot nicer, because we can script that to generate the table of data necessary for graph generation.
  • Use the same flow graph, but instead of running the program repeatedly from command line, modify the generated python file to set the noise power. This sounds much more elegant and will increase speed, but we have to realize that now the researcher has not only to define the flow graph, but to find suitable conditions or hooks to make sure when to extract the current BER and set a new noise power. That might not be that hard, but it will lead to flow graphs that no longer match with the GRC file, and you will always have to write (or at least copy&paste) update-and-extract code for each simulation you might ever run.

What does the user really want?

In the end, as a user, you don't really want to do this. What you'd want is more along the lines:

  • Define some characteristics that should be changed. This could be the aforementioned noise power, but it could also be something less numeric - maybe different impairment models, etc.
  • Define ranges for these characteristics, define conditions on which measurement will end, e.g. number of processed samples, running time, total errors occure, etc. Make that definition flexible yet clear.
  • Define some properties to observe. Again, this is not only limited to BER, and it might be an item stream, or a message port, it could also be a function to probe or some other metric.
  • You want your system to collect and keep the data itself. No writing down. No figuring out later that having the average is nice, but you should have also calculated the standard deviation or any of the like. Storage is cheap. You wanted these characteristics, and we can assume there's not going to be a flood of values, so why not store them all?
  • Ignore where you're measurements will run. A test case with some hundred different parametrizations (e.g. you vary noise power and frequency behaviour of the channel and try to find out how the combination of these effects affect your system) might take some time to run - why not use a room full of computers to run portions of the overall workload in parallel?
  • Select data to visualize and get graphs. Export these graphs to your favourite graphics format, or paste the data into a file that your colleague might use for his own purposes.

Enter the Measurement Toolbox: Defining these test cases, getting the data out of the running flow graph into a database and extracting it from there later is a core milestone. Integration into the GNU Radio companion and common export formats make the understanding, sharing and publishing easier; as you will see in the next section, the ability to run on remote computers will come for free.

Doing (Computational) Performance Evaluation

A lot of work is currently going into efforts at speeding up GNU Radio infrastructure (VOLK, the gr-trellis GSoC project, ongoing work on working on the buffer architecture of GNU Radio to allow smarter and directly device-accessible buffers and a lot more). This is becoming a more vital part of the GNU Radio project as the limits of realizable systems approach real-world medium access controller, GNU Radio is seeing more usage on embedded devices, and GNU Radio's capabilities have reached a point where using GNU Radio-based SDRs is competitive to using hardware transceivers. The objectives of many GNU Radio users have shifted from building proof-of-concept systems and simulation software to high-throughput systems.

However, to optimize the computational performance (or just performance from hereon), one has to know where time is spent. This should happen on different levels:

  • A component level. In GNU Radio, this means looking at which blocks use how much of the total CPU time.
  • A algorithmic level. This means that of certain implementations should be compared. That implies looking at things like VOLK kernels in isolation, and finding out how well they work
  • A systemic level. A computer running a set of GNU Radio blocks is not really executing things fully parallel without any performance effects: How far does increasing buffer size, and thus, latency, increase performance due to en-block processing? What are the performance hits happening when changing between block threads, each using a part of the CPU caches?

I will lose a few words on each these problems; because things get a little technical here, I assume you have worked with GNU Radio before and know that a block is an instance of a subclass of gr::block, and its general_work method gets called in order to produce items.

Evaluating Block Performance

The most interesting question here is: Of the total CPU time used by GNU Radio, how much time does each block consume?

To understand a bit of the problems involved, we must go through a little background:

If you look at GNU Radio, the standard (and only fully featured) scheduler (as of today!) is the thread-per-block scheduler. What happens here is that the GNU Radio runtime spawns a thread for each block instance your flowgraph uses.

Initially, these threads are in a waiting state, because none of them have input, which means that aside from the sources, none can do anything. The source blocks' threads are being notified, each block_executor calls his blocks' general_work, supplying the amount of available space in the output buffers as noutput_items (or at least a reasonable portion of it).

Finishing its work, the block then notifies downstream blocks' threads that things have changed; and since there is the whole output buffer free, the downstream blocks can start working away. The moment they finish, GNU Radio knows how much items they consumed, freeing space for items produced by the sources.

Thus, as soon as the first items have propagated from source to sink, the whole flowgraph is running at maximum throughput: While a downstream block is still processing data, an upstream block might already be producing more items. The more CPU cores you have, the more blocks can execute in parallel -- as long as there are no bottlenecks.

As soon as there is a single point of congestion, as soon as buffers upstream of that block are full, and as soon as items from the downstream buffers got completely emptied, our application becomes single-threaded.

So obviously, there was big interest in finding out how blocks interact and how much time they spend computing. Sadly, performance measurements in multi-threaded applications are quite non-trivial if not done from within.

Thus, ControlPort was introduced. It offers the ability to get performance counters (along with other control information) out of the scheduler framework using RPC calls. Aside from some installation hassle, it works quite nicely - and that's what makes it so powerful.

Together with the UIs for ControlPort, one is able to understand where CPU cycles go when executing a whole flowgraph. This alone is vastly useful, but bundled together with the ability to evaluate algorithmic performance, an automatable framework was written: gr-benchmark .

gr-benchmark can execute a flowgraph and collect performance data, it can condense this into JSON files and even has a uploading automatism to submit things to http://stats.gnuradio.org for analysis. What it lacks is the ability to run GNU Radio applications under an easy-to-define set of conditions, integrate well with other tools, and automatically distribute workload: Issues I plan to tackle.

Evaluating Numerical Performance

As soon as a bottleneck has been identified, developers may strive to optimize the computations involved. Digital signal processing in general and GNU Radio in special leave great room for optimization: Often, vast amount of high-bandwidth data have to be processed under near real-time conditions, and often, mathematically basic operations (such as convolution, dot products) leave much space for optimization.

Luckily, modern hardware offers means to improve performance: From cached RAM over preemptive computation up to rather massive SIMD instructions. Compilers also try to optimize as much as possible, and assembler code often looks massively different from what the source code would dictate if compiled verbatim.

GNU Radio tries to make use of these features by employing kernels. These are implementations of an algorithm, being either the plain-C++ implementation or employing hardware optimization by using certain compiler intrinsics. This means that for each VOLK protokernel, there are multiple realizations of the same functionality, one that should run on every platform, and for each supported hardware type a specialized one.

GNU Radio comes with a tool, volk_profile, that executes every kernel implementation (ie. C++ generic and all machine-specific implementations that could be compiled for the current machine) and measures the time they consumed, to find optimum kernels for each functionality. This does lead to interesting results: Sometimes, the generic implementation outperforms all optimized versions on a SSE4 machine; on other SSE4 enabled CPU models, things are different.

This shows that testing optimized code on various machines is a necessity. gr-benchmark is capable of uploading VOLK performance data; however, this is less helpful when developing kernels but will only point at need for improvement when kernels have been widely deployed.

As a result, I specified that the Measurement Toolbox will be able to distribute the same benchmark to a list of hosts, each sporting a different machine type. Results must be gathered centrally.

Also, performance of vector kernels might be quite sensitive of things like cache performance and usage of the cache (possibly leading to cache misses and the like). Getting a little more statistical background (primarily standard deviation) on performance measure might prove handy when optimizing systems.

Comments powered by Disqus