With the recent release of the SPRacingF3 flight controller and the Seriously Dodo flight controller, both based on the newer STM32 F3 CPU, users have been posting great Blackbox flight logs demonstrating very fast and consistent looptimes. I found this interesting because the F3’s CPU isn’t clocked any faster than the F1 processor that is used on the Naze32 and compatibles.
One major difference between these CPUs is that the F3 has a hardware floating point unit, while the F1 must emulate its floating point support using some very large software routines. Floating point support is used in the IMU (which is responsible for estimating the craft’s attitude) and also as part of some PID controllers such as the new Harakiri controller. It’s not heavily used in any other parts of the code.
But how much time does the Naze32 spend doing floating point operations anyway? What is the total possible speedup available from just adding a hardware floating point unit? Could I speed my Naze32 up some other way?
In order to find out, I built a Sampling Profiler feature for Cleanflight. A sampling profiler works by periodically interrupting the program’s execution to check what code is currently running. The profiler makes 1000 of these checks per second and sends the results out to a serial port to be logged.
The checks can be used to measure what percentage of the time is spent on each line of code in the program. If one piece of code accounts for 30% of the execution time of the program, then we should expect 30% of our random samples to end up seeing that line of code being run. In this way we can construct an accurate picture of the program with a nice constant overhead.
If the profiler slows down the CPU too much, we only have to reduce our sampling rate and sample for longer. Our results will be the same, and CPU sampling overhead will decrease.
This profiler is somewhat limited because it can only see the top function that was executing when it takes a sample. It doesn’t currently have the ability to look up the stack and see which functions were responsible for calling the function that was interrupted.
This can make it harder to work out which parts of the code are the culprits for slowdowns, because some routines are shared by many parts of the program. It also means that functions which mostly just call other functions to do their work for them will have very low execution times as measured by the profiler.
The very highest priority interrupt handlers (I2C EV and ER interrupt handlers) are not measured by the profiler. This is because the profiler is not high priority enough to interrupt them. I configured it this way because of the I2C documentation’s warning that the interrupt handlers should not themselves be interrupted (I assume they’re too timing sensitive).
You can see the raw output from the profile log decoder here. The results are pretty interesting. I was running a Naze32 Full with a PPM receiver, PID controller 0, looptime set to 0 and all other features disabled, so pretty much a barebones system.
You can see that the majority of runtime is spent on floating point routines (48%) plus trigonometry routines (5%), which are the likely callers of most of the floating point routines. This is great news for the SPRacingF3 because these are precisely the routines that will be sped up by its hardware floating point unit!
Of the remaining time, 36% is spent just waiting to receive data from the gyroscope over the I2C bus. I haven’t examined this code before, but from a quick glance it appears that the CPU performs a busy-wait while it waits for a response to be received from the gyroscope. If this task could be handed off to the DMA controller, which would perform the read in the background, it could potentially free up to 36% of the CPU time for other tasks.
It’s clear that it’s not worth speeding up any other part of the code, because the remainder accounts for such a small percentage of the total runtime. This information is a win for developers because we don’t need to waste our time on something that isn’t going to make a difference to looptimes!
Speeding up the Naze32
So, the SPRacingF3 is likely faster than the Naze32 because its floating point operations are faster. But can I squeeze more speed out the Naze32? There are two main approaches here.
One option is to speed up the floating point routines themselves. One way of doing this is to use routines which are slightly less precise than the default implementation, which can bring large speedups. The default implementation is extremely pedantic about being precise and handling error conditions in order to conform to IEEE floating point standards, but in a noisy system like a quadcopter that precision isn’t noticeable. To this end, some developers have been discussing Taylor approximations of our trigonometry routines over on the Cleanflight issue tracker.
The other option, which is even simpler, is to avoid the calls that require floating point in the first place. Since I’m flying PID controller 0, the main place in the code that is actually using floating point is the IMU’s attitude estimator. This is responsible for figuring out the roll/pitch angles and heading of the quadcopter. That information is required for Angle and Horizon flight modes, as well as the flight modes based on sensors like the magnetometer and barometer. But good old Acro flight mode doesn’t use the IMU at all, because it only uses rotation rates read directly from the gyroscope.
Luckily, I happen to be an acro-only pilot. So in my case, the only thing the IMU is being used for is the pretty display of the 3D model in the Configurator’s welcome page. Let’s see what happens when I turn off my accelerometer, which causes the IMU to never run:
Entering CLI Mode, type 'exit' to return, or 'help' # set acc_hardware acc_hardware = 0 # set acc_hardware=1 acc_hardware set to 1 # save
$ blackbox_decode LOG-ACRO.TXT Decoding log 'LOG-ACRO.TXT' to 'LOG-ACRO.01.csv'... Log 1 of 1, start 00:18.056, end 00:28.247, duration 00:10.190 Statistics Looptime 325 avg 10.3 std dev (3.2%) I frames 978 44.1 bytes avg 43094 bytes total P frames 978 24.5 bytes avg 23994 bytes total E frames 2 9.0 bytes avg 18 bytes total S frames 8 4.0 bytes avg 32 bytes total Frames 1956 34.3 bytes avg 67088 bytes total Data rate 191Hz 6696 bytes/s 67000 baud 29325 loop iterations weren't logged because of your blackbox_rate settings (9552ms, 93.75%)
Yep, that’s a looptime of 325 with standard deviation of 10.3 microseconds! It looks like 1/8 logging rate is about as much as the Blackbox can keep up with at 250,000 baud. It’s now a lean, mean, acro machine! Here’s what the distribution of execution time now looks like according to the profiler:
i2cRead didn’t get slower, it probably stayed the same, but now it accounts for a larger portion of the total runtime because the floating point routines have shrunk so much. Now we’re seeing the actual flight code taking up a noticeable portion of the runtime, like mixTable() which is responsible for deciding what strength each motor should be driven at, and the PID controller itself. Here’s the raw profiler report for my acro-only configuration.
The version of Cleanflight with (extremely experimental, only useful for developers) profiling support can be found here:
You must read the documentation first, which is here:
The decoder tool which reads profile logs is here (it’s also my first application written in Go, so it’s probably horrible):