This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Articles

Lluvia port to Rust!
Color mapping for data visualization
Raspberry Pi 4 build
Android integration using mediapipe
Mediapipe integration
Camera undistort
Implementing the Horn and Schunck optical flow algorithm
Working with floating point precision

Lluvia port to Rust!

By Juan Adarve | Wednesday, June 17, 2026

It has been almost 3 years since my last blog post for the project. A lot of things have happened since then, both in my personal and professional life 😃 🐣. In particular to the project, it reached a stall point. The C++ core library is stable and the node library provides several complex algorithms (particularly in optical flow) that showcase the engine capabilities. Integration with Mediapipe proved the idea of using Lluvia on Mobile Android devices and compilation on Raspberry PI showed it can be used on embedded devices as well.

All that said, I lost rythm on the project, and could not find a way to use it in some productive environment. It was a good way to keep my C++ skills sharp, and certainly it helped a lot. Now, I moved my professional work to Rust 🦀. Nice language, once you get used to the borrow-check.

Given that, and as an incentive to improve my Rust, I decided to rewrite the project from scratch in Rust. The new lluvia-rs repository in Github is organized as a Cargo workspace that will combine several crates:

lluvia_vk: The same engine as Lluvia C++, using Vulkan API via the vulkano crate. Here I want to get access to all Vulkan capabilities, including the video encode/decode extensions for creating high-performance native apps.
lluvia_webgpu: A port using WebGPU API via the wgpu crate. This will allow running on Web browsers, something new for me.
lluvia_media: A new idea for creating parsers for several media formats (e.g. MPEG Transport Stream, MP4, etc), something that might come handy at work.
bindings/python/pyo3-lluvia-vk: Pyo3 wrappers to use the crate from Python. This is a key component to improve the development experience.

Currently I’m focused on the first one, lluvia_vk. These are some technical choices:

Use Roblox’s Luau as the scripting language to describe the compute nodes. It’s like Lua, but with types! That helps a lot when developing new nodes and allows IDEs to provide better autocompletion.
I’m exploring using Slang shading language to code the GPU shaders. This way, I can reuse the same shader code in lluvia_vk and in lluvia_webgpu, in theory.

In addition to support compute shaders and their composition in complex pipelines, I want to also support rendering pipelines. Here I have two motivations. First, an old dream of mine about developing my own video game, and secondly, I want to learn about Gaussian Splatting techniques, from scratch, the hard way 😒.

C++ support for the project:

For now, I have no plans on extending the current C++ implementation of the project. If you are using it in an actual production environment, I would love to know more about it, and maybe we can look into adding new funtionalities.

I still need to stabilize the first version of the Rust API and port the nodes from the C++ project to Slang and Luau before I feel confortable releasing it on crates.io.

Color mapping for data visualization

New colormap nodes for data visualization.

By Juan Adarve | Friday, July 07, 2023

Jupyter notebook:

A Jupyter notebook with the code in this article is available in Google Colab. Check it out!

Introduction

Data visualization is an important tool for presenting the results of new algorithms. For 2D data in particular, it is possible to visualize the data as colored images so that our brains can interpret the data in a visual way. This article presents new color mapping nodes available in Lluvia to transform 2D data into color images. These nodes use several color maps available in the Matplotlib project to accelerate data to color conversion using the GPU.

Color mapping

Let $F : \mathbb{Z}^{2}_{\ge 0} \rightarrow \mathbb{R}$ be a scalar field defined for $(x, y)$ in the set of integer numbers greater or equal zero (standard image coordinates). The value $F(x, y)$ on each point is an element of the real numbers $\mathbb{R}$.

Next, let $c(z) : \mathbb{R} \rightarrow \text{RGB}$ be a color mapping function converting from a real value $z$ to RGB color space. In practice, the color output of $c(z)$ must be in a closed range that enables visualization in a computer screen. Constraining each color component to lie in the range $[0, 255]$ is common. This can be achieved by limiting the range of the input $z$ to values in the interval $[0, 1]$, that is:

$$ \bar{z} = \frac{z}{z_\text{max} - z_\text{min}} $$

where $z_\text{min}$ and $z_\text{max}$ are known values.

The color field $C : \mathbb{Z}^{2}_{\ge 0} \rightarrow \text{RGB}$ is the result of applying the color mapping function of all values of field $F$ as:

$$ C(x, y) := c\left( \frac{F(x, y)}{z_\text{max} - z_\text{min}} \right) $$

Gray color mapping

A simple color mapping function is the one mapping to gray scale values. $c_\text{gray}(z)$ is defined as:

$$ c_\text{gray}(\bar{z}) = 255 (\bar{z}, \bar{z}, \bar{z}) $$

That is, creating a 3-vector of the normalized input value repeated in each color component and multiplying it by 255 to obtain an RGB color.

Complex color mappings

More complex color maps have a whole set of research on color theory, human color perception and physics. Designing new color maps exclusive for Lluvia is out of scope for the project and serves little purpose as there are great color maps readily available from the open source community. In particular, since I work with Python a lot, and use Matplotlib heavily, I decided to export several of the color maps available there into Lluvia. The reader is highly encouraged to watch the presentation below from Stéfan van der Walt (@stefanv) and Nathaniel Smith (@njsmith) on a default perceptually uniform colormap for Matplotlib.

The following color maps are extracted from Matplotlib using code similar to that presented in the Appendix section:

Perceptually uniform maps:
- viridis.
- plasma.
- inferno.
- magma.
- cividis.
Sequential maps:
- gray.
- purples.
- blues.
- greens.
- oranges.
- reds.
Diverging maps:
- spectral.
- coolwarm.
- bwr.
- seismic.
Cyclic maps:
- twilight.
- hsv.

Colormap nodes in Lluvia

There are four new nodes available in Lluvia for color mapping:

lluvia/viz/colormap/ColorMap             : Container : Maps a scalar field to a color field using a color map.
lluvia/viz/colormap/ColorMap_float       : Compute   : Maps a floating point scalar field to a color field using a color map.
lluvia/viz/colormap/ColorMap_int         : Compute   : Maps an integer scalar field to a color field using a color map.
lluvia/viz/colormap/ColorMap_uint        : Compute   : Maps an unsigned integer scalar field to a color field using a color map.

ColorMap_float, ColorMap_int, and ColorMap_uint process input fields of float, int, and uint types respectively. They are aggregated by the ColorMap container node (no suffix), which instantiates one of the others according to the input image type.

The interface of ColorMap is:

Parameters:
- color_map: string. Defaults to “viridis”.
- min_value: float. Defaults to 0.0. Minimum input value.
- max_value: float. Defaults to 1.0. Maximum input value.
- alpha : float. Defaults to 0.0. The alpha value of the output image in range [0, 1].
- reverse: float. Defaults to 0.0. If 1.0, the color map is reversed.
Inputs:
- in_image : ImageView. {r8ui, r16ui, r32ui, r8i, r16i, r32i r16f, r32f} image. Input image.
Outputs:
- out_image : ImageView. rgba8ui image. The encoded color of the optical flow field.

The code bellow shows how to instantiate, configure, and run the ColorMap node:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import lluvia as ll
import lluvia.util as ll_util
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

session = ll.createSession(device=ll.getAvailableDevices()[0])

memory = session.createMemory(ll.MemoryPropertyFlagBits.DeviceLocal)
host_memory = session.createMemory([ll.MemoryPropertyFlagBits.DeviceLocal, ll.MemoryPropertyFlagBits.HostVisible, ll.MemoryPropertyFlagBits.HostCoherent])

colormap_names = ['viridis', 'plasma', 'inferno', 'magma', 'cividis', 'gray', 'purples', 'blues', 'greens', 'oranges', 'reds', 'spectral', 'coolwarm', 'bwr', 'seismic', 'twilight', 'hsv']

RGBA = ll_util.readSampleImage('mouse')

in_rgba = memory.createImageViewFromHost(RGBA, filterMode=ll.ImageFilterMode.Nearest, addressMode=ll.ImageAddressMode.Repeat, normalizedCoordinates=False, sampled=False)

RGBA2Gray = session.createComputeNode('lluvia/color/RGBA2Gray')
RGBA2Gray.bind('in_rgba', in_rgba)
RGBA2Gray.init()

RGBA2Gray.run()

img_gray = RGBA2Gray.getPort('out_gray').toHost().astype(dtype)
        
for cmap_name in colormap_names:
    
    fig = plt.figure(figsize=(20, 4)); fig.set_tight_layout(True)
    
    plt.subplot2grid((1,3), (0,0)); plt.imshow(img_gray, cmap='gray')
    plt.tick_params(axis='both', which='both', bottom=False, top=False, labelbottom=False, labelleft=False)
    plt.title('original')
    
    for i, reverse in enumerate([0, 1]):
        

        in_image = memory.createImageViewFromHost(img_gray)

        ColorMap = session.createContainerNode('lluvia/viz/colormap/ColorMap')
        ColorMap.bind('in_image', in_image)
        ColorMap.setParameter('colormap', ll.Parameter(cmap_name))
        ColorMap.setParameter('min_value', ll.Parameter(0))
        ColorMap.setParameter('max_value', ll.Parameter(255))
        ColorMap.setParameter('alpha', ll.Parameter(1.0))
        ColorMap.setParameter('reverse', ll.Parameter(reverse))
        ColorMap.init()

        ColorMap.run()

        out_rgba = ColorMap.getPort('out_rgba').toHost()
        
        plt.subplot2grid((1,3), (0, i+1)); plt.imshow(out_rgba)
        plt.tick_params(axis='both', which='both', bottom=False, top=False, labelbottom=False, labelleft=False)
        plt.title('{0}{1}'.format(cmap_name, ' - reversed' if bool(reverse) else ''))
    
    plt.show()

Perceptually uniform maps

viridis

plasma

inferno

magma

cividis

Sequential maps

gray

purples

blues

greens

oranges

reds

Diverging maps

spectral

coolwarm

bwr

seismic

Cyclic maps

twilight

hsv

Appendix

Color map extraction from matplotlib

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import numpy as np
import matplotlib as mpl
import base64

colormap_names = ['viridis', 'plasma', 'inferno', 'magma', 'cividis', 'gray', 'Purples', 'Blues', 'Greens', 'Oranges', 'Reds', 'Spectral', 'coolwarm', 'bwr', 'seismic', 'twilight', 'hsv']

x = np.linspace(0, 1, 256)

for name in colormap_names:
    
    cmap = mpl.colormaps[name]
    RGB = cmap(x)
    
    RGBA = [[c[0], c[1], c[2], 0] for c in RGB]
    RGBA = np.ceil(np.array(RGBA) * 255).astype(np.uint8)
    
    data = RGBA.data.tobytes()
    s = str(base64.b64encode(data), 'utf-8')
    
    lua_str = "builder.colorMaps['{0}'] = '{1}'".format(name.lower(), s)
    print(lua_str)

which produces an output similar to

builder.colorMaps['viridis']  = 'RQJVAEUDVgBFBFgARgZZAEYHWwBGCVwARwpdAEcMXwBHDWAARw9iAEgQYwBI...'
builder.colorMaps['plasma']   = 'DQiHABEIiAAUB4oAFgeLABkHjAAcB40AHgeOACAGjwAiBpAAJAaRACYGkgAo...'
builder.colorMaps['inferno']  = 'AQEEAAEBBQABAQcAAgEIAAICCgACAgwAAwIPAAMDEQAEAxMABQQVAAUEFwAG...'
builder.colorMaps['magma']    = 'AQEEAAEBBQABAQcAAgEIAAICCgACAgwAAwMOAAMDEAAEBBIABQQUAAUFFgAG...'
builder.colorMaps['cividis']  = 'ACNOAAAkUAAAJFEAACVTAAAmVQAAJ1YAACdYAAAoWgAAKVwAACldAAAqXwAA...'
builder.colorMaps['gray']     = 'AAAAAAEBAQACAgIAAwMDAAQEBAAFBQUABgYGAAcHBwAICAgACQkJAAoKCgAL...'
builder.colorMaps['purples']  = '/Pv9APz7/QD8+/0A+/r9APv6/AD6+fwA+vn8APr4/AD5+PsA+fj7APj3+wD4...'
builder.colorMaps['blues']    = '9/v/APf7/wD2+v8A9fr/APT5/gD0+f4A8/j+APL4/gDx9/0A8Pf9APD2/QDv...'
builder.colorMaps['greens']   = '9/z1APf89QD2/PQA9vz0APX88wD1+/IA9PvyAPT78QDz+/AA8vvwAPL67wDx...'
builder.colorMaps['oranges']  = '//XrAP/16wD/9eoA//TpAP/06AD/8+cA//PmAP/y5QD/8uQA//HjAP/x4gD/...'
builder.colorMaps['reds']     = '//XwAP/18AD/9O8A//TuAP/z7QD/8uwA//LrAP/x6gD/8OkA//DoAP/v5wD/...'
builder.colorMaps['spectral'] = 'ngFCAKEEQwCjBkQApQlEAKcLRQCpDUUAqxBGAK4SRgCwFUcAshdHALQZSAC2...'
builder.colorMaps['coolwarm'] = 'O03BADxOwgA9UMQAP1LFAEBUxwBBVcgAQlfKAENZywBEW80ARlzOAEde0ABI...'
builder.colorMaps['bwr']      = 'AAD/AAIC/wAEBP8ABgb/AAgI/wAKCv8ADAz/AA4O/wAQEP8AEhL/ABQU/wAW...'
builder.colorMaps['seismic']  = 'AABNAAAAUAAAAFMAAABVAAAAWAAAAFsAAABeAAAAYQAAAGMAAABmAAAAaQAA...'
builder.colorMaps['twilight'] = '4tnjAOHa4wDg2uIA39rhAN7a4QDc2eAA2tnfANnY3gDX190A1dfcANPW2wDQ...'
builder.colorMaps['hsv']      = '/wAAAP8GAAD/DAAA/xIAAP8YAAD/HgAA/yQAAP8qAAD/MAAA/zYAAP88AAD/...'

Raspberry Pi 4 build

Configuration of the Raspberry Pi 4 with the Vulkan SDK and Lluvia to run GPU compute pipelines.

By Juan Adarve | Monday, November 28, 2022

Introduction

The Raspberry Pi 4 project announced back in November 2020 that the Vulkan 1.0 conformance tests successfully passed for its GPU driver. More recently in August 2022, Vulkan 1.2 conformance testing has been completed.

The conformance tests are a large set of tests run against a driver implementation to see if it conforms with the Vulkan specification. This is essential to maintain the Vulkan API portable across platforms and GPU vendors.

In addition, LunarG announced support of the Vulkan SDK on the Raspberry Pi 4. With this, the two most important requirements to build Lluvia on the RPi4 became available.

Build instructions

The build instructions are available in the Getting started page. They can be summarized as:

Prepare the operating system.
Build the Vulkan SDK following the official documentation.
Build and install OpenCV (for running demos).
Install Bazel.
Build and install Lluvia.

Optical flow demo

There is a new demo shipped with the Lluvia source code to run pipelines with images captured from a camera. Currently, the demo uses OpenCV VideoCapture class to capture images from the Raspberry camera module.

The demo app, which can be run from the repository root folder as

1
2
./samples/webcam/webcam.py --width=320 --height=240 \
  ./samples/webcam/scripts/horn_schunck.lua webcam/HornSchunck

configures the camera to capture images at 320x240 resolution and runs the webcam/HornSchunck container node defined in the horn_schunck.lua script. The container node creates the pipeline illustrated below:

@startuml
skinparam linetype ortho

state BRGA2Gray
state HS as "HornSchunck"
state Flow2RGBA
state RGBA2BGRA

BRGA2Gray -down-> HS: in_gray
HS -down-> Flow2RGBA: in_flow
Flow2RGBA -down-> RGBA2BGRA: out_rgba
@enduml

with the HornSchunck node containing the algorithm implementation as discussed in a previous article.

OpenCV BGR color ordering

By default, OpenCV used BGRA channel ordering for color images. On the other hand, Lluvia uses RGBA ordering to store color images. The last node in the demo pipeline converts to the color order OpenCV expects to render into the screen.

Discussion

This post introduced the instructions for building Lluvia on the Raspberry Pi 4. A new demo application for running pipelines with images captured from the Pi’s camera module is also presented.

As of now, Lluvia is supported in four platforms:

Linux x86_64.
Windows.
Android through the mediapipe integration.
Raspberry Pi 4.

Future work can include support for other platforms such as:

Nvidia Jetson hardware.
MacOS and iOS.

With these many platforms, an interesting topic for future work is running benchmarks across all of them for assessing the runtime performance of different computer vision algorithms.

Android integration using mediapipe

Running Lluvia pipelines on Android using Mediapipe

By Juan Adarve | Tuesday, November 08, 2022

Introduction

The previous post on mediapipe integration explored how to integrate Lluvia with Mediapipe to create complex Computer Vision pipelines leveraging mediapipe’s integration with other frameworks such as OpenCV and Tensorflow. This post expands this integration to run Lluvia on Android systems.

The video below shows the Optical Flow Filter algorithm running on a Samsung Galaxy S22+ phone.

Graph for mobile applications

The figure below illustrates the pipeline used on Android for running the LluviaCalculator. First, a FlowLimiter calculator receives the images from the input_stream. This calculator controls the rate at which packets are sent downstream; it receives an additional input from the last calculator, ImageFrameToGpuBuffer, that indicates processing has been completed and that a new packet can be received.

The GpuBufferToImageFrame and ImageFrameToGpuBuffer calculators convert packets from GpuBuffer to ImageFrame and vice-versa. They are needed for two reasons:

On Android, the input_stream and output_stream ports of the pipeline expect GpuBuffer types.
Currently, the Lluvia calculator cannot handle GpuBuffer packets.

Finally, the Lluvia calculator runs the configured GPU pipeline.

GpuBuffer support

Support for transferring data to and from GpuBuffer packets in the LluviaCalculator is planned. This will avoid the use of GpuBufferToImageFrame and ImageFrameToGpuBuffer calculators, thus reducing the memory copy overhead.

Android application

The mediapipe repository provides examples on how to build Android applications that run the framework. The dataflow is as follows:

In Java/Kotlin, configure the app to open the camera. Camera2 or CameraX APIs can be used.
Camera frames are received on Surface objects, opaque objects that hold reference to the image pixel data.
The surface objects are transferred to the mediapipe graph, entering through the input_stream. In there, the packets are sent as GpuBuffer to be consumed by the calculators.
The graph execution takes place, and the output packets are received by the application through the output_stream. The GpuBuffer packets are transformed to Android surface objects.
The surface objects are rendered in the screen.

Mediapipe Android archive

Build instructions

The mediapipe integration guide includes new instructions on how to configure the project to support Android builds.

Apps can consume medipipe libraries through Android Archive files (AAR). Archives are a special type of library containing JVM classes, assets, and native libraries (compiled for x86, arm64, or other architectures). Archives can be imported to the app either by placing them within the app source tree, or by declaring a dependency to a remote repository (e.g. Maven).

Mediapipe bazel rules include targets to build AAR files than be used on Android. The lluvia-mediapipe repo uses these rules to compile an archive that compiles lluvia along with mediapipe, and export all the required assets (node library, scripts and graphs). The AAR is created by running:

1
2
3
4
5
bazel build \
    -c opt \
    --host_crosstool_top=@bazel_tools//tools/cpp:toolchain \
    --fat_apk_cpu=arm64-v8a \
    //mediapipe/lluvia-mediapipe/java/ai/lluvia:lluvia_aar

where --fat_apk_cpu=arm64-v8a defines the CPU architectures that the native code will be compiled to. The archive is available at:

bazel-bin/mediapipe/lluvia-mediapipe/java/ai/lluvia/lluvia_aar.aar

which can directly be copied to the App’s libraries, or exported to a remote repository.

Discussion

This post explained how to use Mediapipe to run Lluvia compute pipelines on Android systems. The mobile graph uses the FlowLimiter calculator to control the rate at which image packets are consumed by the pipeline. Future work includes:

Support GpuBuffer input and output packets in the Lluvia calculator.
Fully working Android code example.
Use the Android GPU Inspector to profile the performance of the app.

Mediapipe integration

Integration of Lluvia into Mediapipe to create complex Computer Vision pipelines.

By Juan Adarve | Saturday, October 08, 2022

Introduction

Mediapipe is a cross-platform framework to create complex Computer Vision pipelines both for offline and real-time applications. It leverages popular frameworks such as OpenCV and Tensorflow to process audio, video, and run deep learning models. By integrating Lluvia into mediapipe, it is possible to speed up some of those computations by creating a GPU compute pipeline.

Difference 1: project scope

Mediapipe is a more general framework than Lluvia. Mediapipe, at its core, is a compute graph scheduler, where each node can contain any arbitrary processing logic. The integration of third-party frameworks (e.g. OpenCV, Tensorflow, Lluvia) gives the framework its power for developing complex Computer Vision pipelines.

Lluvia, on the other hand, is specialized in creating compute pipelines running efficiently on GPU. Bringing the project to Mediapipe will enable easier integration with other frameworks and increase runtime performance of Computer Vision applications.

On Graphs, Calculators and Packets

Mediapipe uses Directed Acylic Graphs to describe the compute pipeline to be run by the framework. Each node in the graph is denoted a Calculator. Each calculator declares its inputs and outputs contract, establishing the type of packet it can handle, and defines a function to process those packets.

Graphs are described as Protobuffers, with the configuration for each calculator. Mediapipe takes this data at runtime, instantiate each calculator, and connects it to its up and downstream neighbors according to the supplied contracts.

Packets enter the graph through input streams and leave it through output streams. When a new packet arrives, mediapipe schedules the processing of that packet to the corresponding calculator, or enqueues it if it is busy.

The figure below illustrates a mediapipe graph for performing edge detection on the GPU. Each calculator receives GPU image packets and schedules execution on the available device.

Difference 2: packets and graph scheduling

A packet in Mediapipe is an independent piece of data that travels through the calculator graphs. This enables Mediapipe to schedule running several calculators concurrently, thus potentially increasing performance.

In Lluvia, nodes connected through inputs and outputs do not allocate new memory on each run of the node. Instead, all the memory is allocated at node initialization time, and exposed through the node’s ports. Then, the whole graph is scheduled to run on the GPU device in one go. This reduces the delay in computations as avoids cross-talk between the host CPU and the GPU to synchronize individual node execution.

Lluvia as a mediapipe dependency

Mediapipe, as well as Lluvia, are built using Bazel. As a consequence, the integration of Lluvia can be done by including the project as a Bazel dependency into Mediapipe repository. The current approach to achieve this is through the use of an auxiliary repository, lluvia-mediapipe, that contains the LluviaCalculator node to run GPU compute-pipelines as a Mediapipe calculator. The build instructions are available in the mediapipe integration guide. The process is as follows:

Clone Mediapipe repository alongside Lluvia.
Configure Mediapipe’s Bazel workspace to build in your host machine.
Include Lluvia as a dependency to Mediapipe.
Clone lluvia-mediapipe repository inside Mediapipe to enable building its targets.
Run the tests included in the repository to validate the build.

The directory structure of the three projects should look like this:

lluvia                          <-- lluvia repository
mediapipe                       <-- mediapipe repository
├── BUILD.bazel
├── LICENSE
├── ...
├── mediapipe                   <--
│   ├── BUILD
│   ├── calculators
│   ├── examples
│   ├── framework
│   ├── gpu
│   ├── ...
│   ├── lluvia-mediapipe        <-- lluvia-mediapipe repository
├── ...
├── .bazelrc
└── WORKSPACE

Once Mediapipe builds correctly, it is possible to create graphs that include the LluviaCalculator.

The `LluviaCalculator`

The LluviaCalculator is in charge of initializing Lluvia, binding input and output streams from mediapipe to lluvia ports, and running a given compute pipeline. The figure below illustrates a basic mediapipe graph utilizing lluvia, while the code below shows the graph description using Protobuffer text syntax:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
input_stream: "input_stream"
output_stream: "output_stream"

node: {
  calculator: "LluviaCalculator"
  input_stream: "IN_0:input_stream"
  output_stream: "OUT_0:output_stream"
  node_options {
      [type.googleapis.com/lluvia.LluviaCalculatorOptions]: {
          enable_debug: true
          library_path: "path to .zip node library file"
          script_path: "path to .lua script defining the main container node"
          container_node: "mediapipe/examples/Passthrough"
          input_port_binding:  {
              mediapipe_tag: "IN_0"
              lluvia_port: "in_image"
          }
      }
  }
}

where:

The enable_debug flag tells whether or not the Vulkan debug extensions used by Lluvia should be loaded during session creation. This flag might be set to false in production applications to improve runtime performance.
The library_path declare paths to node libraries (a .zip file) containing Lluvia nodes (Container and Compute). This attribute can be repeated several times.
The script_path is the path to a lua script declaring a ContainerNode that Lluvia will instantiate as the “main” node to run inside the calculator.
input_port_binding, maps mediapipe input tags to the main ContainerNode port. In the example above, mediapipe’s input tag IN_0 is mapped to lluvia’s in_image port.

Examples

lluvia-mediapipe includes two applications, single_image and webcam to run on the host system. The single_image app, as the name suggests, reads the content of a single image and feeds it to a Mediapipe graph.

The command below executes the binary with a graph configured to run the lluvia/color/BGRA2Gray compute node to convert from the BGRA input to gray scale:

1
2
3
4
5
bazel run --copt -DMESA_EGL_NO_X11_HEADERS --copt -DEGL_NO_X11 \
    //mediapipe/lluvia-mediapipe/examples/desktop/single_image:single_image -- \
    --input_image=${HOME}/git/lluvia/lluvia/resources/mouse.jpg \
    --script_file=${HOME}/git/mediapipe/mediapipe/lluvia-mediapipe/examples/desktop/graphs/BGRA2Gray/script.lua \
    --graph_file=${HOME}/git/mediapipe/mediapipe/lluvia-mediapipe/examples/desktop/graphs/BGRA2Gray/graph.pbtxt

where ${HOME}/git is the base folder where Lluvia and Mediapipe are cloned. Change this according to your setup.

A more sophisticated example is running the Horn and Schunck optical flow algorithm inside of Mediapipe. The webcam binary opens the default capture device using OpenCV and transfers the captured frames the compute graph. The graph is a single LluviaCalculator running several nodes:

1
2
3
4
bazel run --copt -DMESA_EGL_NO_X11_HEADERS --copt -DEGL_NO_X11 \
    //mediapipe/lluvia-mediapipe/examples/desktop/webcam:webcam -- \
    --script_file=${HOME}/git/mediapipe/mediapipe/lluvia-mediapipe/examples/desktop/graphs/HornSchunck/script.lua \
    --graph_file=${HOME}/git/mediapipe/mediapipe/lluvia-mediapipe/examples/desktop/graphs/HornSchunck/graph.pbtxt

where --graph_file=${HOME}/git/mediapipe/mediapipe/lluvia-mediapipe/examples/desktop/graphs/HornSchunck/graph.pbtxt is the path to Mediapipe’s graph to be run by the app, and --script_file=${HOME}/git/mediapipe/mediapipe/lluvia-mediapipe/examples/desktop/graphs/HornSchunck/script.lua points to a Lua script defining the Container node to run inside of the LluviaCalculator.

@startuml
skinparam linetype ortho

state LluviaCalculator as "LluviaCalculator" {

    state input_stream as "IN_0:input_stream" <<inputPin>>
    state output_stream as "OUT_0:output_stream" <<outputPin>>

    state ContainerNode as "mediapipe/examples/HornSchunck" {
        
        state in_image <<inputPin>>

        state BGRA2Gray
        state HS as "HornSchunck"
        state Flow2RGBA
        state RGBA2BGRA

        input_stream -down-> in_image

        in_image -down-> BGRA2Gray
        BGRA2Gray -down-> HS: in_gray
        HS -down-> Flow2RGBA: in_flow
        Flow2RGBA -down-> RGBA2BGRA: in_rgba

        RGBA2BGRA -down-> out_image <<outputPin>>
    }
    
  
  out_image -down-> output_stream <<outputPin>>
}

@enduml

First, the input image is transformed from BGRA color space to gray scale. Next, the images are fed to the HornSchunck container node to compute optical flow. The estimated flow is then converted to color using the Flow2RGBA compute node, and finally, the RGBA output is converted to BGRA to proper rendering in the window opened by OpenCV.

Difference 3: calculators as code vs. nodes as data

In Mediapipe, every Calculator must be compiled and integrated into the binary at build time, thus requiring rebuilding every time a new Calculator must be added or modified.

Lluvia describes nodes as a pair of Lua and GLSL (for ComputeNode) files that are compiled and packaged into a node library as a .zip file. Once packaged, the library can be imported on any runtime where Lluvia runs. This eases the developer experience as one can develop nodes in a higher-level environment, using Python in a Jupyter notebook for instance, package the nodes in a node library and then use them in any environment (Mediapipe for instance).

Discussion

This article presented the integration of Lluvia into the Mediapipe project. By added the project into Mediapipe, it is possible to leverage the GPU compute-pipeline capabilities of Lluvia to speed up parts of complex Computer Vision applications.

The integrations between thw two projects is achieved through the LluviaCalculator which runs any arbitrary ContainerNode. This calculator is in early stages of development, and feedback is very welcomed. Some immediate improvements include:

Support GPUImageFrame input and output packets. Currently, the calculator only accepts CPU ImageFrame packets, thus introducing some latency while copying data from CPU memory space to the GPU.
Support Mediapipe side packets to send configuration updates to the calculator.
Include more configuration attributes (e.g. node parameters) in the Protobuffer type.

And finally, testing the integration in other platforms such as Android.

References

Camera undistort

Presents new nodes for undistorting images given a camera calibration model with radial and tangential distortion.

By Juan Adarve | Sunday, September 18, 2022

Jupyter notebook:

A Jupyter notebook with the code in this article is available in Google Colab. Check it out!

Background

Camera undistort is the process by which distortions generated by the optics used in the camera during the capture process are corrected in software. The process requires a mathematical model of the distortion, and a calibration procedure to estimate the parameters of such model given actual images.

An overview of the camera modeling is pressented in the Computer Vision book of Szeliski and the Multiple View Geometry book of Hartley and Zisserman, as well as the articles of Zhang, Wei and Ma.

There are several calibration toolboxes available for estimating the camera model from a series of images:

Any of such frameworks can be used to estimate the camera model parameters. Those parameters are the input to the undistort method presented in this article to rectify raw captured images.

Camera model

The figure below illustrates the camera model.

The 3D point $\mathbf{x} \in \mathbb{R}^3$ is expressed relative to the camera body fixed frame. It projects onto the camera image plane as pixel $\mathbf{p} := (u, v)^\top \in \mathbb{R}^2$ as

$$ \begin{equation} \begin{pmatrix} \mathbf{p} \\\ 1 \end{pmatrix} := \begin{pmatrix} u \\\ v \\\ 1 \end{pmatrix} = \frac{\mathbf{K} \mathbf{x}}{ \left< e_3, \mathbf{x} \right>} \end{equation} $$

where $\mathbf{K} \in \mathbb{R}^{3\times3}$ is the camera intrinsics matrix, $e_3 := (0, 0, 1)^\top$, and $\left< e_3, \mathbf{x} \right>$ is the dot product between the two vectors. The units of $\mathbf{p}$ are actual pixel coordinates in the ranges $u \in [0, W)$ and $v \in [0, H)$, with $W$ and $H$ denoting the image width and height respectively.

Given a pixel point, the corresponding 3D coordinate $\bar{\mathbf{x}}$ in the image plane is defined as:

$$ \begin{equation} \bar{\mathbf{x}} := \begin{pmatrix} \bar{x} \\\ \bar{y} \\\ \bar{z} \\\ \end{pmatrix} = \mathbf{K}^{-1} \begin{pmatrix} \mathbf{p} \\\ 1 \end{pmatrix} \end{equation} $$

Notice that this projection does return $\bar{\mathbf{x}}$ and not the original 3D point $\mathbf{x}$. To return the actual 3D position in the world, the depth information is needed to project $\bar{\mathbf{x}}$ outside of the image plane to the world.

Standard distortion model

The standard distortion model is formed by two components:

A radial component parameterized by three coefficients: $k_1$, $k_2$, and $k_3$.
A tangential component with two parameters: $p_1$ and $p_2$.

The radial distortion component for a given pixel $\mathbf{p}$ is computed as

$$ \begin{equation} \bar{\mathbf{x}}_r := R \begin{pmatrix} \bar{x} \\\ \bar{y} \\\ 0 \end{pmatrix} \end{equation} $$

where $R \in \mathbb{R}$ is

$$ \begin{equation} R = k_1 r^2 + k_2 r^4 + k_3 r^6 \end{equation} $$

with

$$ \begin{equation} r^2 = \bar{x}^2 + \bar{y}^2 \end{equation} $$

and $\bar{x}, \bar{y}$ are the $x$ and $y$ coordinates of the projection of pixel $\mathbf{p}$ using equation (2).

The tangential distortion is computed as:

$$ \begin{equation} \bar{\mathbf{x}}_p := \begin{pmatrix} 2 p_1 \bar{x}\bar{y} + p_2(r^2 + 2\bar{x}^2) \\\ p_1(r^2 + 2 \bar{y}^2) + 2 p_2 \bar{x}\bar{y} \\\ 0 \end{pmatrix} \end{equation} $$

Finally, the undistorted image plane coordinates $\bar{\mathbf{x}}_u$ is computed as:

$$ \begin{equation} \bar{\mathbf{x}}_u = \bar{\mathbf{x}} + \bar{\mathbf{x}}_r + \bar{\mathbf{x}}_p \end{equation} $$

Given $\bar{\mathbf{x}}_u$, the corresponding undistorted pixel coordinate is:

$$ \begin{equation} \begin{pmatrix} \mathbf{p}_u \\\ 1 \end{pmatrix} := \begin{pmatrix} u_u \\\ v_u \\\ 1 \end{pmatrix} = \frac{\mathbf{K} \bar{\mathbf{x}}_u}{ \left< e_3, \bar{\mathbf{x}}_u \right>} \end{equation} $$

The convention for the tangential parameters $p_1$ and $p_2$ is the same to that of OpenCV. However, this convention is flipped to respect to that presented in the article of Wei and Ma.

The figures below illustrate the effects of the radial and tangential distortion. A possitive value of $k_1$ creates a barrel effect, while a negative value generates a pincushion effect. For the tangential parameters, $p_1$ models missalignment between the image sensor and the image plane in the $y$ axis, while $p_2$ models such missalignment in the $x$ axis.

Implementation

The camera undistort procedure is implemented as a single ComputeNode with the following interface:

@startuml
skinparam linetype ortho

state CameraUndistort as "CameraUndistort_rgba8ui" {

    state in_rgba <<inputPin>>
    state in_camera <<inputPin>>
    state out_rgba <<outputPin>>

}

note top of CameraUndistort
Parameters
----------
camera_model : int. Defaults to 0.
    The camera model used for rectifying the image. Possible values are:

    * 0: standard model
end note

@enduml

The node explicitly requires rgba8ui images to be bound to the node. The output out_rgba is allocated by the node. The in_camera is a UniformBuffer storing the camera model. This model is defined by the ll_camera struct in GLSL as:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
struct ll_camera {

    // The camera intrinsic matrix. Used to project 3D points expressed in the camera coordinate frame
    // to the image plane and convert to pixel coordinates.
    mat3 K;

    // The inverse camera intrinsic matrix. Used to convert from pixel to image plane coordinates.
    mat3 Kinv;

    // Radial distortion coefficients. For the standard camera model,
    // only the first 3 coefficients are used (XYZ).
    vec4 radialDistortion;

    // Tangential distortion coefficients. Only the first 2 coefficients are used (XY).
    vec4 tangentialDistortion;
};

Uniform buffers are a special type of buffers used to store small data structures used in graphics and compute pipelines. The Vulkan tutorial on Uniform Buffers is a good read on how they are used in general. Notice that the ll_camera uses GLSL types such as mat3 and vec4. In the host CPU, one must use corresponding types and follow the byte alignmnet rules to make the buffer usable in the GPU. The alignment rules are defined by the STD140 layout rules. For the ll_camera struct, the mat3 attributes must be transferred as a matrix of 4 rows and 3 columns in order to meet the alignment requirements.

Matrix storage in GLSL

In GLSL, matrices are stored in column-major order. For a given matrix M indexed as M[i, j] where i and j are the row and column indexes, respectively, the elements M[i, j] and M[i, j+1] are stored contiguously in memory. This is different, for instance, to numpy’s default ordering as row-major.

The code block below shows a complete example on how to run the lluvia/camera/CameraUndistort_rgba8ui node using a dummy camera model with radial and tangential distortion:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import lluvia as ll
import lluvia.util as ll_util
import numpy as np
import matplotlib.pyplot as plt

session = ll.createSession()

# memory to store the input and output images
memory = session.createMemory(ll.MemoryPropertyFlagBits.DeviceLocal)

# memory to store the uniform buffer with the camera parameters
host_memory = session.createMemory([ll.MemoryPropertyFlagBits.DeviceLocal,
                                    ll.MemoryPropertyFlagBits.HostVisible,
                                    ll.MemoryPropertyFlagBits.HostCoherent])

# read a sample image
sampleImage = ll_util.readSampleImage('koala')

# draw a grid on top of the sample image
Yrange = np.arange(0, sampleImage.shape[0], 128)
Ylines = np.concatenate([n + Yrange for n in range(4)])

Xrange = np.arange(0, sampleImage.shape[1], 128)
Xlines = np.concatenate([n + Xrange for n in range(4)])

sampleImage[Ylines, ...] = 0
sampleImage[:, Xlines, ...] = 0

# the input image view must be sampled. This example uses nearest neighbor interpolation
in_rgba = memory.createImageViewFromHost(sampleImage,
                                         filterMode=ll.ImageFilterMode.Nearest,
                                         addressMode=ll.ImageAddressMode.Repeat,
                                         normalizedCoordinates=False,
                                         sampled=True)

###################################################
# Camera parameters
W = float(in_rgba.width)
H = float(in_rgba.height)

# Dummy camera matrix
K = np.array([[W, 0, 0.5*(W -1)],
              [0, H, 0.5*(H -1)],
              [0, 0, 1] ], dtype=np.float32, order='F')
Kinv = np.linalg.inv(K)
radialDistortion = np.array([0.5, 0, 0, 0,], dtype=np.float32)
tangentialDistortion = np.array([0.1, 0, 0, 0], dtype=np.float32)

# align the matrices according to the STD140 rules (column major, 4-component vectors)
K_aligned = np.zeros((4,3), dtype=np.float32, order='F'); K_aligned[:3, :3] = K
Kinv_aligned = np.zeros((4,3), dtype=np.float32, order='F'); Kinv_aligned[:3, :3] = Kinv

# create bytes buffer from matrices
buf = K_aligned.tobytes(order='F') + Kinv_aligned.tobytes(order='F') + radialDistortion.tobytes() + tangentialDistortion.tobytes()
npBuf = np.frombuffer(buf, dtype=np.uint8)

# in_camera uniform buffer
in_camera = host_memory.createBufferFromHost(npBuf, usageFlags=[ll.BufferUsageFlagBits.TransferSrc,
                                                                ll.BufferUsageFlagBits.TransferDst,
                                                                ll.BufferUsageFlagBits.UniformBuffer])

###################################################
# Compute node
CameraUndistort = session.createComputeNode('lluvia/camera/CameraUndistort_rgba8ui')
CameraUndistort.setParameter('camera_model', ll.Parameter(1)) # standard model
CameraUndistort.bind('in_rgba', in_rgba)
CameraUndistort.bind('in_camera', in_camera)
CameraUndistort.init()

CameraUndistort.run()

out_rgba = CameraUndistort.getPort('out_rgba')

###################################################
# Plotting
fig = plt.figure(figsize=(15, 8)); fig.set_tight_layout(True)
plt.subplot2grid((1,2), (0,0)); plt.imshow(in_rgba.toHost()[..., :3]); plt.title('in_rgba')
plt.subplot2grid((1,2), (0,1)); plt.imshow(out_rgba.toHost()[..., :3]); plt.title('out_rgba')
plt.show()

Lines 36 to 60 create the uniform buffer containing the camera model. Lines 42 and 45 create the camera intrinsics matrix K and its inverse Kinv. Then, in lines 50-51, those matrices are aligned to meet the std140 requirements; in this case, storing each matrix in a 4x3 matrix in column-major ordering (using order='F' in numpy). Finally, lines 54-55 concatenates all camera parameters to create a single numpy array npBuf which is then used to create the in_camera uniform buffer in lluvia.

Runtime performance

A Razer Blade laptop running Ubuntu 22.04LTS was used for the runtime analysis. The laptop is equipped with an Intel i7-11800H processor, and the following Vulkan devices as reported by the code block below:

1
2
3
import lluvia as ll
for dev in ll.getAvailableDevices():
    print(dev)

NVIDIA GeForce RTX 3070 Laptop GPU.
Intel(R) UHD Graphics (TGL GT1).
llvmpipe (LLVM 13.0.1, 256 bits). This is a CPU implementation of the Vulkan API shipped with the Mesa drivers.

In addition, the cv2.undistort() function from OpenCV is considered for reference. Five resolutions are used in the evaluation: VGA 640x480, HD 1280x720, FHD 1920x1080, WQHD 2560x1440, and UHD 3840x2160. For each resolution, the algorithm is run for 1000 iterations and the median runtime is extracted. The figure and table belows show the runtime for each device and resolution combination.

Resolution	Device name	Runtime median ms
VGA 640x480	Intel UHD Graphics	0.00235
	RTX 3070	0.013888
	llvmpipe	0.604263
	OpenCV	2.04252
HD 1280x720	Intel UHD Graphics	0.007734
	RTX 3070	0.03728
	llvmpipe	1.50221
	OpenCV	6.38165
FHD 1920x1080	Intel UHD Graphics	0.0151045
	RTX 3070	0.07456
	llvmpipe	3.17916
	OpenCV	17.1453
WQHD 2560x1440	Intel UHD Graphics	0.0262145
	RTX 3070	0.109344
	llvmpipe	5.97528
	OpenCV	22.8469
UHD 3840x2160	Intel UHD Graphics	0.058583
	RTX 3070	0.242688
	llvmpipe	17.6178
	OpenCV	49.477

Integrated vs Discrete GPU performance

Notice that the Intel UHD Graphics device reports lower runtime than the discrete Nvidia RTX 3070 GPU. It is not clear why this is the case, as the Nvidia GPU has more compute resources than the Intel integrated graphics.

Also, notice how the llvmpipe CPU device is between three to four times faster than the OpenCV function. However, both CPU devices are 2 orders of magnitude slower than the Nvidia and Intel GPU devices.

Discussion

This post showed how to run the camera undistort node in Lluvia. The node takes as input an RGBA image and a camera model stored in a uniform buffer in the GPU, and produces an RGBA output image. The camera model stored in the uniform model must follow the GLSL std140 layout rules. In terms of runtime performance, the GPU implementation is several orders of magnitude faster than the OpenCV default implementation.

Future pieces of work includes:

Expose the interpolation coordinates for undistorting the images as a new compute node. These coordinates could be cached in order to save computations on every node invocation.
Clip the undistorted image to a given area according to the camera model. This will be useful to avoid wasted pixels in the output, as shown in the examples.
Support for more image formats, such as r8ui and floating point channel types.

References

OpenCV camera calibration routines.
Matlab calibration app.
Vulkan tutorial on Uniform Buffers.
GLSL STD140 memory layout.
Mesa llvmpipe.
Zhang, Z., 2000. A flexible new technique for camera calibration. IEEE Transactions on pattern analysis and machine intelligence, 22(11), pp.1330-1334. Microsoft Technical Report.
Wei, G.Q. and De Ma, S., 1994. Implicit and explicit camera calibration: Theory and experiments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5), pp.469-480. DOI.
Szeliski, R., 2010. Computer vision: algorithms and applications. Springer Science & Business Media. Book.
Hartley, R. and Zisserman, A., 2003. Multiple view geometry in computer vision. Cambridge university press. Book

Implementing the Horn and Schunck optical flow algorithm

GPU implementation of Horn and Schunck’s variational method for estimating optical flow.

By Juan Adarve | Sunday, August 07, 2022

Jupyter notebook:

A Jupyter notebook with the code in this article is available in Google Colab. Check it out!

Background

The Horn and Schunck variational method for computing optical flow is one of the seminal works in the field. It introduces the idea of using a global smoothness constrain on the estimated optical flow. This constrain helps the numerical solution to find a good flow estimate even in image regions with poor texture.

Let $\mathbf{E}(x, y, t)$ be the image brightness at point $(x, y)$ and time $t$. Considering the constant brightness assumption, where the change in brightness is zero, that is,

$$ \frac{d \mathbf{E}}{d t} = 0 $$

Taking the partial derivatives over $(x, y, t)$, one has:

$$ \frac{\partial \mathbf{E}}{\partial x} \frac{\partial x}{\partial t} + \frac{\partial \mathbf{E}}{\partial y} \frac{\partial y}{\partial t} + \frac{\partial \mathbf{E}}{\partial t} = 0 $$

For convenience, let:

$$ \begin{align*} \mathbf{E}_x &= \frac{\partial \mathbf{E}}{\partial x} \\\\ \mathbf{E}_y &= \frac{\partial \mathbf{E}}{\partial y} \\\\ \mathbf{E}_t &= \frac{\partial \mathbf{E}}{\partial t} \end{align*} $$

be the image gradient in the $x$ and $y$ directions, and the partial derivative in time, respectively, and

$$ \begin{align} u &= \frac{\partial x}{\partial t} \\\\ v &= \frac{\partial y}{\partial t} \end{align} $$

be the $x$ and $y$ components of the optical flow, respectively. The constant brightness equation is then

$$ \mathbf{E}_x u + \mathbf{E}_y v + \mathbf{E}_t = 0 $$

which is the basis for the differential methods for computing optical flow (e.g. Lukas-Kanade).

Minimization

Differential methods for estimating optical flow try to minimize the cost function

$$ \epsilon_b = \mathbf{E}_x u + \mathbf{E}_y v + \mathbf{E}_t $$

that is, to try to find values $(u, v)$ of the optical flow such that the constant brightness constrain is maintained. Notice that there is a single cost funcion and two unknowns $(u, v)$. To solve this, the Horn and Schunck algorithm adds a smoothness constrain based on the average value of the flow in a neighborhood, as

$$ \epsilon_c^2 = (\bar{u} - u )^2 + (\bar{v} - v)^2 $$

Combining both cost functions, one has

$$ \epsilon^2 = \alpha^2 \epsilon_b^2 + \epsilon_c^2 $$

From these equations, a numerical solution is derived. The reader is encouraged to go to the paper for more details. The iterative solution for $(u, v)$ is

$$ \begin{align*} u^{n+1} &= \bar{u}^n - \mathbf{E}_x \frac{\mathbf{E}_x \bar{u}^n + \mathbf{E}_y \bar{v}^n + \mathbf{E}_t}{\alpha^2 + \mathbf{E}_x^2 + \mathbf{E}_y^2} \\\\ v^{n+1} &= \bar{v}^n - \mathbf{E}_y \frac{\mathbf{E}_x \bar{u}^n + \mathbf{E}_y \bar{v}^n + \mathbf{E}_t}{\alpha^2 + \mathbf{E}_x^2 + \mathbf{E}_y^2} \end{align*} $$

where $(u^{n+1}, v^{n+1})$ is the estimated optical flow at iteration $n + 1$, using the estimated flow at previous iterations and image parameters computed from an image pair.

Implementation

The figure below illustrates the pipeline implementing the algorithm:

@startuml
skinparam linetype ortho

state HS as "HornSchunck" {

  state in_gray <<inputPin>>

  state ImageProcessor
  state ImageNormalize_uint_C1

  state NI_1 as "NumericIteration 1"
  state NI_2 as "NumericIteration 2"
  state NI_3 as "NumericIteration 3"
  state NI_N as "NumericIteration N"
  
  in_gray -down-> ImageProcessor
  in_gray -down-> ImageNormalize_uint_C1
  
  
  ImageNormalize_uint_C1 -down-> ImageProcessor: in_gray_old
  
  ImageProcessor -down-> NI_1: in_image_params
  ImageProcessor -down-> NI_2
  ImageProcessor -down-> NI_3
  ImageProcessor -down-> NI_N: in_image_params
  
  NI_1 -> NI_2
  NI_2 -> NI_3
  NI_3 -> NI_N: ...
  NI_N -> NI_1: in_flow, used for next image iteration
  
  NI_N -down-> out_flow <<outputPin>>
  ImageNormalize_uint_C1 -down> out_gray <<outputPin>>
}

note top of HS
Parameters
----------
alpha : float. Defaults to 0.05.
    Regularization gain.

iterations : int. Defaults to 1.
    Number of iterations run to compute the optical flow.

float_precision : int. Defaults to ll.FloatPrecision.FP32.
    Floating point precision used accross the algorithm. The outputs out_gray
    and out_flow will be of this floating point precision.
end note
@enduml

The HornSchunck is a ContainerNode that instantiates several ComputeNode implementing the algorithm. In particular, the ImageProcessor node computes image parameters from the pair of images in_gray and in_gray_old. Those parameters are transfered to the instances of NumericIteration through in_image_params, organized as follows:

in_image_params.x: X component of the image gradient
in_image_params.y: Y component of the image gradient
in_image_params.z: temporal derivative between in_gray and in_gray_old.
in_image_params.w: gain for this pixel computed from image gradient and alpha parameter.

This packaging of the image parameters is convenient as all values are packed together in a singe RGBA pixel. The floating point precision of this, and the estimated optical flow is controlled by the float_precision parameter.

The NumericIteration node takes the image parameters and a prior estimation of the optical flow, in_flow, and computes the next iteration of the flow field. The algorithm requires several iterations for the estimated flow to be of acceptable quality. In the figure above, the last iteration is denoted as NumericIteration_N and it feeds its output back as input to the first one, as well as the output of the HornSchunck node. The number of iterations is controlled by the iterations parameter.

The code block below shows how to run a simple pipeline:

@startuml
skinparam linetype ortho

state RGBA2Gray
state HS as "HornSchunck"
state Flow2RGBA

RGBA2Gray -down-> HS: in_gray
HS -down-> Flow2RGBA: in_flow
@enduml

where RGBA2Gray converts an input RGBA image to gray scale, HornSchunck computes the optical flow, and Flow2RGBA converts the optical flow to color representation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import lluvia as ll
import lluvia.util as ll_util
import matplotlib.pyplot as plt

# read two images as numpy arrays
frame_0 = ll_util.readRGBA('path to first image...')
frame_1 = ll_util.readRGBA('path to second image...')

# global session and memory objects
session = ll.createSession()
memory = session.createMemory(ll.MemoryPropertyFlagBits.DeviceLocal)

# this is the input of the comple pipeline
in_rgba = memory.createImageViewFromHost(frame_0)

RGBA2Gray = session.createComputeNode('lluvia/color/RGBA2Gray')
RGBA2Gray.bind('in_rgba', in_rgba)
RGBA2Gray.init()
RGBA2Gray.run() # run the node immediately in order to populate out_gray with valid values

in_gray = RGBA2Gray.getPort('out_gray')

HornSchunck = session.createContainerNode('lluvia/opticalflow/HornSchunck/HornSchunck')
HornSchunck.setParameter('alpha', ll.Parameter(0.05))
HornSchunck.setParameter('iterations', ll.Parameter(1000))
HornSchunck.setParameter('float_precision', ll.Parameter(ll.FloatPrecision.FP32.value))
HornSchunck.bind('in_gray', in_gray)

# when the node is initialized, it transfers the content of in_gray to out_gray.
HornSchunck.init()

out_gray = HornSchunck.getPort('out_gray')
out_flow = HornSchunck.getPort('out_flow')

# Convert the optical flow field to color images
flow2RGBA = session.createComputeNode('lluvia/viz/Flow2RGBA')
flow2RGBA.setParameter('max_flow', ll.Parameter(float(2)))
flow2RGBA.bind('in_flow', out_flow)
flow2RGBA.init()

out_flow_rgba = flow2RGBA.getPort('out_rgba')

duration = session.createDuration()

# Record the command buffer to run the pipeline in one go
cmdBuffer = session.createCommandBuffer()
cmdBuffer.begin()
cmdBuffer.run(RGBA2Gray)
cmdBuffer.memoryBarrier()
cmdBuffer.durationStart(duration) # start recording the duration to measure runtime
cmdBuffer.run(HornSchunck)
cmdBuffer.memoryBarrier()
cmdBuffer.durationEnd(duration)   # stop recording duration
cmdBuffer.run(flow2RGBA)
cmdBuffer.end()

# copy the content of the second frame to the in_rgba image before running the whole pipeline
in_rgba.fromHost(frame_1)

# run the pipeline
session.run(cmdBuffer)

# print runtime in milliseconds
print('{0:.02f} ms'.format(duration.nanoseconds / 1e6))

fig = plt.figure(figsize=(10, 6)); fig.set_tight_layout(True)
plt.subplot2grid((1,2), (0, 0)); plt.imshow(out_gray.toHost(), vmin=0, vmax=1, cmap='gray')
plt.subplot2grid((1,2), (0, 1)); plt.imshow(out_flow_rgba.toHost())
plt.show()

Evaluation on the Middlebury dataset

The Middlebury optical flow dataset from Baker et. al. provides several real-life and synthetic image sequences with ground truth optical flow. The figures below shows the estimated optical flow for the test sequences whose ground truth is available.

The Horn ans Schunck algorithm is not well suited for large pixel displacements. Considering this, the input images are scaled to half before entering the compute pipeline. The ground truth flow is scaled accordingly in order to be compared with the estimated flow. The Endpoint Error measures the different in magnitude between the ground truth and the estimation, it is computed as:

$$ EE = \sqrt{(u - u_\text{gt})^2 + (v - v_\text{gt})^2} $$

The algorithm is configured as follows:

alpha: 15.0/255
iterations: 2000
float_precision: FP32

In general, the estimated optical flow yields acceptable results in image regions with small displacements (e.g. Dimetrodon, Grove2, Hydrangea, and RubberWhale). In image regions with large displacements, the method is not able to compute a good results, as can be visualized in the Urban2 and Urban3 sequences.

The results reported in this post were run on a Razer Blade 2021 Laptop equipped with an Nvidia RTX 3070 GPU. The runtime is reported in the title of each figure, and is in the order of 20 milliseconds for most of the image sequences. Section runtime performance evaluates the performance of the algorithm on different devices, resolutions, and floating point precisions.

Runtime performance

For the runtime analysis of the algorithm, two GPU devices were used:

A Nvidia GTX 1080 Desktop GPU.
A Nvidia RTX 3070 Laptop GPU running on a Razer Blade 2021.

The Horn and Schunck pipeline is configured using the same number of iterations used for the Middlebury evalatuon, that is, iterations = 2000. The pipeline is configured for 5 different image resolutions (VGA 640x480, HD 1280x720, HD 1920x1080, WQHD 2560x1440, UHD 3840x2160). For each resolution, the pipeline is run both using FP16 and FP32 floating point precision. The table and figure below show the runtime performance for each configuration.

Resolution	Float precision	Device	Runtime median (ms)
VGA 640x480	FP16	GTX 1080	68.8196
		RTX 3070	39.4354
	FP32	GTX 1080	97.5005
		RTX 3070	63.6458
HD 1280x720	FP16	GTX 1080	193.977
		RTX 3070	115.626
	FP32	GTX 1080	279.538
		RTX 3070	175.635
HD 1920x1080	FP16	GTX 1080	429.256
		RTX 3070	257.624
	FP32	GTX 1080	623.555
		RTX 3070	386.718
WQHD 2560x1440	FP16	GTX 1080	757.101
		RTX 3070	449.536
	FP32	GTX 1080	1099.35
		RTX 3070	682.558
UHD 3840x2160	FP16	GTX 1080	1694.16
		RTX 3070	1010.16
	FP32	GTX 1080	2453.45
		RTX 3070	1551.34

It is not surprising that the RTX 3070 GPU is faster than the GTX 1080, as the former is of a newer generation than the latter.

Discussion

This post presented a GPU implementation of the Horn and Schunck optical flow algorithm. Evaluation in the Middlebury test sequences show the validity of the implementation. A runtime performance analysis was conducted on two GPUs using several image resolutions and floatin point precisions.

Future work includes:

Implementing a pyramidal scheme, for instance that of Llopis et. al., to improve the accuracy of the algorithm in presence of large displacements.
Use the smoothness constrain and numerical scheme in the FlowFilter algorithm to improve the accuracy.

References

Horn, Berthold KP, and Brian G. Schunck. “Determining optical flow.” Artificial intelligence 17.1-3 (1981): 185-203. Google Scholar.
Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J. and Szeliski, R., 2011. A database and evaluation methodology for optical flow. International journal of computer vision, 92(1), pp.1-31. Google Scholar.
Meinhardt-Llopis, E. and Sánchez, J., 2013. Horn-schunck optical flow with a multi-scale strategy. Image Processing on line. Google Scholar
Adarve, Juan David, and Robert Mahony. “A filter formulation for computing real time optical flow.” IEEE Robotics and Automation Letters 1.2 (2016): 1192-1199. Google Scholar

Working with floating point precision

Discusses how to use different floating point precisions available in the GPU, and how to take advantage of smaller representations to improve runtime performance.

By Juan Adarve | Saturday, February 12, 2022

Jupyter notebook:

A Jupyter notebook with the code in this article is available in Google Colab. Check it out!

GPU devices support several floating point number precisions, where precision refers to the number of bits used for representing a given number. Typical representations are:

FP16: or half precision. Numbers are represented in 16 bits.
FP32: or single precision. It uses 32 bits for representing a number.
FP64: or doble precision. 64 bits are used for represeting a number.

FP64 is used when numerical precision is required, while FP16 is suitable for fast, less exact calculations, and FP32 sits in the middle. The IEEE 754 standard defines the specification of floating point numbers used in modern computers. It defines the rules for interpreting the bit fields that form a number, as well as the arithmetic rules to process them.

The Vulkan API offers support for the three floating point precisions. However, not all GPUs support every format. The Vulkan GPU Info page is great tool to check support for a given feature.

Improvements in runtime performance

Smaller bit representation of floating point numbers have an advantage in terms of runtime performance. Consider the case of a RGBA image. If the image channel type is ll.ChannelType.Float16, the four pixel values will fit in 8 bytes, compared to the 16 bytes needed if ll.ChannelType.Float32 was used. This reduction in memory footprint increases the pixel transfer rate from memory to the compute device.

To illustrate this, let’s consider the optical flow filter node. The code below configures the flowfilter algorithm both with ll.FloatPrecision.FP16 and ll.FloatPrecision.FP32, it runs each node for N = 10000 iterations and collects its runtime using the duration probe.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import lluvia as ll
import numpy as np

session = ll.createSession()
memory = session.createMemory([ll.MemoryPropertyFlagBits.DeviceLocal])

host_rgba = np.zeros((1016, 544, 4), dtype=np.uint8)
in_rgba = memory.createImageViewFromHost(host_rgba)

RGBA2Gray = session.createComputeNode('lluvia/color/RGBA2Gray')
RGBA2Gray.bind('in_rgba', in_rgba)
RGBA2Gray.init()

N = 10000
runtimeMilliseconds = {
    ll.FloatPrecision.FP16 : np.zeros((N), dtype=np.float32),
    ll.FloatPrecision.FP32 : np.zeros((N), dtype=np.float32)
}

for precision in [ll.FloatPrecision.FP32, ll.FloatPrecision.FP16]:

    flowFilter = session.createContainerNode('lluvia/opticalflow/flowfilter/FlowFilter')
    flowFilter.setParameter('levels',            ll.Parameter(2))
    flowFilter.setParameter('max_flow',          ll.Parameter(2))
    flowFilter.setParameter('smooth_iterations', ll.Parameter(2))
    flowFilter.setParameter('gamma',             ll.Parameter(0.0005))
    flowFilter.setParameter('gamma_low',         ll.Parameter(0.0005))
    
    # use selected floating point precision
    flowFilter.setParameter('float_precision',   ll.Parameter(precision.value))
    
    flowFilter.bind('in_gray', RGBA2Gray.getPort('out_gray'))
    flowFilter.init()

    duration = session.createDuration()

    cmdBuffer = session.createCommandBuffer()
    cmdBuffer.begin()
    cmdBuffer.run(RGBA2Gray)
    cmdBuffer.memoryBarrier()

    # probe the runtime of the flowfilter node
    cmdBuffer.durationStart(duration)
    cmdBuffer.run(flowFilter)
    cmdBuffer.memoryBarrier()
    cmdBuffer.durationEnd(duration)

    cmdBuffer.end()
    
    # run the command buffer N times and collect the runtime of the flow algorithm
    for n in range(N):
        session.run(cmdBuffer)
        runtimeMilliseconds[precision][n] = duration.nanoseconds / 1e6

Here, the ll.FloatPrecision.FP16, ll.FloatPrecision.FP32 are new enum values for representing 16-bit and 32-bit floating point precision, respectively. The line flowFilter.setParameter('float_precision', ll.Parameter(precision.value)) configures the node with the given precision. Internally, the float_precision is used to instantiate any floating point image with the requested precision.

Note:

By convention, any node that allows selecting floating point precision will define the float_precision parameter and will expect one of the ll.FloatPrecision enum values.

The figure below shows the collected runtime for both floating point precisions. The median runtime for FP16 is 0.501ms, while for FP32 is 0.770ms. That is, the FP16 algorithm improves the runtime by 35% compared to FP32.

Optical flow filter runtime using FP16 and FP32 floating point precision. Results collected on a Nvidia GTX-1080 (driver 460.91.03) running Ubuntu 20.04.

Modifications to GLSL shader code

In terms of GLSL shader code, there are no changes to support FP16 or FP32 images. However, it is important to understand the underlying functioning. For instance, consider the GLSL implementation of the RGBA2HSVA compute node. Notice that the out_hsva port is bound to the shader as a rgba32f image:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#version 450

#include <lluvia/core.glsl>
#include <lluvia/core/color.glsl>

layout(binding = 0, rgba8ui) uniform uimage2D in_rgba;
layout(binding = 1, rgba32f) uniform writeonly image2D  out_hsva;

layout(push_constant) uniform const_0 {
    float min_chroma;
} params;

void main() {

    const float min_chroma = params.min_chroma;

    const ivec2 coords  = LL_GLOBAL_COORDS_2D;
    const ivec2 imgSize = imageSize(out_hsva);

    if (coords.x > imgSize.x || coords.y > imgSize.y) {
        return;
    }

    const uvec4 RGBA = imageLoad(in_rgba, coords);
    const vec4  HSVA = color_rgba2hsva(RGBA, min_chroma);

    imageStore(out_hsva, coords, HSVA);
}

Images compatible with the rgba32f image format can be bound as output. The shader image load store extension defines the compatibility rules to be able to bind images to shaders. For this case in particular, it is possible to bind either a rgba16f or rgba32f images to the output. The shader will execute all arithmetic operations using 32-bit floating point precision. When storing an image texel using imageStore(out_hsva, coords, HSVA), the shader will reinterpret the vec4 HSVA either as a 16 or 32-bit floating vector, according to the image bound to out_hsva.

The shader image load store extension describes the way texels are re-interepret during load/store operations.

In terms of Lua code to build the node, these are the considerations to support different precisions:

Define the float_precision parameter with default value to ll.FloatPrecision.FP32.
Allocate the node objects according to the selected precision.

In the code below, line local outImageChannelType = ll.floatPrecisionToImageChannelType(float_precision) transforms the recevied ll.FloatPrecision value to the corresponding ll.ChannelType. Then, out_hsva is created and bound to the node.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
local builder = ll.class(ll.ComputeNodeBuilder)

builder.name = 'lluvia/color/RGBA2HSVA'

function builder.newDescriptor()

    local desc = ll.ComputeNodeDescriptor.new()
    desc:init(builder.name, ll.ComputeDimension.D2)

    -- define the float_precision parameter with default value
    desc:setParameter('float_precision', ll.FloatPrecision.FP32)

    local in_rgba = ll.PortDescriptor.new(0, 'in_rgba', ll.PortDirection.In, ll.PortType.ImageView)
    in_rgba:checkImageChannelCountIs(ll.ChannelCount.C4)
    in_rgba:checkImageChannelTypeIs(ll.ChannelType.Uint8)

    desc:addPort(in_rgba)
    desc:addPort(ll.PortDescriptor.new(1, 'out_hsva', ll.PortDirection.Out, ll.PortType.ImageView))

    return desc
end

function builder.onNodeInit(node)

    local in_rgba = node:getPort('in_rgba')

    -- receive the selected float_precision
    local float_precision = node:getParameter('float_precision')

    -- transform float precision to a suitable image channel type
    local outImageChannelType = ll.floatPrecisionToImageChannelType(float_precision)

    -------------------------------------------------------
    -- allocate out_hsva
    -------------------------------------------------------
    local imgDesc = ll.ImageDescriptor.new()
    imgDesc.width = in_rgba.width
    imgDesc.height = in_rgba.height
    imgDesc.depth = in_rgba.depth
    imgDesc.channelCount = ll.ChannelCount.C4
    imgDesc.channelType = outImageChannelType

    local imgViewDesc = ll.ImageViewDescriptor.new()
    imgViewDesc.filterMode = ll.ImageFilterMode.Nearest
    imgViewDesc.normalizedCoordinates = false
    imgViewDesc.isSampled = false
    imgViewDesc:setAddressMode(ll.ImageAddressMode.Repeat)

    -- ll::Memory where out_hsva will be allocated
    local memory = in_rgba.memory
    local out_hsva = memory:createImageView(imgDesc, imgViewDesc)

    -- need to change image layout before binding
    out_hsva:changeImageLayout(ll.ImageLayout.General)

    node:bind('out_hsva', out_hsva)
    node:configureGridShape(ll.vec3ui.new(out_hsva.width, out_hsva.height, 1))
end

-- register builder in the system
ll.registerNodeBuilder(builder)

Discussion

There are several floating point precisions available to use in compute shaders: FP16, FP132, and FP64, are the ones more commonly available in commodity GPU hardware. The ability to control the underlying floating point precision used in compute pipelines can improve runtime performance, as the transfer rate of data from and to memory can increase. The choice of a given precision must be made carefully, as it might affect the accuracy of the algorithm.

Articles

Lluvia port to Rust!

C++ support for the project:

Color mapping for data visualization

Jupyter notebook:

Introduction

Color mapping

Gray color mapping

Complex color mappings

Colormap nodes in Lluvia

Perceptually uniform maps

viridis

plasma

inferno

magma

cividis

Sequential maps

gray

purples

blues

greens

oranges

reds

Diverging maps

spectral

coolwarm

bwr

seismic

Cyclic maps

twilight

hsv

Appendix

Color map extraction from matplotlib

Raspberry Pi 4 build

Introduction

Build instructions

Optical flow demo

OpenCV BGR color ordering

Discussion

Android integration using mediapipe

Introduction

Graph for mobile applications

GpuBuffer support

Android application

Mediapipe Android archive

Build instructions

Discussion

Mediapipe integration

Introduction

Difference 1: project scope

On Graphs, Calculators and Packets

Difference 2: packets and graph scheduling

Lluvia as a mediapipe dependency

The LluviaCalculator

Examples

Difference 3: calculators as code vs. nodes as data

Discussion

References

Camera undistort

Jupyter notebook:

Background

Camera model

Standard distortion model

Implementation

Matrix storage in GLSL

Runtime performance

Integrated vs Discrete GPU performance

Discussion

References

Implementing the Horn and Schunck optical flow algorithm

Jupyter notebook:

Background

Minimization

Implementation

Evaluation on the Middlebury dataset

Runtime performance

Discussion

References

Working with floating point precision

Jupyter notebook:

The `LluviaCalculator`