AI at the Edge: Model Optimizer, Inference Engine, and MQTT

Edge means local (or near local) processing, as opposed to just anywhere in the cloud. This can be an actual local device like a smart refrigerator, or servers located as close as possible to the source (i.e. servers located in a nearby area instead of on the other side of the world).

The edge can be used where low latency is necessary, or where the network itself may not always be available. The use of it can come from a desire for real-time decision-making in certain applications.

Many applications with the cloud get data locally, send the data to the cloud, process it, and send it back. The edge means there’s no need to send to the cloud; it can often be more secure (depending on edge device security) and have less impact on a network. Edge AI algorithms can still be trained in the cloud, but get run at the edge. Not every single app needs it — you can likely wait for a second while your voice app goes to ask the server a question, or such as when NASA engineers are processing the latest black hole data

Model Optimizer

Model optimizer helps convert the model from multiple from framework to an intermediate representation for the inference engine.

It can be used to reduce latency and inference costs for cloud and edge devices (e.g. mobile, IoT). Enable execution on and optimize for existing hardware or new special-purpose accelerators deploy models to edge devices with restrictions on processing, memory, power consumption, network usage, and model storage space. In a nutshell, basically reduce space and improve latency. Latency the amount of delay (or time) it takes to send information from one point to the next. It’s usually measured in milliseconds or ms. Latency is the time taken for a response to occur. Model Optimization brings about a trade-off between precision of a model and the latency operation or space reduction.

Techniques in Model Optimizer

  • Quantization is the process of mapping values from a larger set to a smaller set, in the process of Machine Learning, it is used to reduce higher bits of floating points in models to lower bit floating points. that the precision can be reduced without substantial loss of accuracy. Quantization is the process of reducing the precision of a model. Bits are used to represent the weights and biases of the model.
  • Freezing models will remove certain operations and metadata only needed for training, such as those related to backpropagation.
  • Fusion Combining certain operations together into one operation and needing less computational overhead. This can be particularly useful for GPU inference, where the separate operations may occur on separate GPU kernels, while a fused operation occurs on one kernel, thereby incurring less overhead in switching from one kernel to the next
Model Optimization Block Diagram Representation

An Intermediate representation (IR) is the data structure or code used internally by a compiler or virtual machine to represent source code. An IR is designed to be conducive for further processing, such as optimization and translation.

Inference engine is a component of the system that applies logical rules to the knowledge base to deduce new information. The logic an inference engine uses is the IF THEN; IF logic expression THEN logic expression:

Example “If it rains then it wet”, the pseudocode for this is:

Rule1: rain(a) => wet(b)

A popular algorithm is the Rete Algorithm for Inference Engine

MQ Telemetry Transport (MQTT)

MQTT is a lightweight messaging protocol, a protocol is the system of rules that allows two or more entities of a communication system to transmit information. It is optimized for high latency a reliable network. MQTT is a lightweight publish/subscribe architecture that is designed for resource-constrained devices and low-bandwidth setups. It is used a lot for Internet of Things devices, or other machine-to-machine communication. QTT architecture used to publish data from your edge models to the web.


In the publish/subscribe architecture, there is a broker, or hub, that receives messages published to it by different clients. The broker then routes the messages to any clients subscribing to those particular messages.

This is managed through the use of what are called “topics”. One client publishes to a topic, while another client subscribes to the topic. The broker handles passing the message from the publishing client on that topic to any subscribers. These clients, therefore, don’t need to know anything about each other, just the topic they want to publish or subscribe to.

A layman example will be a buyer who wants to buy a burger, will request it from the waiter who collects the order from a cook that makes the burger the cook i don’t know about. Where the buyer, waiter, the cook will represent the MQTT publisher, Broker Server, MQTT subscriber.

reference from

Sometimes, you may still want a video feed to be streamed to a server. A security camera that detects a person where they shouldn’t be and sends an alert is useful, but you likely want to then view the footage. Since MQTT can’t handle images, we have to look elsewhere.

Network communications can be expensive in cost, bandwidth and power consumption. Video streaming consumes a ton of network resources, as it requires a lot of data to be sent over the network, clogging everything up. Even with high-speed internet, multiple users streaming video can cause things to slow down. As such, it’s important to first consider whether you even need to stream video to a server, or at least only stream it in certain situations, such as when your edge AI algorithm has detected a particular event. FFmpeg, which, similar to MQTT, will actually have an intermediate FFmpeg server that video frames are sent to. The final Node server that displays a webpage will actually get the video from that FFmpeg server. The FFmpeg library is one way to do this. The name comes from “fast forward” MPEG, meaning it’s supposed to be a fast way of handling the MPEG video standard (among others).

By moving certain workloads to the edge of the network, your devices spend less time communicating with the cloud, react more quickly to local changes, and operate reliably even in extended offline periods.

Thanks for reading!!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store