Use device reliability engineering to eliminate risk from IoT products


Developer teams know this cycle well. They finally get a product to launch and celebrate over the next 48 hours when everything looks amazing. But, soon, the cold reality and hard work of having a product in the field sets in: there will never be a problem-free launch.

It is not a failure of talent or technology. Firmware author and lecturer Jack Ganssle claims that every 1,000 lines of code has between 10 and 100 flaws. Since most IoT devices have thousands of lines of code, Ganssle’s logic suggests that hundreds, if not thousands, of faults exist in each one. While many are insignificant, some flaws are serious and cannot be ignored.

No matter how good an organization’s QA processes are, some issues only show up in production. Why? If a bug occurs once every 10,000 hours, that bug is extremely hard to find in a handful of devices in QA. But, once there are 10,000 units in the field, this bug surfaces every hour.

Bugs, security issues, missing features – it all leads to unhappy customers and, often, a deluge of customer complaints. And planning for these post-launch issues should be part of the product life cycle. This approach, called device reliability engineering (DRE), includes the engineering practices, frameworks, and tools that can be used to manage reliability at scale, post-launch.

With the inevitability of bugs and issues in the field, adopting three key DRE techniques can help teams everywhere reduce the risks of a product launch.

1. Full OTA

Consider over-the-air (OTA) updates, which are wireless deliveries of new software, firmware or other data to connected devices, i.e. an insurance policy; without it, the only option is a product recall. With OTA, developers can release updates and keep devices operational.

Proper architecture of the OTA makes it more likely to perform well, which includes ensuring optimal test coverage. Successful systems support cohorts, staged deployments, and firmware signing.

Cohorts are when developers group their devices together and update each group separately. Cohorts are a simple way to test releases, allowing A/B testing and other types of experimentation. It is also useful when working with multiple industrial customers who want updates on a different schedule.

Staged deployments provide the ability to gradually push new updates to the device fleet. Because each release poses risk, rolling out updates gradually limits the reach of any new issues and can prevent one issue from affecting all customers at once. Ideally, developers can set up a system to direct reported issues to the OTA system; if no issues are reported, they can automatically increase deployment increments until they reach the entire fleet.

Firmware signing is a method that proves that a file was created by a trusted source and has not been tampered with. It does this by creating a verifiable signature for a file. By implementing signature verification in a bootloader, developers can identify the authenticity of a given firmware update, and the bootloader can decide whether to warn the user, void the warranty of the device or simply refuse to run an unauthenticated binary.

2. Performance Metrics

After launch, IoT developers must have access to hardware data, which is essential for help them monitor the state of the device fleet.

The five most useful metrics are connectivity, battery life, memory usage, sensor performance, and system responsiveness. The system collecting these metrics must have three essential characteristics:

  1. Low overhead. Collecting metrics should not impact device performance.
  2. Easy to extend. When teams decide to add a metric, it can’t require the collaboration of three different teams.
  3. Preservation of privacy. Given the regulatory landscape in California, Europe, and other places where a device can be used, privacy protections need to be built in from the start.

Generally, there are two primary use cases for these metrics. The first is device metrics. By collecting different data points for individual devices, developers can investigate specific reports of devices misbehaving, either through customer support or engineering teams. Organizations need to be able to capture device delays so that when a customer calls with a battery life complaint, the customer support team or engineers can quickly see operational correlations, such as battery usage and writing to flash. A solid metric system makes this possible.

A key thing to remember about capturing performance metrics is that it can be done asynchronously, a particularly critical feature for devices with limited connectivity. Beyond individual metrics, there should be some level of aggregation and dashboards to give an indication of overall fleet performance and a way to quickly identify data trends.

The other main use case is for setting up alerts. A metrics system should have a way to configure alerts. Configure the system to send alerts via email, instant messaging or incident management platforms when certain conditions are met. Rather than waiting for someone to look at charts, alerts bring issues to the team’s immediate attention.

3. Remote debugging

Consider the different steps involved in traditional debugging. Typically, this starts with several different reports on different customer issues; all can be the same problem, but they are not described in the same way enough for design teams to know. The support team answers the phone or responds to individual emails. Eventually, organizations collect feedback and then convert it into a few different logs that customers can manually collect. With this data, the teams collect the devices in the laboratory, then send them to the engineers. It is time-consuming and expensive.

Remote debugging converts this time-consuming and expensive process into something automated that can happen much faster. It creates a way for devices to report issues automatically by feeding them into a cloud pipeline that analyzes that data, aggregates the reports into error instances that are deduplicated, and then shares those reports with engineering.

Core dumps are a standard debugging technique. These are automatic, detailed diagnostics that are captured whenever issues arise. They come with logs, backtraces, and memory and give engineers the information they need to troubleshoot the issue. Developers need to collect diagnostic data, upload it, and create a way to examine it.

Connectivity has transformed device development by extending product life cycles. Previously, shipping a product meant there was probably no interaction. Now, the product life cycle continues long after a product has been sent into the hands of customers.

By adopting DRE techniques, developers can reduce product launch risk, prepare for the inevitability of post-launch issues, and deliver a higher quality and ever-improving overall product.

About the Author
François Baldassari is founder and CEO of memory error, a connected device observability platform provider. An embedded software engineer by trade, Baldassari’s passion for tooling and automation in software engineering led him to launch Memfault. Prior to Memfault, Baldassari led the firmware team at Oculus and built the operating system at Pebble. Baldassari holds a BS in Electrical Engineering from Brown University.


Comments are closed.