Feedback

Your feedback is important to keep improving our website and offer you a more reliable experience.

IoT Memory Management: A Case Study

BY 01 Staff (not verified) ON Mar 17, 2016

Abstract

The explosive growth of the Internet of Things (IoT) has returned the niche art of writing embedded software to the forefront of software development. This growth is nurtured by the introduction of IoT frameworks and the availability of powerful new constrained-resource processors. A major resource constraint in some IoT processors is the amount of available RAM. Writing applications and libraries with limited RAM is one of the unique skills of embedded software developers. This paper illustrates some techniques used to exploit limited RAM, using a case study based on a fork of the open source project IoTivity [1] by the author. Our primary concern is eliminating the possibility of unexpected malloc failures resulting from heap fragmentation, although related issues are also considered. Malloc failures can result in unstable behavior and even permanent client or server node termination, which can degrade IoT systems. The goal is to allow long-lived IoT nodes to be designed and deployed by developers who might not have a background in embedded programming.

Introduction

A related paper, Heap Allocation in the Internet of Things [2] explains the severity and implications of the memory issue and surveys possible solutions. The current paper assumes the reader is familiar with the heap fragmentation issue and the reasons they should care about solutions.

This is a case study. It includes the author’s personal account, including hard won experience and opinions.

IoTivity [1] is an implementation of the Open Internet Consortium (OIC) [3] protocol connecting clients and servers in the Internet of Things. In the OIC protocol, a server typically represents one or more sensors or actuators located in the real world. Such devices might be quite small physically, and might run on batteries or scavenged energy. Ultra-low power nodes running IoTivity might depend on a relatively new class of processors with minimal computing resources, such as the Atmel* SAM3X and Intel® Curie™, both of which have less than 100 KB of RAM.

Keep in mind that the issue is not shoehorning OIC protocol onto a resource-constrained processor with a simple application. The goal is allowing significant OIC applications, which might have significant RAM requirements of their own, to run indefinitely in secure, highly constrained processing environments.

The paper recounts the relevant steps needed to turn an unconstrained IoT library into one that can support OIC protocol in a constrained-resource processor. First, the IoTivity starting point is described. Second, the necessary architectural changes are explained. Third, specific examples of common design changes are described.

IoTivity: the starting point

IoTivity was originally written rapidly to fill a specific need. Because of this, the initial release of IoTivity has some limitations and can exhibit suboptimal behavior in resource-constrained environments. In this article, we examine the causes of some of these limitations and illustrate how to convert conventional heap usage into heap usage more appropriate to a highly-constrained, embedded environment.

To do the work needed to provide an OIC server stack that is useful in a constrained environment, we created a fork of IoTivity on a local repository.

IoTivity provides both client and server capability, and its original release required both to be built at the same time. Our first task is to allow the server capability to be built independently. This allows heap improvements to be performed on a slightly smaller code base and reduces the minimum code size of a server library.

As an OIC server, IoTivity acts as a transaction processor. When a request arrives, it is processed according to the details of the request, touches server resources, and returns a result to end the transaction. Some requests, such as those for observe, leave persistent states in the server that contribute to memory fragmentation. Also, when two requests arrive in rapid succession, some parts of the request handling can occur at the same time, further exacerbating heap issues. Fortunately, the first release of IoTivity serialized much request handling, minimizing this issue, although delays were introduced as a side effect.

We used static analysis as a proxy for dynamic analysis of malloc usage. The OIC protocol stack of IoTivity made the following calls (discovered using grep):

  • OICMalloc:50

  • OICCalloc:58

  • OICRealloc:2

  • OICStrdup:87

  • Total mallocs:197

This does not include the 362 malloc calls made by the security and connectivity code. An estimated 90% of the OIC stack malloc calls were required by the client and server capability capability. The OIC protocol stack mallocs are most relevant because that’s the part of IoTivity the author reworked for constrained processors. The security code was not included because it was under significant rework at the time. We also didn’t include the connectivity adapter code because it is likely to be significantly different in a highly constrained environment.

A related issue is that, at the time of the initial IoTivity release, limited Valgrind* tests suggested five memory leaks, and further testing would likely reveal significantly more. The sheer number of malloc calls (>500) makes memory leaks inevitable and hard to resolve.

Fortunately, after the refactoring, the forked OIC server stack allocates seven structures with no memory leaks and yields provable resistance to heap fragmentation.

A strategy for handling heap memory

The following steps can be applied when writing embedded software or when converting working embedded software to work in a constrained memory environment. They work best when performed in the order shown.

  1. Minimize data copying, especially where no new information is added to the data.

  2. Reduce the number of allocated structures and buffers.

  3. Parameterize the structure and buffer definitions.

  4. Provide appropriate allocators for the remaining allocation types.

We can demonstrate these steps with examples from IoTivity.

Minimize data copying

Copying data has many downsides; here we concentrate on the memory issues it presents. Data copying is not bad per se, but every copy should be justified in terms of the overall goals of the project. Specifically, every copy should be part of a step that adds value. A common step that often provides sufficient value to justify copying is an API that supports modular construction of applications. Other places might not provide sufficient value.

Phony APIs

When a problem and associated solution are written as an architectural design, various steps are presented as code modules that achieve some localized function and call each other. Sometimes various individuals or teams are assigned related modules. An interface is defined to facilitate joint development, and each module is expected to conform to that interface definition. Unfortunately, teams sometime refer to this interface definition as an API, and sometimes they treat it as an API. Typically that interface definition adds no value for applications or application developers; it is simply a useful tool to allow team implementation of a library or system. Thinking of these internal boundaries as APIs often leads to data copying.

IoTivity was quickly built by several international teams working together, and several internal boundaries became thought of as APIs even though an application or application developer would never see them. As a result, the bulk of an incoming transaction was copied several times (you don’t need to know the structure definitions):

  1. coap_pdu_t (in heap), copied to

  2. CARequestInfo_t (in heap), copied to

  3. OCServerProtocolRequest (on stack), copied to

  4. OCServerRequest (in heap), copied to

  5. OCEntityHandlerRequest (on stack)

Four copies. The first copy can be justified because it is on a thread hand-off and transforms some of the content to a more usable form. The remaining copies were there because different teams wanted the data in a specific form. Note that some of the allocations were on the stack and some were in the heap; each has its problems. All copy types increase the memory footprint of the transaction since copied buffers can’t be freed until the copying is done, at the earliest. In a constrained environment, both the stack and heap are limited resources.

In the reworked OIC server, only the first copy is retained and the rest of the code was reworked to use the second structure only. This reduced memory usage, copying overhead, and code size. These types of optimizations can only be made when the larger data path is considered. Thinking of design boundaries as APIs makes optimization much more difficult.

Reduce memory allocations

Object nesting

Often a transaction or other process is represented by several objects, represented as data structures, which are often related using pointers among the objects. For instance, a transaction might be represented (as in IoTivity) by a request body, the return address, and various metadata gleaned from encoding the request. It is reasonable to define separate structures for each of these items to designate functional separation.

Unfortunately, this separation of function is often echoed by separate memory allocations, with multiple allocations being linked by pointers between the structures. The costs of this approach include:

  • Multiple memory allocations where one might suffice.

  • Object lifetime management confusion.

  • Pointer reference complexity and overhead.

This disjointed style is often adopted because the ability for the objects to be allocated and freed independently seems to promote object memory efficiency. For example, one object might not be needed for every transaction, so a null pointer can hold its place.

We submit that it is better to allocate all resources needed to support a transaction at the start of transaction processing. The allocations can be made clearer by creating a single nested structure that includes instances of all the needed structures rather than pointers to separate instances. The separate structures are still provided for functional separation, but the pointers are gone. The advantages of this approach include predictable buffer usage, elimination of unwinding a transaction after a later allocation failure, reduced opportunities for memory leaks, and reduced pointer manipulation.

Object nesting appears in the reworked OIC server in the definition of CAMessageInfo_t (previously CARequestInfo_t above). IoTivity needed to keep track of the return address as well as the transaction body, and their lifetimes were identical, so the return address (OCDevAddr) was moved into the CAMessageInfo_t structure rather than being maintained as a separate structure. This simplified allocation, and it also meant that several call argument lists could be simplified.

String allocation

The most common nested object in IoTivity is the string. The fundamental design issue with a string is its variable length. Because a string might be large or small, common usage is to place the string in its own buffer thatis allocated to be just the right length for the current usage of the string. This is done in the often-mistaken belief that optimizing the string allocation saves memory and eliminates specification uncertainty.

Allocating a buffer on a heap requires additional control structures whose size depends on the implementation. Let's assume our allocator requires 16 bytes in addition to the size of the buffer. The extra 16 bytes amounts to 100% overhead (wastage) for a 16 character string (200% for 8, 50% for 32, etc.) Plus, the pointer to a string in your controlling structure is another 4 bytes, making the real overhead 20 bytes. So optimizing a string allocation for memory efficiency isn’t a good reason to individually allocate strings, as we demonstrate below.

Specification uncertainty means that you aren’t sure how long a string can get. This is often a result of lazy analysis, because strings must have a maximum length. If there’s no maximum at the place you are programming now, there is likely a maximum downstream. In a constrained environment it is critical to know how large your strings can be, and to place arbitrary limits when no inherent limits are apparent. At some point you are going to test or assume the length of the string.

When you know how long a string can reasonably be, you can allocate space for it where previously you would have instead placed a pointer to it. This accrues several advantages:

  • One less malloc/free to program and maintain.

  • One less opportunity for a memory leak now and in the future.

  • Less runtime processing of malloc/free.

  • Eliminate chance of later transaction failure due to allocation error.

  • Simplify memory usage and fragmentation by reducing the total number of allocations.

A common objection to fixed string allocation is wasting memory. Refer to following table, which compares the memory usage and wastage for string sizes varying from 1 byte to 128 bytes. Note that because of the heap allocation overhead mentioned above, the wastage depends on the details of the allocation. There is no clear winner, or single best option.

maximum string

actual string

individual

wastage

fixed allocation

wastage

size (bytes)

size (bytes)

allocation (heap)

(individual)

(another structure)

(fixed)

1

1

21

20

1

 

16

1

21

5

16

 

16

8

28

12

16

 

16

16

36

 

16

 

32

1

21

 

32

11

32

16

36

4

32

 

32

32

52

20

32

 

64

1

21

 

64

43

64

32

52

 

64

12

64

64

84

20

64

 

128

1

21

 

128

107

128

64

84

 

128

44

128

128

148

20

128

 

40

16

36

 

40

4

40

25

45

5

40

 

 

The last two rows in the table represent a choice made in IoTivity to allocate the IP address string. At one point it was individually allocated, but now it is a fixed, 40 byte allocation in an addressing structure. When most addressing is IPv4 (~16 bytes), about four bytes of memory are wasted by this arrangement. As usage transitions to IPv6 addresses (~25 byte, 40 byte max), the wastage reverses. In neither case is the wastage large. When memory is short, predictability trumps efficiency.

From the table, note that the better you can constrain the maximum size of a string, the better fixed allocation behaves. Good analysis pays off. The discussion above also pertains to any variable size allocation, not just strings.

Allocation for decoding

IoTivity decodes payloads that are encoded by CBOR. For this discussion, think of CBOR as a compact form of JSON. We are taking a single buffer (equivalent to a JSON string) and decoding it into a large number of individual components (arrays, names, numbers, strings, etc.). Originally IoTivity allocated all of these components as individual heap allocations. A single CBOR payload might include dozens of these individual components.

The reworked IoTivity decodes using a single allocation instead of dozens of allocations. A decoded payload is represented as a hierarchy of structures, representing either arrays or compound objects, with leaf nodes that represent values such as names, numbers, strings, etc. The top level structure is predefined to match the definition of each payload type, and additional space is allocated in the same buffer as the top level structure, following the structure. The top level structure might be 24 bytes, while the total allocation might be 1500 bytes.

As the decoding takes place, the components for decoding are allocated from the 1500 bytes rather than from the heap. The allocation consists of keeping track of the end of allocated space at the end of the buffer, starting right after the top level structure. The next free space is given to each allocation, and the end of that allocation becomes the next buffer available for allocation.

[ top level structure] [ string ] [ array structure ] [ string ] [ number ] [ string ] [ string ] …

Of course, you have to keep track of the next allocation space and make sure it doesn’t exceed the space that was really allocated from the heap.

This form of allocation requires only one heap allocation rather than dozens of heap allocations, and only one free rather than dozens. The advantages of this include:

  • Greatly reduced heap usage, since each allocation has zero overhead (rather than 20 bytes of overhead).

  • Greatly reduced allocation/free CPU times.

  • No chance of a memory leak.

  • Lower chance of memory fragmentation.

Of course, this approach also requires careful analysis to minimize the risk of running out of memory in the single allocation. The key observation is that the binary payload has a defined maximum size. Analysis of decoded payloads showed that a 50% larger buffer is plenty to hold the decoded payload. Because most elements were quite small, the savings relative to malloc/free easily justified the decode buffer size.

This type of allocation is used by getifaddrs(3) in Linux. The returned ifaddrs structure is a rich structured list of network interface descriptions, all allocated at the end of the single returned buffer. A call to freeifaddrs(3) simply frees the one buffer. Through the use of intermediate buffers, getifaddrs() guarantees that the one buffer is large enough to hold all elements.

Linked lists

IoTivity must represent lists in some of its data structures. Originally, these lists were represented as linked lists with the attendant CPU and memory overhead. For example, a resource structure has lists of types and interfaces. Linked lists are capable of handling very large lists. Consideration of typical resources shows that few resources have more than three resource types or resource interfaces.

The reworked IoTivity code declares an array of entries for each of these lists instead of using a linked list for each. The array length is parameterized at build time (more on that later) so longer lists can be handled if the need is recognized. Incidentally, each array entry is primarily a fixed string allocation. This eliminates the overhead and uncertainty of a linked list.

The reworked IoTivity code base provides both a server and a client, and recognizes the inherent difference between the two. Specifically, a server can be built to handle the application it is supporting. By considering the nature of the resource it is serving, the developer can safely make assumptions about parameters, such as the number of interfaces that need to be supported.

On the other hand, an IoTivity client must be able to handle all servers it is asked to manage. Thus, it is appropriate for an IoTivity client to use a linked list, with its attendant heap ramifications, where a server might not use a linked list. Recognizing this difference, the reworked client code handles most lists using linked lists, which the server code handles as fixed arrays. The right answer depends on the circumstances.

Parameterize the allocations

Many aspects of allocation and behavior are parameterized in the reworked code. All adjustable parameters are available in a single include file (OCConfig.h) so the developer and maintainers don’t have to hunt for them (as they did in the original IoTivity). All adjustable parameters have reasonable defaults, so they don’t get in the way of learning how to use the code base. The defaults are chosen to handle the most common usage of the code base, so adjustments typically tighten RAM usage rather than loosen it.

Included in OCConfig.h are parameters that determine what capabilities are included. Choosing or disallowing a capability almost always affects code size, but it might also affect structures and buffers associated with using that capability.

Most of the parameters in OCConfig.h adjust the sizes of various buffers and lists. Many parameters are carried over from the original IoTivity, where they were buried in a variety of include files and were difficult to find and keep track of. Examples of parameterized fields are:

  • COAP_MAX_PDU_SIZE. Maximum CoAP message size.

  • RESOURCE_URI_LENGTH. Maximum URL string size.

  • RT_MAX. Maximum number of resource types in a resource.

  • RI_MAX. Maximum number of resource interfaces types in a resource.

  • MAX_NAME_LENGTH. Maximum length of resource type or interface.

  • URI_QUERY_LENGTH. Maximum OIC query string size.

  • PAYLOAD _STRUCT_SIZE. Maximum size of a decoded payload.

  • SIMULTANEOUS_MESSAGES. Number of preallocated transaction buffers.

A counterpart to parameterization is knowing the sizes of the allocated data structures. The reworked server prints the sizes as part of the debug logging, making the sizes readily available to the developer. Being explicitly aware of all the sizes contributes to careful memory design.

Parameterization should be a first-class design product.

Allocating memory without a heap

Heap fragmentation can be a source of application failure, and the best way to avoid it is to avoid using the heap. A common way to avoid fragmentation is to use the heap as a source of buffers that are never freed. Typically such allocations are done once at application startup based on parameters the user provides.

A common mechanism for allocating without a heap is often called Slab (or Slice) Allocation [4]. Slab allocation is used in Linux and some RTOSes for exactly the purpose described here. It eliminates fragmentation and reduces allocation overhead.

The reworked IoTivity uses a generalization of Slab Allocation, which we refer to as a buffer pool. Each of the allocated structures/buffers has a dedicated pool size and a parameterized number of buffers of that size. The number of buffers of each size is a user-configurable parameter in OCConfig.h. At startup time, the configured number of each pool size is allocated from the heap. When a buffer is needed, the appropriate size pool buffer is provided. The memory pool mechanism keeps track of how many buffers of each size remain available.

Allocating a buffer at runtime consists of grabbing a buffer of the desired size from a pool allocator list. Returning it just puts it back in the list. CPU overhead of allocate/free is miniscule.

The explicit nature of this memory pool means that the number of available or allocated buffers can be tracked as the application runs. The buffer counts can be examined as part of a debug process or monitored remotely over a network management channel. This provides detailed resource usage data that is not available when using a heap directly. For example, it makes it possible to know that a particular buffer has little or no margin, which provides an objective basis for reconfiguring the memory pool.

The memory pool mechanism in the reworked IoTivity takes a few hundred bytes of code and a few dozen bytes of RAM. The pool implementation includes a few extra useful features:

  • Table-driven configuration of any number of pool sizes.

  • Separate configuration of initial and maximum pool count for each size.

  • Dynamic addition of pool buffers up to a maximum count while running.

  • Share pool buffers between similarly sized buffers.

  • Emergency failover to a larger buffer when critical pool size is exhausted.

For more generalized allocation, memory pools can be organized as a collection of powers-of-two buffers. While more general, this arrangement is less effective than identifying the actual needed sizes.

Allocation is a system issue

RAM allocation is really a system issue. It requires coordination between the application, the operating system, and frameworks such as IoTivity. All three system aspects need to work together to use RAM effectively. In an unconstrained environment, the standard heap hides the need for cooperation in most cases. In a constrained environment, RAM management must be coordinated among the three. This means that a mechanism used in only one aspect can’t solve the problem. Even when the library and OS use memory perfectly, an injudicious application may still render the constrained node useless.

The memory pools used in the reworked IoTivity help, but don’t do any good unless the application is equally careful. Most systems are dependent on the operating system for a more general solution, and most RTOSes do provide a solution. This means that the memory pools described above are only part of a system-wide solution to using RAM.

Summary

We described a four-step strategy to deal with RAM usage in a constrained environment, illustrating each step with examples from reworking the IoTivity code base for constrained environments.

The overriding message of this article is that the developer of software in a resource-constrained environment must pay close attention to how RAM is used.

It might be acceptable to simply call malloc() and free() to establish functionality, yet long-term productivity requires care in every RAM use.

The author hopes the examples here demonstrate that effective use of RAM is possible and worthwhile.

References

  1. http://www.iotivity.org/

  2. https://01.org/blogs/2016/heap-allocation-iot

  3. http://openinterconnect.org/

  4. http://en.wikipedia.org/wiki/Slab_allocation