Tools for exploiting warm/cold boot in iCE40 FPGAs

This project is based on YosysHQ/icestorm, which was published and announced in 2015:

Project IceStorm aims at documenting the bitstream format of Lattice iCE40 FPGAs and providing simple tools for analyzing and creating bitstream files. (…) The focus of the project is on the iCE40 LP/HX 1K/4K/8K chips. Most of the work was done on HX1K-TQ144 and HX8K-CT256 parts.

— IceStorm
bygone.clairexen.net/IceStorm

Later, support for devices of Lattice’s iCE40 Ultra Plus family was added. Moreover, Lattice embraced the open source community by providing a list of Community Sourced development boards. Today, the list of available boards with iCE40 FPGAs supported by open source synthesis tools comprises, but is not limited to: iCEstick Evaluation Kit, iCE40-HX8K Breakout Board, iCEBreaker FPGA, ICESugar, icoBOARD 1.0, Kéfir I iCE40-HX4K, Nandland Go board, myStorm, BlackIce, eCow-Logic, TinyFPGA… See hdl.github.io/awesome/boards.

This site documents existing examples, bootloaders and references about dynamic reconfiguration through the cold/warm boot feature available in iCE40 FPGAs.

Context

This article, examples and tool enhancements are the joint effort of several people.

In 2015, Claire (@clairexen) released Icestorm, including icemulti.

In March 2017, Juanma (@juanmard) and Unai (@umarcor) met and the latter introduced the former to the concept of (partial) dynamic reconfiguration in FPGAs (using high-end devices as a market reference). Juanma was eagger to know more about it, so Unai analysed the configuration options in the iCE40 family manuals and datasheets; then explaining the possibilities that warm/cold boot provided. During a couple of months, a continuous feedback was built. Juanma, modified IceStorm’s icemulti and iceprog for prototyping and proving the features, while Unai proposed solutions for extending the scope beyond four addressable images (see Hypothesis). Juanma published a demo and enhancements to icemulti and iceprog were made available in juanmard/icestorm.

In January 2018, independently, Luke (@tinyfpga) developed the TinyFPGA-Bootloader. That one implemented an USB to SPI core for using the reset image as a passthrough for programming the flash memory. Therefore, although warm/cold boot are not explicitly mentioned in the description, it was, as far as we are aware, the first open source and documented practical use of the feature.

In December 2018, Luke helped Tim (@mithro) and Sean (@xobs) implement the im-tomu/foboot. It was also based on loading a bootloader upon reset for allowing programming a user image in one of the cold/warm boot addresses. However, due to size constraints, it’s a completely different implementation based on a soft core.

In May 2021, Sylvain (@sylefeb) and Bruno (@brunolevy01) were discussing about SPI-flash difficulties on Twitter, when Unai jumped in and let them know about the existing work done together with Juanma, as well as the similarities with both TinyFPGA and FOMU bootloaders. As it happened four years earlier, Sylvain got so excited about the feature and a nice feedback was built between him and Juanma. Sylvain implemented a demo and a tutorial about warmboot. Within less than a couple of days, he implemented the first actual demo using 4+ images (8 precisely): Dynamic warm boot on the ice40, proof of concept.

Overall, this document is an attempt at gathering the information that all those projects have in common (from a theoretical/technical point of view) and for linking to all the specific implementations and examples.

Introduction to iCE40 configuration modes

Since iCE40 FPGAs are SRAM-based, thus volatile, it is common practice to include a flash memory in any board design. That is typically used for automatically loading a configuration on power-up through SPI. As a result, it is common in FPGAs devices to find hard IP cores implementing SPI controllers. Furthermore, according to TN1248: iCE40 Programming and Configuration, the hard SPI cores in the iCE40 devices support not only the master mode required for loading from flash. In slave mode, "the iCE40 configuration data can be downloaded from an external processor, microcontroller, or DSP processor using the SPI interface".

On top of that, some devices support so-called Cold Boot and/or Warm Boot configuration options. That allows writing up to four addressable images/bitstreams to the flash memory, so that any of them can be loaded afterwards, without requiring any additional external communication. That is known as Dynamic Reconfiguration in the FPGA community.

There is another configuration mode: the one-time programmable NVCM (Non-Volatile Configuration Memory). That is, naturally, out of the scope of this project.

Introduction to cold/warm boot

To avoid mixing terms, image is used for referring to the bitstream corresponding to a single design, and pack relates to multiple images packed in a single bitstream.

Since most of the boards are based on the iCEstick, it’s design is the reference for the tests explained below. As shown in the iCEstick User Manual, the FPGA, the Flash memory and the FTDI chip (which is a processor), are all connected to the same SPI bus:

FTDI is always a master. Writes/reads to/from the FPGA or the Flash.
Flash memory is always a slave. It is read from the FTDI or the FPGA.
FPGA is master/slave depending on the configuration mode.
- In slave mode, it is written by the FTDI.
- In master mode, data is read from the Flash.

On top of that, the FTDI chip controls the programming reset signal of the FPGA.

TBW

Single bitstream

After exiting the Power-On Reset (POR) state or when CRESET_B returns High after being held Low, the iCE40 device samples the logical value on its SPI_SS_B pin.

(…)

If the SPI_SS_B pin is sampled as a logic ‘1’ (High), then …

If enabled to configure from NVCM, the device configures itself using the Nonvolatile Configuration Memory (NVCM).

If not enabled to configure from NVCM, then the device configures using the SPI Master Configuration Interface.

If the SPI_SS_B pin is sampled as a logic ‘0’ (Low), then the device waits to be configured from an external controller or from another device in SPI Master Configuration Mode using an SPI-like interface.

— Lattice
TN1248, pp 3-4

Therefore a single bitstream can be directly loaded to the FPGA with:

TBW

That is, in the iCEstick and similar boards, the FTDI resets the FPGA by asserting CRESET_B and lets it power up in slave SPI mode by keeping SPI_SS_B low. Then, the image is written to the SRAM directly. The flash memory ignores any command, because asserting the communication to the FPGA disables the memory’s chip select. This is explained in detail in [TN1248, pp 17-20].

However, with the option above, the FPGA will lose it’s functionaly as soon as it is powered off. To avoid so, the following command can be used instead:

TBW

This time, the FTDI explicitly holds the FPGA in reset state and asserts the chip select signal of the flash memory. Then, the image is written to the flash memory. When the transference is complete, the reset state is released and the FPGA is powered up in master SPI mode. Therefore, the image written just before is loaded from the flash memory. For more information check [TN1248, pp 10-13].

TODO: to pos 0? Is an applet added?

Coarse understanding of the bitstream format

Instead of thoroughly analyzing the details of the format, which is explained at bygone.clairexen.net/icestorm/format, a naive approach was followed. Four bitstreams generated with Yosys and nextpnr were analyzed:

Checking the size reveals that all of the images require the same number of bytes: 31.4 KB (32220 bytes), although 32KB are required on disk.
An hexadecimal dump of the images, reveals that, as expected, the first eight bytes are the same:

$ hexdump -C img01_counter8.bin > img01.dump
$ hexdump -C img02_blink.bin > img02.dump
$ hexdump -C img03_led_on.bin > img03.dump
$ hexdump -C img04_pushbutton_and.bin > img04.dump

00000000  ff 00 00 ff 7e aa 99 7e  51 00 01 05 92 00 20 62
00000010  01 4b 72 00 90 82 00 00  11 00 01 01 00 00 00 00
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

Actually, in the example images at least the first 32 bytes are the same. However, that may change if more heterogeneous designs sets are used.

As it is explained below, that is the applet, a table/index of addressable images.

Packing up to four images

In [TN1248, pp 14-15] the Cold Boot Configuration Option is explained. The procedure is roughly the same as the second one explained in the single bitstream example above, but up to four images can be written at the same time. The advantage is that this allows the user to later change from one image to another without requiring an external processor for transferring it.

To support such a feature, an applet is written to the first addresses of the flash. Then, when the cold boot option is enabled:

(…) the iCE40 FPGA boots normally from power-on or a master reset (CRESET_B = Low pulse), but monitors the value on two PIO pins that are borrowed during configuration (…). These pins, labeled PIO2/CBSEL0 and PIO2/CBSEL1, tell the FPGA which of the four possible SPI configurations to load into the device.

(…) If the applet is written, but the cold boot option is disabled:

(…) the FPGA configuration starts from the default location (image 0) defined in the Cold/Warm Boot applet.

— Lattice
TN1248, pp 3-4

Actually, five image can be addressed since there is a fifth one identified as the power-on reset image.

Packing images is achieved with a tool named icemulti from the IceStrom toolchain.

Usage: icemulti [options] input-files

 -c
 coldboot mode, power on reset image is selected by CBSEL0/CBSEL1

 -p0, -p1, -p2, -p3
 select power on reset image when not using coldboot mode

 -a<n>, -A<n>
 align images at 2^<n> bytes. -A also aligns image 0.

 -o filename
 write output image to file instead of stdout

 -v
 verbose (repeat to increase verbosity)

For example, to program four images at a time, by setting the first one as the default and not enabling cold boot:

 $ icemulti -p0 -o pack_cp0.bin img01_counter8.bin img02_blink.bin img03_pushbutton_and.bin img04_led_on.bin
 $ iceprog pack_cp0.bin
 init..
 cdone: high
 reset..
 cdone: low
 flash ID: 0x20 0xBA 0x16 0x10 0x00 0x00 0x23 0x54 0x82 0x46 0x06 0x00 0x56 0x00 0x29 0x19 0x01 0x16 0xA4 0xB5
 file size: 130524
 erase 64kB sector at 0x000000..
 erase 64kB sector at 0x010000..
 programming..
 reading..
 VERIFY OK
 cdone: high
 Bye.

The same pack can be generated with the second image as the default option by changing -p0 to -p1. When programming any of these packs, the transference will last longer than in the single image example, because four full images are being written. However there will be no functional difference, since only the default image will be used by the FPGA. This is a good starting point for understanding how packs are generated.

The size of both packs is the same: 127 KB (130524 bytes), on disk 128KB. As done previously, an hexdump of one of the packs was generated. If we compare it with the hexdump of a single image, the starting point of each of them is easily found. Indeed, looking for ff 7e aa 99 7e is enough. In the following block only the most meaninful parts are shown:

00000000  7e aa 99 7e 92 00 00 44  03 00 01 00 82 00 00 01
00000010  08 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00000020  7e aa 99 7e 92 00 00 44  03 00 01 00 82 00 00 01
00000030  08 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00000040  7e aa 99 7e 92 00 00 44  03 00 80 00 82 00 00 01
00000050  08 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00000060  7e aa 99 7e 92 00 00 44  03 01 00 00 82 00 00 01
00000070  08 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00000080  7e aa 99 7e 92 00 00 44  03 01 80 00 82 00 00 01
00000090  08 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
000000a0  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
*
00000100  ff 00 00 ff 7e aa 99 7e  51 00 01 05 92 00 20 62
00000110  01 4b 72 00 90 82 00 00  11 00 01 01 00 00 00 00
*
00008000  ff 00 00 ff 7e aa 99 7e  51 00 01 05 92 00 20 62
00008010  01 4b 72 00 90 82 00 00  11 00 01 01 00 00 00 00
*
00010000  ff 00 00 ff 7e aa 99 7e  51 00 01 05 92 00 20 62
00010010  01 4b 72 00 90 82 00 00  11 00 01 01 00 00 00 00
*
00018000  ff 00 00 ff 7e aa 99 7e  51 00 01 05 92 00 20 62
00018010  01 4b 72 00 90 82 00 00  11 00 01 01 00 00 00 00

Then, we can derive the following memory map:

applet 0x00000000 256 bytes
img01 0x00000100 32512 bytes
img02 0x00008000 32768 bytes
img03 0x00010000 32768 bytes
img04 0x00018000 32768 bytes

These addresses match columns 10-12 in lines 00000020, 00000040, 00000060 and 00000080. Hence, those three bytes tell the warm/cold boot feature where to load the bitstream from. Furthermore, the address of img01 is also present in columns 10-12 at address 0x00000000. That is the power-on reset image.

Every image, except img01 is placed in a 32KB section, which makes sense if no compression is used at all. The space for img01 is smaller, because of the applet. However, since images require 32220 bytes, there are still 292 free bytes. Indeed, there are 3*548+292=1936 free bytes between images in the space 0x00000000-0x0001FFFF, which can be used for user applications, even if cold boot is active.

Moreover, the hexdump of the second pack is equal to the previous one, except for the power-on reset image, which is set to img02 instead of img01. Actually, they differ in a single byte:

00000000  7e aa 99 7e 92 00 00 44  03 00 01 00 82 00 00 01 | pack_cp0.bin
00000000  7e aa 99 7e 92 00 00 44  03 00 80 00 82 00 00 01 | pack_cp1.bin

Therefore, even though vector addresses (4) are mentioned in [TN1248, Fig. 11], a single one is used when -c option is not passed to icemulti.

TODO: what’s the byte-difference between cold-boot active/inactive? See icemulti sources.
TODO: option "-c" to activate cold boot and select with CBSELx

Warm boot demo

The warm boot feature is functionally the same as the cold boot. The same external memory layout is used. The only difference is that warm boot is triggered from inside the FPGA. That is, a hard module/component named SB_WARMBOOT needs to be instantiated in each of the designs which should change to some other under certain conditions. It has two bits for selecting one of the four images, along with an additional bit for triggering the reboot/reload. That replaces the external pins used for cold-boot.

TBW

Beyond four images/bitstreams

After diving into the existing documentation, and having performed some experiments, the contributors to this project realized that the cold/warm boot feature can be extended far beyond the limit of four (in)directly addressable images/bitstreams. For instance, ~128 images can fit in the 4MB (32Mb) flash included in the iCEstick.

Hypothesis

The configuration defaults to reading pointers in fixed positions and directly jumping to them. Three bytes (24 bits) are reserved for each pointer, so 0xFFFFFF is the largest value they can take. As a result, up to floor((2^24-1)/215)=511 images can be addressed, if a memory of at least 16MB (128Mb) is used and 32KB are used for each image. The size of the flash memory in the iCEstick is 4MB (32Mb), so up to floor((2^22-1)/215)=127 images can be addressed. The expression for computing the address corresponding to an image in position x, where x = 0,…,$number_of_images-1 is x==0 ? 0x000100 ; (x-1)*0x8000.

If images are appended without free space between them, slightly larger packs can fit:

floor((2^24-1)/32220)=520
floor((2^22-1)/32220)=130

Apparently, this extended memory map can not be addressed through CBSEL. However, either the processor or a component in the FPGA can be used for updating just the pointers (applet), allowing changing between groups of four images in the extended pack.

Scope

Cold/warm boot features of Lattice’s iCE40 FPGAs allows mimicking high-performance SoC designs which include programmable logic, such as Xilinx’s Zynq or Intel/Altera’s Arria/Cyclone. The main orchestrator in those systems is expected to be a CPU (either a PC or a microcontroller), which is already true for most of the available open source boards. Furthermore, embedded CPUs can be synthetised. Actually, that’s the case of FOMU, which loads a RISC-V based design as the demo design. That sets a quite large list of devices to choose from. Although not exclusively, examples here are focused on the following:

USB-TTL adapter: FTDI, CH340, PL2303…
- USB-SPI
- USB-UART
External uC: AVR, ARM, FTDI, ESP32…
Embedded uC: VexRiscv (RISC-V), Lattuino (AVR)…

Depending on the design of the boards, multiple connection schemes might be possible in order to achieve the same functional result. See the specific documentation of each of the examples.

Software

The upstream icemulti allows packing up to four images for using the default cold/warm boot features. It also allows the power on reset image. However, it does not currently support features beyond the default usage.

@juanmard extended both icemulti and iceprog for allowing packing any number of images, up to the size of the target memory. See juanmard/icestorm. It also allowed modifying entries in the header/applet for switching the addressable images efficiently. Moreover, he used some spare bytes at the beginning of each image for writing an string identifier of the bitstream. That allows listing the content of the memory (through iceprog) and getting a human readable output.

@sylefeb complemented @juanmard’s solution by writing a hot-swap HDL core that can manipulate the header/applet in the external memory, so that an external CPU or PC is not required for switching addresses/pointers. Furthermore, he cleverly implemented it by employing an unused region of the external memory as an scratchpad and modifying the header, on the fly (while passing through a reduced footprint HDL). Chunks of 256 bytes are used. The actual HDL is a RISC-V soft core requiring ~2K LUTs. Yet, as he explains in sylefeb/Silice: draft/projects/ice40-dynboot, there is room for improvement there!

When using development boards with iCE40 devices which don’t use FTDI for programming, iceprog cannot be used. That is the case of e.g. FOMU, which uses dfu-util. However, @sylefeb found that dfu-util is happy to upload binary files larger than the default bitstream size. Hence, data can be concatenated and it is the available at address 262144 (warmboot slot) + 104106 (bitstream size) ( see im-tomu/foboot: doc/FLASHLAYOUT.md).

Compression

From the potentially hundreds of images available in external memory, only five of them can be directly loaded by the FPGA. Therefore, all others can be stored in a compressed format. From a software point of view, there are many compression algorithms adapted for being executed on low power/performance embedded devices. It might be more challenging to achieve it with a pure HDL solution. Still, dictionary or block based compressions such as LZO might be efficient enough. Some quick experiments show that the size of each bitstream can be reduced to 3-5% (from 32KB to 1-1.6KB) by using lzop.

Preserving BRAM data

It would be interesting to know whether BRAM data is necessarily overwritten when a new bitstream is loaded. If that is the case, it might be possible to use the pipeline approach from Sylvain for hot-replacing the content of the BRAMs when an image is changed. That would allow the implementation of complex algorithms on the same data. The advantage would be that freshly loaded images could start computation straightaway after load. However, depending on the use case, it might more efficient to have some custom save/load mechanism.

Nevertheless, nextpnr allows specifying absolute placement constraints in the HDL sources. See YosysHQ/nextpnr: master/docs/constraints.md. Hence, that is worth exploring before considering it a dead-end. Sylvain did some preliminary tests, without success: gitter.im/im-tomu/warmboot?at=60919f992cc8c84d850db0dd.

Tests

Write pack_cp0.bin to flash.
- Change default pointer only, through FTDI.
- Rearrange the pointers, through FTDI.
Pack more than four images and write the binary to flash.
- Set a fifth image as default (which is not referred by any of the four pointers).
- Rearrange the poiners, through FTDI.

TODO: CLI to rearrange pointers. measure and compare reconfiguration time.

Hardware

iCE40 FPGAs do have hard SPI modules, which can be instantiated for user applications (see iCE40™ LP/HX/LM Family Handbook, page 62). Hence, it might be possible to prototype a module/component in HDL for overwriting the pointers in the applet without requiring an external CPU. A look-up-table and some FSM would be required, apart from enough BRAM for holding the minimal ammount of data that needs to be read from the flash.

Cold/warm boot allows dynamic reconfiguration, but partial dynamic reconfiguration is not supported. Therefore, the warm boot module/controller needs to be implemented in each of the images which needs to be capable of dynamically changing to another one. That would provide the illusion of partial reconfiguration with an stop-the-world approach.

It would also be possible to handle uncompressing some image and overwritting one of the existing addressable locations, instead of modifying the pointers. However, dealing with uncompression algorithms in HDL might be non trivial.

TBW

Tools for exploiting warm/cold boot in iCE40 FPGAs

Context

Introduction to iCE40 configuration modes

Introduction to cold/warm boot

Single bitstream

Coarse understanding of the bitstream format

Packing up to four images

Warm boot demo

Beyond four images/bitstreams

Hypothesis

Scope

Software

Compression

Preserving BRAM data

Tests

Hardware

References