It’s been over a year since I created the first version of my Ambilight clone, but I finally found the time to return to it and add some new features. I was originally planning on adding colour correction (to correct for the purple wall behind my TV) and output delay (to get perfect synchronisation with the TV), but I got a bit carried away and have added a whole raft of new stuff.
The specification now looks like this:
- Direct HDMI input at up to 1920×1080 60Hz
- Drive up to 8 strips of LEDs at full frame rate
- Up to 512 LEDs per strip for a total of 4096 LEDs
- Each LED can take it’s colour from any one of 256 arbitrary screen areas
- Each LED can use one of 16 colour correction matrices
- Each LED can use one of 8 sets of gamma correction tables (separate tables for each R, G, B channel)
- Output can be delayed up to 8 frames with microsecond steps for precise synchronisation with TV
- Configurable temporal smoothing
- Up to 64 configurations can be stored in flash memory
- Automatic letterbox/pillarbox detection which can trigger loading of different configurations from flash memory
- Configurable via serial console with a greatly extended command set
The rest of this post describes the implementation details, but with all the changes it’s got quite long, so alternatively you can go straight to the demo videos at the bottom of the page, or the code over on github.
HardwareThe hardware for this version is almost the same as for the previous one, consisting of my own HDMI receiver board and a Papilio FPGA development board.
The most notable change is a move from the Papilio One 250K to the Papilio Pro. I’ve had to make this move as the previous version wasn’t far from filling the Spartan-3E FPGA on the old board. That combined with the Spartan-3E’s lack of any DSP slices (dedicated multiplier / accumulator hardware) and the fact that I’d used almost all of the block RAMs meant I had very little chance of being able to squeeze in the colour correction, let alone any of the other new features.
The Papilio Pro has a Spartan-6 LX9 FPGA, which among other things increases the number of logic cells from 5,500 to 9,100, increases the number of block RAMs from 12 (24KB) to 32 (64KB), and adds 16 DSP48 slices.
The HDMI receiver board remains essentially the same, I still haven’t fixed the output level shifter or the power jack bugs from the previous version, they don’t stop anything working so they’re low priority. The Papilio Pro board also uses a switching regulator that can deliver more power then the linear regulator used on the Papilio One, so the power jack on the receiver board is even less necessary than it was on the previous version.
The move to the Papilio Pro has also fixed the clock net bug. By luck the move means that the LLC clock output of the ADV7611 HDMI Receiver IC is now connected to an FPGA pin that’s connected to a clock net, which is nice!
The only actual change to the receiver board is a change to the resistor values for the DDC and CEC pull-ups. The ones I was using previously were 1,000 times too strong and prevented the EDID information from being retrieved by the HDMI source.
FPGA Design ChangesIt’s inside the FPGA that most of the changes can be found. The diagram above gives an overview of the components that make up the ambilight and the interconnections between them. If you compare it to the previous version, you’ll see that the components in the top left (hscale4, scaler, light averager, line buffer and result ram) are essentially unchanged. The rest, however, is all new.
The following sections give more detail about the implementation of the new features.
Colour CorrectionFirst up is colour correction, which is one of the main reasons that I’ve returned to this project. The walls to the side of my TV are an off white colour which allow the LEDs to light them up with fairly accurate colours, but the wall above my TV is dark purple, and that really messes with the perceived colours.
In theory, correcting the colour should be fairly straight forward. The purple paint is absorbing more of some of the colour components of the light that’s shone on it than others, stopping those components from being reflected and seen. If the colour sent to the LEDs either boosts these components by an amount proportional to the extra absorption, or attenuates the other components by the opposite amount then the perceived colour should be corrected.
Of course my dark purple wall is a fairly extreme case and no matter what I do I’m never going to make it look bright white, but as the images above show, I can at least make white lights appear to be more grey than pink. In case it’s not obvious, in each of the images the LEDs on the left are colour corrected and the LEDs on the right are not. There’s a piece of cardboard stopping the light from each side interfering with the other.
This correction could be done by simply multiplying each of the colour channels with a constant correction factor to scale the output of each channel. However, by expanding from a set of three constants to a matrix of 16, it’s possible to go from simple scaling to a whole world of possible transformations.
An RGB colour can be considered to be a point in a 3D space, where instead of having the X, Y and Z axes representing physical space, we instead have R, G and B. By doing this, any transformation that could be applied to a 3D vector can also be applied to a colour, simply by multiplying the vector that represents the colour by a transformation matrix.
Possible transformations include:
- Brightness: implemented using a scaling transformation
- Contrast: implemented using a combination of scaling and translation
- Hue: implemented by applying several rotations and a shear
- Saturation: implemented using a shear
It’s also possible for a single matrix to do all of these at once, simply multiplying together the matrices for each individual transformation results in a single matrix that performs all of the transformations.
To implement this I’ve added a new colourTransformer component, which is used by the resultDistributor to modify the colours before they’re sent to the outputs. The transformation matrices are stored in a single block RAM and the resultDistributor uses the output mapping table to set the upper address bits of the RAM to select the appropriate matrix for each LED.
The colour transformation is performed over six clock cycles using three multipliers and three accumulators, plus a counter and a few multiplexers to control the inputs to each stage of the calculation. The six cycles perform the following steps:
- The first cycle just selects the first row of the matrix, nothing else can be done until the coefficients are available on the output of the RAM.
- The second cycle then multiplies the R, G and B values by the first set of coefficients and requests the second row from the RAM
- The third cycle multiplies the R, G and B values by the second set of coefficients and adds the result to the previous result
- The fourth cycle multiplies the third set of coefficients
- the fifth cycle multiplies the fourth and final set of coefficients
- The sixth cycle clamps the accumulated result to the range 0-255 and signals completion
Gamma CorrectionWhile the colour transformation matrices can modify the output in many different ways, they can’t do non-linear transformations, and there’s one particularly useful non-linear transformation: gamma correction.
The perceived brightness of LEDs is not linear with the RGB values that are sent to them. They tend to get bright very quickly through the lower values, but then there’s very little noticeable change at the higher end. For example the difference between 0 and 15 is quite noticeable, but the difference between 240 and 255 is practically invisible.
Gamma correction fixes this by defining a curve that is used to modify the output so that a linear change in the value calculated for an LED also appears to be a linear change to our perception.
I’ve implemented gamma correction using lookup tables, which made it the simplest of the new features to implement. There are three 2KB block RAMs, one for each of the R, G and B channels. Each of these block RAMs contains 8 separate tables, each of which contains 256 8-bit values.
To perform the gamma correction the incoming R, G and B values are used to set the lower 8 address bits on each of the block RAMs (the upper address bits are used to select which of the 8 tables is used). Then on the next rising edge of the clock the resulting R, G and B values are available on the data out of the RAMs.
The tables can be populated using the equation:
table[i] = ((i / 255) ** gamma) * 255
Output DelayIn the previous version, and by default in this version too, the LEDs start to get their new colour data almost immediately after the incoming frame of video has ended. However, many modern TVs do a lot of processing on the incoming video stream, which can delay the picture by several frames, so sending the data to the LEDs immediately results in the LEDs not being synchronised with what’s on the screen.
To allow for this the output can now be delayed by up to 8 frames, and there is further fine tuning that can adjust the output delay in steps of 1 micro-second.
The two images above show the output from a pair of photo-transistors, the yellow trace is the LEDs and the blue trace is the TV. The first image shows how the LEDs are switching on too early when there is no delay and the second image shows that the synchronisation is almost perfect with a delay of two frames (video during tests is 50Hz).
The delay is implemented by a new resultDelay component that is inserted between the lightAverager and the resultDistributor. This component can delay the trigger signal that tells the resultDistributor to start sending data out to the LEDs, and it can also present different results to the resultDistributor than those from the current frame.
The whole-frame delay is achieved by remembering the last 8 sets of results from the screen area averaging. Each set of results from the averaging consists of 256 24-bit colours, so a 32-bit wide by 2048 row block RAM is used (consuming four 2KB block RAM primitives). This RAM is used as a ring buffer, the first set of results go into the first 256 rows, the second set of results into the second, and so on, wrapping around to the beginning again after the eighth.
With the last eight sets of results available in the ring buffer it’s then simply a matter of using the frame delay count to offset the read location, so that the resultDistributor is reading results from the RAM at write_pointer minus frame_delay.
One important detail is that when the frame delay is set to zero, the signal that starts the resultDistributor can’t be set until the new results have been copied into the delay RAM.
The micro-second fine tuning is implemented with a simple counter. When the copying of the latest results into the delay RAM has been completed the counter starts counting down, when it reaches zero the start signal is sent to the resultDistributor.
It turns out that there’s really no benefit to such a fine level of adjustment, but now that I’ve done it there’s not much point ripping it out again. As long as the output is synchronised within less than one frame then it appears as good as perfect, which is lucky really if you consider how long it can take to update the LEDs. If one of the strips has a full compliment of 512 LEDs then it takes over half a frame’s worth of time to clock out the serial data to the last LED.
Temporal SmoothingAfter I’d implemented the output delay I realised that I could use the same memory that’s used for the delay to implement temporal smoothing.
This smoothing applies a sort of rolling average to the calculated colour of each screen area. It implements the following equations:
R = (Rprevious * X) + (Rcurrent * (1 - X)) G = (Gprevious * X) + (Gcurrent * (1 - X)) B = (Bprevious * X) + (Bcurrent * (1 - X))
Where X is the desired level of smoothing, between 0.000 and 1.000. This means that a value of 0.000 gives no smoothing, with the output being 100% from the current frame. A value of 0.500 has the output set from 50% of the previous colour and 50% of the current.
The calculation is performed with three multipliers and three accumulators (one for each colour channel) over two clock cycles. During the first cycle the address is set on the delay RAM to lookup the previous result and the current result’s R, G and B values are multiplied with the coefficient. During the second cycle the previous result is available from the delay RAM and its R, G and B values are multiplied with the coefficient and added to the results of the first cycle.
The calculations are performed using 9.9 unsigned fixed point numbers, which means that the smoothing can be configured in steps of 0.002.
To access the delay RAM the temporal smoothing is borrowing the read port that’s normally used by the resultDistributor, but that’s OK as the resultDistributor can’t get the start signal until it’s done.
Configuration In FlashThe previous version was configured either from a table compiled into the firmware or via commands given to the serial interface. The former requiring a rebuild of the firmware and reflash of the FPGA to change, and the latter being incredibly tedious.
In this version it’s possible to use the spare capacity in the flash memory on the Papilio board to hold 64 separate configurations and to switch configuration with a single serial command.
Initial population of these configurations in the flash memory happens at build time. A set of text config files (which reside in the config directory) are parsed and rendered into a binary form and appended to the FPGA bit file. The bit file can then be written to flash as normal.
Transferring data between flash and the CPU’s RAM and between flash and the ambilight configuration RAMs and registers is handled by a new flashDmaController component. The transfers are initiated by the CPU by setting a flash address,
a RAM address, a length and a direction. The CPU is then halted while the flashDmaController copies the data, after which the CPU continues.
Halting the CPU is kind of cheating, but it’s a lot easier than double clocking the RAM and providing simultaneous access for both the CPU and DMA controller.
Format DetectionOne of the most common configurations is to take the colour for the LEDs from a relatively narrow band around the outside of the screen. This leads to a problem when the picture on the screen becomes letter-boxed or pillar-boxed, for example when showing a 2.40:1 film or some old 1.33:1 TV content on a 16:9 screen. When this happens the LEDs at the top or the sides either go out or become very dim.
Now that it’s possible to store multiple configurations in flash memory, it’s possible to have one configuration for full screen content, one for letter-boxed content and one for pillar-boxed content. All that’s needed is a way to switch betweeen them.
I’ve added a new formatDetector component that examines the incoming video data to find the active area. Like the scaler it takes its input from the hscale4 component rather than the raw incoming video so that it’s got four clock cycles available per pixel.
It’s essentially just a collection of counters:
- A counter that measures the number of pixels in a line.
- A counter that measures the number of lines.
- A counter that measures the number of lines at the top of the screen before the first line that has an average brightness above a threshold (height of top black bar).
- A counter that measures the number of lines at the bottom of the screen after the last line that has an average brightness above a threshold. (height of bottom black bar)
- A counter that measures the minimum width from the start of a line before the brightness goes above a threshold (width of left black bar).
- A counter that measures the minumum width after the last pixel on a line that has a brightness above a threshold (width of right black bar).
At the end of each frame, if any of the counters has a different value when compared to the last frame then a signal is raised which causes the CPU running the firmware to receive an interrupt.
The remainder of the format detection is then done in the firmware. Given the size of the screen and the size of the black bars around the active area of the screen, the firmware works as follows:
- If the active area has invaded the black bars of the current format or the resolution has changed then reset to the default 16:9 ratio
- If the active area of the picture fills the screen horizontally and the active area is centered vertically and it approximately matches a 2.4:1 ratio then switch to 2.40:1 format.
- If the active area of the picture fills the screen vertically and the active area is centered horizontally and it approximately matches a 1.33:1 ratio then switch to 1.33:1 format.
There is a format table stored in flash memory and whenever the format changes the firmware walks the table until it finds an entry that matches the resolution and aspect ratio of the active area. If it finds a matching entry then it loads the configuration specified by that entry from flash.
The final change within the FPGA is a completely rewritten resultDistributor component, which is needed to tie together all of the new features.
The previous version was configured by a single table with 256 entries. That table contained one entry for each of the 256 LEDs that could be driven, defining the rectangular area of the screen that should be averaged to get the LED’s colour and the index of the output that the result should be sent to.
The output index was relayed through together with the colour resulting from averaging the screen area, so that the result distributor component could iterate through the 256 results and control a demultiplexer to direct each result to the appropriate output.
The new version decouples the LEDs from the screen areas, which allows for there to be more LEDs than screen areas, and also allows the LEDs to have individually configurable colour correction.
The decoupling is achieved by having a mapping table associated with each of the 8 outputs. Each of these tables contains 512 entries, and each entry defines the screen area to take the LEDs colour from, which colour correction matrix to use and which set of gamma correction tables to use.
The new result distributor component iterates through each of the 512 output mapping entries from each of the 8 output tables, interleaving the tables so that it first does row 0 from table 0, then row 0 from table 1, etc. It uses each entry in the output mapping tables to set the addresses for the result RAM, colour correction matrix RAM and gamma correction table RAM. It then takes the colour from the result RAM and passes it to the colour transformer component and waits for the result be be available. It then uses the transformed colour to address the gamma correction table before finally taking the output of the gamma correction table and passing it to the WS2811 driver for the relevant output.
There are now 8 separate WS2811 drivers, and unlike the previous version they’re all driven in parallel. Clocking out the 24 bits of serial data for each LED takes a long time, so the result distributor starts the driver for output 0 and then uses the time it takes to clock out the data to perform the lookups and transformations for output 1, then output 2, etc. Even by the time it’s started the last output driver, the first one still hasn’t finished outputting the first LEDs data, so it does the lookups for the second LED on channel 0 and then waits for the driver to be idle, which allows it to start clocking out the next LEDs data immediately after the first one finishes, with no gaps.
Firmware ChangesJust like the previous version, the firmware runs on an AVR compatible CPU that’s implemented within the FPGA, and its primary purpose is to provide a serial interface that can be used to control the configuration.
Supporting all of the new features has required an almost complete rewrite and it’s also meant that I’ve had to quadruple the size of the memory used to store the code, from the 4KB used in the previous version to 16KB in this version.
I’ve also changed the way that the firmware accesses the configuration RAMs and registers. In the previous version the access was via a set of 8 bit ports, and every byte written to the configuration required writing to one port to set the high bits of the config address, writing to a second port to set the low bits of the address, and then finally writing the data to a third port. In this new version the configuration RAMs and registers are mapped directly into the upper 32KB of the AVR’s memory map, so writing a byte to the configuration is as simple as writing to RAM. As you can probably imagine, this greatly simplifies a lot of the code.
I won’t cover the code here, it can be found in the firmware directory in the github repository and should be fairly self explanatory.
The command set is also documented in the repository in the file docs/serial-commands.md
The set up for the following videos has 20 LEDs between each of the three shelves on the left (at the back near the wall), another 20 between each of the three shelves on the right, 39 LEDs on each side of the TV illuminating the front of the shelves, and 2 rows of 66 LEDs on top of the TV for a grand total of 330 LEDs. They draw almost 10 Amps for full white and light up the whole room.
All LEDs are using a gamma setting of 1.6, they all have colour saturation reduced to 0.7, and the LEDs on the top of the TV are combating the purple wall with brightness set to 0.7 for red, 1.1 for green and 0.6 for blue.
It’s been incredibly difficult getting a camera to produce a representative image, the colours in these videos aren’t very accurate and the bright bands to the sides of the TV aren’t anywhere near as visible in real life, instead there’s a much smoother gradient.
Everything needed to recreate this can be found in the github repository: