# Advantages Of LPDDR5: A New Clocking Scheme

Innovative new clocking schemes in the latest LPDDR standard enable easier implementation of controllers and PHYs at maximum data rate as well as new options for power consumption.

Earlier this year, JEDEC released the new standard, JESD209–5, Low Power Double Data Rate 5 (LPDDR5). Those that contributed to the development of the standard come from a diverse technology background and represent both manufacturers and consumers of SDRAM memories. Now we have a new memory standard to help enable the future that requires more compute power, higher reliability, and lower power.

This first in a series of articles highlighting the new LPDDR5 standard compares the signal clocking architecture of the LPDDR5 standard as compared to its predecessor (JESD209-4, LPDDR4).

The LPDDR5 standard offers several feature enhancements compared with the existing LPDDR4/4X standard, including support for larger densities, higher speed operation, a flexible bank architecture, enhanced Reliability, Availability, Serviceability (RAS) capabilities, new low-power features as well as a new clocking architecture. LPDDR5 memories will soon be found in applications such as smartphones, automotive, artificial intelligence (AI), embedded applications, SSDs, and various consumer applications.

**High-speed external clocking**

One of the key aspects of LPDDR5 is the introduction of a new clocking scheme. In all previous generations of LPDDR (and DDR for that matter), a single clock from the host to the device essentially synchronized the interface between the host and device. This clock signal (CK) was used to set the transfer rate of the command and address (CA) signals passing from the host to device. In addition, it fixed the rate at which data (DQ) and the data strobes (DQS) were transferred between the host and device (writes) or the device and host (reads). See Figure 1.

**Figure 1: Synchronous CK and bidirectional DQS in pre-LPDDR5 (LP)DDR system**

When considering LPDDR4, both the clock signal and the data strobes operate at a maximum rate of 2133 MHz. In LPDDR4, the CA bus is a single data rate (SDR) bus, meaning with every clock cycle one packet of information is transferred from the host to the device. Since the LPDDR4 CA bus is SDR, the maximum effective rate of information transfer on the CA interface is 2133 Mbps. In LPDDR4, the data bus is, as the name implies, double data rate (DDR). Since the data bus is DDR, with every clock two packets of information are transferred, making the maximum effective rate on the data bus 4266 Mbps. See Figure 2.

**Figure 2: Waveform showing SDR CA bus and DDR DQ bus as specified for LPDDR4-4266 (Only one of two differential signals shown for CK and DQS)**

It should be noted, in LPDDR4 the data strobes are implemented as a differential pair and are bi-directional. The LPDDR5 standard evolved to implement two different pairs of differential signals – both effectively unidirectional signals with one going from host to device and one going from device to host. The signal going from host to device is called the write clock (WCK) and the signal going from device to host is called the read data strobe (RDQS).

This change in clocking between the host and device is indicative of a change in the fundamental way the device itself works. An LPDDR5 device relies on WCK to not only capture the write data from the host, but it uses WCK to generate RDQS and push out DQ on reads from the device. This change brings about both opportunities and challenges. See Figure 3.

**Figure 3: CK, WCK and RDQS* in an LPDDR5 system
* There are some special cases where RDQS is bidirectional.**

The new clocking architecture allows the decoupling of the traditional clock signal from the host to the device and the data strobe signals. In fact, while the new maximum rate of WCK and RDQS in LPDDR5 are 3200 MHz to enable a data transfer rate of up to 6400 Mbps, the fastest rate the CK will run from the host to the device is only 800 MHz (even when the data channels are operating at 6400 Mbps).

Decoupling the clock signal from the strobes, and thus allowing the clock signal to run significantly slower than the data strobes, allows the CA bus to evolve from an SDR bus in LPDDR4 to a DDR bus in LPDDR5. Even though the CA bus has been changed from SDR to DDR, since the CA clock has been capped at a maximum rate of 800 MHz the maximum transfer rate of information on the CA bus is now 1600 Mbps. While LPDDR4-4266 requires a CA transfer rate of 2133 Mbps, LPDDR5-6400 only requires a CA transfer rate of 1600 Mbps, as seen in Figure 4.

**Figure 4: Waveform showing DDR CA bus and DDR DQ bus as specified for LPDDR5-6400 (Only one of two differential signals shown for CK, WCK and RDQS)**

Decoupling the CK and WCK is challenging because the LPDDR5 SDRAM requires internal synchronization of these signals in order to process any data transfer to or from the device. The synchronization of CK to WCK takes several CK cycles, meaning there is a real penalty involved when performing the synchronization operation, so it will be advantageous to avoid this whenever possible. Additionally, there is a specific sequence for how the WCK must behave for synchronization to occur, starting with static assertions for at least one CK, followed by one CK of half rate activity, followed by a variable number of CKs of full rate activity based on the operating frequency. An example of the synchronization procedure is outlined in Figure 5.

**Figure 5: Simple illustration of clock and WCK synchronization (Only one of two differential signals shown on CK)**

There are two options regarding synchronization of CK and WCK. The easy option is simply to synchronize the signals once and then keep WCK running constantly to maintain synchronization (this is known as free running mode). While this option requires little ingenuity, it does come at the expense of system power. Given the most prolific use of LPDDR5 devices will be in the mobile market, the desire to save power will be strong, which means the system must turn off WCK whenever it isn’t absolutely required. Turning off WCK requires a resynchronization of WCK to CK before any data transfer can occur. In order to manage this efficiently the LPDDR5 memory controller will need to be very clever in how it schedules commands, so the synchronization operation does not add unnecessary latency.

**High speed internal clocking**

The decision to decouple the CA clock and the data strobes affects not only the interface between the host and the device – it also affects the interface of the LPDDR5 controller and LPDDR5 PHY inside the host.

Inside a typical host, a controller and a PHY communicate with the external memory. The interface between the controller and PHY is commonly implemented with a specification known as the DDR PHY Interface (DFI). The DFI specification allows SoC designers to separate the design of the (LP)DDR controller, which typically converts system commands into (LP)DDR commands, and the (LP)DDR PHY, which typically converts the digital domain on the SoC to the analog domain of the host to device interface. Having a defined interface between the (LP)DDR controller and (LP)DDR PHY provides SoC designers a large amount of flexibility when selecting the (LP)DDR controller and (LP)DDR PHY solution.

If we examine an LPDDR4-4266 solution from an internal LPDDR4 controller and LPDDR4 PHY perspective, it is notable that while the PHY will typically run at the same speed as the memory, or a maximum of 2133 MHz, the interface between the LPDDR4 controller and PHY (e.g., the DFI interface) will typically run at half that speed, or 1066 MHz. This is commonly referred to as a DFI 1:2 frequency ratio solution since a single LPDDR controller clock covers two memory clocks. This approach is used to achieve a reasonable maximum clock frequency to close timing within the ASIC design flow for the digital logic of the controller.

The internal LPDDR5 controller and LPDDR5 PHY have a different clocking relationship when used in an LPDDR5-6400 solution. The data interface between the host and device is running at a maximum rate of 3200 MHz. Mimicking the LPDDR4-4266 internal DFI 1:2 frequency ratio would mean that the interace between the LPDDR5 controller and LPDDR5 PHY would be running at 1600 MHz, which is not a reasonable expectation for an LPDDR5 controller of any significant complexity. Instead, it is ideal to transition from a DFI 1:2 frequency ratio to a DFI 1:4 frequency ratio which allows for four clocks on the memory for every single LPDDR5 controller clock. This will allow the interface between the LPDDR5 controller and LPDDR5 PHY to run at 800 MHz, even while the LPDDR5 PHY runs the data interface to the memory at 3200 MHz.

However, remember that the CA interface between the host and device is running at a maximum transfer rate of 800 MHz, which should not be stepped down to 200 MHz at the DFI simply because the data transfer rate requires a DFI 1:4 frequency ratio. The LPDDR5 PHY must already manage multiple clock rates to interface to the memory, so it is ideal to contain the clocking complexity within the LPDDR5 PHY. By doing this one maintains a DFI 1:1 frequency ratio for the LPDDR5 commands while moving to a DFI 1:4 frequency ratio for LPDDR5 data and keeping the LPDDR5 controller and the entire DFI running at 800 MHz. This new mode of LPDDR5 controller and LPDDR5 PHY interoperation is known as a DFI 1:1:4 frequency ratio – DFI 1:1 for commands and DFI 1:4 for data. See Figure 6.

**Figure 6: Illustration of clock domains for an LPDDR5-6400 Solution using DFI 1:1:4 frequency ratio**

**Lower speed clocking options**

The above sections discuss the external and internal clocking when running at the maximum data rate, 6400Mbps, as defined by the new LPDDR5 standard. However, there are use cases when it is advantageous to run the interface slower, for example to conserve power when maximum bandwidth to the memory is not required. In such use cases, the LPDDR5 standard offers options to maximize lower speed performance while minimizing power consumption.

The first option is the ability for the CA clock rate to adjust when lowering the data strobe and data transfer rates. Once the data transfer rate drops to 3200 Mbps or slower, it is possible to change the CK to WCK ratio from 1:4 to 1:2, allowing the user to keep the CA transfer rate at 1600 Mbps while the data transfer rate is slowed to 3200 Mbps. See Figure 7.

**Figure 7: Waveform showing DDR CA bus and DDR DQ bus as specified for LPDDR5-3200 with CK:WCK ratio of 1:2. Only one of two differential signals shown for CK, WCK and RDQS.**

By providing an option to slow down the data bus while keeping the CA bus running at the same data rate, the system has the option to adjust internally as well.

When the CK to WCK ratio is 1:4, the DFI interface operates internally at a 1:1:4 ratio. When the CK to WCK ratio is operating in a 1:2 mode, the DFI operation is updated to work in a 1:1:2 mode. In each case the LPDDR5 controller, DFI, PHY core and CK run at the same speed. However, the DFI frequency ratio for the data operations change to either 1:4 in the case where the LPDDR5 SDRAM data transfer rate is greater than 3200 Mbps and the CK to WCK ratio is 1:4, or 1:2 in the case where the LPDDR5 SDRAM data transfer rate is 3200 Mbps or slower and the CK to WCK ratio is 1:2. This adjustment of the DFI operating frequency ratio allows the LPDDR5 controller and DFI domain portion of the LPDDR5 PHY to run at up to 800 MHz for any speed of operation, keeping the latency through the internal LPDDR5 controller and LPDDR5 PHY as low as possible for all speeds of operation.

**Differential, single ended, and strobeless operation**

During high-speed operations (the assumed majority mode of operation when not in a low-power state), the LPDDR5 device will use CK, WCK, and RDQS in differential mode to provide maximum performance. However, there are use cases for running the interface slower. The LPDDR5 specification has some built-in power saving capabilities for these use cases.

One power saving option provided by the LPDDR5 specification offers the ability to change the three differential signals CK, WCK, and RDQS into single-ended signals when running at data rates at or below 1600 Mbps. If we take the assumption of running the CK to WCK ratio as 1:2, then CK will be running at 400 MHz and WCK (and RDQS) at 800 MHz when CK, WCK, and RDQS are placed into single-ended mode operation.

The user also has the option to place CK and WCK in single-ended mode of operation and turn off RDQS entirely. Intended for low-speed operation, this is known as strobeless mode and requires the LPDDR5 PHY to generate an internal strobe to capture read DQ from the device.

When switching CK and WCK from differential to single-ended mode of operation and changing RDQS from differential to either single-ended operation or strobeless mode, it is required to disable device termination for CK, WCK and RDQS as well as to the CA signals, the DQ signals, and the data mask inversion (DMI) signal. Moving signals from differential mode to either single ended mode or turning them off entirely saves power, and not terminating most of the signals of the LPDDR5 interface saves additional power.

There are choices and restrictions to consider when setting CK, WCK, and RDQS into a single-ended mode. WCK and RDQS may only be configured for single-ended mode when CK is also configured for single-ended mode. It is also possible to enable single-ended mode for CK while keeping both WCK and RDQS in differential mode. If WCK is put into single-ended mode, then RDQS must also be placed into single-ended mode (with the same polarity chosen for the active signal for both WCK and RDQS) or placed in strobeless mode. Table 1 lists all the valid combinations for CK, WCK, and RQDS.

**Table 1: Allowed combinations of CK, WCK and RDQS**

**Summary**

The introduction of the LPDDR5 specification not only enables the implementation of a new low-power SDRAM standard, promising larger density devices and faster data rates, it also outlines some innovative new clocking schemes which allow for easier implementation of LPDDR5 controllers and LPDDR5 PHYs when running at the maximum data rates allowed by the specification. Additionally, the specification offers a few options for power savings with the clock and data strobes when the memory cannot be placed in a low-power state but does not need to run at higher data rates.

Synopsys, the memory interface IP leader, offers a complete LPDDR5 IP interface solution including a configurable LPDDR5 controller, LPDDR5 PHYs available in a wide variety of technology nodes, and LPDDR5 Verification IP. Synopsys is an active member of JEDEC helping to drive development and adoption of the newest memory standards. Synopsys’ configurable memory interface IP solutions can be tailored to meet the exact requirements of SoC’s for applications such as AI, automotive, mobile and cloud computing.

## Abstract

Transport based distances, such as the Wasserstein distance and earth mover'sdistance, have been shown to be an effective tool in signal and image analysis. The success of transport based distances is in part due to their Lagrangian nature which allows it to capture the important variations in many signal classes. However these distances require the signal to be nonnegative and normalized. Furthermore, the signals are considered as measures and compared by redistributing (transporting) them, which does not directly take into account the signal intensity. Here we study a transport-based distance, called the TLp distance, that combines Lagrangian and intensity modelling and is directly applicable to general, non-positive and multi-channelled signals. The distance can be computed by existing numerical methods. We give an overview of the basic properties of this distance and applications to classification, with multi-channelled non-positive one-dimensional signals and two-dimensional images, and color transfer.

## 1 Introduction

Enabled by advances in numerical implementation [4, 5, 16, 53, 71, 74], and their Lagrangian nature, transportation based distances for signal analysis are becoming increasingly popular in a large range of applications. Recent applications include astronomy [9, 18, 19], biomedical sciences [3, 25–27, 77, 81, 82, 88, 89], colour transfer [14, 17, 49, 62, 63], computer vision and graphics [7, 44, 60, 65, 68, 74, 75], imaging [36, 40, 64], information theory [78], machine learning [1, 15, 20, 34, 37, 48, 76], operational research [69] and signal processing [54, 58].

The success of transport based distances is due to the large number of applications that consider signals that are Lagrangian in nature (spatial rearrangements, i.e. transport, are a key factor when considering image differences). Many signals contain similar features for which transport based distances will outperform distances that only consider differences in intensity, such as the Lp distance. Optimal transport (OT) distances, for example the earth mover'sdistance or Wasserstein distance, are examples of transport distances. However these distances do not directly account for signal intensity. The Lp distance is the other extreme, this distance is based on intensity and does not take into account Lagrangian properties.

In this paper we develop the TLp distance introduced in [21] which combines both Lagrangian and intensity based modeling. Our aim is to show that by including both transport and intensity within the distance we can better represent the similarities between classes of data in many problems. For example, if a distance can naturally differentiate between classes, that is the within class distance is small compared to the between class separation, then the classification problem is made easier. This requires designing distances that can faithfully represent the structure within a given data set.

In the majority of the literature optimal transport distances interpret signals as either probability measures or as densities of probability measures. This places restrictions on the type of signals one can consider. Probability measures must be non-negative, integrate to unity and be real valued (i.e. cannot be applied to multi-channelled signals). In order to apply OT distances to a wider class of signals one has to use ad-hoc methods, which do not necessarily preserve metric properties, to transform the signal into a probability measure. This can often dampen the features, for example renormalization may reduce the intensity range of a signal. We do however note the works of Liero, Mielke and Savaré [42, 43], Chizat, Peyré, Schmitzer and Vialard [13, 14] and Kondratyev, Monsaingeon and Vorotnikov [38] who develop an optimal transport metric that is applicable to un-normalised positive measures. Similarly Pele and Wermen propose a variant of the earth movers distance that is applicable to un-normalised positive measures [59]. Whilst these are also promising avenues research there are still restrictive assumptions, such as signals must be non-negative and real valued, required in order to apply these distances.

Extensions to matrix valued optimal transport have been made in [11, 12, 52]. In [52] the authors propose a method for defining an optimal transport distance between matrix valued densities. As in the scalar valued case for a suitable class of matrix valued densities there exists a non-empty set of couplings. A matrix valued optimal transport distance is defined by minimising over the set of couplings a cost function that penalises both the transport of mass and rotations of the coupling. Similar ideas in [11, 12] use an analogue of the Brenier and Benamou formulation of optimal transport [4] to define a matrix valued optimal transport distance. Whilst these distances are applicable to matrix valued signals they still require the assumption of positivity (in this case positive definite). These distances are also specific to n × n matrices which does not include vectors.

The TLp distance does not need the signal to be a probability measure and therefore the above restrictions do not apply. Rather, the TLp distance models the intensity directly. The applicability of the distance is sufficiently general as to include non-positive, multi-channelled and un-normalised signals on discrete or continuous domains.

Another property of OT distances, due to the lack of intensity modeling, is its insensitivity to high frequency perturbations. This is due to transport being on the order of the wavelength of the perturbation. Depending upon the application this can be an advantage or a disadvantage. For example in texture modeling one would want to be able to discriminate between a highly oscillating image and a constant image. On the other hand, the lack of sensitivity to high frequency noise, make the OT distance stable under such perturbations. Since the TLp distance directly models intensity then it inherits sensitivity to high frequency noise.

The aim of this paper is to develop the TLp distance and demonstrate its applicability in a range of applications. We consider classification problems on data sets where we show that the TLp distance better represents the underlying geometry, i.e. achieves a better between class to within class distance, than popular alternative distances.

We also consider the colour transfer problem in a context where spatial information, as well as intensity, is important. To apply standardised tests in applications such as medical imaging it is often necessary to normalise colour variation [33, 45, 73]. One solution is to match the means and variance of each colour channel (in some colour space e.g. RGB or LAB). However, by transferring the colour of one image onto the other it is possible to recolour an image with exactly the same colour profile.

A popular method is to use the OT distance on the histogram of images [14, 17, 49, 62, 63]. This allows one to take into account the intensity of pixels but includes no spatial information. The TLp distance is able to include both spatial and intensity information.

Our methodology, therefore, has more in common with registration methods that aim to find a transformation that maximizes the similarity between two images where our measure of similarity includes both spatial and intensity information. One should compare our approach to [27] where the authors develop a numerical method for the Monge formulation of OT with the addition of an intensity term for image warping and registration. However, unlike the method presented in [27], the formulation presented here defines a metric.

### Paper Overview

The outline for this paper is the following. In the next section we review OT and give a formal definition of the TLp distance followed by examples to illustrate its features and to compare with the OT and Lp distances. In Section 3 we give a more general definition and explain some of its key properties. In Section 4 we include applications of the TLp distances. We first consider classification on synthetic and real- world signals and images. The data sets contain non-positive and un-normalised signals in either one or two dimensions. In addition one of the data sets is multi-channelled. A further application to the colour transfer problem is then given. Conclusions are given in Section 5.

## 2 Formal Definitions and Examples

### 2.1 Review of Optimal Transport and the TLp Distance

We begin by reviewing optimal transport in first the Kantorovich formulation and then the Monge formulation.

#### The Kantorovich Formulation of Optimal Transport

For measures μ and ν on Ω ⊂ ℝd with the same mass and a continuous cost function c: Ω × Ω → [0, ∞) the Kantorovich formulation of OT is given by

OT(μ,ν)=minπ∫Ω×Ωc(x,y)dπ(x,y)

(1)

where the minimum is taken over probability measures π on Ω × Ω such that the first marginal is μ and the second marginal is ν, i.e. π (A × Ω) = μ (A) and π (Ω × B) = ν(B) for all open sets A and B. We denote the set of such π by П(μ, ν). We call measures π ∈ П(μ, ν) transport plans since π (A × B) is the amount of mass in A that is transferred to B. Minimizers π* of OT(μ, ν), which we call optimal plans, exist when c is lower semi-continuous [84, Theorem 4.1]. When c is a metric OT(μ, ν) is also known as the earth mover's distance.

A common choice is c(x,y)=|x−y|p=∑i=1d|xi−yi|p in which case we define dOT(μ,ν)=OT(μ,ν)p. When p = 2 this is known as the Wasserstein distance. We will call dOT the OT distance. With an abuse of notation we will sometimes write dOT(f, g) when μ and ν have densities f and g respectively.

When μ has a continuous density then the support of any optimal plan π* is contained on the graph of a function T*. In particular this implies π*(A, B) = μ ({x : x ∈ A, T*(x) ∈ B}) and furthermore that the optimal plan defines a mapping between μ and ν, see for example . This leads us to the Monge formulation of OT.

Open in a separate window#### The Monge Formulation of Optimal Transport

An appealing property of optimal transport distances are their formulation in a Lagrangian setting. One can rewrite the optimal transport problem in the Monge formulation as

OTM(μ,ν)=infT∫Ω×Ωc(x,T(x))dμ(x)

(2)

where the infimum is taken over transport maps T : Ω→ Ω that rearrange μ into ν, i.e. ν = T#μ where we define the pushforward of μ onto the range of T by T#μ(A) = μ(T−1(A)), see . Historically the Monge formulation comes before the Kantorovich formulation; Monge formulated OT for the cost function c(x, y) = |x − y| in 1781 [47] and Kantorovich formulated OT (whilst being unaware of Monge's work) in 1942 [31]. In 1948 Kantorovich made the connection between his work and Monge's [32].

The Monge formulation is a non-convex optimization problem with nonlinear constraints. However when, for example, μ and ν have densities, then optimal transport maps T* exist and give a natural interpolation between two measures. In particular when c(x, y) = |x − y|p the map Tt(x) = (1 – t)x + tT* (x) describes the path of particle x and furthermore the measure of μ pushed forward by Tt is the geodesic (shortest path) between μ and ν. This property has had many uses in transport based morphometry applications such as biomedical [3, 55, 80, 85], super-resolution [36] and has much in common with large deformation diffeomorphism techniques in shape analysis [23, 29].

#### Optimal Transport in Signal and Image Processing

To further motivate our development of the TLp distance we point out some features of optimal transport important to signal and image processing. We refer to [35] and references therein for more details and a review of the subject.

Key to the success of OT is the ability to provide generative models which accurately represent various families of data distributions. The success and appeal of OT owes to (1) ability to capture well the signal variations due to spatial rearrangements (shifts, translations, transport), (2) that OT distances are theoretically well understood and have appealing features (for example the Wasserstein distance has a Riemannian structure and geodesics can be characterized), (3) efficiency and accuracy of numerical methods, (4) simplicity compared to other Lagrangian methods such as large deformation diffeomorphic metric mapping.

The Monge formulation of OT defines a mapping between images which has been used in, for example, image registration [24–27, 50, 82, 89] where one wishes to find a common geometric reference frame between two or more images. In addition to the properties listed above the success of OT is due to the fact that (5) the Monge problem is symmetric (i.e. if T is the optimal map from the first image to the second, then T−1 is the optimal map from the second image to the first) and (6) OT provides a landmark-free and parameter-free registration scheme.

We now introduce the TLp distance in the simplest setting.

#### The Transportation Lp Distance

In this paper we use the TLp distance (given in more generality in the next section), for functions f, g : Ω → ℝm defined by

dTLp(f,g)=minπ∫Ω×Ω|x−y|p+|f(x)−g(y)|pdπ(x,y)

where the minimum is taken over all probability measures π on Ω × Ω such that both the marginals are the Lebesgue measure ℒ on Ω, i.e. π ∊(ℒ, ℒ). This can be understood in the following two ways.

The first is as an optimal transport distance of the Lebesgue measure with itself and cost c(x, y) = |x − y|p + |f(x) − g(y)|p. This observation allows one to apply existing numerical methods for OT where the effective dimension is d (recall that Ω ⊆ ℝ d). For example, the Sinkhorn framework can be adapted to compute an entropy regularised approximation of the TLp distance, see Appendix B. 2 for more details.

The second is as an OT distance between the Lebesgue measure raised onto the graphs of f and g. That is, given f, g : Ω → ℝm then we define the measures μ̃, ν̃ on the graphs of f and g by

μ∼(A×B)=L({x:x∈A,f(x)∈B})

and

ν∼(A×B)=L({y:y∈A,g(y)∈B})

for any open sets A ⊆ Ω, B ⊆ ℝm. For example, given the function f : [0,1] → [0,1] defined by f(x) = x and the Lebesgue measure on [0,1], the pushforward of the measure μ onto the graph of f is the measure μ∼(C)=12Length({x:(x,x)∈C}) for any (measurable) C ⊂ [0,1]2; it is intuitive that μ̃(C) should be proportional to Length({x : (x, x) ∈ C}), the constant of proportionality comes from μ∼([0,1]2)=μ([0,1])=1. See also for an example where the measure μ is Gaussian. The TLp distance between f and g is then the OT distance between μ̃ and ν̃.

Transport (i.e. matching) with respect to the TLp distance is of the form (x, f (x)) ↦ (y, g(y)) and therefore has two components. We refer to horizontal transport as the transport x ↦ y in Ω, and vertical transport as the transport f (x) ↦ g(y).

Although the TLp distance is a special case of OT we will, in order to make a clearer distinction between classical OT distances and the TLp distance, assume that c(x, y) = |x − y|p in (1).

In the next section we discuss the behaviour of the TLp distance through three examples.

### 2.2 Examples Illustrating the Behaviour of TLp

#### No mass renormalization

Unlike for OT, in the TLp distance there is no need to assume that f and g are non-negative or that they have the same mass. If a signal is negative then a typical (ad-hoc) fix in OT is to add a constant to make the signal non-negative before computing the distance. How to choose this constant is often unclear unless a lower bound is known a-priori. Furthermore this may damage sensitivity to translations as the defining features of the signal become compressed. For example, considering the functions in , let g = f (∙ – ℓ) be the translation of f. OT will lose sensitivity when comparing f^=f+αf(f+α) and g^=g+αf(g+α). In particular dOT(f^,g^) scales with the height of the renormalised function, which is of the order of 1α, and the size of the shift: dOT(f^,g^)αhoℓα where h0 is the height of f. To ensure positivity one must choose a large but this also implies a small OT distance. Note also that both Lp and TLp distances are invariant under adding a constant whereas OT is not.

Open in a separate windowAnother approach to apply the OT distance to non-positive signals is to decompose each signal into the positive and negative components f = f+− f−, where f+ = max{0, f} and f− = max{0, −f}, and compute the OT distance between each component. Whilst this method may be reasonable depending on the application it is not invariant under addition which could produce some unnatural properties. For example, consider two signals, one slightly negative and one slightly positive. Then applying an (unbalanced) OT distance on the positive and negative components is equivalent to matching both signals to zero. Adding a small constant to the negative signal so that both signals are positive produces a qualitatively different result. Since both the TLp and Lp distances are invariant under addition neither has this property. To mitigate this issue Bonneel, van de Panne, and Heidrich [8] consider signals decomposed into frequency bands. This also allows them to directly take into account the signal frequency. In some sense our approach is complimentary, as we seek a way to take into account the signal intensity.

#### Sensitivity to High Frequency Perturbations

The TLp distance inherits sensitivity to high frequency perturbations from the Lp distance. For example, let g = f + Aξ where ξ is a high frequency perturbation with amplitude A and wavelength ω. Suppose for simplicity that f is constant and ξ is a sinusoid (and that both signals are positive). The function g is the landscape consisting of piles of earth and trenches and f is the flat landscape. The OT distance between f and g measures the cost of moving piles of earth into trenches (in the most efficient manner). The two factors which determine the OT distance are the total amount of earth to be moved (which we assume fixed) and how far we move each piece of soil, which is determined by the wavelength. Hence the OT distance between f and g is on the order of the wavelength ω of ξ, which is small, and independent of the amplitude A. On the other hand both the TLp distance and the Lp distance are independent of the wavelength but scale linearly with amplitude, see . In particular OT is insensitive to high frequency noise regardless of the size of the amplitude whereas both TLp and Lp distances scale linearly with the amplitude.

#### Ability of the TLp Distance to Track Translations

Another desirable property of both TLp and OT distances are their ability to keep track of translations for longer than the Lp distance. Let f = Aχ[0,1] be the indicator function of the set [0,1] on ℝ scaled by A > 1 and g(x) = f (x – ℓ) the translation of f by ℓ. Once ℓ > 1 then the Lp distance can no longer tell how far apart two humps are. On the other hand OT distances can track the hump indefinitely. In this example the TLp distance couples the graphs of f and g in one of three ways, see . The first is when the transport is horizontal only in the graph ( top left). In the second (top right) there is a mixture of horizontal and vertical transport. And in the third there is only vertical transport (bottom left), in which case the TLp distance coincides with the Lp distance. One can calculate the range of the TLp distance which is on the order of A.

Open in a separate window## 3 Definitions and Basic Properties of the TLp Distance

In the previous section we defined the TLp distance for signals defined with respect to the Lebesgue measure. In this section we generalise to signals defined on a general class of measures. We let Lp (μ) be the space of functions f such that ∫Ω|f (x)|p dμ(x) < ∞. This is a Banach space with the usual norm.

We treat a signal as a pair (f, μ) where μ ∈ 𝒫p(Ω) (the set of probability measures with finite pth moment) and f : Ω →ℝm with f ∈ Lp(μ). The generality considered here allows us to treat continuous and discrete signals simultaneously as well as allowing one to design the underlying measure in order to emphasise certain parts of the signal. We are also able to compare signals with different discretisations. However, unless otherwise stated, μ = v is the Lebesgue measure. There is no assumption on the dimension m of the codomain. This allows us to consider multi-channelled signals.

The TLλp distance for pairs (f, μ) ∈ TLp where

TLp: = {(f, μ):f ∈ Lp(μ), μ ∈ 𝒫p(Ω)}

is defined by

dTLλpp((f,μ),(g,ν))=minπ∈Π(μ,ν)∫Ω×Ωcλ(x,y;f,g)dπ

(3)

cλ(x,y;f,g)=1λ|x−y|p+|f(x)−g(y)|p

(4)

and Π(μ, ν) is the space of measures on Ω × Ω such that the first marginal is μ and the second marginal is ν. Note that if f = g is constant then we recover the OT distance between the measures μ and ν. In the special cases, when μ = ν = ℒ are the Lebesgue measure, we write dTLλp(f,g):=dTLλp((f,L),(g,L)) and, when λ = 1, λ=1,dTLp(f,g):=dTL1p(f,g). The result of [21, Proposition 3.3] implies that dTLλp is a metric on TLp.

**Proposition 3.1.** [21] For any p ∈ [1, ∞] and λ > 0,
(TLp,dTLλp) is a metric space.

When μ = ν = ℒ is the Lebesgue measure then an admissible plan is the identity plan: π (A × B) = ℒ (A ∩ B). This implies that the TLλp distance is bounded above by the Lp distance (for any λ).

In fact the parameter λ controls how close the distance is to an Lp distance. As λ → 0 then the cost of horizontal transport: 1λ∫Ω×Ω|x−y|pdπ(x,y), is very expensive which favours transport plans that are approximately the identity mapping. Hence dTL0p(f,g):=limλ→0dTLλp(f,g)=‖f−g‖Lp. The following result, and the remainder of the results in this section, can be found in [79].

**Proposition 3.2.** [79] Let f, g ∈ Lp (with respect to the Lebesgue measure). The
TLλp distance is decreasing as a function of λ and

limλ→0dTLλp(f,g)=‖f−g‖Lp.

Moreover, if either f or g is Lipschitz then

dTLλpp(f,g)≥{ɛp−1(λ)‖f−g‖Lppifp>1‖f−g‖Lppifp=1andλ<1κ

where ε(λ)=11+(λκ)1p−1 and k = (min{Lip(f), Lip(g)})p.

The above proposition implies that, when p = 1, if 1λ is chosen larger than the length scale given by the derivative then the TL1λ distance is exactly the L1 distance.

Recall that we can consider the TLλp distance as an OT distance on the graphs of f and g:

dTLλp((f,μ),(g,ν))=dOT((Id×f)#μ,(Id×g)#ν).

(5)

When there exists a map T : Ω × ℝm → Ω × ℝm realising the minimum of the Monge formulation of the RHS then we can understand the transport as a map (x, f (x)) ↦ (y, g(y)). We recall that we refer to the transport x ↦ y in the domain Ω as horizontal transport and transport f(x) ↦ g(y) in the codomain of f and g as vertical transport. We see that horizontal transport is cheap as λ ↦ ∞ and we only pay the cost of vertical transport. For example, if we consider f (x) = χ[0,1] and g(x) = χ[1,2] defined on the interval [0, 2] then the mapping T(x, y) = (T1(x,y), T2(x,y)) where

T1(x,y)={x+1ify=1andx∈[0,1]x−1ify=0andx∈[0,1]

and

T2(x,y)={1ify=1andx∈[0,1]0ify=0andx∈[0,1]

defines a transport map on the support of (Id × f )#ℒ. Furthermore, this implies

dTLλpp(f,g)≤∫01|x−T1(x,1)|pλ+|f(x)−g(T1(x,1))|pdx+∫12|x−T1(x,0)|pλ+|f(x)−g(T1(x,0))|pdx=2λ→0asλ→∞.

In this example dTL∞p(f,g):=limλ→∞dTLλp(f,g)=0. More generally the TL∞p distance is an OT distance between the measures f#μ and g#ν.

**Proposition 3.3.** [79] Let Ω ⊆ ℝd, f, g : Ω → ℝm measurable functions and μ, ν ∈ 𝒫p(Ω) where p ≥ 1, then

limλ→∞dTLλp((f,μ),(g,ν))=dOT(f#μ,g#ν)

where dOT is the OT distance (on 𝒫(ℝm)) with cost c(x, y) = |x − y|p.

As the example before the proposition showed, dTL∞p(f,g) is not a metric, however is non-negative, symmetric and the triangle inequality holds.

In [4, Section 2] the authors, using the fluid mechanics formulation of optimal transport, interpolate the optimal transport distance with quadratic cost with the L2 distance. The resulting interpolated distance can still be written in the fluid mechanics formulation which naturally gives rise to geodesics. By contrast the TLp distance interpolates between Lp and the optimal transport distance of the push forward measures. This is well defined for any p ≥ 1 (unlike the previous method which requires p = 2) however geodesics do not exist in the TLp space. One must also treat the signals as probability measures in the approach of [4].

We observe that when μ is a uniform measure (either in the discrete or continuous sense) the measure f#μ is the histogram of f. The OT distance between histograms is a popular tool in histogram specification. Minimizers to the Monge formulation of dOT(f#μ, g#ν) define a mapping between the histograms f#μ and g#ν [49, 62, 63]. However this mapping contains no spatial information. If instead one uses minimizers to the Monge formulation of the TLλp distance (6) (λ < ∞) then one can include spatial information in the histogram specification. We explore this further in Section 4.4 and apply the method to the colour transfer problem.

It is well known that there exists a minimizer (when c is lower semi-continuous) for OT. Since the TLλp distance is closely related to an OT distance between measures in ℝd+m (i.e. measures supported on graphs) then there exists a minimizer to (3-4).

**Proposition 3.4.** [79] Let Ω ⊆ ℝd be open and bounded, f ∈ Lp(μ), g ∈ Lp(ν) where μ, ν ∈ 𝒫(Ω), λ ∈ [0, + ∞] and p ≥ 1. Under these conditions there exists an optimal plan π ∈ Π(μ,ν) realising the minimum in
dTLλp((f,μ),(g,ν)).

As in the OT case it is natural to set the TLλp distance in the Monge formulation (2). We can write

dTLλp((f,μ),(g,ν))=infT:T#μ=ν∫Ωcλ(x,T(x);f,g)dμ(x).

(6)

Minimizers to the above will not always exist. For example, consider when f = g then the TLλp distance is the OT distance between μ and ν. If one chooses μ=13δx1+13δx2+13δx3 and ν=12δy1+12δy2 where all of xi, yj are distinct then there are no maps T : {x1, x2, x3} → {y1, y2} that pushforward μ to ν.

However, in terms of numerical implementation, an interesting and important case is when μ and ν are discrete measures (see also [83, pg 5, 14-15] for the following argument with the Monge OT problem). Let μ=1n∑i=1nδxi and ν=1n∑i=1nδyi then π=(πij)i,j=1n∈Π(μ,ν) is a doubly stochastic matrix up to a factor of 1n, that is

πij≥0∀i,j,∑i=1nπij=1n∀jand∑j=1nπij=1n∀i,

(7)

and the TLλp distance can be written

dTLλpp((f,μ),(g,ν))=min∑i=1n∑j=1ncλ(xi,yj;f,g)πij

(8)

where the minimum is taken over π satisfying (7). It is known (by Choquet's Theorem, e.g. [67, Theorem 32.3]) that the solution to this minimisation problem is an extremal point in the matrix set Π(μ, ν). It is also known (by application of Birkhoff's Theorem, e.g. [6]) that extremal points in Π(μ, ν) are permutation matrices. This implies that there exists an optimal plan π* that can be written as πij∗=1nδj−σ(i) for a permutation σ : {1,…,n} → {1,…,n}. Hence there exists an optimal plan to the Monge formulation of the TLλp distance.

**Proposition 3.5.** For any f ∈ Lp(μ) and g ∈ Lp(ν) where
μ=1n∑i=1nδxi and
ν=1n∑j=1nδyj there exists a permutation σ : {1, 2, …, n} → {1, 2, …, n} such that

dTLλp((f,μ),(g,ν))=1n∑i=1ncλ(xi,xσ(i);f,g).

The above theorem implies that in the uniform discrete case there exists optimal plans (which are matrices) which will be sparse. In particular, π* is an n × n matrix with only n non-zero entries. This motivates the use of numerical methods that can take advantage of expected sparsity in the solution (e.g. iterative linear programming methods such as [53]).

## 4 TLp in Multivariate Signal and Image Processing

Written in the form (3) the TLλp distance is an OT distance between the measures μ and ν with the cost function c given by (4) and which depends upon f and g. Hence, to compute TLλp distances there are many algorithms for OT distances that we may apply, for example the multi-scale approaches of Schmitzer [71] and Oberman and Ruan [53], or the entropy regularized approaches of Cuturi [16] and Benamou, Carlier, Cuturi, Nenna and Peyre [5]. Our choice was the iterative linear programming method of Oberman and Ruan [53] for the multivariate signals which we find works well both in terms of accuracy and computation time. Our choice for the images was the entropy regularized solution due to Cuturi [5, 16]. Whilst this only produces an approximation of the TLp distance we find it computationally efficient for 2D images. In particular this method regularizes the OT distance with εH(π) where H is entropy. We choose ε as small as possible whilst avoiding numerical instability. In practice this corresponds to a choice of ε ≈ 0.005. For convenience we include a review of the numerical methods in Appendix B.

With respect to choosing λ there are two approaches we could take. The first is to compute the TLp distance for a range of λ and then use cross-validation. There are two disadvantages to this approach: we would still have to know the range of λ and computing the TLλp distance for multiple choices of λ would considerably increase computation time. The second approach, and the one we use for each example in this section, is to estimate λ by comparing length scales and desired behaviour. In particular we choose λ so that both horizontal and vertical transport make a contribution. For the applications in this section we want to stay away from the asymptotic regimes λ ≈ 0 and λ ≫. By balancing the vertical and horizontal length scale we can formally find an approximation of λ which in our results below works well. For example, if we expect a set of real valued time series to have range [fmin, fmax] and domain [tmin, tmax] then to balance the vertical and horizontal length scales we choose λ so that

|tmax−tmin|pλ≈|fmax−fmin|p.

We first consider two synthetic examples. Considering synthetic examples allows us to better demonstrate where the TLp distance will be successful. In particular synthetic examples can simplify the analysis and allow us to draw attention to features that may be obscured in real world applications.

The first synthetic example considers three classes where we can analytically compute the within class distances and between class separation. This allows us to compare how well we expect TLp distances to perform in a classification problem.

The second synthetic example uses simulated 2D data from one-hump and two-hump functions. We test how well the TLp distance recovers the classes and compare with OT and Lp distances.

Our first real world application is in classifying multivariate times series and 2D images. We choose a multivariate time series data set where we expect transport based methods to be successful but OT cannot be immediately applied. That is, OT must be applied to measures that are real valued so cannot be directly applied to multi-channelled signals. As a benchmark we find the OT distance on each channel separately and take the average over all channels. This would seem reasonable when channels are independent but is not a good assumption in the AUSLAN dataset where we expect temporally correlated signals.

Our chosen data set consists of sequences of sign language data (we define the data set in more detail shortly) which contains the position of both hands (parametrised by 22 variables) at each time. The TLλp distance can treat these signals as functions f : [0,1] → ℝ22. We expect to see certain features in the signals however these may be shifted based on the speed of the speaker. The second data set contains 2D images that must be normalised in order to apply the OT distance, this distorts some of the features leading to a poor performance.

We repeat the classification experiment on the AT&T Database of Faces. This is a database of ten 2D greyscale images of forty subjects. Note that if the images were in colour then one cannot immediately apply the OT distance.

The second real world application is histogram specification and colour transfer. Histogram specification or matching, where one defines a map T that matches one histogram with another, is widely used to define a colour transfer scheme. In particular let f:(xi)i=1N→ℝ3 represent a colour image by mapping pixels xi to a colour f (xi) (for example in RGB space), one defines a multidimensional histogram of colours on an image by φ(c)=1N#{xi:f(xi)=c}. For colour images the histogram ϕ is a measure on ℝ3. For notational clarity we will call ϕ the colour histogram. One can equivalently define a histogram for grayscale images as a measure on ℝ.

Let ϕ and ψ be two colour histograms for images f and g respectively. The OT map T defines a rearrangement of ϕ onto ψ, that is ψ = T#ϕ. In colour transfer the map T is used to colour the image f using the palette of g by f^(x)=g(T(x)).

The histogram contains only intensity information and in particular there is no spatial dependence. Using the TLλp-optimal map we define spatially correlated histogram specification and explain how this can be applied to the colour transfer problem. We demonstrate that TLλp distance produces a visually more appealing solution than the OT solution when spatial information is important. We also observe that, for colour images, computing OT maps is a 3D problem (the domain and range of the transport maps is in colour space), whereas computing TLλp maps is a 2D problem (the transport maps pixels to pixels). By Proposition 3.3 when λ is large we can approximate the OT distance between histograms by the TLλp distance. For images that use the full spectrum of colours, i.e. the colour histograms are 256 × 256 × 256, the size of the discretisation is 2563 ≈ 16.8 × 106. Hence the spatially correlated histogram specification method allows for a numerically efficient approximation of the OT induced histogram specification method when the size (in terms of number of pixels) of the images are less than 4096 × 4096.

### 4.1 1D Class Separation for Synthetic Data

#### Objective

We compare the expected classification power of TLp, Lp and OT distances with three classes of 1D signals that differ by position (translations), shape (1 hump versus 2 hump) and frequency (hump versus chirp).

#### Data Sets

We consider data from three classes defined in . The first class contains single hump function and the second class contains two hump functions. The third class consists of functions with one hump and one chirp, defined to be a high frequency perturbation of a hump. The classes are chosen to test the performance of the TL2 distance with L2 and OT distances with regards to identifying translations (where we expect the L2 distance to do poorly) with a class containing high frequency perturbations (where we expect the OT distance to do poorly).

Open in a separate window#### Methods

For a distance to have good performance in classification and clustering problems it should be able to separate classes. To be able to quantify this we use the ratio of ‘between class separation’ to ‘class coverage radius’ that we define now.

Let CiN={fji}j=1N be a sample of N functions from class 𝒞i. For a given radius r we let Gi(r) be the graph defined by connecting any two points in CiN with distance less than r. The distance will be defined using the TLλp, L2 and OT distances. Let RTLλp(CiN) be the smallest r such that Gi(r) is a connected graph using the TLλp distance. Analogously we can define RLp and ROT.

We define ‘between class separation’ as the Hausdorff distance between classes:

dH,ρ(CiN,CjN)=max{supf∈CiNinfg∈CjNρ(f,g),supg∈CjNinff∈CiNρ(f,g)}

where we will consider p to be one of the TL2, L2 or OT distances. Large values of dH,ρ(CiN,CjN) imply that the classes CiN and CjN are well separated.

When Rρ(CiN)≤dH,ρ(CiN,CjN) then we say that the class CiN is separable from class CjN since for any f∈CiN the nearest neighbour in (CiN∪CjN)\{f} is also in class CiN. We define the pairwise property

κij(ρ;N)=EdH,ρ(CiN,CjN)max{ERρ(CiN),ERρ(CjN)}

where we take the expectation over sample classes CiN. We will assume that the distribution over each class is uniform in the parameter ℓ. When κij (ρ; N) > 1 then we expect classes CiN and CjN to be separable from each other.

As a performance metric we use the smallest value of N such that κij (ρ; N) ≥ 1. We let

Nij∗(ρ)=min{N:κij(ρ;N)≥1}.

This measures how many data points we need in order to expect a good classification accuracy.

#### Results

We leave the calculation to the appendix but the conclusion is

N12∗(TL2)<N12∗(OT)<N12∗(L2)N13∗(TL2)<N13∗(OT)<N13∗(L2)N23∗(TL2)<N23∗(OT)<N23∗(L2).

In each case the TL2 distance outperforms the L2 and OT distances.

In each class the L2 distance has a larger value of R. This implies a larger data set is needed to accurately cover each class. This is due to the Lagrangian nature of signals within each class (translations) that is poorly represented by the L2 distance. The OT distance has the lowest (and therefore best) value of R in each class. Since each class is Lagrangian then the OT distance is very small between functions of the same class.

When considering between class separation the TL2 and L2 distances coincide and give a bigger (and better) between class distance than the OT distance. Since the class 𝒞3 can be written as a high frequency perturbation of functions in the class 𝒞2 then, in the OT distance, the functions from class 𝒞3 approximate functions from the class 𝒞2. The distance dH,OT(C2N,C3N) is therefore small so that one needs more data points in order to fully resolve these classes. We see a similar effect when considering dH,OT for the other classes.

### 4.2 2D Classification for Synthetic Data

#### Objective

We use simulated data to illustrate better separation of the TLp distance compared to Lp and OT distances for 2D data from two classes of 1-hump and 2-hump functions.

#### Data Sets

The data set consists of two dimensional images simulated from the following classes

ℙ = {p[0,1]2:p(x) = αϕ(x|γ, σ), γ ∼ unif([0,1]2), α ∼ unif([0.5, 1])}

Q={[0,1]2:q(x)=αϕ(x|γ1,σ)−αϕ(x|γ2,σ),γ1,γ2iid∼unif([0,1]2),α∼unif([0.5,1])}

where ϕ (·|γ, σ) is the multivariate normal pdf with mean γ ∊ ℝ2 and co-variance σ ∊ ℝ2×2. We choose σ = 0.01 × Id where Id is the 2 × 2 identity matrix. The first class, ℙ, are the set of multivariate Gaussians restricted to [0,1]2 with mean uniformly sampled in [0,1]2 and weighted by a uniformly sampled in [0.5,1]. The second class, ℚ, are the set of weighted differences between two Gaussian pdf's restricted to [0,1]2 with means γ1, γ2 sampled uniformly in [0,1]2. Note that the second class contains non-positive functions. See for examples from each class.

Open in a separate windowWe simulate 25 from each set and denote the resulting set of functions by F={fi}i=1N where N = 50.

#### Methods

Let ({fi}i=1N,DTLλ2) be a finite dimensional metric space where DTLλ2 is the N × N matrix containing all pairwise SLλ2 distances, i.e. DTLλ2(i,j)=dTLλ2(fi,fj). Similarly for ({fi}i=1N,DL2) and ({fi}i=1N,DOT) where the optimal transport distance is defined by dOT(f,g)=OT(f,g) and OT is given by (1) for c(x, y) = |x − y|2.

To apply the optimal transport distance we need to renormalise so that signals are all non-negative and integrate to the same value. We do this by applying the nonlinear transform N(f)=f−β∫(f−β) where β = min f∈ℱminx∈[0,1]2 f(x). Neither the L2 nor TL2 distances require normalisation.

We use non-metric multidimensional scaling (MDS) [39] to represent the graph in k dimensions. More precisely the aim is to approximate ({fi}i=1N,D.) by a metric space ({xi}i=1N,D|⋅|2) embedded in ℝk (D|.|2 is the matrix of pairwise distances using the Euclidean distance, i.e. D|.|2 (i, j) = |xi−xj|2). This is done by minimising the stress S defined by

STLλ2(k)=∑i,j=1N(|xi−xj|22−F(DTLλ2(i.j)))2∑i,j=1N|xi−xj|22

over {xi}i=1N⊂ℝk and monotonic transformations F : [0, ∞) → [0, ∞), with SLλ2, SOT defined analogously. The classical solution to finding the MDS projection (for Euclidean distances) is to use the k dominant eigenvectors of the matrix of squared distances, after double centring, as coordinates weighted by the square root of the eigenvalue. More precisely, define D(2) = −½J[|fi−fj|2]ij where J=Id−1NI and 𝕀 is the N × N matrix of ones. Let Λk be the matrix with the k largest eigenvalues of D(2) on the diagonal and Ek to be the corresponding matrix of eigenvectors. Then X=EkΛk12 is the MDS projection. Increasing the dimension of the projected space k leads to a better approximation. In we show the projection in L2, TL2 and the OT distances for k = 2 as well as the dependence of k on S for each choice of distance.

#### Results

Our results in show that the TL2 distance is the better distance for this problem. There is no separation in either L2 or OT distances whereas the TL2 distance completely separates the data. It should not therefore be surprising that the 1NN classifier with the TL2 distances outperforms the others. In fact, using 5 fold cross validation (CV) we get 100% accuracy with the TL2 distance, compared to 72% in the L2 distance and 86% in the OT distance. In addition we see that the stress Sp is much smaller and converges quickly to zero for the TL2 distance which indicates that the TL2 distance is, in this problem, more amenable to a low dimensional representation than either OT or L2 distances.

### 4.3 Classification with Real World Data Sets

#### Objective

We evaluate TLλ2 as a distance to classify real world data sets where spatial and intensity information is expected to be important and compare with popular alternative distances. We choose one dataset which is of the type multivariate time series and a second data set consisting of images.

#### Data Sets

We use two data sets. The first is the AUS- LAN [30, 41] data set which contains 95 classes (corresponding to different words) from a native AUSLAN speaker (Australian Sign Language) using 22 sensors on a CyberGlove (recording position of x, y, z axis, roll, yaw, pitch for left and right hand). Therefore signals are considered as functions from {t1,t2,.. .tN} to ℝ22. There are 27 signals in each class which give a total of 2565 signals.

We make two pre-processing steps. The first is to truncate each signal so it is 44 frames in length. Empirically we find that the signal is constant after the 44th frame and therefore there is no loss of information in truncating the signal. The second pre-processing step is to normalise each channel independently. This is because some channels are orders of magnitude greater than others and would otherwise dominate each choice of distance.

The second data set we use is the AT&T Database of Faces [70]. The dataset consists of ten greyscale facial images from forty subjects, see for examples. There were 400 images in total. In order to reduce the computation time we reduced the size of the images from 92 × 112 pixels to 50 × 50 pixels.

Open in a separate window#### Methods

For the multivariate time series we compare the performance of a 1NN classifier using the L2 and TLλ2 distances as well as the state-of-the-art method dynamic time warping [22] and the OT distance average over each channel:

dMOT(f,g)=122∑i=122dOT(f^,g^i)

where f̂i is the ithchannel of f after normalisation that is given by f^i=fi+c∫(fi+c) and where c is chosen so that each signal is non-negative. We use the OT distance with cost c(x, y) = |x – y|2.

There are three common variations of dynamic time warping. One can apply dynamic time warping directly to the signals f and g (denoted by DTW), to the derivative f′ of the signals (denoted by DDTW) and to a weighted average of DTW and DDTW (denoted by WDTW). We define

dDDTW(f,g)=dDTW(f′,g′)dWDTW(f,g)=αdDTW(f,g)+(1−α)dDDTW(f,g).

The parameter a is chosen by 5-fold 2nd depth cross validation. To be more precise the training data set is split into five partitions. One forms the testing data set (accounting for 20% of the data) and the other four form the training data set. To choose α we further divide the training data set into five partitions (each accounting for 16% of the data set). For each α=i100, where i = 0,1, …, 100, we compute how accurately one partition of the training set is classified using the remaining four parts. We then choose the value of α which produces the best classification accuracy on the training data. This value of α is then used to classify the testing data set.

The analogous distances for L2, TLλ2 and multi-channelled OT are defined by

dDL2(f,g)=dL2(f′,g′)dDTLλ2(f,g)=dTLλ2(f′,g′)dDL2(f,g)=dL2(f′,g′)dDMOT(f,g)=dMOT(f′,g′)dWL2(f,g)=αdL2(f,g)+(1−α)dDL2(f,g)dWTLλ2(f,g)=αdTLλ2(f,g)+(1−α)dDTLλ2(f,g)dWMOT(f,g)=αdMOT(f,g)+(1−α)dDMOT(f,g).

We do not have to choose the same value of λ in the TLλ2 and DTLλ2 distances however considering that signals are normalised, we will use the same value. Note that DL2, DTW, DDTW, WDTW DTLλ2 and DMOT are not metrics.

We remark that an alternative method for including derivatives in the TLλp distance would be to extend the signal to include the derivative. We briefly assume that f is defined over a continuous domain. Let f : ℝ →ℝ, and f∼=(f,dfdx), then we define

dTWλ1,p(f,g)=dTLλp(f∼,g∼).

We take our notation TWλk,p from the Sobolev space notation where Wk,p is the Sobolev space with k weak derivatives integrable in Lp. There is no reason to limit this to one derivative, and we may define f∼=(f,dfdx,…,dkfdxk) and

dTWλk,p(f,g)=dTLλp(f∼,g∼).

When the signals are discrete one should use a discrete approximation of the derivative. In order to be consistent with previous extensions of dynamic time warping we do not develop this approach here.

Dynamic time warping is only defined on time series so we are not able to apply it to the AT&T Database of Faces. We apply the optimal transport distance by normalising each image f ∈ ℝ2 → {0, 1} f^(x)=f(x)∫[0,1]2f(y)dy. There is no normalisation for either L2 or TL2 distances. We find the 1NN classifier using TL2, L2 and OT distances.

We will use λ =1 in AUSLAN and λ = 0.1 in the AT&T Database of Faces for the TLλ2 based distances. The underlying measure μ is chosen to be the uniform measure defined on [0,1] or [0,1]2.

#### Results

We considered two methods for comparing the performance of each distance. The first is the 1NN classification accuracy in each distance. We use the 1NN classification accuracy as a measure as to how well each distance captures the underlying geometry. A higher accuracy implies closest neighbours are more likely to belong to the same class.

The results are given in where we report error rates using 5-fold cross-validation. In terms of the 1NN classifier for the AUSLAN data set we see that TL2 is better than L2 and is a modest improvement over dynamic time warping.

### Table 1

L2DTW TL2λMOTSignal15.39%11.45%12.12%61.71%Derivative22.15%19.77%12.63%10.41%Weighted Average8.06%7.33%6.70%10.41%Open in a separate windowA rather surprising result is the difference between the MOT distance between the signals and the MOT distance between the derivative of signals. We believe this is most likely due to the length of the word being a good indicator of the word (this would also explain why the L2 distance has reasonable performance). We can see from the example signal in only the first part of the signal contains information (the word being spoken), the remainder of the signal is noise. Because we need to renormalise in order to apply the MOT distance then, similar to the example in , the difference between the first part of the signal (containing information) and the latter part of the signal (containing noise) is reduced.

On the other hand the derivative of the signal will place a lot of mass at the end of the word, with smaller masses in other places where the signal is changing. In particular MOT is now able to identify the length of the signal, leading to a big improvement in performance. Furthermore, some channels are likely to contain more information than others. The decoupling of channels in the MOT distance could be an advantage as simultaneously matching across all channels, as in the TLλp distance, can mean the latter distance is corrupted by low quality channels. This artefact could be removed by weighting channels (this would require training the distance). Since we expect a temporally correlated distance to be a better model then, when weighting channels, we would expect to see an improved performance of the TLp distance over the MOT distance.

Our results indicate that the TLλp distance better represents the geometry of the dataset than any of the other (psuedo) distances. However a 1NN classifier should not be expected to achieve the best classification results. We refer to [2] for a state-of-the- art neural network which produces a much better classification error than the 1NN method considered here (the smallest error rate in [2] for AUSLAN is 2.53%, this uses a training data set equal to 49th of the total data set). We stress that the aim of this paper is to introduce a distance that better models signals where both spatial and intensity is important, not to define a new classification method.

For the AT&T Database of Faces the 1NN classifier using the OT distance performs worse with an error of 3.3%. The L2 distance does the second best with 2.5% and the TL2 distance is the best with 2%.

In the same spirit as Section 4.1 we define the performance metric κij(ρ) as the ratio of distance between class i and class j and the maximum class coverage radius of class i and class j. For the distance between classes we use the Hausdorff distance (see Section 4.1) and for the class coverage radius we use the minimum radius r such that connecting any two data points in class i closer than r defines a connected graph. We plot the results in . The x axis represents pairs of classes where for visual clarity we have ordered the pairs so that the κ(L2) is increasing. A large value of κij indicates that it is easier to identify class i from class j whereas a small value indicates that identifying the two classes is a difficult problem.

For AUSLAN we see that the TL2 distance has, for the majority of pairs of classes, a larger value of κij than the L2 distance and DTW and therefore better represents the class structure. The MOT distance does poorly, except in a few cases. We notice that all distances follow the trend that class separation is increasing with κ(L2).

For AT&T Database of Faces the L2 and TL2 distances perform very similarly. However both the TL2 and L2 distances are much more consistent than the OT distance, we can see that although between some classes the OT distance achieves the best results, with other classes the OT distance does extremely poorly (there are many more classes with a class separation close to 1).

### 4.4 Histogram Specification and Colour Transfer with the TLp Distance

#### Histogram specification and colour transfer

Histogram specification concerns the problem of matching one histogram onto another. For a function f on a discrete domain X the histogram is given by f#μ where μ is the uniform discrete measure supported on N points. We do not make any assumption on the dimension of the codomain of f (so that f may be multivalued and the histogram may be multidimensional). This coincides with the definition given in the introduction to the section, that is

f#μ(y)=1N#{x∈X:f(x)=y}.

Given two functions f : X → ℝm and g : Y → ℝm, with histograms ϕ and ψ respectively, histogram specification is the problem of finding a map T : X → Y such that ψ = T# ϕ.

The colour transfer problem is the problem of colouring one image f with the palette of an exemplar image g. A common method used to solve this problem is to use histogram specification where T is the minimizer to Monge's optimal transport problem (2) between ϕ and ψ [14, 17, 49, 62, 63]. Let our colour space be denoted by 𝒞 where for example if the colour space is 8 bit RGB then 𝒞 = {0,1,…, 255}3. The colour histogram then defines a measure over 𝒞. If we consider two such histograms ϕ and ψ corresponding to images f : X → 𝒞 and g : Y → 𝒞 respectively then a histogram specification is a map T : 𝒞 → 𝒞 that satisfies ψ = T#ϕ. The recoloured image fˆ = g∘T has the same colour histogram as g. The solution fˆ is a recolouring of f using the palette of g.

If we consider grayscale images then 𝒞 = [0,1] and the optimal transport map (assuming it exists) is a monotonically increasing function. In particular this implies that if pixel x is lighter than pixel y (i.e. f (x) > f (y)) then in the recoloured image f ∘ = Tº f pixel x is still lighter than pixel y. In this sense the OT solution preserves intensity ordering. But note that no spatial information is used to define T; only the difference in intensity between pixels is used and not the distance between pixels.

#### Spatially correlated histogram specification

Let ϕ and ψ be the histograms corresponding to images f : X → ℝm and g : Y → ℝm respectively. If we recall Proposition 3.3 then limλ→∞dTLλp((f,μ),(g,ν))=dOT(f#μ,g#ν) (where μ and ν are the discrete uniform measures over the sets X and Y). For λ < ∞ the TLλp distance includes spatial and intensity information. Hence the TLλp distance provides a generalization of OT induced histogram specification.

Analogously to the OT induced histogram specification method we define the spatially correlated histogram specification to be histogram specification using the map T : X → Y which is a minimizer to Monge's formulation of the TLλp distance (6). When the images are of the same size then, by Proposition 3.5 such a map exists. The recoloured image f̂ of f is given by f̂ = g∘T. Furthermore when the images are of the same size the map T is a rearrangement of the pixels in X and therefore the histograms are invariant under T. In particular the histogram of f̂ is the same as the histogram of g.

Although we propose the spatially correlated histogram specification as a method to incorporate spatial structure we recall from the discussion at the start of Section 4 the value of the method as a numerically efficient approximation to OT induced histogram specification for colour images that are not too large. Motivated by Proposition 3.3 one expects that for large λ the TLλp map is approximately the OT map between colour histograms. The OT problem is in the C space which, for colour images is 3 dimensional. However, the TLλp problem is in the domain of the images Ω, which is typically 2 dimensional. Hence one can use the TLλp distance to approximate OT induced histogram specification in a lower dimensional space when O(nc3)=|C|>|Ω|=O(ns2) where nc is the size of discretisation in each colour channel and ns is the size discretisation is each spatial dimension.

We briefly remark that histogram specification methods often include additional regularization terms. Such choices of regularization on the transport map include penalizing the gradients [17, 62, 63], sparsity [63], average transport [56] and rigidity [28]. One could apply any of the above regularizations to spatially correlated histogram specification.

#### Examples

First, let us consider the 128 × 128 grayscale images in . The objective is to combine the shading of the first image with the geometry of the second image. We are motivated by the scenario where one wishes to combine information about a scene obtained by two different measurements: one where intensities (dynamical range) are well resolved, but the spatial resolution (geometry) is not well captured, and another where dynamical range is poorly captured, but the geometry is well resolved. We furthermore allow that the scenes captured may be somewhat different. The desire is to combine the images to obtain a single image with both good geometry and intensity. The solution we propose is to use spatially correlated histogram specification to re-shade the image with low quality intensity.

Open in a separate windowThe result, as given in , produces what we consider to be the desired output. The shading has been transferred and the geometry has not been lost. One is not able to apply histogram specification (induced by the OT map) due to the lack of existence of an optimal transport map from the histogram of the original image ϕ to the histogram exemplar image ψ. This is due to the histogram of the original image being a sum of two delta masses as in .

As a more challenging example we consider real world colour images. Images are 128 × 128. We compare our method with histogram specification using the OT mappings and the following state of the art methods for which code is freely available. Reinhard, Ashikhmin, Gooch and Shirley's renormalisation method (RAGS) [66] rescales the image so that the mean and standard deviation of the LAB channels match the exemplar image. Pitié and Kokaram (PK) [61] approximate colour histograms with a Gaussian and look for the best linear map between the two colour histograms. Essentially they look for couplings, as in the Monge formulation of optimal transport but they restrict to linear mappings. In general there may not be any linear mappings between two histograms, however it can be shown that when the histograms are Gaussians the set of mappings is not empty. The final method we compare to is the regularized transportation method due to Ferradans, Papadakis, Rabin, Peyré and Aujol (FPRPA) [17].

It is difficult to quantitatively assess the performance of each method objectively. Whether the output is satisfactory depends on content and artistic preferences. We refer to [86] for an objective quantitative measure of colour transfer results, however this explicitly marks against introducing colour artefacts. Whilst some colour artefacts are clearly undesirable introducing others, such as the northern lights in was the objective. Hence we no of no way to quantify the performance of our method and instead rely on qualitative assessments.

Open in a separate windowIn the first pair of images the exemplar image contains a few trees with the northern lights in the background, whilst the other image has a few trees with a mostly clear sky in the background. The challenge is to recreate the northern lights in the second image.

As one would expect, in we see that the histogram specification induced by OT loses the spatial structure. Indeed, it is hard to recognise the northern lights. Similar with each competing method in , none of them successfully manage to reproduce the northern lights and the palm trees all pick up an unnatural reddish shade. The spatially correlated histogram specification solution does a much better job at preserving the ordering locally. As λ increases it becomes cheaper to match pixels that are further apart and therefore, for large λ, the matching does not preserve the local structure in the exemplar image.

In the second real world example we consider colour transfer between two images from Masson's trichrome staining procedure shown in . We manipulate the luminosity of the second images. The objective is to colour the second image using the palette of the first. In we compare the spatially correlated histogram specification method of TLλp to the other methods.

Open in a separate windowSince we know the true image we may compare the colourisation with the true image. We report the L2 error computed by

err(f−g)=1N∑i=13∑j|fi(xj)−gi(xj)|2

where f = (f1, f2, f3) and g = (g1,g2,g3) are images in RGB space and N = 1282 is the number of pixels. The TLp method (error 0.2885) gives a more accurate colourisation compared to the RAGS method (error 0.4040), the PK method (error 0.3568), the FPRPA method (error 0.4817), and the OT induced histogram specification method (error 0.4030). One can also see that the TL12 based method does not have the same artefacts as the other methods. In particular, (a) the darker band is still evident in RAGS and PK, (b) FPRPA fails to accurately recolour the white band on the right hand side, and (c) OT places too much white on the left hand side and not enough on the right hand side.

## 5 Conclusions

In this paper we have developed and applied a distance that directly accounts for the intensity of the signal within a Lagrangian framework. This differs from OT distances that do not directly measure intensity and the Lp distance which measures intensity only. Through applications we have shown the potential of this distance in signal analysis.

The distance is widely applicable, unlike in classical OT distances, such as the Wasserstein distance or the earth mover distance, the TLλp distance does not require treating signals as measures. Treating a signal as a measure implies the following constraints: non-negativity, normalised mass and single channelled. None of these assumptions are necessary for the TLλp distance. Furthermore the distance is applicable to both discrete and continuous signals as well as allowing practitioners to emphasise features which in many cases should allow for a better representation of data sets, for example one could include derivatives.

Efficient existing methods, such as entropy regularized or multi-scale linear programming, for optimal transport are applicable to the TLλp distance. In fact any numerical method for optimal transport that can cope with arbitrary cost functions is immediately available. This includes the entropy regularised approach of Cuturi [16]. However, there are more efficient methods that are specific to the OT distance with quadratic cost that are unavailable here, e.g. [74].

Via the representation as an OT distance between measures supported on graphs we expect many other results for OT distances to carry through to TLλp distances. For example, one could extend the LOT method [85] for signal representation and analysis to the TLλp distance. This would allow pairwise distances of a data set to be computed with numerical cost that is linear in the number of images. We leave the development for future work.

We considered a few examples where we expect (and then showed) that TLλp will outperform OT and other distances. We expect the TLλp distance to give a better performance than OT distances when intensity information is important. On the other hand, we do not expect the TLλp distance to be robust to high frequency noise. In this case an OT distance would probably have superior performance.

The applications we considered were to classification and histogram specification in the context of colour transfer. For classification we chose data sets with a Lagrangian nature but were either multi-channelled or non-positive (so that in both cases one must apply ad-hoc methods in order to apply the OT distance). We showed the TLλp distance better represented the underlying geometry. The 1NN classifier is a very simple method and we expect our results here could be significantly improved by, for example, replacing the L2 distance in the MDS projection approach of Weinberger and Chapelle [87] with the TLλp distance. For the colour transfer problem we defined a spatially correlated histogram specification method which produced more visually appealing results when combining the colour of one image with the geometry of another.

Although the main motivation was to develop a distance which better represents Lagrangian data sets we also note that the TLλp distance provides a numerically efficient (for images that are not too large) approximation for the OT induced histogram specification method by, for 2-dimensional images colour images, reducing the effective dimension of the problem from three for OT distances to two for the TLλp distance. We also observe that the effective dimension of multi-channelled time signals is one. In particular the effective dimension is independent of the number of channels.

The applications we have considered are for demonstration on the performance of the TLλp distance. A next step would be to consider a more detailed study of a specific problem. For example in the colour transfer application we could have considered regularization terms/constraints which would have improved the performance, e.g. [17, 28, 51, 56, 62, 63]. It was not the aim to propose a state-of-the-art method for each application, indeed each application would constitute a paper within its own right.

## Acknowledgments

Authors gratefully acknowledge funding from the NSF (CCF 1421502) and the NIH (GM090033, CA188938) in contributing to a portion of this work. DS also acknowledges funding by NSF (DMS-1516677). The authors are grateful to the Center for Nonlinear Analysis at CMU for its support. In addition the authors would like to thank the referees for their valuable comments that lead to significant improvements in the paper.

## A Performance of TLλp in Classification Problems with Simple and Oscillatory Signals

We compare the performance of TLλ2, L2 and OT distances with respect to classification/clustering for the three classes {𝒞j}i=1,2,3 of signals defined in . We test how each distance performs by finding the smallest number of data points such that the classes CiN={fi}i=1N⊂Ci are separable. For sufficiently large N the approximation dH,ρ(CiN,CjN)≈dH,ρ(Ci,Cj) is used to simplify the computation. Similarly, as a proxy for ERρ(CiN) we use Rρ(C^N) where

C^iN={fℓ:ℓ=ℓmini+n−1N−1(ℓmini−ℓmaxi),n∈{1,2,…,N}}

is the uniform sample from class 𝒞i (recall that class 𝒞i is parameterized by ℓ∈[ℓmini,ℓmaxi] and with an abuse of notation we use the subscript of fℓ to denote the dependence of ℓ).

It follows that the class separation distances and class coverage radius are approximated by

dH,L22(C1N,C2N)≈α2RL22(C1N)≈2NdH,L22(C1N,C3N)≈3α4RL22(C2N)≈1NdH,L22(C2N,C3N)≈α4RL22(C3N)≈2αNγdH,OT2(C1N,C2N)≈β2α4ROT2(C1N)≈αN2dH,OT2(C1N,C2N)≈β2α4ROT2(C1N)≈αN2dH,OT2(C1N,C3N)≈β2α4ROT2(C2N)≈αN2dH,OT2(C2N,C3N)≈αγ28ROT2(C3N)≈αN2dH,TLλ22(C1N,C2N)≈α2RTLλ22(C1N)≈α2NdH,TLλ22(C1N,C3N)≈3α4RTLλ22(C2N)≈4α2NdH,TLλ22(C2N,C3N)≈α4RTLλ22(C3N)≈α2N.

We have

κ122(L2;N)≈αN4,κ132(L2;N)≈3γN8,κ122(OT;N)≈β2N4,κ132(OT;N)≈β2N24,κ122(TLλ2;N)≈N8α,κ132(TLλ2;N)≈3N4α,κ232(L2;N)≈γN8,κ232(OT;N)≈γ2N28,κ232(TLλ2;N)≈N16α.

Finally we can compute N*,

N12∗(L2)≈4α,N13∗(L2)≈83γ,N23∗(L2)≈8γN12∗(OT)≈2β,N13∗(OT)≈2β,N23∗(OT)≈8γN12∗(TL2)≈α8,N13∗(TL2)≈4α3,N23∗(TL2)≈16α

which for β>α2,β>3γ4 and γ<2α8 implies the ordering given Section 4.1.

## B Numerical Methods

In principle any numerical method for computing OT distances capable of dealing with an arbitrary cost function can be adapted to compute TLλp the distance. Here we describe two numerical methods we used in Section 4.

### B.1 Iterative Linear Programming

Here we describe the iterative linear programming method of Oberman and Ruan [53] which we abbreviate OR. Although this method is not guaranteed to find the minimum in (3) we find it works well in practice and is easier to implement than, for example, methods due to Schmitzer [71] that provably minimize (3) but require a more advanced refinement procedure. See also [46] and references therein for a multiscale descent approach.

The linear programming problem restricted to a subset M⊆Ωh2 is

minimize:∑(i,j)∈Mcλ(xi,xj;fh,gh)πijoverπsubject to∑i:(i,j)∈Mπij=qj,∑j:(i,j)∈Mπij=pi,(LPh)

where cλ is given by (4). When M=Ωh2 then the TLλp distance between (fh, μh) and (gh, νh) is the minimum to the above linear programme. Furthermore if πh is the minimizer in the TLλp distance then it is also the solution to the linear programme in (LPh) for any ℳ containing the support of πh. That is if one already knows (or can reasonably estimate) the set of nodes ℳ for which the optimal plan is non-zero then one need only consider the linear programme on ℳ. This is advantageous when ℳ is a much smaller set. Motivated by Proposition 3.5 we expect to be able to write the optimal plan as a map. This implies whilst πh has n2 unknowns we only expect n of them to be non-zero.

The method proposed by OR is given in Algorithm 1. An initial discretisation scale h0 is given and an estimate πh0 found for the linear programme (LPh) with M=Ωh02. One then iteratively finds Mr⊆Ωhr2, where hr=hr−12, to be the set of nodes defined by the following refinement procedure. Find the set of nodes for which πhr−1 is non-zero, add the neighbouring nodes and then project onto the refined grid Ωhr2. The optimal plan πhr on Ωhr2 is then estimated by solving the linear programme (LPh) with ℳ = ℳr.

The grid Ωhr will scale as (2rdh0−1)2. If the linear programme is run N times then at the rth step the linear programme has on the order of 2rdh0−1 variables. In particular on the last (and most expensive) step the number of variables is O(2Ndh0−1). This compares to size (2Ndh0−1)2 if the linear programme was run on the final grid without this refinement procedure.

**Algorithm 1**An Iterative Linear Programming Approach [53]

**Input:**functions f,g ∈ Lp(Ω), measures μ,ν ∈ 𝒫(Ω) and parameters h0, N.1:Set r = 0.2:

**repeat**3: Define Sr=Ωhr2 where Ωhr is the square grid lattice with distances between neighbouring points hr and discretise functions f, g and measures μ,ν on Ωh.4:

**if**r = 0

**then**5: Solve (LPh) on So and call the output πh0.6:

**else**7: Find the set of nodes on 𝒮r−1 for which πhr−1 is non-zero and call the set 𝒦r−1.8: To 𝒦r−1 add all neighbouring nodes and call this set 𝒩r−1.9: Define ℳr to be the set of nodes on 𝒮r that are children of nodes in 𝒩r−1.10: Solve (LPh) restricted to ℳr and call the optimal plan πhr.11:

**end if**12: Set hr+1=hr2 and r ↦ r +1.13:

**until**r = N

**Output:**The optimal plan πhN−1 for (LPh).

### B.2 Entropic Regularisation

Cuturi, in the context of computing OT distances, proposed regularizing the minimization in (3) with entropy [16]. This was further developed by Benamou, Carlier, Cuturi, Nenna and Peyré [5], abbreviated to BCCNP, which is the method we describe here. Instead of considering the distance TLλp we consider

Sɛ=infπ∈Π(μ,ν){∑i=1n∑j=1ncλ(xi,xj;f,g)πij−ɛH(π)}

where H(π)=−∑i=1n∑j=1nπijlogπij is the entropy. In the OT case the distance Sε is also known as the Sinkhorn distance. It is a short calculation to show

Sɛ=ɛinfπ∈Π(μ,ν){KL(π|κ)}

where Kij=exp(−cλ(xi,xj:f,g)ɛ) (the exponential is taken pointwise) and KL is the Kullback-Leibler divergence defined by

KL(π|K)=∑i=1n∑j=1nπijlog(πijKij)

It can be shown that the optimal choice of π for Sε can be written in the form π* = diag(u) 𝒦diag(v) where u, v ∈ ℝn are limits, as r →∞, of the sequence

υ(0)=I,u(r)=p_Kυ(r),υ(r+1)=q_K⊺u(r)

and p =(p1,…pn),q= (q1,…qn) (multiplication is the usual matrix-vector multiplication, division is pointwise and Τ denotes the matrix transpose). The algorithm given in 2 is a special case of iterative Bregman projections and also known as the Sinkhorn algorithm.

The stopping condition proposed in [16] is to let π(r) = diag(u(r)) 𝒦 diag(v(r)) then stop when

|∑i,j=1nKijπij(r)−ɛH(π(r))∑i,j=1nKijπij(r−1)−ɛH(π(r−1))−1|<10−4.

Note that although as ε →0 we will recover the unregularised TLλp distance we also suffer numerical instability as 𝒦 → 0 exponentially in ε. These instabilities have been addressed in, for example, [14, 72].

For optimal transport with quadratic cost c(x, y) = |x − y|2 the Sinkhorn algorithm can be more efficiently implemented using Gaussian convolutions [74]. The two numerical methods described so far use the formulation of TLλ2 given by (3-4) which interprets the TLλ2 as an OT distance between measures μ and ν for a (non-quadratic) cost function cλ(·,·; f, g), hence one cannot make use of previous OT methods such as [74].

However, we also recall that we can define the TLλ2 distance as the optimal transport distance between measures (f × Id)#μ and (g × Id)#ν, see (5), in which case the entropy regularized approach can be implemented using Gaussian convolutions in dimension d + m (when p = 2), where f :Ω⊆ ∝d → ∝m. Although this means that the numerical method is based in a higher dimension we note the success of the bilateral grid method for bilateral filters that are also based on computing a Gaussian filter in a higher dimension [10, 57]. For colour images, where m = 3 this approach may not be efficient however for m =1 these ideas have the potential for an improved algorithm.

**Algorithm 2**An Entropy Regularised Approach [5, 16]

**Input:**discrete functions f = (f1,…,fn), g = (g1, …, gn), discrete measures μ=∑i=1npiδxi,ν=∑j=1nqjδxj, the parameter ε and a stopping condition.1:Set r = 0, K=(exp(−c(xi,xj;f,g)ɛ))ij and u(0) = 𝕀 ∈ ℝn.2:

**repeat**3: Let r↦r+ 1, υ(r)=q_K⊺u(r−1)and u(r)=p_Ku(r) where

p

=(p1,…pn),q

= (q1,…qn)4:**until**Stopping condition has been reached5:Set π = diag(u(r))𝒦diag(v (r)).

**Output:**An estimate π on the optimal plan for Sε where the accuracy is determined by the stopping condition.