## Part 2:

Codes for distributed linear data processing in presence of straggling/faults/errors

Motivation: nonideal computing systems

## Motivation: nonideal computing systems

$\mathrm{M} \times \mathrm{V}$ for 4 processors on AmazonEC2 cloud system


## Motivation: nonideal computing systems



## Motivation: nonideal computing systems



Practitioners are already using redundancy to address straggling

## Organization: How to perform these computations?

1. 


2.

efficiently, fast, in presence of faults/straggling/errors
Motivation: The critical steps for many compute applications (Machine learning: neural nets, LDA, PCA, Regression, Projections.
Scientific computing and physics simulations)

## Rest of the tutorial is divided into two parts:

I. Big processors [Huang, Abraham '84]
II. Small processors [von Neumann '56]

# Part I: Big processors <br> Processor memory scales with problem size 



## System metrics



## System metrics



1. Per-processor computation costs:

- \# operations/processor

2. Straggler tolerance (directly related to "recovery threshold")

- max \# processors that can be ignored by fusion node

3. Communication costs

- number of bits exchanged between all processors
- can use more sophisticated metrics. See [Bruck et al.'97]



## Parallelization for speeding up matrix-vector products


$P$ processors (master node aggregates outputs)
Operations/processor: $M N / P$ (e.g. $P=3$, each does $1 / 3$ rd computations)

## Parallelization for speeding up matrix-vector products


$P$ processors (master node aggregates outputs)
Operations/processor: $M N / P$ (e.g. $P=3$, each does $1 / 3$ rd computations)
In practice, processors can be delayed ("stragglers") or faulty
Recovery threshold $=P$ i.e., Straggler tolerance $=0$

## Parallelization for speeding up matrix-vector products


$P$ processors (master node aggregates outputs)
Operations/processor: $M N / P$ (e.g. $P=3$, each does $1 / 3$ rd computations)
In practice, processors can be delayed ("stragglers") or faulty
Recovery threshold $=P$ i.e., Straggler tolerance $=0$
Note: can parallelize by dividing the matrix horizontally as well

## Parallelization for speeding up matrix-vector products


$P$ processors (master node aggregates outputs)
Operations/processor: $M N / P$ (e.g. $P=3$, each does $1 / 3$ rd computations)
In practice, processors can be delayed ("stragglers") or faulty
Recovery threshold $=P$ i.e., Straggler tolerance $=0$
Note: can parallelize by dividing the matrix horizontally as well

Replication: repeat Job r times


Replication: repeat Job r times


## Replication: repeat Job r times


$P$ processors
\# operations/processor: $r M N / P$ 个
Straggler tolerance: $r-1$ Recovery threshold: $P-r+1$

## Replication: repeat Job r times


$P$ processors
\# operations/processor: $r M N / P$ 个
Straggler tolerance: $r$ - 1 Recovery threshold: $P-r+1$

Also see: recent works of [Joshi, Soljanin, Wornell]

A coding alternative to replication: MDS compute codes ("ABFT")
Algorithm-Based Fault Tolerance
[Huang, Abraham '84]
[Lee, Lam, Pedarsani, Papailopoulos, Ramchandran '16]

A coding alternative to replication: MDS compute codes ("ABFT")
Algorithm-Based Fault Tolerance
[Huang, Abraham '84]
[Lee, Lam, Pedarsani, Papailopoulos,
Ramchandran '16]

A coding alternative to replication: MDS compute codes ("ABFT")


## Algorithm-Based Fault Tolerance

[Huang, Abraham '84]
[Lee, Lam, Pedarsani, Papailopoulos, Ramchandran '16]

Example: $P=3, K=2$

A coding alternative to replication: MDS compute codes ("ABFT")


Algorithm-Based Fault Tolerance
[Huang, Abraham '84]
[Lee, Lam, Pedarsani, Papailopoulos,
Ramchandran '16]

Example: $P=3, K=2$

A coding alternative to replication: MDS compute codes ("ABFT")


## Algorithm-Based Fault Tolerance

[Huang, Abraham '84]
[Lee, Lam, Pedarsani, Papailopoulos, Ramchandran '16]

Example: $P=3, K=2$

A coding alternative to replication: MDS compute codes ("ABFT")


## Algorithm-Based Fault Tolerance

[Huang, Abraham '84]
[Lee, Lam, Pedarsani, Papailopoulos, Ramchandran '16]

Example: $P=3, K=2$
Assumption: A known in advance

A coding alternative to replication: MDS compute codes ("ABFT")


## Algorithm-Based Fault Tolerance

[Huang, Abraham '84]
[Lee, Lam, Pedarsani, Papailopoulos, Ramchandran '16]

Example: $P=3, K=2$
Assumption: A known in advance
Can tolerate 1 straggler \# operations per processor = MN/2

A coding alternative to replication: MDS compute codes ("ABFT")


## Algorithm-Based Fault Tolerance

[Huang, Abraham '84]
[Lee, Lam, Pedarsani, Papailopoulos, Ramchandran '16]

Example: $P=3, K=2$
Assumption: A known in advance
Can tolerate 1 straggler \# operations per processor = MN/2

A coding alternative to replication: MDS compute codes ("ABFT")


## Algorithm-Based Fault Tolerance

[Huang, Abraham '84]
[Lee, Lam, Pedarsani, Papailopoulos, Ramchandran '16]

Example: $P=3, K=2$
Assumption: A known in advance
Can tolerate 1 straggler \# operations per processor = MN/2

A coding alternative to replication: MDS compute codes ("ABFT")


## Algorithm-Based Fault Tolerance

[Huang, Abraham '84]
[Lee, Lam, Pedarsani, Papailopoulos, Ramchandran '16]

Example: $P=3, K=2$
Assumption: A known in advance
Can tolerate 1 straggler \# operations per processor = MN/2
$P$ processors
In general, use a $(P, K)$-MDS code $(K<M)$ :
Recovery Threshold $=K$, i.e., Straggler tolerance $=P-K$
\# operations/processor $=M N / K \quad(>M N / P$ in uncoded)

## MDS coded computing of $\mathrm{M} \times \mathrm{V}$ outperforms replication

## MDS coded computing of $\mathrm{M} \times \mathrm{V}$ outperforms replication

[Lee et al]: MDS beats replication in expected time (exponential tail models)

## MDS coded computing of $\mathrm{M} \times \mathrm{V}$ outperforms replication

[Lee et al]: MDS beats replication in expected time (exponential tail models)

Experiments on AmazonEC2: [Lee at al]


## MDS coded computing of $\mathrm{M} \times \mathrm{V}$ outperforms replication

[Lee et al]: MDS beats replication in expected time (exponential tail models)

Experiments on AmazonEC2: [Lee at al]


Can tradeoff \# operations/processor for straggler tolerance Codes for \# operations/processor < N ?

## Short-Dot codes

[Dutta, Cadambe, Grover '16]
[Tandon, Lei, Dimakis, Karampatziakis ‘16]

THE MATRIX-VECTOR PRODUCT


## Short-Dot codes

[Dutta, Cadambe, Grover '16]
[Tandon, Lei, Dimakis, Karampatziakis ‘16]


ILLUSTRATION OF SHORT-DOT IMPLEMENTATION


## Short-Dot codes

[Dutta, Cadambe, Grover '16]
[Tandon, Lei, Dimakis, Karampatziakis ‘16]


## Short-Dot codes

[Dutta, Cadambe, Grover '16]
[Tandon, Lei, Dimakis, Karampatziakis '16]


Sparsity
(i) allows tradeoff between computation per-processor and straggler tolerance;
(ii) reduces communication to each processor

## Short-Dot codes

[Dutta, Cadambe, Grover '16]
[Tandon, Lei, Dimakis, Karampatziakis ‘16]


Sparsity
(i) allows tradeoff between computation per-processor and straggler tolerance;
(ii) reduces communication to each processor
\# operations/processor $=s<N$ ฟ
Recovery threshold $=K=P(1-s / N)+M \uparrow$

## Short-Dot codes: the construction

Given A, an $M \times N$ matrix, $M<P$, and a parameter $K, M<K<P$, an $(s, K)$ Short-Dot code consists of a $P \times N$ matrix $\mathbf{B}$ satisfying:

1) $\mathbf{A}$ is contained in span of any $K$ rows of $\mathbf{B}$
2) Every row of $\mathbf{B}$ is $s$-sparse

$\mathbf{B}_{P \times N}$

"Short-Dot": Computing Large Linear Transforms Distributedly Using Coded Short Dot Products [Dutta, Cadambe, Grover, NIPS 2016]

## Achievability and outer bound

Achievability: For any $M \times N$ matrix $\mathbf{A}$, an $(s, K)$ Short-Dot code exists s.t.:

$$
s \leq \frac{N}{P}(P-K+M)
$$

...and outputs of any $K$ processors suffice, i.e., Straggler tolerance $=P-K$

## Achievability and outer bound

Achievability: For any $M \times N$ matrix $\mathbf{A}$, an $(s, K)$ Short-Dot code exists s.t.:

$$
s \leq \frac{N}{P}(P-K+M)
$$

...and outputs of any $K$ processors suffice, i.e., Straggler tolerance $=P-K$

Outer bound: Any Short-Dot code satisfies:

$$
\bar{s} \geq \frac{N}{P}(P-K+M)-\frac{M^{2}}{P}\binom{P}{K-M+1}
$$

... for "sufficiently dense" A

## Achievability and outer bound

Achievability: For any $M \times N$ matrix $\mathbf{A}$, an $(s, K)$ Short-Dot code exists s.t.:

$$
s \leq \frac{N}{P}(P-K+M)
$$

...and outputs of any $K$ processors suffice, i.e., Straggler tolerance $=P-K$

Outer bound: Any Short-Dot code satisfies:

$$
\bar{s} \geq \frac{N}{P}(P-K+M)-o(N)
$$

... for "sufficiently dense" A

## Short-Dot strictly and significantly outperforms Uncoded/Replication/ABFT (MDS)



Paper contains expected completion time analysis for exponential service time model, and experimental results.
For $N \gg M$, decoding complexity negligible compared to per-processor computation

## Related result: Gradient coding



What if some gradient-computing workers straggle?

# Related result: Gradient coding 


[Tandon, Lei, Dimakis, Karampatziakis'17]
Want to compute:
$\sum_{i} g_{i}$

What if some gradient-computing workers straggle?

## Related result: Gradient coding

model $\beta$

[Figure courtesy A Dimakis]
A
model $\beta$


model $\beta$

[Tandon, Lei, Dimakis, Karampatziakis'17]
Want to compute:

vector computed distributedly

What if some gradient-computing workers straggle?

# Related result: Gradient coding 


add gradients and update
model $\beta$
worker 3 model
[Tandon, Lei, Dimakis, Karampatziakis'17]
Want to compute:
$\sum_{i} g_{i}=[1,1, \ldots, 1]\left[\begin{array}{c}g_{1} \\ g_{2} \\ \cdot \\ \cdot \\ g_{N}\end{array}\right]$
vector computed "matrix"
distributedly

What if some gradient-computing workers straggle?
Solution: code "matrix" A (i.e., [1 1 ... 1]) using a Short-Dot code

- introduce redundancy in datasets consistent with the Short-Dot pattern
- computes the correct (redundant) gradients at each processor

Can also be viewed as a novel "distributed storage code for computation"

# Related result: Gradient coding 



ma
add gradients and update model
model $\beta$ worker 3
[Tandon, Lei, Dimakis, Karampatziakis'17]

What if some gradient-computing workers straggle?
Solution: code "matrix" A (i.e., [1 1 ... 1]) using a Short-Dot code

- introduce redundancy in datasets consistent with the Short-Dot pattern
- computes the correct (redundant) gradients at each processor

Can also be viewed as a novel "distributed storage code for computation"
For $\mathbf{V}^{\mathbf{T}} \mathbf{V}$, coding can beat replication only due to integer effects. No scaling-sense gain, at least in this coarse model, over replication. (See also [Halbawi, Azizan-Ruhi, Salehi, Hassibi '17])

## Trend:

- $\quad \mathrm{V} \times \mathrm{V}$ : offers some advantage over replication
- $\mathrm{M} \times \mathrm{V}$ : arbitrary gains over replication, MDS coding


## Trend:

- $\quad \mathrm{V} \times \mathrm{V}$ : offers some advantage over replication
- $\mathrm{M} \times \mathrm{V}$ : arbitrary gains over replication, MDS coding
- Next: M x M: ?

Trend:

- $\quad \mathrm{V} \times \mathrm{V}$ : offers some advantage over replication
- $\mathrm{M} \times \mathrm{V}$ : arbitrary gains over replication, MDS coding
- Next: M $\times$ M: ?

Answer: arbitrarily large gains over M x V-type coding!

Trend:

- $\quad \mathrm{V} \times \mathrm{V}$ : offers some advantage over replication
- $\mathrm{M} \times \mathrm{V}$ : arbitrary gains over replication, MDS coding
- Next: M x M: ?

Answer: arbitrarily large gains over M x V-type coding!

break!



## Uncoded parallelization

## Let's assume that each processor can store $1 / \mathrm{m}$ of $\mathbf{A}$ and $1 / \mathrm{n}$ of $\mathbf{B}$



Total mn processors
(i,j)-th Processor receives $\mathbf{A}_{\mathbf{i}}, \mathbf{B}_{\mathbf{j}}$, computes $\mathbf{A}_{\mathbf{i}} \mathbf{x} \mathbf{B}_{\mathbf{j}}$, sends them to fusion center
\# operations/processor $=N^{3} / m n$ (we'll keep this constant across strategies) Recovery Threshold $=P$; Straggler tolerance $=0$

## Strategy I: M x V $\rightarrow \mathrm{M} \times \mathrm{M}$



Each processor computes a product $\mathbf{A}_{\mathbf{i}} \mathbf{B}_{\mathbf{j}}$ Recovery threshold $=P-P / n+m=\Theta(P)$ \# operations/processor: $N^{3} / m n$

## Algorithm-based Fault Tolerance (ABFT)



Fig. 1. A checksum matrix multiplication.
[Huang, Abraham'84]
[Lee, Suh, Ramchandran'17]

## Algorithm-based Fault Tolerance (ABFT)

| A1 |
| :---: |
| A2 |
| A3 |
| A4 |



Fig. 1. A checksum matrix multiplication.
[Huang, Abraham'84]
[Lee, Suh, Ramchandran'17]

## Algorithm-based Fault Tolerance (ABFT)



Fig. 1. A checksum matrix multiplication.
[Huang, Abraham'84]
[Lee, Suh, Ramchandran'17]
Recovery threshold: $K=2(m-1) \sqrt{P}-(m-1)^{2}+1=\Theta(\sqrt{P})$ Straggler resilience: $P-K \quad$ [Lee, Sun, Ramchandran'17] \# operations/processor: $N^{3} / m n$

## Algorithm-based Fault Tolerance (ABFT)

| A1 |
| :---: |
| A2 |
| A3 |
| A4 |



Fig. 1. A checksum matrix multiplication.
[Huang, Abraham'84]
[Lee, Suh, Ramchandran'17]
Recovery threshold: $K=2(m-1) \sqrt{P}-(m-1)^{2}+1=\Theta(\sqrt{P})$ Straggler resilience: $P-K \quad$ [Lee, Sun, Ramchandran'17] \# operations/processor: $N^{3} / m n$

Next: Polynomial codes [Yu, Maddah-Ali, Avestimehr '17]

Recovery threshold: $K=m n$
\# operations/processor: $N^{3} / m n$

Polynomial codes [Yu, Maddah-Ali, Avestimehr '17]
Intuition: forget matrices for this slide

## Polynomial codes [Yu, Maddah-Ali, Avestimehr '17]

Intuition: forget matrices for this slide
$\left\{A_{i}\right\}\left\{B_{j}\right\}$

## Polynomial codes [Yu, Maddah-Ali, Avestimehr '17]

Intuition: forget matrices for this slide


## Polynomial codes [Yu, Maddah-Ali, Avestimehr '17]

Intuition: forget matrices for this slide
GOAL: Compute all products of the form $A_{i} B_{j}$


## Polynomial codes [Yu, Maddah-Ali, Avestimehr '17]

Intuition: forget matrices for this slide
GOAL: Compute all products of the form $A_{i} B_{j}$


## Polynomial codes [Yu, Maddah-Ali, Avestimehr '17]

Intuition: forget matrices for this slide


## Constraints:

1) Can only send information of size of one $A_{\mathrm{i}}$ and one $B_{\mathrm{j}}$
2) Processor can only compute a product of its inputs

## Polynomial codes [Yu, Maddah-Ali, Avestimehr '17]

Intuition: forget matrices for this slide


## Constraints:

1) Can only send information of size of one $A_{\mathrm{i}}$ and one $B_{\mathrm{j}}$
2) Processor can only compute a product of its inputs

## Solution:

Send $\sum_{i} \gamma_{i} A_{i}$ and $\sum_{i} \delta_{i} B_{i}$

## Polynomial codes [Yu, Maddah-Ali, Avestimehr '17]

Intuition: forget matrices for this slide


## Constraints:

1) Can only send information of size of one $A_{\mathrm{i}}$ and one $B_{\mathrm{j}}$
2) Processor can only compute a product of its inputs

## Solution:

Send $\sum_{i} \gamma_{i p} A_{i}$ and $\sum_{i} \delta_{i p} B_{i}$

## Polynomial codes [Yu, Maddah-Ali, Avestimehr '17]

Intuition: forget matrices for this slide


## Constraints:

1) Can only send information of size of one $A_{\mathrm{i}}$ and one $B_{\mathrm{j}}$ 2) Processor can only compute a product of its inputs

## Solution:

Send $\sum_{i} \gamma_{i p} A_{i}$ and $\sum_{i} \delta_{i p} B_{i}$ $\left\{A_{i}\right\}_{i=1}^{m}\left\{B_{j}\right\}_{i=1}^{n}$

## Achievability

You can use random codes.
But "polynomial codes" get you there with lower enc/dec complexity
Example: $\mathrm{m}=2, \mathrm{n}=2$


Proc $i$ computes $\tilde{\mathbf{C}}_{i}=\tilde{\mathbf{A}}_{i} \tilde{\mathbf{B}}_{i}=\mathbf{A}_{1} \mathbf{B}_{1}+i \mathbf{A}_{2} \mathbf{B}_{1}+i^{2} \mathbf{A}_{1} \mathbf{B}_{2}+i^{3} \mathbf{A}_{2} \mathbf{B}_{2}$

## Achievability

You can use random codes.
But "polynomial codes" get you there with lower enc/dec complexity
Example: $m=2, n=2$


Proc $i$ computes $\tilde{\mathbf{C}}_{i}=\tilde{\mathbf{A}}_{i} \tilde{\mathbf{B}}_{i}=\mathbf{A}_{1} \mathbf{B}_{1}+i \mathbf{A}_{2} \mathbf{B}_{1}+i^{2} \mathbf{A}_{1} \mathbf{B}_{2}+i^{3} \mathbf{A}_{2} \mathbf{B}_{2}$
Fusion center needs outputs from only 4 such processors! e.g. from 1,2,3,4:

$$
\left[\begin{array}{c}
\tilde{\mathbf{C}}_{1} \\
\tilde{\tilde{\tilde{C}}}_{2} \\
\tilde{\widetilde{C}}_{3} \\
\widetilde{\mathbb{C}}_{4}
\end{array}\right]=\left[\begin{array}{llll}
1^{0} & 1^{1} & 1^{2} & 1^{3} \\
2^{0} & 2^{1} & 2^{2} & 2^{3} \\
3^{0} & 3^{1} & 3^{2} & 3^{3} \\
4^{0} & 4^{1} & 4^{2} & 4^{3}
\end{array}\right]\left[\begin{array}{l}
\mathbf{A}_{1} \mathbf{B}_{1} \\
\mathbf{A}_{2} \mathbf{B}_{1} \\
\mathbf{A}_{1} \mathbf{B}_{2} \\
\mathbf{A}_{2} \mathbf{B}_{2}
\end{array}\right] \text { Invert a Vandermonde matrix }
$$

## Achievability

You can use random codes.
But "polynomial codes" get you there with lower enc/dec complexity
Example: $\mathrm{m}=2, \mathrm{n}=2$


Proc $i$ computes $\tilde{\mathbf{C}}_{i}=\tilde{\mathbf{A}}_{i} \tilde{\mathbf{B}}_{i}=\mathbf{A}_{1} \mathbf{B}_{1}+i \mathbf{A}_{2} \mathbf{B}_{1}+i^{2} \mathbf{A}_{1} \mathbf{B}_{2}+i^{3} \mathbf{A}_{2} \mathbf{B}_{2}$
Fusion center needs outputs from only 4 such processors! e.g. from 1,2,3,4:

$$
\left[\begin{array}{c}
\tilde{\mathbf{C}}_{1} \\
\tilde{\widetilde{C}}_{2} \\
\tilde{\widetilde{C}}_{3} \\
\tilde{\mathbb{C}}_{4}
\end{array}\right]=\left[\begin{array}{llll}
1^{0} & 1^{1} & 1^{2} & 1^{3} \\
2^{0} & 2^{1} & 2^{2} & 2^{3} \\
3^{0} & 3^{1} & 3^{2} & 3^{3} \\
4^{0} & 4^{1} & 4^{2} & 4^{3}
\end{array}\right]\left[\begin{array}{l}
\mathbf{A}_{1} \mathbf{B}_{1} \\
\mathbf{A}_{2} \mathbf{B}_{1} \\
\mathbf{A}_{1} \mathbf{B}_{2} \\
\mathbf{A}_{2} \mathbf{B}_{2}
\end{array}\right] \text { Invert a Vandermonde matrix }
$$

In general, Recovery Threshold = mn (attained using RS-code-type construction) 22

## Summary so far...

- $\quad \mathrm{V} \times \mathrm{V}$ : Coding offers little advantage over replication
- M $\times$ V: Short-Dot codes provide arbitrary gains over replication, MDS coding,
- $\quad \mathrm{M} \times \mathrm{M}$ : polynomial coding provides arbitrary gains over $\mathrm{M} \times \mathrm{V}$ codes

What additional costs come with coding?

- encoding and decoding complexity (skipped here for simplicity)
- Next: degradation is not graceful as you pull deadline earlier

To see this, let's look a problem with repeated $\mathrm{M} \times \mathrm{V}$, and slow convergence to solution

## Understanding a limitation of coding: Coding for linear iterative solutions

$$
\mathbf{x}^{(l+1)}=(1-d) \mathbf{M \times V} \quad \text { computation input }
$$

Converges to $\mathbf{x}^{*}$ satisfying $\mathbf{x}^{*}=(1-d) \mathbf{A} \mathbf{x}^{*}+d \mathbf{r}$.
Subtracting, $\mathbf{e}^{(l+1)}=(1-d) \mathbf{A} \mathbf{e}^{(l)}$, where $\mathbf{e}^{(l)}=\mathbf{x}^{(l)}-\mathbf{x}^{*}$.

## Understanding a limitation of coding: Coding for linear iterative solutions

$$
\mathbf{x}^{(l+1)}=(1-d) \stackrel{\mathrm{MxV}}{\mathbf{A} \mathbf{x}^{(l)}}+d \mathbf{c o m p u t a t i o n ~ i n p u t} .
$$

Converges to $\mathbf{x}^{*}$ satisfying $\mathbf{x}^{*}=(1-d) \mathbf{A} \mathbf{x}^{*}+d \mathbf{r}$.
Subtracting, $\mathbf{e}^{(l+1)}=(1-d) \mathbf{A} \mathbf{e}^{(l)}$, where $\mathbf{e}^{(l)}=\mathbf{x}^{(l)}-\mathbf{x}^{*}$.


## Understanding a limitation of coding: Coding for linear iterative solutions

$$
\stackrel{\mathrm{MxV}}{\mathbf{x}^{(l+1)}=(1-d) \mathbf{A x}^{(l)}+d \mathbf{c o m p u t a t i o n ~ i n p u t .}}
$$

Converges to $\mathbf{x}^{*}$ satisfying $\mathbf{x}^{*}=(1-d) \mathbf{A} \mathbf{x}^{*}+d \mathbf{r}$.
Subtracting, $\mathbf{e}^{(l+1)}=(1-d) \mathbf{A} \mathbf{e}^{(l)}$, where $\mathbf{e}^{(l)}=\mathbf{x}^{(l)}-\mathbf{x}^{*}$.


Next: how to code multiple linear iterative problems in parallel

## Understanding a limitation of coding: Coding for linear iterative solutions

$$
\underset{\mathbf{x}^{(l+1)}}{ }=(1-d) \stackrel{\mathrm{MxV}}{\mathbf{A} \mathbf{x}^{(l)}}+d \mathbf{d} \text {. } \quad \text { computation input }
$$

Converges to $\mathbf{x}^{*}$ satisfying $\mathbf{x}^{*}=(1-d) \mathbf{A} \mathbf{x}^{*}+d \mathbf{r}$.
Subtracting, $\mathbf{e}^{(l+1)}=(1-d) \mathbf{A} \mathbf{e}^{(l)}$, where $\mathbf{e}^{(l)}=\mathbf{x}^{(l)}-\mathbf{x}^{*}$.


Next: how to code multiple linear iterative problems in parallel

## Solving multiple iterative problems in parallel

Classical coded computation applied to linear iterative problems

- Initialize (Encoding)

$$
\left[\mathbf{s}_{1}, \ldots, \mathbf{s}_{P}\right]=\left[\mathbf{r}_{1}, \ldots, \mathbf{r}_{k}\right] \cdot \mathbf{G}_{k \times P}
$$

- Parallel Computing:
$l_{i}$ power iterations at the $i$-th worker with input $\mathbf{s}_{i}$

$$
\mathbf{Y}_{N \times P}^{\left(T_{\mathrm{d})}\right)}=\left[\mathbf{y}_{1}^{\left(l_{1}\right)}, \ldots, \mathbf{y}_{P}^{\left(l_{P}\right)}\right]
$$

- Post Processing (Decoding) Matrix inversion on fastest $k$ processors.

$$
\widehat{\mathbf{X}}^{\top}=\tilde{\mathbf{G}}^{-1}\left(\mathbf{Y}^{\left(T_{\mathrm{dl}}\right)}\right)^{\top} .
$$

## Solving multiple iterative problems in parallel

Classical coded computation applied to linear iterative problems

- Initialize (Encoding)

$$
\left[\mathbf{s}_{1}, \ldots, \mathbf{s}_{P}\right]=\left[\mathbf{r}_{1}, \ldots, \mathbf{r}_{k}\right] \cdot \mathbf{G}_{k \times P}
$$

- Parallel Computing:
$l_{i}$ power iterations at the $i$-th worker with input $\mathbf{s}_{i}$

$$
\mathbf{Y}_{N \times P}^{\left(T_{\mathrm{d}}\right)}=\left[\mathbf{y}_{1}^{\left(l_{1}\right)}, \ldots, \mathbf{y}_{P}^{\left(l_{P}\right)}\right]
$$

- Post Processing (Decoding) Matrix inversion on fastest $k$ processors.

$$
\widehat{\mathbf{X}}^{\top}=\tilde{\mathbf{G}}^{-1}\left(\mathbf{Y}^{\left(T_{\mathrm{dl}}\right)}\right)^{\top} .
$$

Is this invertible?
Is this well conditioned?

## Solving multiple iterative problems in parallel

Classical coded computation applied to linear iterative problems

- Initialize (Encoding)

$$
\left[\mathbf{s}_{1}, \ldots, \mathbf{s}_{P}\right]=\left[\mathbf{r}_{1}, \ldots, \mathbf{r}_{k}\right] \cdot \mathbf{G}_{k \times P}
$$

- Parallel Computing:
$l_{i}$ power iterations at the $i$-th worker with input $\mathbf{s}_{i}$

$$
\mathbf{Y}_{N \times P}^{\left(T_{\mathrm{d}}\right)}=\left[\mathbf{y}_{1}^{\left(l_{1}\right)}, \ldots, \mathbf{y}_{P}^{\left(l_{P}\right)}\right]
$$

- Post Processing (Decoding) Matrix inversion on fastest $k$ processors.

$$
\widehat{\mathbf{X}}^{\top}=\tilde{\mathbf{G}}^{-1}\left(\mathbf{Y}^{\left(T_{\mathrm{dl}}\right)}\right)^{\top} .
$$

Is this invertible? Yes!
Is this well conditioned?

## Solving multiple iterative problems in parallel

Classical coded computation applied to linear iterative problems

- Initialize (Encoding)

$$
\left[\mathbf{s}_{1}, \ldots, \mathbf{s}_{P}\right]=\left[\mathbf{r}_{1}, \ldots, \mathbf{r}_{k}\right] \cdot \mathbf{G}_{k \times P}
$$

- Parallel Computing:
$l_{i}$ power iterations at the $i$-th worker with input $\mathbf{s}_{i}$

$$
\mathbf{Y}_{N \times P}^{\left(T_{\mathrm{d}}\right)}=\left[\mathbf{y}_{1}^{\left(l_{1}\right)}, \ldots, \mathbf{y}_{P}^{\left(l_{P}\right)}\right]
$$

- Post Processing (Decoding) Matrix inversion on fastest $k$ processors.

$$
\widehat{\mathbf{X}}^{\top}=\tilde{\mathbf{G}}^{-1}\left(\mathbf{Y}^{\left(T_{\mathrm{d}}\right)}\right)^{\top} .
$$

Is this invertible? Yes!
Is this well conditioned? No!

What is the effect of a poor conditioning number? Error blows up!


What is the effect of a poor conditioning number? Error blows up!


What is the effect of a poor conditioning number?

## Error blows up!



Similar issues arise in designing good "analog coding with erasures"
[Haikin, Zamir ISIT'16][Haikin, Zamir, Gavish '17] 26

A graceful degradation with time:
Coded computing with weighted least squares

- Initialize (Encoding)


$$
\left[\mathbf{s}_{1}, \ldots, \mathbf{s}_{P}\right]=\left[\mathbf{r}_{1}, \ldots, \mathbf{r}_{k}\right] \cdot \mathbf{G} .
$$

- Parallel Computing:
$l_{i}$ power iterations at the $i$-th worker with input $\mathrm{s}_{i}$

$$
\mathbf{Y}_{N \times P}^{\left(T_{d l}\right)}=\left[\mathbf{y}_{1}^{\left(l_{1}\right)}, \ldots, \mathbf{y}_{P}^{\left(l_{P}\right)}\right] .
$$

- Post Processing (Decoding)
$\widehat{\mathbf{X}}^{\top}=\left(\mathbf{G} \boldsymbol{\Lambda}^{-1} \mathbf{G}^{\top}\right)^{-1} \mathbf{G} \boldsymbol{\Lambda}^{-1}\left(\mathbf{Y}^{\left(T_{\mathrm{dl}}\right)}\right)^{\top}$
Similar to the "weighted least-square" solution.


# Weighted least squares outperforms competition; Degrades gracefully with early deadline 



## Summary thus far...

## ABFT $\underset{\neq}{ }$ Coded computation

New codes, new problems, new analyses, converses

But, we need to be careful in lit-searching ABFT literature

Next: small processors

## Break!

## Questions/comments? <br> Your favorite computation problem?

## Preview of Part II: Small Processors

Controlling error propagation with small processors/gates

- No central processor to distribute/aggregate

Encoding/decoding also have errors

# Part II: "Small processors" 

has so far received relatively less attention

## What are small processors?

1) Logic gates

e.g. Dot product "nanofunction" in graphene [Pop, Shanbhag, Blaauw labs '15-'16]
2) Analog "Nanofunctions" and beyond CMOS devices
3) Processors with limited memory (i.e., ALL processors are small!)

- can't assume that processor memory increases with problem size

Synthesize large reliable computations using small processors?

What is fundamentally new in small processor computing?

## 1) Errors accumulate; information dissipates

a) Info-dissipation in noisy circuits:


What is fundamentally new in small processor computing?

## 1) Errors accumulate; information dissipates

a) Info-dissipation in noisy circuits:

Noisy circuits
built with noisy gates


What is fundamentally new in small processor computing?

## 1) Errors accumulate; information dissipates

a) Info-dissipation in noisy circuits:

Noisy circuits
$\underset{\begin{array}{c}\text { binary } \\ \text { inputs }\end{array}}{\sim} \Longrightarrow \underbrace{\epsilon}_{\begin{array}{c}\text { error probability } \\ \text { of binary output }\end{array}}$


What is fundamentally new in small processor computing?

1) Errors accumulate; information dissipates


What is fundamentally new in small processor computing?

## 1) Errors accumulate; information dissipates

a) Info-dissipation in noisy circuits:

Noisy circuits


What is fundamentally new in small processor computing?

## 1) Errors accumulate; information dissipates

a) Info-dissipation in noisy circuits:

Noisy circuits


Classical Data-Processing Inequality

$$
\frac{I(X ; Z)}{I(X ; Y)} \leq 1
$$

"Strong" Data-Processing Inequality

What is fundamentally new in small processor computing?

## 1) Errors accumulate; information dissipates

a) Info-dissipation in noisy circuits:

Noisy circuits built with noisy gates


$$
X \quad-\quad Y^{B S C}(\epsilon)_{Z}
$$

Classical Data-Processing Inequality

$$
\frac{I(X ; Z)}{I(X ; Y)} \leq 1
$$

$$
\begin{aligned}
& \text { "Strong" Data-Processing Inequality }
\end{aligned}
$$

b) Distortion accumulation with quantization noise
(e.g. in "data summarization", consensus, etc.)


What is fundamentally new in small processor computing?

## 1) Errors accumulate; information dissipates

a) Info-dissipation in noisy circuits:

Noisy circuits
 built with noisy gates


$$
X \quad-\quad Y^{B S C}(\epsilon)_{Z}
$$

Classical Data-Processing Inequality

> "Strong" Data-Processing Inequality

$$
\frac{I(X ; Z)}{I(X ; Y)} \leq 1
$$

b) Distortion accumulation with quantization noise
(e.g. in "data summarization", consensus, etc.)


What is fundamentally new in small processor computing?

## 1) Errors accumulate; information dissipates

a) Info-dissipation in noisy circuits:

Noisy circuits
 built with noisy gates


$$
X \quad-\quad Y^{B S C}(\epsilon)_{Z}
$$

Classical Data-Processing Inequality

> "Strong" Data-Processing Inequality

$$
\frac{I(X ; Z)}{I(X ; Y)} \leq 1
$$

b) Distortion accumulation with quantization noise
(e.g. in "data summarization", consensus, etc.)


What is fundamentally new in small processor computing?

## 1) Errors accumulate; information dissipates

## 2) Decoding, and possibly encoding, also error prone

Essential to analyze decoding/encoding costs in noisy computation:
there may be no conceptual analog of Shannon capacity in computing problems [Grover et al.'07-'15][Grover ISIT'14][Blake, Kschischang '15,'16]

Error-prone decoding (often message-passing for LDPCs)
[Taylor '67][Hadjicostis, Verghese '05][Vasic et al. '07-'13][Varshney '11][Grover, Palaiyanur, Sahai '10] [Huang, Yao, Dolecek '14][Gross et al. '13][Vasic et al.'16]

Error-prone encoding [Yang, Grover, Kar '14][Dupraz et al. '15]

- see also erasure version [Hachem, Wang, Fragouli, Diggavi ‘13]

Can we compute $\mathrm{M} \times \mathrm{V}$ reliably using error-prone gates? Is it even possible?
We'll next discuss this for 1) Gates; 2) Processors

## $\mathrm{M} \times \mathrm{V}$ on noisy gates: the basics

$$
\underset{\text { Output }}{\left[r_{1}, r_{2}, \ldots, r_{K}\right]}=\underset{\text { Input }}{\left[s_{1}, s_{2}, \ldots, s_{L}\right]}\left[\begin{array}{c}
A \\
\text { Linear transform }
\end{array}\right]_{L \times K}
$$

## $\mathrm{M} \times \mathrm{V}$ on noisy gates: the basics



## $\mathrm{M} \times \mathrm{V}$ on noisy gates: the basics



## $\mathrm{M} \times \mathrm{V}$ on noisy gates: the basics

$\underset{\text { Output }}{\left[r_{1}, r_{2}, \ldots, r_{K}\right]}=\underset{\text { Input }}{\left[s_{1}, s_{2}, \ldots, s_{L}\right]}\left[\begin{array}{c}A \\ \text { Linear transform }\end{array}\right]_{L \times K}$

$\widetilde{\mathbb{G}}:$ coded generator matrix
Note: rows of $\widetilde{\mathbb{G}}$ are also codewords of $\mathbb{G}$ !

## $\mathrm{M} \times \mathrm{V}$ on noisy gates: the basics



Encoded computation: multiply $s$ with $\widetilde{\mathbb{G}}$
Decoding: use parity-check matrix H for $\mathbb{G}$

## $\mathrm{M} \times \mathrm{V}$ on noisy gates: the basics



## A difficulty with this approach: error propagation

Naive computation of $\mathbf{x}=\mathbf{s} \widetilde{\mathbf{G}}$ requires computing $x_{i}=\sum_{j} s_{j} g_{j i}$

## A difficulty with this approach: error propagation

Naive computation of $\mathbf{x}=\mathbf{s} \widetilde{\mathbf{G}}$ requires computing $x_{i}=\sum_{j} s_{j} g_{j i}$


## A difficulty with this approach: error propagation

Naive computation of $\mathbf{x}=\mathbf{s} \widetilde{\mathbf{G}}$ requires computing $x_{i}=\sum_{j} s_{j} g_{j i}$


Requiring $L$ AND gates, $L-1$ XOR gates
Error accumulates! As $L \rightarrow \infty$, each $x_{i}$ approaches a random coin flip

## Addressing error accumulation: a simple observation



## Addressing error accumulation: a simple observation



## Addressing error accumulation: a simple observation



## Addressing error accumulation: a simple observation



Any correctly computed partial sum is a valid codeword

## Addressing error accumulation: a simple observation



Any correctly computed partial sum is a valid codeword

- possibly correct compute errors by embedding decoders inside encoder
- Use LDPC codes: utilize results on noisy decoding (we used (Tabatabaei, Cho, Dolecek '14])
"ENCODED": ENcoded COmputation with Decoders EmbeddeD (with decoding also being noisy)

"ENCODED": ENcoded COmputation with Decoders EmbeddeD (with decoding also being noisy)

"ENCODED": ENcoded COmputation with Decoders EmbeddeD (with decoding also being noisy)

"ENCODED": ENcoded COmputation with Decoders EmbeddeD (with decoding also being noisy)

"ENCODED": ENcoded COmputation with Decoders EmbeddeD (with decoding also being noisy)


Better yet: ENCODED-Tree
"ENCODED": ENcoded COmputation with Decoders EmbeddeD (with decoding also being noisy)

"ENCODED": ENcoded COmputation with Decoders EmbeddeD (with decoding also being noisy)

"ENCODED": ENcoded COmputation with Decoders EmbeddeD (with decoding also being noisy)


Moral: can overcome info loss on each link by collecting info over many links
"ENCODED": ENcoded COmputation with Decoders EmbeddeD (with decoding also being noisy)

COMPUTE \& CORRECT

$s_{1} \mathbf{g}_{1}+s_{2} \mathbf{g}_{2}$
Better yet: ENCODED-Tree


Moral: can overcome info loss on each link by collecting info over many links Reflections of a converse [Evans, Schulman '99] in our achievability

## ENCODED vs Uncoded and Repetition

## Theorem Error correction with ENCODED-Tree Yang, Grover, Kar Allerton '14]

LDPC codes of sufficiently large girth can keep errors contained through repeated error suppression



ENCODED provably requires fewer gates, and lessenergy)than repetition in scaling sense [Yang, Grover, Kar IEEE Trans. Info Theory '17]

Using general device
models, focusing
specifically on spintronics
Moral: repeated error-correction can fight information dissipation
Next: How do these insights apply to processors of limited memory (but > 1 gate)?

## $\mathrm{M} \times \mathrm{V}$ on small (but reliable) processors

Let's first understand $\mathrm{M} \times \mathrm{V}$ on reliable processors
"SUMMA": Scalable Universal Matrix Multiplication Algorithm

- a widely used algorithm [van de Geijn and Watts '95]


## $\mathrm{M} \times \mathrm{V}$ on small (but reliable) processors

Let's first understand $\mathrm{M} \times \mathrm{V}$ on reliable processors
"SUMMA": Scalable Universal Matrix Multiplication Algorithm

- a widely used algorithm [van de Geijn and Watts '95]

Naive M x V computation (Ax)


## $\mathrm{M} \times \mathrm{V}$ on small (but reliable) processors

Let's first understand $\mathrm{M} \times \mathrm{V}$ on reliable processors
"SUMMA": Scalable Universal Matrix Multiplication Algorithm

- a widely used algorithm [van de Geijn and Watts '95]

Naive M x V computation (Ax)


SUMMA $\quad \mathbf{x}=\left[\mathrm{x}_{1}, \mathrm{x}_{2}, \ldots, \mathrm{x}_{\mathrm{c}}\right]$


## $\mathrm{M} \times \mathrm{V}$ on small (but reliable) processors

Let's first understand $\mathrm{M} \times \mathrm{V}$ on reliable processors
"SUMMA": Scalable Universal Matrix Multiplication Algorithm - a widely used algorithm [van de Geijn and Watts '95]

Naive M x V computation (Ax)


## Coded SUMMA for M x V on error-prone processors

 ABFT/MDS coding
[in prep.]

## Coded SUMMA for $\mathrm{M} \times \mathrm{V}$ on error-prone processors

ABFT/MDS coding ENCODED (using LDPC)

[in prep.]

## Summary of Part II. 2

## What is fundamentally new in small vs large processors?

0) Memory limitations: necessitate algorithms like SUMMA
1) Errors accumulate; information dissipates
2) Decoding also error prone

Embed (noisy) decoders to repeatedly suppress errors, limiting info dissipation

# Coded Map-reduce <br> Not covered in detail here, but belongs thematically 

[Li-Avestimehr-Maddah-Ali 2015]
Map-reduce: A widely used framework for parallelizing a variety of tasks

- Simple to learn, very scalable


## Coded Map-reduce

Not covered in detail here, but belongs thematically
[Li-Avestimehr-Maddah-Ali 2015]
Map-reduce: A widely used framework for parallelizing a variety of tasks

- Simple to learn, very scalable

Three phases

```
Map()
```

First phase

## Data exchange

Second phase (usually called shuffle)

```
Reduce()
```

Third phase

# Coded Map-reduce 

Not covered in detail here, but belongs thematically
[Li-Avestimehr-Maddah-Ali 2015]
Map-reduce: A widely used framework for parallelizing a variety of tasks

- Simple to learn, very scalable

Three phases

```
Map()
```

First phase


Second phase (usually called shuffle)

## Reduce( )

Third phase

Idea of coded map reduce

- Introduce redundancy in the map phase
- Exploit information theory ideas (a la coded caching) to minimize communication cost in data exchange
- Save on overall time-to-completion by tuning correctly

Lots of follow up work, exciting area of research!

## Broader view of coded distributed computing

Conventional "division of labor" approach:

- design a "good" algorithm with low Turing complexity
- engineer deals with real world costs and imperfections

This tutorial: an information-theoretic approach:

- model system costs and imperfections and,
- derive fundamental information-theoretic limits,
- obtain optimal strategies for these models


## Our thanks to...

## Collaborators:

- Soummya Kar
- Kishori Konwar
- Nancy Lynch
- Muriel Medard
- Prakash N Moorthy
- Peter Musial
- Zhiying Wang

Student collaborators:

- Rami Ali
- Jeremy Bai
- Malhar Chaudhari
- Sanghamitra Dutta
- Mohammad Fahim
- Farzin Haddadpour
- Haewon Jeong
- Yaoqing Yang


## Help with talk and slides:

- Mohammad Ali Maddah Ali
- Salman Avestimehr
- Alex Dimakis
- Gauri Joshi
- Kangwook Lee
- Ramtin Pedarsani


## Funding sources:

National Science Foundation (NSF)
SDNIC center of the Semiconductor Research Corporation

Appendices/Backup slides

Weak scaling:
Number of processors scales with problem size

- constant computational workload per processor

Strong scaling:
Problem size fixed!

- finding the "sweet-spot" in number of processors
- too many processors => high comm overhead
- too few => not enough parallelization

Related: gate-level errors

- error/fault-tolerant computing

Related problem:
Minimizing total power in communication systems


New goal: Design a $P_{\text {total }}$-efficient code
(errors only in the channel;

$$
P_{t o t a l}=P_{T}+P_{e n c}+P_{d e c} \quad \text { encoding/decoding noiseless) }
$$

## Related problem:

## Minimizing total power in communication systems



New goal: Design a $P_{\text {total }}$-efficient code
(errors only in the channel;

$$
P_{\text {total }}=P_{T}+P_{e n c}+P_{d e c} \quad \text { encoding/decoding noiseless) }
$$

Circuit implementation model:
Channel model:


## Related problem:

## Minimizing total power in communication systems



New goal: Design a $P_{\text {total }}$-efficient code
(errors only in the channel;

$$
P_{\text {total }}=P_{T}+P_{e n c}+P_{d e c} \quad \text { encoding/decoding noiseless) }
$$

Circuit implementation model:
Channel model:


Circuit energy model: "Information-Friction" [Grover, IEEE Trans IT 2015]
[Blake, Ph.D. thesis UToronto, 2017]


$$
E_{\mathrm{info}-\mathrm{friction}}=\mu B d
$$

## Fundamental limits on total communication energy

| Theorem [Grover, IEEE Trans. Info Theory '15] |  |
| :---: | :---: |
| $E_{\text {enc,dec per-bit }} \geq \Omega\left(\sqrt{\frac{\log \frac{1}{P_{e}}}{P_{T}}}\right) \begin{aligned} & \text { for any code, and any encoding \& } \\ & \text { decoding algorithm implemented } \\ & \text { in the circuit model }\end{aligned}$ | [El Gamal, Greene, Peng '84] [Grover, Woyach, Sahai '11] Grover, Goldsmith, Sahai '12] [Grover et al. '07-15] [Thompson '80] |

## Fundamental limits on total communication energy

## Theorem [Grover, IEEE Trans. Info Theory '15] <br> $$
E_{\text {enc,dec per-bit }} \geq \Omega\left(\sqrt{\frac{\log \frac{1}{P_{e}}}{P_{T}}}\right) \begin{aligned} & \text { for any code, and any encoding \& } \\ & \text { decoding algorithm implemented } \\ & \text { in the circuit model } \end{aligned}
$$

builds on
[El Gamal, Greene, Peng '84]
[Grover, Woyach, Sahai '11]
[Grover, Goldsmith, Sahai '12]
[Grover et al. '07-15]
[Thompson '80]


## Fundamental limits on total communication energy

## Theorem [Grover, IEEE Trans. Info Theory '15]


builds on
[El Gamal, Greene, Peng '84]
[Grover, Woyach, Sahai '11]
[Grover, Goldsmith, Sahai '12]
[Grover et al. '07-15]
[Thompson '80]



## Fundamental limits on total communication energy


builds on
[El Gamal, Greene, Peng '84]
[Grover, Woyach, Sahai '11]
[Grover, Goldsmith, Sahai '12] [Grover et al. '07-15]
[Thompson '80]



Straightforward extension to noisy computing of invertible linear transforms [Grover, ISIT'14]: don't aim for "Shannon capacity of noisy computing"!

## Short Dot Achievability

$$
\square=\square \square
$$

Rows of $\mathbf{A}$ lie in the span of any $K$ rows of $\mathbf{B}$
$i$-th column of $\mathbf{Z}$ chosen to set zeroes in the $i$-th column of $\mathbf{B}$
Equation/variable counting gives $s \leq \frac{N}{P}(P-K+M)$

## Short Dot outer bound intuition

Intuition: no column can be too sparse: can't have > K zeros

- since $\mathbf{A}$ has to be recoverable from any $K$ rows


This argument yields a looser converse:
Converse: Any Short-Dot code satisfies:

$$
s \geq \frac{N}{P}(P-K+1)
$$

Tighten by rank arguments (messy; happy to discuss offline)

