1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
#![cfg_attr(
    doc,
    feature(prelude_import, custom_inner_attributes, proc_macro_hygiene)
)]
#![cfg_attr(doc, katexit::katexit)]
//! Helper crate for einsum algorithm
//!
//! Introduction to einsum
//! -----------------------
//! The Einstein summation rule in theoretical physics and related field
//! including machine learning is a rule for abbreviating tensor operations.
//! For example, one of most basic tensor operation is inner product of
//! two vectors in $n$-dimensional Euclidean space $x, y \in \mathbb{R}^n$:
//! $$
//! (x, y) = \sum_{i \in I} x_i y_i
//! $$
//! where $I$ denotes a set of indices, i.e. $I = \\{0, 1, \ldots, n-1 \\}$.
//! Another example is matrix multiplications.
//! A multiplication of three square matrices $A, B, C \in M_n(\mathbb{R})$
//! can be written as its element:
//! $$
//! ABC_{il} = \sum_{j \in J} \sum_{k \in K} a_{ij} b_{jk} c_{kl}
//! $$
//!
//! Many such tensor operations appear in various field,
//! and we usually define many functions corresponding to each operations.
//! For inner product of vectors, we may defines a function like
//! ```ignore
//! fn inner(a: Array1D<R>, b: Array1D<R>) -> R;
//! ```
//! for matrix multiplication:
//! ```ignore
//! fn matmul(a: Array2D<R>, b: Array2D<R>) -> Array2D<R>;
//! ```
//! or taking three matrices:
//! ```ignore
//! fn matmul3(a: Array2D<R>, b: Array2D<R>, c: Array2D<R>) -> Array2D<R>;
//! ```
//! and so on.
//!
//! These definitions are very similar, and actually,
//! they can be represented in a single manner.
//! These computations multiplicate the element of each tensor with some indices,
//! and sum up them along some indices.
//! So we have to determine
//!
//! - what indices to be used for each tensors in multiplications
//! - what indices to be summed up
//!
//! This can be done by ordering indices for input tensors
//! with a Einstein summation rule, i.e. sum up indices which appears more than once.
//! For example, `inner` is represented by `i,i->`, `matmul` is represented by `ij,jk->ik`,
//! `matmul3` is represented by `ij,jk,kl->il`, and so on
//! where `,` is the separator of each indices
//! and each index must be represented by a single char like `i` or `j`.
//! `->` separates the indices of input tensors and indices of output tensor.
//! If no indices are placed like `i,i->`, it means the tensor is 0-rank, i.e. a scalar value.
//! "einsum" is an algorithm or runtime to be expand such string
//! into actual tensor operations.
//!
//! einsum algorithm
//! -----------------
//! We discuss an overview of einsum algorithm for understanding the structure of this crate.
//!
//! ### Factorize and Memorize partial summation
//! Partial summation and its memorization reduces number of floating point operations.
//! For simplicity, both addition `+` and multiplication `*` are counted as 1 operation,
//! and do not consider fused multiplication-addition (FMA).
//! In the above `matmul3` example, there are $\\#K \times \\#J$ addition
//! and $2 \times \\#K \times \\#J$ multiplications for every indices $(i, l)$,
//! where $\\#$ denotes the number of elements in the index sets.
//! Assuming the all sizes of indices are same and denoted by $N$,
//! there are $O(N^4)$ floating point operations.
//!
//! When we sum up partially along `j`:
//! $$
//! \sum_{k \in K} c_{kl} \left( \sum_{j \in J} a_{ij} b_{jk} \right),
//! $$
//! and memorize its result as $d_{ik}$:
//! $$
//! \sum_{k \in K} c_{kl} d_{ik},
//! \text{where} \space d_{ik} = \sum_{j \in J} a_{ij} b_{jk},
//! $$
//! there are $O(N^3)$ operations for both computing $d_{ik}$ and final summation
//! with $O(N^2)$ memorization storage.
//!
//! When is this factorization possible? We know that above `matmul3` example
//! is also written as associative matrix product $ABC = A(BC) = (AB)C$,
//! and partial summation along $j$ is corresponding to store $D = AB$.
//! This is not always possible. Let us consider a trace of two matrix product
//! $$
//! \text{Tr} (AB) = \sum_{i \in I} \sum_{j \in J} a_{ij} b_{ji}
//! $$
//! This is written as `ij,ji->` in einsum subscript form.
//! We cannot factor out both $a_{ij}$ and $b_{ji}$ out of summation
//! since they contain both indices.
//! Whether this factorization is possible or not can be determined only
//! from einsum subscript form, and we call a subscript is "reducible"
//! if factorization is possible, and "irreducible" if not possible,
//! i.e. `ij,jk,kl->il` is reducible and `ij,ji->` is irreducible.
//!
//! ### Subscript representation
//!
//! To discuss subscript factorization, we have to track which tensors are
//! used as each input.
//! In above `matmul3` example, `ij,jk,kl->il` is factorized into sub-subscripts
//! `ij,jk->ik` and `ik,kl->il` where `ik` in the second subscript uses
//! the output of first subscript. The information of the name of tensors
//! has been dropped from sub-subscripts,
//! and we have to create a mechanism for managing it.
//!
//! We introduce a subscript representation of `matmul3` case with tensor names:
//!
//! ```text
//! ij,jk,kl->il | a b c -> out
//! ```
//!
//! In this form, the factorization can be described:
//!
//! ```text
//! ij,jk->ik | a b -> d
//! ik,kl->il | d c -> out
//! ```
//!
//! To clarify the tensor is given from user or created while factorization,
//! we use `arg{N}` and `out{N}` identifiers:
//!
//! ```text
//! ij,jk->ik | arg0 arg1 -> out0
//! ik,kl->il | out0 arg2 -> out1
//! ```
//!
//! ### Summation order
//!
//! This factorization is not unique.
//! Apparently, there are two ways for `matmul3` case as corresponding to $(AB)C$:
//!
//! ```text
//! ij,jk->ik | arg0 arg1 -> out0
//! ik,kl->il | out0 arg2 -> out1
//! ```
//!
//! and to $A(BC)$:
//!
//! ```text
//! jk,kl->jl | arg1 arg2 -> out0
//! jl,ij->il | out0 arg0 -> out1
//! ```
//!
//! These are different procedure i.e. number of floating operations
//! and required intermediate memories are different,
//! but return same output tensor
//! (we ignore non-associativity of floating numbers on this document).
//! This becomes complicated combinational optimization problem
//! if there are many contraction indicies,
//! and the objective of this crate is to (heuristically) solve this problem.
//!

pub mod codegen;
pub mod parser;

mod namespace;
mod path;
mod subscripts;

pub use namespace::*;
pub use path::*;
pub use subscripts::*;