Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators:
A Signal Processing Approach
The construction of a multilayer perceptron (MLP) as a piecewise
low-order polynomial approximator using a signal processing approach is
presented in this work. The constructed MLP contains one input, one
intermediate and one output layers. Its construction includes the
specification of neuron numbers and all filter weights. Through the construction, a one-to-one correspondence
between the approximation of an MLP and that of a piecewise low-order
polynomial is established. Comparison between piecewise polynomial and
MLP approximations is made. Since the approximation capability of
piecewise low-order polynomials is well understood, our findings shed
light on the universal approximation capability of an MLP.
piecewise polynomial approximation.
Piecewise low-order polynomial approximation (or regression) is a
classic topic in numerical analysis. It fits multiple low-order
polynomials to a function in partitioned intervals. The approximation
capability of piecewise low-order polynomials is well understood. In
contrast, the representation power of neural networks is not as obvious.
The representation power of neural networks has been studied for three
decades, and universal approximation theorems in various forms have been
proved [3, 4, 2, 1, 11]. These
proofs indicate that neural networks with appropriate activations can
represent a wide range of functions of interest. Their goals were to
make their theorems as general as possible. The proofs were based on
tools from functional analysis (e.g., the Hahn-Banach theorem
) and real analysis (e.g., Sprecher-Kolmogorov
theorem [12, 5]),
which is less well known in the signal processing community.
Furthermore, although they show the existence of a solution, most of
them do not offer a specific MLP design. Lack of MLP design examples
hinders our understanding of the behavior of neural networks.
A feedforward construction of an MLP with two intermediate layers was
presented to solve a classification problem in . As a
sequel to , this work addresses the regression problem.
For ease of presentation, our treatment will focus on the approximation
of a univariate function. We adopt a signal processing (or numerical
analysis) approach to tackle it. Through the design, we can explain how
MLPs offer universal approximations intuitively. The construction
includes the choice of activation functions, neuron numbers, and filter
weights between layers. A one-to-one correspondence between an MLP and
a piecewise low-order polynomial will be established. Except for the
trivial case of the unit step activation function, results on MLP
construction and its equivalence to piecewise low-order polynomial
approximators are new. Another interesting observation is the critical
role played by activation functions. That is, activation functions
actually define kernels of piecewise low-order polynomials. Since the
approximation error of piecewise low-order polynomials is well
understood, our findings shed light on the universal approximation
capability of an MLP as well.
The rest of this paper is organized as follows. The construction of
MLPs corresponding to pieceswise constant, linear and cubic polynomial
approximations is detailed in Sec. II. The
generalization from the 1-D case to multivariate vector functions is
sketched in Sec. III. Concluding remarks and
future research directions are given in Sec. IV.
Ii Proposed MLP Constructions
Without loss of generality, we consider a continuous function
defined on interval , which is uniformly divided into
sub-intervals. Let be the length of the subinterval and
, where . We show how to construct an MLP so
that it serves as a piecewise constant or a piecewise linear
approximation to in this section.
Ii-a Piecewise Constant Approximation
A piecewise constant approximation of can be written as
is the unit box function. We can rewrite the unit box function
as the difference of two unit step functions
Our designed MLP consists of one input node denoting , one
output node denoting , and intermediate nodes (or neurons),
denoted by , where . We use and
to denote the weight and bias between the input node,
and and and to denote the
weight and bias between and the output node ,
respectively. The response at before activation, denoted by , and
the response at , denoted by , can be written as
where Act is the activation function, and are weights
and and are biases.
Ii-B Piecewise Linear Approximation
We will construct an MLP using as the nonlinear activation
function and show its equivalence to a piecewise linear
approximation to . The latter can be expressed as
and where is the unit triangle function in form of
With one input node denoting and one output node denoting
, we construct an MLP that has 4 neurons in two pairs, denoted by
, , in the intermediate layer to yield the same
effect of . We use and to denote the
weight and bias from the input node to , respectively.
In our design, they are specified as:
The responses of neurons , and after ReLU
nonlinear activation are shown in Figs. 1 (a) and (b),
respectively, where the dotted lines indicate those of individual
neurons and the solid lines show those of combined responses. Finally,
the combination of all responses of all 4 neurons at the output node is
shown in Fig. 1 (c). It is a unit triangle plus a vertical
individually (dotted lines) and jointly (solid lines), (b)
neurons and , (c) all four neurons combined, and
(d) the approximation of with a sequence of weighted triangle
It is easy to verify from the Fig. 1 (d) that these four neurons are nothing
but a weighted triangle centered at with its base between
and and height and can be
approximated by a sequence of such weighted triangles.
Ii-C Piecewise Cubic Approximation
where , , satisfies two constraints: i)
and, ii) is anti-symmetric with respect to ; i.e.,
Note that these two constraints imply and . By
substituting , , into Eq. (15),
we get , , and . There is one degree of
freedom left. To complete the design, one choice is to specify the
first-order derivative at the inflection point: . As
shown in Fig. 2(a), the inflection point has the maximum
first-order derivative and its second-order derivative is zero.
and individually (dotted lines) and jointly (solid
lines), (c) and (d) two designs for the approximation to with a
sequence of weighted third-order bumps.
where and are activated by the unit cubic function,
. The weights and biases from , , to output node
where , , is the solution to the following linear
system of equations:
There is an alternative design as shown in Fig. 2(d), where
we increase the spacing between two adjacent bumps from to .
The weight and bias from the input node to still
follow Eq. (17) with , where ,
while Eq. (18) should be changed to
This is fine since there is no interference between ,
. Furthermore, using the Taylor series
point of two associated cubic activation functions, we can derive
This section builds a bridge between Piecewise low-order polynomial and
MLP approximations. The approximation errors of piecewise low-order
polynomials can be derived using Taylor’s series expansion. As
goes to infinity, the errors of piecewise constant, linear
and cubic polynomials converge to zero at a rate of , and
. The same conclusion applies to the corresponding designed MLPs.
where is a kernel. The unknown function at given point
can be expressed as the summation of weighted kernels, where weights
can be solved by a regression method (say, linear regression)
for given . The box, triangle, and third-order unit bump
functions are kernels to smooth statistical variations in local
intervals. The pre-selected kernel could be too local and rigid to be
efficient in handling long- and mid-range correlations in . In
contrast, parameters in the first stage of an MLP allow a richer choice
of kernels. Our MLP design does not exploit this flexibility fully. For
example, we may adopt more different values (see Fig.
One focal point of neural network research is to understand the role of
activation in a neuron. Historically, earliest activation was the simple
unit step function . It offers “on” and
“off” two states with an abrupt jump. Such activation has zero
derivatives except at and, as a result, backpropagation is not
able to update parameters. Several variants such as the sigmoid, ReLU,
leaky ReLU activations have been developed later. Paired activations are
used to build kernels of low-order polynomials here. As presented
earlier, higher-order nonlinear activations are not fundamental in the
universal approximation capability of neural networks. They affect the
kernel shape only. Instead, the role of nonlinear activation is
interpreted as a tool to resolve the sign confusion problem caused by
the cascade of multiple layers in  and
Iii Generalization to Multivariate Vector Functions
Our presentation has focused on the case of 1D input/output so far. In
general, the regression problem can be a -dimensional vector function
where . We will provide a sketch to the generalization from
the univariate scalar function to the multivariate vector function in
this section. Generalization of our proposed design to a vector function
is straightforward since we can design one MLP for each output dimention.
The generalization from the univariate case to the multivariate case
using 1D piecewise low-order polynomials, we can proceed as follows:
Partition into -D cells of volume size ;
Conduct the 1D approximation separately;
Perform the tensor product of 1D kernels to form -D kernels.
For the MLP design, we cannot perform the tensor product in the third
step directly. Instead, we can use the tensor product result to find the
weighted contributions of neighboring grid points to the center of each
cell and construct an MLP accordingly.
In most practical machine learning applications, the multivariate vector
function, , it is assumed to be sufficiently smooth. That
is, its lower-order partial derivatives exist. Consequently, Talyor
series expansion for multivariable functions can be adopted to estimate
the convergence rate to the target regression function.
Iv Conclusion and Future Work
A bridge between piecewise low-order polynomials and three-layer MLP
approximators was built in this work. Instead of following the
traditional path of universal approximation theorem of neural networks
based on functional or real analysis, we proposed a signal processing
approach to construct MLPs. The construction includes the choice of
activation functions, neuron numbers, and filter weights between
adjacent layers. Our design decomposes an MLP into two stages: 1) kernel
construction in the first stage and 2) weight regression in the second
stage. We can explain how an MLP offers an universal approximation to a
univariate function intuitively by this approach. Kernel design plays
an important role in many machine learning and statistical inference
problems. It was shown to be related to the shape of the activation
Given an MLP architecture, backprogagation is used to solve kernel
construction and weight regression problems jointly via end-to-end
optimization. It is however an open problem to find the optimal number
of neurons in the intermediate layer. On the other hand, the
traditional approach decouples the whole problem into two subproblems
and solve them one by one in a feedforward manner. It is interesting to
study the relationship between the kernel design and the first-stage
filter weight/bias determination of MLPs. It is also desired to exploit
prior knowledge about data and applications for flexible kernel design
in feedforward learning systems.
- under the assumption that low-order derivatives
of are bounded
Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems.
IEEE Transactions on Neural Networks 6 (4), pp. 911–917.
Cited by: §I.
Approximation by superpositions of a sigmoidal function.
Mathematics of control, signals and systems 2 (4), pp. 303–314.
Cited by: §I.
Multilayer feedforward networks are universal approximators.
Neural Networks 2 (5), pp. 359–366.
Cited by: §I.
On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition.
In Doklady Akademii Nauk,
Vol. 114, pp. 953–956.
Cited by: §I.
A logical calculus of the ideas immanent in nervous activity.
The Bulletin of Mathematical Biophysics 5 (4), pp. 115–133.
Cited by: §II-D.
On estimation of a probability density function and mode.
The annals of mathematical statistics 33 (3), pp. 1065–1076.
Cited by: §II-D.
McGraw-Hill, New York.
Cited by: §I.
On the structure of continuous functions of several variables.
Transactions of the American Mathematical Society 115, pp. 340–355.
Cited by: §I.