Constructing Multilayer Perceptrons as Piecewise LowOrder Polynomial Approximators:
A Signal Processing Approach
Abstract
The construction of a multilayer perceptron (MLP) as a piecewise
loworder polynomial approximator using a signal processing approach is
presented in this work. The constructed MLP contains one input, one
intermediate and one output layers. Its construction includes the
specification of neuron numbers and all filter weights. Through the construction, a onetoone correspondence
between the approximation of an MLP and that of a piecewise loworder
polynomial is established. Comparison between piecewise polynomial and
MLP approximations is made. Since the approximation capability of
piecewise loworder polynomials is well understood, our findings shed
light on the universal approximation capability of an MLP.
piecewise polynomial approximation.
I Introduction
Piecewise loworder polynomial approximation (or regression) is a
classic topic in numerical analysis. It fits multiple loworder
polynomials to a function in partitioned intervals. The approximation
capability of piecewise loworder polynomials is well understood. In
contrast, the representation power of neural networks is not as obvious.
The representation power of neural networks has been studied for three
decades, and universal approximation theorems in various forms have been
proved [3, 4, 2, 1, 11]. These
proofs indicate that neural networks with appropriate activations can
represent a wide range of functions of interest. Their goals were to
make their theorems as general as possible. The proofs were based on
tools from functional analysis (e.g., the HahnBanach theorem
[10]) and real analysis (e.g., SprecherKolmogorov
theorem [12, 5]),
which is less well known in the signal processing community.
Furthermore, although they show the existence of a solution, most of
them do not offer a specific MLP design. Lack of MLP design examples
hinders our understanding of the behavior of neural networks.
A feedforward construction of an MLP with two intermediate layers was
presented to solve a classification problem in [7]. As a
sequel to [7], this work addresses the regression problem.
For ease of presentation, our treatment will focus on the approximation
of a univariate function. We adopt a signal processing (or numerical
analysis) approach to tackle it. Through the design, we can explain how
MLPs offer universal approximations intuitively. The construction
includes the choice of activation functions, neuron numbers, and filter
weights between layers. A onetoone correspondence between an MLP and
a piecewise loworder polynomial will be established. Except for the
trivial case of the unit step activation function, results on MLP
construction and its equivalence to piecewise loworder polynomial
approximators are new. Another interesting observation is the critical
role played by activation functions. That is, activation functions
actually define kernels of piecewise loworder polynomials. Since the
approximation error of piecewise loworder polynomials is well
understood, our findings shed light on the universal approximation
capability of an MLP as well.
The rest of this paper is organized as follows. The construction of
MLPs corresponding to pieceswise constant, linear and cubic polynomial
approximations is detailed in Sec. II. The
generalization from the 1D case to multivariate vector functions is
sketched in Sec. III. Concluding remarks and
future research directions are given in Sec. IV.
Ii Proposed MLP Constructions
Without loss of generality, we consider a continuous function
defined on interval , which is uniformly divided into
subintervals. Let be the length of the subinterval and
, where . We show how to construct an MLP so
that it serves as a piecewise constant or a piecewise linear
approximation to in this section.
Iia Piecewise Constant Approximation
A piecewise constant approximation of can be written as
(1) 
where
(2) 
is the unit box function. We can rewrite the unit box function
as the difference of two unit step functions
(3) 
where
(4) 
is the unit step function. Then, we can rewrite Eq. (1)
as
(5) 
Our designed MLP consists of one input node denoting , one
output node denoting , and intermediate nodes (or neurons),
denoted by , where . We use and
to denote the weight and bias between the input node,
and and and to denote the
weight and bias between and the output node ,
respectively. The response at before activation, denoted by , and
the response at , denoted by , can be written as
(6)  
(7) 
where Act is the activation function, and are weights
and and are biases.
IiB Piecewise Linear Approximation
A commonly used nonlinear activation function is the rectified linear
unit (ReLU), which can be written as
(9) 
We will construct an MLP using as the nonlinear activation
function and show its equivalence to a piecewise linear
approximation to . The latter can be expressed as
(10) 
where
(11) 
and where is the unit triangle function in form of
(12) 
With one input node denoting and one output node denoting
, we construct an MLP that has 4 neurons in two pairs, denoted by
, , in the intermediate layer to yield the same
effect of . We use and to denote the
weight and bias from the input node to , respectively.
In our design, they are specified as:
(13) 
The responses of neurons , and after ReLU
nonlinear activation are shown in Figs. 1 (a) and (b),
respectively, where the dotted lines indicate those of individual
neurons and the solid lines show those of combined responses. Finally,
the combination of all responses of all 4 neurons at the output node is
shown in Fig. 1 (c). It is a unit triangle plus a vertical
shift.
individually (dotted lines) and jointly (solid lines), (b)
neurons and , (c) all four neurons combined, and
(d) the approximation of with a sequence of weighted triangle
functions.
Similarly, we use and to
denote the weight and bias from to the output node ,
respectively. They are specified below:
(14) 
It is easy to verify from the Fig. 1 (d) that these four neurons are nothing
but a weighted triangle centered at with its base between
and and height and can be
approximated by a sequence of such weighted triangles.
IiC Piecewise Cubic Approximation
Next, we design an MLP that offers a piecewise cubic approximation to
. Besides incoming and outgoing weights/biases, we will design
the unit cubic activation (see Fig. 2(a)) in form of:
(15) 
where , , satisfies two constraints: i)
and, ii) is antisymmetric with respect to ; i.e.,
(16) 
Note that these two constraints imply and . By
substituting , , into Eq. (15),
we get , , and . There is one degree of
freedom left. To complete the design, one choice is to specify the
firstorder derivative at the inflection point: . As
shown in Fig. 2(a), the inflection point has the maximum
firstorder derivative and its secondorder derivative is zero.
and individually (dotted lines) and jointly (solid
lines), (c) and (d) two designs for the approximation to with a
sequence of weighted thirdorder bumps.
For interval centered at , we can use two
neurons to build a thirdorder unit bump function plus a vertical shift
as shown in Fig. 2(b). The weight and bias from the input
node to are specified as:
(17) 
where and are activated by the unit cubic function,
. The weights and biases from , , to output node
are:
(18) 
where , , is the solution to the following linear
system of equations:
(19) 
with boundary conditions . Given the target
regression function at the righthandside of (19),
one can solve the system for , , uniquely.
If is smooth in a local region,
as shown in Fig. 2(c).
There is an alternative design as shown in Fig. 2(d), where
we increase the spacing between two adjacent bumps from to .
The weight and bias from the input node to still
follow Eq. (17) with , where ,
while Eq. (18) should be changed to
(20) 
This is fine since there is no interference between ,
. Furthermore, using the Taylor series
expansion
point of two associated cubic activation functions, we can derive
(21) 
The design in Fig. 2(c) demands neurons while that in
Fig. 2(d) demands neurons only in total. The latter design
appears to be simpler.
IiD Discussion
This section builds a bridge between Piecewise loworder polynomial and
MLP approximations. The approximation errors of piecewise loworder
polynomials can be derived using Taylor’s series expansion. As
goes to infinity, the errors of piecewise constant, linear
and cubic polynomials converge to zero at a rate of , and
. The same conclusion applies to the corresponding designed MLPs.
There is however difference between them. To give an example, consider
the kernel density estimation problem [9]:
(22) 
where is a kernel. The unknown function at given point
can be expressed as the summation of weighted kernels, where weights
can be solved by a regression method (say, linear regression)
for given . The box, triangle, and thirdorder unit bump
functions are kernels to smooth statistical variations in local
intervals. The preselected kernel could be too local and rigid to be
efficient in handling long and midrange correlations in . In
contrast, parameters in the first stage of an MLP allow a richer choice
of kernels. Our MLP design does not exploit this flexibility fully. For
example, we may adopt more different values (see Fig.
3).
One focal point of neural network research is to understand the role of
activation in a neuron. Historically, earliest activation was the simple
unit step function [8]. It offers “on” and
“off” two states with an abrupt jump. Such activation has zero
derivatives except at and, as a result, backpropagation is not
able to update parameters. Several variants such as the sigmoid, ReLU,
leaky ReLU activations have been developed later. Paired activations are
used to build kernels of loworder polynomials here. As presented
earlier, higherorder nonlinear activations are not fundamental in the
universal approximation capability of neural networks. They affect the
kernel shape only. Instead, the role of nonlinear activation is
interpreted as a tool to resolve the sign confusion problem caused by
the cascade of multiple layers in [6] and
[7].
Iii Generalization to Multivariate Vector Functions
Our presentation has focused on the case of 1D input/output so far. In
general, the regression problem can be a dimensional vector function
of multivariables:
(23) 
where . We will provide a sketch to the generalization from
the univariate scalar function to the multivariate vector function in
this section. Generalization of our proposed design to a vector function
is straightforward since we can design one MLP for each output dimention.
The generalization from the univariate case to the multivariate case
using 1D piecewise loworder polynomials, we can proceed as follows:

Partition into D cells of volume size ;

Conduct the 1D approximation separately;

Perform the tensor product of 1D kernels to form D kernels.
For the MLP design, we cannot perform the tensor product in the third
step directly. Instead, we can use the tensor product result to find the
weighted contributions of neighboring grid points to the center of each
cell and construct an MLP accordingly.
In most practical machine learning applications, the multivariate vector
function, , it is assumed to be sufficiently smooth. That
is, its lowerorder partial derivatives exist. Consequently, Talyor
series expansion for multivariable functions can be adopted to estimate
the convergence rate to the target regression function.
Iv Conclusion and Future Work
A bridge between piecewise loworder polynomials and threelayer MLP
approximators was built in this work. Instead of following the
traditional path of universal approximation theorem of neural networks
based on functional or real analysis, we proposed a signal processing
approach to construct MLPs. The construction includes the choice of
activation functions, neuron numbers, and filter weights between
adjacent layers. Our design decomposes an MLP into two stages: 1) kernel
construction in the first stage and 2) weight regression in the second
stage. We can explain how an MLP offers an universal approximation to a
univariate function intuitively by this approach. Kernel design plays
an important role in many machine learning and statistical inference
problems. It was shown to be related to the shape of the activation
function here.
Given an MLP architecture, backprogagation is used to solve kernel
construction and weight regression problems jointly via endtoend
optimization. It is however an open problem to find the optimal number
of neurons in the intermediate layer. On the other hand, the
traditional approach decouples the whole problem into two subproblems
and solve them one by one in a feedforward manner. It is interesting to
study the relationship between the kernel design and the firststage
filter weight/bias determination of MLPs. It is also desired to exploit
prior knowledge about data and applications for flexible kernel design
in feedforward learning systems.
Footnotes
 under the assumption that loworder derivatives
of are bounded
References

(1995)
Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems.
IEEE Transactions on Neural Networks 6 (4), pp. 911–917.
Cited by: §I.

(1992)
Approximation by ridge functions and neural networks with one hidden layer.
Journal of Approximation Theory 70 (2), pp. 131–141.
Cited by: §I.

(1989)
Approximation by superpositions of a sigmoidal function.
Mathematics of control, signals and systems 2 (4), pp. 303–314.
Cited by: §I.

(1989)
Multilayer feedforward networks are universal approximators.
Neural Networks 2 (5), pp. 359–366.
Cited by: §I.

(1957)
On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition.
In Doklady Akademii Nauk,
Vol. 114, pp. 953–956.
Cited by: §I.

(2016)
Understanding convolutional neural networks with a mathematical model.
Journal of Visual Communication and Image Representation 41, pp. 406–413.
Cited by: §IID.

(2020)
From twoclass linear discriminant analysis to interpretable multilayer perceptron design.
arXiv preprint arXiv:2009.04442.
Cited by: §I,
§IID.

(1943)
A logical calculus of the ideas immanent in nervous activity.
The Bulletin of Mathematical Biophysics 5 (4), pp. 115–133.
Cited by: §IID.

(1962)
On estimation of a probability density function and mode.
The annals of mathematical statistics 33 (3), pp. 1065–1076.
Cited by: §IID.

(1973)
Functional analysis.
McGrawHill, New York.
Cited by: §I.

(1998)
Universal approximation using feedforward neural networks: a survey of some existing methods, and some new results.
Neural networks 11 (1), pp. 15–37.
Cited by: §I,
§IIA.

(1965)
On the structure of continuous functions of several variables.
Transactions of the American Mathematical Society 115, pp. 340–355.
Cited by: §I.