Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators: A Signal Processing Approach

Basic Theroy Deep Talk 2周前 (10-16) 7次浏览 未收录 0个评论 扫描二维码

Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators:
A Signal Processing Approach

Abstract

The construction of a multilayer perceptron (MLP) as a piecewise
low-order polynomial approximator using a signal processing approach is
presented in this work. The constructed MLP contains one input, one
intermediate and one output layers. Its construction includes the
specification of neuron numbers and all filter weights. Through the construction, a one-to-one correspondence
between the approximation of an MLP and that of a piecewise low-order
polynomial is established. Comparison between piecewise polynomial and
MLP approximations is made. Since the approximation capability of
piecewise low-order polynomials is well understood, our findings shed
light on the universal approximation capability of an MLP.

multilayer perceptron, feedforward neural network, approximation theory,
piecewise polynomial approximation.

©©20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

I Introduction

Piecewise low-order polynomial approximation (or regression) is a
classic topic in numerical analysis. It fits multiple low-order
polynomials to a function in partitioned intervals. The approximation
capability of piecewise low-order polynomials is well understood. In
contrast, the representation power of neural networks is not as obvious.
The representation power of neural networks has been studied for three
decades, and universal approximation theorems in various forms have been
proved [3, 4, 2, 1, 11]. These
proofs indicate that neural networks with appropriate activations can
represent a wide range of functions of interest. Their goals were to
make their theorems as general as possible. The proofs were based on
tools from functional analysis (e.g., the Hahn-Banach theorem
[10]) and real analysis (e.g., Sprecher-Kolmogorov
theorem [12, 5]),
which is less well known in the signal processing community.
Furthermore, although they show the existence of a solution, most of
them do not offer a specific MLP design. Lack of MLP design examples
hinders our understanding of the behavior of neural networks.

A feedforward construction of an MLP with two intermediate layers was
presented to solve a classification problem in [7]. As a
sequel to [7], this work addresses the regression problem.
For ease of presentation, our treatment will focus on the approximation
of a univariate function. We adopt a signal processing (or numerical
analysis) approach to tackle it. Through the design, we can explain how
MLPs offer universal approximations intuitively. The construction
includes the choice of activation functions, neuron numbers, and filter
weights between layers. A one-to-one correspondence between an MLP and
a piecewise low-order polynomial will be established. Except for the
trivial case of the unit step activation function, results on MLP
construction and its equivalence to piecewise low-order polynomial
approximators are new. Another interesting observation is the critical
role played by activation functions. That is, activation functions
actually define kernels of piecewise low-order polynomials. Since the
approximation error of piecewise low-order polynomials is well
understood, our findings shed light on the universal approximation
capability of an MLP as well.

The rest of this paper is organized as follows. The construction of
MLPs corresponding to pieceswise constant, linear and cubic polynomial
approximations is detailed in Sec. II. The
generalization from the 1-D case to multivariate vector functions is
sketched in Sec. III. Concluding remarks and
future research directions are given in Sec. IV.

Ii Proposed MLP Constructions

Without loss of generality, we consider a continuous function
defined on interval , which is uniformly divided into
sub-intervals. Let be the length of the subinterval and
, where . We show how to construct an MLP so
that it serves as a piecewise constant or a piecewise linear
approximation to in this section.

Ii-a Piecewise Constant Approximation

A piecewise constant approximation of can be written as

(1)

where

(2)

is the unit box function. We can rewrite the unit box function
as the difference of two unit step functions

(3)

where

(4)

is the unit step function. Then, we can rewrite Eq. (1)
as

(5)

Our designed MLP consists of one input node denoting , one
output node denoting , and intermediate nodes (or neurons),
denoted by , where . We use and
to denote the weight and bias between the input node,
and and and to denote the
weight and bias between and the output node ,
respectively. The response at before activation, denoted by , and
the response at , denoted by , can be written as

(6)
(7)

where Act is the activation function, and are weights
and and are biases.

With the unit step function as activation and Eq.
(5), it is easy to verify that we can choose
the following weights and biases for the MLP:

(8)

where . The above derivation was given in
[11], and it is repeated here for the sake of
completeness.

Ii-B Piecewise Linear Approximation

A commonly used nonlinear activation function is the rectified linear
unit (ReLU), which can be written as

(9)

We will construct an MLP using as the nonlinear activation
function and show its equivalence to a piecewise linear
approximation to . The latter can be expressed as

(10)

where

(11)

and where is the unit triangle function in form of

(12)

With one input node denoting and one output node denoting
, we construct an MLP that has 4 neurons in two pairs, denoted by
, , in the intermediate layer to yield the same
effect of . We use and to denote the
weight and bias from the input node to , respectively.
In our design, they are specified as:

(13)

The responses of neurons , and after ReLU
nonlinear activation are shown in Figs. 1 (a) and (b),
respectively, where the dotted lines indicate those of individual
neurons and the solid lines show those of combined responses. Finally,
the combination of all responses of all 4 neurons at the output node is
shown in Fig. 1 (c). It is a unit triangle plus a vertical
shift.

Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators: A Signal Processing Approach Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators: A Signal Processing Approach Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators: A Signal Processing Approach Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators: A Signal Processing Approach
Fig. 1: Responses of (a) neurons and
individually (dotted lines) and jointly (solid lines), (b)
neurons and , (c) all four neurons combined, and
(d) the approximation of with a sequence of weighted triangle
functions.

Similarly, we use and to
denote the weight and bias from to the output node ,
respectively. They are specified below:

(14)

It is easy to verify from the Fig. 1 (d) that these four neurons are nothing
but a weighted triangle centered at with its base between
and and height and can be
approximated by a sequence of such weighted triangles.

Ii-C Piecewise Cubic Approximation

Next, we design an MLP that offers a piecewise cubic approximation to
. Besides incoming and outgoing weights/biases, we will design
the unit cubic activation (see Fig. 2(a)) in form of:

(15)

where , , satisfies two constraints: i)
and, ii) is anti-symmetric with respect to ; i.e.,

(16)

Note that these two constraints imply and . By
substituting , , into Eq. (15),
we get , , and . There is one degree of
freedom left. To complete the design, one choice is to specify the
first-order derivative at the inflection point: . As
shown in Fig. 2(a), the inflection point has the maximum
first-order derivative and its second-order derivative is zero.

Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators: A Signal Processing Approach Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators: A Signal Processing Approach Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators: A Signal Processing Approach Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators: A Signal Processing Approach
Fig. 2: (a) The unit cubic activation function, (b) responses of
and individually (dotted lines) and jointly (solid
lines), (c) and (d) two designs for the approximation to with a
sequence of weighted third-order bumps.

For interval centered at , we can use two
neurons to build a third-order unit bump function plus a vertical shift
as shown in Fig. 2(b). The weight and bias from the input
node to are specified as:

(17)

where and are activated by the unit cubic function,
. The weights and biases from , , to output node
are:

(18)

where , , is the solution to the following linear
system of equations:

(19)

with boundary conditions . Given the target
regression function at the right-hand-side of (19),
one can solve the system for , , uniquely.
If is smooth in a local region,
as shown in Fig. 2(c).

There is an alternative design as shown in Fig. 2(d), where
we increase the spacing between two adjacent bumps from to .
The weight and bias from the input node to still
follow Eq. (17) with , where ,
while Eq. (18) should be changed to

(20)

This is fine since there is no interference between ,
. Furthermore, using the Taylor series
expansion1 and based on the fact is the inflection
point of two associated cubic activation functions, we can derive

(21)

The design in Fig. 2(c) demands neurons while that in
Fig. 2(d) demands neurons only in total. The latter design
appears to be simpler.

Ii-D Discussion

This section builds a bridge between Piecewise low-order polynomial and
MLP approximations. The approximation errors of piecewise low-order
polynomials can be derived using Taylor’s series expansion. As
goes to infinity, the errors of piecewise constant, linear
and cubic polynomials converge to zero at a rate of , and
. The same conclusion applies to the corresponding designed MLPs.

There is however difference between them. To give an example, consider
the kernel density estimation problem [9]:

(22)

where is a kernel. The unknown function at given point
can be expressed as the summation of weighted kernels, where weights
can be solved by a regression method (say, linear regression)
for given . The box, triangle, and third-order unit bump
functions are kernels to smooth statistical variations in local
intervals. The pre-selected kernel could be too local and rigid to be
efficient in handling long- and mid-range correlations in . In
contrast, parameters in the first stage of an MLP allow a richer choice
of kernels. Our MLP design does not exploit this flexibility fully. For
example, we may adopt more different values (see Fig.
3).

Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators: A Signal Processing Approach
Fig. 3: Two kernels of different supports centered at , which can be
achieved by adding more neurons to the intermediate layer of an MLP.

One focal point of neural network research is to understand the role of
activation in a neuron. Historically, earliest activation was the simple
unit step function [8]. It offers “on” and
“off” two states with an abrupt jump. Such activation has zero
derivatives except at and, as a result, backpropagation is not
able to update parameters. Several variants such as the sigmoid, ReLU,
leaky ReLU activations have been developed later. Paired activations are
used to build kernels of low-order polynomials here. As presented
earlier, higher-order nonlinear activations are not fundamental in the
universal approximation capability of neural networks. They affect the
kernel shape only. Instead, the role of nonlinear activation is
interpreted as a tool to resolve the sign confusion problem caused by
the cascade of multiple layers in [6] and
[7].

Iii Generalization to Multivariate Vector Functions

Our presentation has focused on the case of 1D input/output so far. In
general, the regression problem can be a -dimensional vector function
of multivariables:

(23)

where . We will provide a sketch to the generalization from
the univariate scalar function to the multivariate vector function in
this section. Generalization of our proposed design to a vector function
is straightforward since we can design one MLP for each output dimention.

The generalization from the univariate case to the multivariate case
using 1D piecewise low-order polynomials, we can proceed as follows:

  1. Partition into -D cells of volume size ;

  2. Conduct the 1D approximation separately;

  3. Perform the tensor product of 1D kernels to form -D kernels.

For the MLP design, we cannot perform the tensor product in the third
step directly. Instead, we can use the tensor product result to find the
weighted contributions of neighboring grid points to the center of each
cell and construct an MLP accordingly.

In most practical machine learning applications, the multivariate vector
function, , it is assumed to be sufficiently smooth. That
is, its lower-order partial derivatives exist. Consequently, Talyor
series expansion for multivariable functions can be adopted to estimate
the convergence rate to the target regression function.

Iv Conclusion and Future Work

A bridge between piecewise low-order polynomials and three-layer MLP
approximators was built in this work. Instead of following the
traditional path of universal approximation theorem of neural networks
based on functional or real analysis, we proposed a signal processing
approach to construct MLPs. The construction includes the choice of
activation functions, neuron numbers, and filter weights between
adjacent layers. Our design decomposes an MLP into two stages: 1) kernel
construction in the first stage and 2) weight regression in the second
stage. We can explain how an MLP offers an universal approximation to a
univariate function intuitively by this approach. Kernel design plays
an important role in many machine learning and statistical inference
problems. It was shown to be related to the shape of the activation
function here.

Given an MLP architecture, backprogagation is used to solve kernel
construction and weight regression problems jointly via end-to-end
optimization. It is however an open problem to find the optimal number
of neurons in the intermediate layer. On the other hand, the
traditional approach decouples the whole problem into two subproblems
and solve them one by one in a feedforward manner. It is interesting to
study the relationship between the kernel design and the first-stage
filter weight/bias determination of MLPs. It is also desired to exploit
prior knowledge about data and applications for flexible kernel design
in feedforward learning systems.

Footnotes

  1. under the assumption that low-order derivatives
    of are bounded

References

  1. T. Chen and H. Chen (1995)

    Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems.

    IEEE Transactions on Neural Networks 6 (4), pp. 911–917.

    Cited by: §I.

  2. C. K. Chui and X. Li (1992)

    Approximation by ridge functions and neural networks with one hidden layer.

    Journal of Approximation Theory 70 (2), pp. 131–141.

    Cited by: §I.

  3. G. Cybenko (1989)

    Approximation by superpositions of a sigmoidal function.

    Mathematics of control, signals and systems 2 (4), pp. 303–314.

    Cited by: §I.

  4. K. Hornik, M. Stinchcombe and H. White (1989)

    Multilayer feedforward networks are universal approximators.

    Neural Networks 2 (5), pp. 359–366.

    Cited by: §I.

  5. A. N. Kolmogorov (1957)

    On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition.

    In Doklady Akademii Nauk,

    Vol. 114, pp. 953–956.

    Cited by: §I.

  6. C. J. Kuo (2016)

    Understanding convolutional neural networks with a mathematical model.

    Journal of Visual Communication and Image Representation 41, pp. 406–413.

    Cited by: §II-D.

  7. R. Lin, Z. Zhou, S. You, R. Rao and C. J. Kuo (2020)

    From two-class linear discriminant analysis to interpretable multilayer perceptron design.

    arXiv preprint arXiv:2009.04442.

    Cited by: §I,
    §II-D.

  8. W. S. McCulloch and W. Pitts (1943)

    A logical calculus of the ideas immanent in nervous activity.

    The Bulletin of Mathematical Biophysics 5 (4), pp. 115–133.

    Cited by: §II-D.

  9. E. Parzen (1962)

    On estimation of a probability density function and mode.

    The annals of mathematical statistics 33 (3), pp. 1065–1076.

    Cited by: §II-D.

  10. W. Rudin (1973)

    Functional analysis.

    McGraw-Hill, New York.

    Cited by: §I.

  11. F. Scarselli and A. C. Tsoi (1998)

    Universal approximation using feedforward neural networks: a survey of some existing methods, and some new results.

    Neural networks 11 (1), pp. 15–37.

    Cited by: §I,
    §II-A.

  12. D. A. Sprecher (1965)

    On the structure of continuous functions of several variables.

    Transactions of the American Mathematical Society 115, pp. 340–355.

    Cited by: §I.

https://www.groundai.com/project/constructing-multilayer-perceptrons-as-piecewise-low-order-polynomial-approximators-a-signal-processing-approach/


CSIT FUN , 版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators: A Signal Processing Approach
喜欢 (0)
[985016145@qq.com]
分享 (0)
发表我的评论
取消评论
表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址