Towards Generalising Neural Implicit Representations
Neural implicit representations have shown substantial improvements in efficiently storing 3D data, when compared to conventional formats.
However, the focus of existing work has mainly been on storage and subsequent reconstruction.
In this work, we argue that training neural representations for both reconstruction tasks, alongside conventional tasks, can produce more general
encodings that admit equal quality reconstructions to single task training, whilst providing improved results on conventional tasks when compared to single task encodings.
Through multi-task experiments on reconstruction, classification, and segmentation our approach learns feature rich encodings that produce high
quality results for each task.
We also reformulate the segmentation task, creating a more representative challenge for implicit representation contexts.
Implicit neural representations have garnered significant interest recently for their ability to reconstruct complex 3D structures and
The appeal of these methods stems from a number of useful properties they possess for both reconstructing 3D shapes, as well
as storing them efficiently.
By learning the reconstruction of the shapes, networks are able to encode and use a rich set of priors over the 3D domain to improve
the quality of the reconstructions over what can be achieved with classical methods[mescheder2019occupancy, park2019deepsdf].
Efficiency in storage is achieved by decoupling the encoding from the input and output modality so, unlike voxel based representation,
the storage requirements do not grow cubically with the output resolution.
Further, implicit representations do not suffer from the limitations of mesh and point-cloud based representations, where the quality
of the reconstruction is typically limited by the output size constraints of a single feed forward pass[mescheder2019occupancy].
Conditioning a “decoder” network on an encoded representation of the input data, neural representations query the network at sample
point locations for occupancy or distance function information.
This approach allows for reconstructions to be generated with arbitrary resolutions at run-time[mescheder2019occupancy].
Whilst these properties are impressive, we argue that further useful properties have been left on the table.
In many of the works making use of implicit methods, training is performed with the loss function targeted only at reconstruction accuracy.
This approach, whilst clearly effective, misses a significant potential benefit.
We argue that using a multi task loss, including loss terms related to common tasks such as classification, produces encodings that are
equally effective for reconstruction, but that still provide a richer set of features for use in other downstream tasks.
We suggest that in applications such as augmented reality, where efficient representations are very useful, the ability to encode more
than just shape information into the representation is likely to be useful.
Whilst a number of works have produced impressing and high quality reconstructions using a number of different approaches, there is a
general aim in the design neural network encoders to produce features that have meaningful uses beyond a single task.
However, we observe that this aim has not yet translated to implicit representation works.
Many practical applications of implicit representations to real world problems would benefit from the ability to perform multiple
tasks using the same stored data, rather than having to render and re-encode before performing other downstream tasks.
In this paper, we examine the generation of more descriptive neural representation encodings.
Through experiments, we show that encodings generated purely for reconstruction can produce poorer results on other tasks.
Further, we show clearly that the encodings used in neural representations can be trained to develop properties useful for other
tasks common in computer vision, without any appreciable reduction in reconstruction performance.
Whilst this should be possible for any number of tasks or applications such as texture generation, material property estimation, we examine
two common tasks in 3D, namely classification and segmentation.
We also argue that the conventional 3D semantic segmentation task does not translate well to implicit representations, where the object being
reconstruction is not or cannot be operated on directly.
To address this we propose, among other experiments, a re-formulation of the semantic segmentation task that is a more representative
formulation of the segmentation task when applied to implicit representations.
In summary, the key contributions of this paper are
A re-formulation of the semantic segmentation task, that is more representative of a real world task in the context of implicit representations.
Ii Related Work
Implicit or Neural representations have been the subject of much recent work.
Early works primarily focused on single objects[mescheder2019occupancy, sitzmann2019scene, sitzmann2020metasdf, park2019deepsdf, michalkiewicz2019deep, genova2019learning, atzmon2020sal, gropp2020implicit, poursaeed2020coupling, xu2019disn],
encoding some input, often either image or point-cloud, and producing a feature vector which is used to condition the output network.
There have been different approaches to conditioning the network with concatenation, however they mainly fall into two categories:
i) concatenation/biasing (dumoulin2018feature-wise argue that biasing and concatenation are analogous) and ii) hyper-networks.
Further, the implementation of the implicit representation can be divided into two categories, namely occupancy generating functions or
signed distance function (SDF) generating functions
In concatenation based conditioning, the encoding is concatenated with the point being queried and then passed through the network.
In hyper-networks, the encodings is passed through a small network, whose outputs are the weights used in the network that actually
predicts the value for a given query point.
Early works such as [sitzmann2019scene, park2019deepsdf, mescheder2019occupancy] showed that simple MLP networks were capable of
representing complex distance functions and occupancy functions.
park2019deepsdf also detailed the use of auto-decoders to estimate optimal encodings for a given input, using a fixed decoder and simple backpropagation.
mescheder2019occupancy first demonstrated the alternative occupancy paradigm for implicit representations, as well as proposing
a procedure to extract high quality meshes in an efficient manner from the implicit representation, using an octree like approach.
michalkiewicz2019deep learn level sets to represent shapes.
We note that the level sets are equivalent to learning an unsigned distance function like atzmon2020sal.
poursaeed2020coupling combines both explicit atlas based reconstruction and implicit neural reconstruction, enforcing consistency
between the two methods.
Later works investigated larger scenes[chabra2020deep, Peng2020convoccupancy], however many of these methods did not
expand the size of the area described by a given embedding, instead proposing methods to recover encodings for a small
local region where a neural representation can then extract shape information.
Peng2020convoccupancy interpolated between encoded points in a volume or plane, to generate the conditioning vector for
a occupancy network in a region around the encoded point.
chibane2020implicit took a similar approach, adding also multiple resolutions of encoded volumes.
A separate group of works[jiang2020local, chabra2020deep, tretschk2020patchnets] all took a slightly different approach from above
and divided the scene into regions generating a small encoding for each region.
Both [jiang2020local] and [chabra2020deep] make use of a grid of small local encodings, whereas [tretschk2020patchnets]
make use of a number of oriented spherical patches of differing radii each with an encoding.
Further improvements to implicit representations in general were proposed by sitzmann2020implicit and tancik2020fourier
showing that adding higher frequency information to simple networks, drastically improved their ability to generate high quality
duan2020curriculum proposed a curriculum based learning approach for implicit representations, improving reconstruction quality
of complex local details.
Other works have used implicit representations for a number of other tasks, most notably novel view synthesis[martin2020nerf, mildenhall2020nerf, sitzmann2019deepvoxels].
There have also been a number of works considering the application of multi-task approaches to 3D problems[pham2019jsis3d, lahoud20193d, hassani2019unsupervised, liang2019multi].
Multi-task learning has enabled improvements where tasks are related or closely coupled, such as semantic and instance segmentation[lahoud20193d, pham2019jsis3d].
As well hassani2019unsupervised, made use of a multi-task setup in unsupervised training to learn an embedding space
over 3D point-cloud inputs.
To our knowledge, only one other paper has investigated using implicit representations for tasks other than
However, their work focused only on the 2D domain (albeit using an internal 3D modality), prohibiting any meaningful comparison
with our method.
In this section, we first cover the principles of implicit representations. We then describe the other tasks we consider, including our variation to
the normal task of segmentation that we use in our experiments for a fairer representation of the segmentation task in an implicit context.
Finally we discuss our network architecture and training procedure.
Iii-a Implicit Representations
Neural implicit representations attempt to estimate the function describing the surface of a given object.
A common formulation is to map from a point, , in space to the smallest signed distance between the point and
the outer face of a surface, i.e. a SDF.
This gives rise to an expression[park2019deepsdf, xu2019disn] of the form
Another common formulation is to estimate the probability that a given point lies within the object (i.e. probability of occupancy), rather
than regressing the SDF directly.
This gives a function[mescheder2019occupancy] of the form
However, we note the following relationship
where is a threshold parameter.
This relationship suggests that the occupancy function is a simplified version of the SDF, moving the problem from a regression context
and into a classification context.
Further, this reformulation suggests that where the extra information provided by the SDF (but not the occupancy function) such as the
surface normal (given as, , the spatial gradient of the SDF[park2019deepsdf]) is not needed,
the occupancy function is likely to be easier to learn.
If this is true, we suggest this is the result of the occupancy network only needing to learn a decision boundary over rather than
having to learn both the boundary and then regress a points distance from it.
Further, we note that papers using the SDF formulation do not actually measure the accuracy of the overall learnt SDF, instead they make
use of metrics that compare rendered meshes with ground truth
[xu2019disn, park2019deepsdf, michalkiewicz2019deep, sitzmann2020metasdf].
The absence of metrics comparing the overall accuracy of the SDFs leaves open the possibility that the values of the implicit SDF are accurate
only near the boundary region.
Again, if this assertion is correct it would suggest that any difference in quality or output between SDF and occupancy formulations is minimal.
Given this and the simplicity of the method, we chose to use the formulation of mescheder2019occupancy for our experiments.
Iii-B Other Tasks
The conventional 3D segmentation task as explored in a number of papers[qi2017pointnet, xie2020review] typically involves predicting a
semantic label for each point in an given input point-cloud.
However, in the context of implicit representations, this task loses much of its meaning.
Particularly when the input is a degraded and noisy point-cloud (see Sec. IV).
As we are considering the occupancy of a given spatial location, it makes more sense to consider the task as determining the semantic
label of regions within the shape.
This scheme has the same effect as producing a Voronoi partitioning of the space inside the mesh.
Points that lie outside the mesh, are considered to be background and therefore have no valid semantic class.
The segmentation task then becomes predicting the label of a point inside the shape according to the nearest neighbour assignment.
During both training and inference, we evaluate the semantic label task at the same locations as the reconstruction task.
As well as segmentation, we also investigate the performance of our approach to the task of classification.
Unlike the segmentation task which requires the implicit code to encode information about the properties of spatial regions (similarly also with the
reconstruction task), classification requires that the encodings allow simple classification networks to discriminate between them.
Later experiments (see Sec. V-B and Sec. V-C), show that the requirements classification has for the encodings are
noticeably different to segmentation and reconstruction.
Our results show that implicit representations can be encouraged to be more representative of objects, rather than merely encoding their shape.
We focus on two particular tasks that are common tasks, but expect that generalising the encodings over further tasks is likely to also be possible.
An overview of our network architecture is shown in Fig. 2.
Our network takes as input to the encoder, either point-clouds or images.
Throughout all the following experiments, we use the same two encoders. One for point-cloud input, and another for image based input.
We use the same variation on the original network from [qi2017pointnet] as mescheder2019occupancy.
In this formulation, the fully connected (FC) layers normally present in the original network are replaced by residual FC
During training the network samples 300 points from the input point cloud and applies Gaussian noise () before passing these into the encoder(identically to [mescheder2019occupancy]).
We use a pre-trained ResNet-18[he2016deep], followed by a linear layer to reduce the output dimension following
The encoded features are then passed to a decoder.
For decoding point locations into either occupancy values or semantic labels we use one or more of the following, depending on
For classification, the encoding is passed directly to the classifier.
This is the same decoder used in [mescheder2019occupancy]. The network takes a number of points
as input and uses conditional batchnorms[de2017modulating], which take the encoding as their
input, to condition the network.
The same network as the occupancy decoder but with a larger output channel dimension.
Joint Segmentation and Occupancy Decoder
Also the same network as the occupancy decoder, however rather than two separate networks for each task, the same network performs both tasks
The output is then sliced along the channel dimension to yield two tensors, one containing the occupancy probability, and another
containing the semantic label probabilities.
The loss functions used depend on the task.
For the reconstruction loss, , we use binary cross entropy as in[mescheder2019occupancy].
Both classification, , and segmentation,, use the cross entropy loss.
In the multi task settings, the losses were combined in a weighted linear fashion as
for all experiments .
We use an ADAM optimiser with learning rate of .
Training for joint tasks takes approximately 4 days on an NVIDIA GeForce GTX 1080Ti.
Experiments are divided into three parts.
First we consider the original dataset from [mescheder2019occupancy], establishing a baseline and some preliminary experiments involving
reconstruction and classification.
Next we examine a more challenging classification task, before turning finally to a semantic segmentation task.
We perform our experiments on a number of datasets. The original dataset from mescheder2019occupancy is the subset of
ShapeNetCore[shapenet2015] from choy20163d. We also make use of ModelNet40[wu2015modelnet] for further classification
experiments and ShapeNetPart[yi2016scalable] for our segmentation experiments. Data pre-processing pipelines were accelerated with
We limit our experiments to datasets with similar properties to those used in [mescheder2019occupancy], as we are not seeking to validate
the specific implicit representation format we are using, rather the benefits of more feature rich encodings.
This means that we do not consider larger scale datasets such as Stanford3D[armeni20163d] that our chosen method might struggle with.
We leave this to future experiments with other methods such as [Peng2020convoccupancy] or [chabra2020deep] that are better able to
reconstruct larger scenes.
For all experiments, the properties of the inputs remain constant.
For point-clouds we sample 300 points from the ground truth point-cloud, and apply noise using a Gaussian distribution with zero mean
and standard deviation 0.05 to the sampled point clouds, identically to [mescheder2019occupancy].
For images we crop and resize the images identically to [mescheder2019occupancy].
Choy / ShapeNetCore
The dataset used in mescheder2019occupancy from which our work builds on, uses the renderings and voxelisations[choy20163d]
of a subset of the ShapeNetCore[shapenet2015] dataset. We use the rendered images to train the image based encoder in later
The fully processed dataset was provided by [mescheder2019occupancy] as part of their publication.
Briefly, meshes are loaded and a large number of depth images are rendered.
These depth images are fused to form a watertight mesh from which points and their corresponding occupancy value can be sampled.
Although the occupancy samples are not provided as part of the dataset in[choy20163d], to reduce ambiguity we will refer to the dataset
from [mescheder2019occupancy] as the Choy dataset throughout this paper.
The dataset consists of 30,648 training meshes, 4,358 validation meshes and 3,738 test meshes across 13 object categories.
We use the Choy dataset both for our baseline experiments, as well as some preliminary classification experiments.
Whilst this dataset only contains 13 separate classes, this provides sufficient preliminary experiments to validate our hypothesis.
Our experiments with this dataset are outlined in Sec. V-A.
For further classification experiments, we make use of the popular ModelNet40[wu2015modelnet] dataset.
As rendered images were not readily available, we rendered images using Pyrender[matl2020pyrender] in the same fashion as [choy20163d],
choosing 24 viewpoints with constant radius and altitude, but random azimuth.
The occupancy samples are generated with the code provided by [mescheder2019occupancy].
The dataset consists of 9,843 training meshes and 2,468 testing meshes across 40 object categories.
Our experiments with this dataset are outlined in Sec. V-B.
For our semantic segmentation experiments, we make use of the dataset from yi2016scalable, which we refer to as ShapeNetPart.
Again the occupancy samples were generated using the code from [mescheder2019occupancy].
Semantic labels were assigned to the occupancy samples using a simple nearest neighbour assignment from the ground truth semantic labels
The dataset consists of 12,121 training, 1,854 validation, and 2,858 testing meshes following the corresponding splits from ShapeNetCore.
Our experiments with this dataset are outlined in Sec. V-C.
V-a Choy Experiments
We begin with the dataset from the original paper.
Our experiments with point-cloud input are shown in Table I.
Given the small number of classes and fairly unique visual properties of the classes in this dataset, the high accuracy in classification is
not unexpected, even with the reduced quality of the input point-clouds.
To evaluate the performance of baseline encoder, we fix the encoder and train a simple classifier on the output.
This classifier shows a substantial reduction is accuracy, compared to the jointly trained classification and reconstruction results, where the
full accuracy on both tasks was recovered.
For the jointly trained experiment, the encoder was not fixed.
|Classification w/ ONet encoder||—||—||0.80|
|Joint Classification & ONet||0.77||0.0084||0.92|
ONet encoder, the encoder was fixed to allow for the classification performance of the encodings them-selves to be evaluated.
Our experiments with image input are shown in Table II.
The results are similar to the point-cloud experiments. As discussed in [mescheder2019occupancy], the lower performance in reconstruction for
the ONet can potentially be attributed to occlusion.
We do not include the “Classification with ONet encoder” experiment, as the encodings from the pre-trained ResNet are likely already effective for
classification, meaning this experiment is not likely to provide any new insight.
The joint training result shows that the encoding is capable of performing both tasks without loss of accuracy.
V-B ModelNet40 Experiments
To better evaluate the classification performance, as well as the shortcomings of the reconstruction encodings in classification, we run the same
experiments as in Sec. V-A on ModelNet40, a more conventional 3D classification benchmark.
Our experiments with point-cloud input are shown in Table III.
The results follow a similar pattern to the point-cloud results from the Choy dataset.
As we expected, when we train the classifier using the fixed encoder from the reconstruction task, the classification performance is poor.
This reduction in performance is much more severe than on the Choy dataset, but is consistent with the increased difficulty shown by the lower
accuracy figure on the classification baseline.
However, this performance loss is completely recovered in the joint training, with only a minor decrease in reconstruction performance.
|Classification w/ ONet encoder||—||—||0.57|
|Joint Classification & ONet||0.70||0.012||0.82|
encoder for the classification with ONet encoder was fixed.
Our experiments with image input are shown in in Table IV.
Here we see that the joint training is able to recover much of the performance on either of the single tasks.
Again, because of the nature of the pre-trained ResNet, we do not include the fixed encoder task.
V-C ShapeNetPart Experiments
Our metric for the segmentation task is mean average Intersection over Union (mIOU).
Points are sampled within the shape and assigned semantic labels by the decoder.
The same sample points are used for both segmentation and reconstruction.
Whilst in a real world scenario points would be sampled both inside and outside the shape, we wish to assess the performance of the segmentation
decoder independently of the reconstruction performance, and so only consider points inside the shape.
The IOU computed is for each part in each shape, and averaged to give a shape IOU.
If there are no ground truth points for a given part (e.g. whilst armrest is a part of the chair class, many of the chair instances do not have arms),
then the part is automatically assigned an IOU of 1.
We can then compute mIOU as the average of the shape IOUs.
At inference, time points are sampled randomly from a padded bounding box of the ground truth object, as in [mescheder2019occupancy].
Table V shows the reconstruction accuracy, mIOU and classification accracy of our different experiments on the ShapeNetPart dataset.
The results show little to no accuracy being lost in any of the tasks for the jointly trained settings.
Unlike in Table III with the fixed encoder, segmentation with a fixed ONet encoder does not show significantly worse performance than
the baseline task.
We suggest that this might be due to similarities between the reconstruction task and our modified segmentation task.
In the reconstruction task, the network is attempting to learn an encoding that represents the shape properties of a given region of space, such as
the curvature and boundaries.
These properties are likely also useful for the task of segmentation, i.e. the semantic class probabilities are potentially dependant on properties like
Table V shows the per-class segmentation results for the baseline, fixed encoder and joint training as well as the reconstruction IOU
for the baseline.
The poor performance on some of the classes such as rocket and headphones may be explained by the thin sections in parts of those objects.
Because the network samples points within the shapes randomly, thin sections like the fins(rocket), cable(earphones), or handlebar(motorbike) are
likely to be undersampled and therefore have poor performance at inference time (see Fig 3).
As well as this imbalance, there is also significant imbalance in the number of models in certain categories which can negatively affect accuracy at
This is reflected in the higher mIOU scores across all the experiments, for the classes with more shapes.
Fig. 4 shows some selected qualitative segmentation results.
The segmentation decoders show good results for bulk areas like the wings and body on the aeroplane(1st row) and simple objects like the
However areas like the chair arms(3rd row) present more of a challenge.
We can also see another example of the low performance of thin sections on the roofs of the cars(2nd row), particularly so for the right hand car.
Fig. 3 shows some of the failure cases of the segmentation decoder.
A particularly extreme case (2nd row) is shown where the correct semantic labels are completely inverted.
be under-sampled and therefore confuse the network. The cable on the earphones may also suffer from this problem. The earphones present a complete
failure with semantic classes completely inverted.
In this paper we have discussed the potential to generalise the encodings used by implicit representations to a broader range of tasks.
We discuss the current narrow focus of implicit representations, and the potential issues this raises for applications of implicit representations
in the real world.
We also introduce a modified formulation of the conventional segmentation task that is more applicable to implicit contexts, and detail an appropriate
network to use for this new formulation.
We choose two common computer vision tasks and demonstrate that through multi-task training, we can enrich the encodings achieving strong performance
across the tasks without any loss in reconstruction accuracy, also showing how certain encodings can struggle with some tasks but not others.
- Arguably occupancy networks are simply SDF networks with the sign function applied to their output, however this ignores
the increased complexity in regressing SDF values rather than simply their sign. We argue this point in more detail in Sec. III-A