• #ACL2021NLP #ACL2021 Please check our group’s recent publication at the main conference of @aclmeeting. We uncovered a compositional generalization problem existing in NMT models and contributed a new dataset. Contributed by Yafu Li, Yongjing Yin, Yulong Chen, Yue Zhang.

  • Prof Yue Zhang leads the #NLP lab at Westlake University @Westlake_Uni. Our group focuses on machine learning-based natural language processing, as well as application-oriented tasks, such as web information extraction and financial market prediction. Welcome to join us!

  • #NLProc #ACL2021 G-Transformer for Document-level Machine Translation Paper:arxiv.org/abs/2105.14761 Code:github.com/baoguangsheng/ Our @aclmeeting paper at the main conference introduces locality bias to fix the failure of Transformer training on document-level MT data.

Towards Generalising Neural Implicit Representations

论文 Deep Talk 6个月前 (02-02) 147次浏览 已收录 0个评论 扫描二维码

Towards Generalising Neural Implicit Representations

Abstract

Neural implicit representations have shown substantial improvements in efficiently storing 3D data, when compared to conventional formats.
However, the focus of existing work has mainly been on storage and subsequent reconstruction.
In this work, we argue that training neural representations for both reconstruction tasks, alongside conventional tasks, can produce more general
encodings that admit equal quality reconstructions to single task training, whilst providing improved results on conventional tasks when compared to single task encodings.
Through multi-task experiments on reconstruction, classification, and segmentation our approach learns feature rich encodings that produce high
quality results for each task.
We also reformulate the segmentation task, creating a more representative challenge for implicit representation contexts.

I Introduction

Implicit neural representations have garnered significant interest recently for their ability to reconstruct complex 3D structures and
shapes.
The appeal of these methods stems from a number of useful properties they possess for both reconstructing 3D shapes, as well
as storing them efficiently.
By learning the reconstruction of the shapes, networks are able to encode and use a rich set of priors over the 3D domain to improve
the quality of the reconstructions over what can be achieved with classical methods[mescheder2019occupancy, park2019deepsdf].
Efficiency in storage is achieved by decoupling the encoding from the input and output modality so, unlike voxel based representation,
the storage requirements do not grow cubically with the output resolution.
Further, implicit representations do not suffer from the limitations of mesh and point-cloud based representations, where the quality
of the reconstruction is typically limited by the output size constraints of a single feed forward pass[mescheder2019occupancy].

Conditioning a “decoder” network on an encoded representation of the input data, neural representations query the network at sample
point locations for occupancy or distance function information.
This approach allows for reconstructions to be generated with arbitrary resolutions at run-time[mescheder2019occupancy].

Towards Generalising Neural Implicit Representations
Fig. 1: Through multi-task training, implicit representations can be enriched creating a more general representation of a shape or object, and allowing for their use in a number of tasks rather than simply reconstruction.

Whilst these properties are impressive, we argue that further useful properties have been left on the table.
In many of the works making use of implicit methods, training is performed with the loss function targeted only at reconstruction accuracy.
This approach, whilst clearly effective, misses a significant potential benefit.
We argue that using a multi task loss, including loss terms related to common tasks such as classification, produces encodings that are
equally effective for reconstruction, but that still provide a richer set of features for use in other downstream tasks.
We suggest that in applications such as augmented reality, where efficient representations are very useful, the ability to encode more
than just shape information into the representation is likely to be useful.

Whilst a number of works have produced impressing and high quality reconstructions using a number of different approaches, there is a
general aim in the design neural network encoders to produce features that have meaningful uses beyond a single task.
However, we observe that this aim has not yet translated to implicit representation works.
Many practical applications of implicit representations to real world problems would benefit from the ability to perform multiple
tasks using the same stored data, rather than having to render and re-encode before performing other downstream tasks.

In this paper, we examine the generation of more descriptive neural representation encodings.
Through experiments, we show that encodings generated purely for reconstruction can produce poorer results on other tasks.
Further, we show clearly that the encodings used in neural representations can be trained to develop properties useful for other
tasks common in computer vision, without any appreciable reduction in reconstruction performance.
Whilst this should be possible for any number of tasks or applications such as texture generation, material property estimation, we examine
two common tasks in 3D, namely classification and segmentation.
We also argue that the conventional 3D semantic segmentation task does not translate well to implicit representations, where the object being
reconstruction is not or cannot be operated on directly.
To address this we propose, among other experiments, a re-formulation of the semantic segmentation task that is a more representative
formulation of the segmentation task when applied to implicit representations.

In summary, the key contributions of this paper are

  • A simple extension to existing implicit representation approaches allowing the simultaneous training of reconstruction, segmentation and
    classification in a multi-task fashion.

  • Experiments showing implicit encodings can admit substantially improved performance on common computer vision tasks, without any
    compromise to reconstruction accuracy.

  • A re-formulation of the semantic segmentation task, that is more representative of a real world task in the context of implicit representations.

Ii Related Work

Implicit or Neural representations have been the subject of much recent work.
Early works primarily focused on single objects[mescheder2019occupancy, sitzmann2019scene, sitzmann2020metasdf, park2019deepsdf, michalkiewicz2019deep, genova2019learning, atzmon2020sal, gropp2020implicit, poursaeed2020coupling, xu2019disn],
encoding some input, often either image or point-cloud, and producing a feature vector which is used to condition the output network.
There have been different approaches to conditioning the network with concatenation, however they mainly fall into two categories:
i) concatenation/biasing (dumoulin2018feature-wise argue that biasing and concatenation are analogous) and ii) hyper-networks.
Further, the implementation of the implicit representation can be divided into two categories, namely occupancy generating functions or
signed distance function (SDF) generating functions1.

In concatenation based conditioning, the encoding is concatenated with the point being queried and then passed through the network.
In hyper-networks, the encodings is passed through a small network, whose outputs are the weights used in the network that actually
predicts the value for a given query point.

Early works such as [sitzmann2019scene, park2019deepsdf, mescheder2019occupancy] showed that simple MLP networks were capable of
representing complex distance functions and occupancy functions.
park2019deepsdf also detailed the use of auto-decoders to estimate optimal encodings for a given input, using a fixed decoder and simple backpropagation.
mescheder2019occupancy first demonstrated the alternative occupancy paradigm for implicit representations, as well as proposing
a procedure to extract high quality meshes in an efficient manner from the implicit representation, using an octree like approach.
michalkiewicz2019deep learn level sets to represent shapes.
We note that the level sets are equivalent to learning an unsigned distance function like atzmon2020sal.
poursaeed2020coupling combines both explicit atlas based reconstruction and implicit neural reconstruction, enforcing consistency
between the two methods.

Later works investigated larger scenes[chabra2020deep, Peng2020convoccupancy], however many of these methods did not
expand the size of the area described by a given embedding, instead proposing methods to recover encodings for a small
local region where a neural representation can then extract shape information.
Peng2020convoccupancy interpolated between encoded points in a volume or plane, to generate the conditioning vector for
a occupancy network in a region around the encoded point.
chibane2020implicit took a similar approach, adding also multiple resolutions of encoded volumes.
A separate group of works[jiang2020local, chabra2020deep, tretschk2020patchnets] all took a slightly different approach from above
and divided the scene into regions generating a small encoding for each region.
Both [jiang2020local] and [chabra2020deep] make use of a grid of small local encodings, whereas [tretschk2020patchnets]
make use of a number of oriented spherical patches of differing radii each with an encoding.

Further improvements to implicit representations in general were proposed by sitzmann2020implicit and tancik2020fourier
showing that adding higher frequency information to simple networks, drastically improved their ability to generate high quality
reconstructions.
duan2020curriculum proposed a curriculum based learning approach for implicit representations, improving reconstruction quality
of complex local details.

Other works have used implicit representations for a number of other tasks, most notably novel view synthesis[martin2020nerf, mildenhall2020nerf, sitzmann2019deepvoxels].

There have also been a number of works considering the application of multi-task approaches to 3D problems[pham2019jsis3d, lahoud20193d, hassani2019unsupervised, liang2019multi].
Multi-task learning has enabled improvements where tasks are related or closely coupled, such as semantic and instance segmentation[lahoud20193d, pham2019jsis3d].
As well hassani2019unsupervised, made use of a multi-task setup in unsupervised training to learn an embedding space
over 3D point-cloud inputs.

To our knowledge, only one other paper has investigated using implicit representations for tasks other than
reconstruction[kohli2020inferring].
However, their work focused only on the 2D domain (albeit using an internal 3D modality), prohibiting any meaningful comparison
with our method.

Iii Method

In this section, we first cover the principles of implicit representations. We then describe the other tasks we consider, including our variation to
the normal task of segmentation that we use in our experiments for a fairer representation of the segmentation task in an implicit context.
Finally we discuss our network architecture and training procedure.

Towards Generalising Neural Implicit Representations
Fig. 2: An overview of our network architecture. The network takes as input either images or point-clouds, generating an encoding from them. This
encoding can then be used in a number of ways. For classification, the encoding is passed directly into a simple classifier. For segmentation and
reconstruction, the encoding is used to condition the decoder networks. The decoder networks take a number of points as input and returns for each point
either, the probability that that point lies inside the encoded shape, or semantic label probabilities.

Iii-a Implicit Representations

Neural implicit representations attempt to estimate the function describing the surface of a given object.
A common formulation is to map from a point, , in space to the smallest signed distance between the point and
the outer face of a surface, i.e. a SDF.
This gives rise to an expression[park2019deepsdf, xu2019disn] of the form

Another common formulation is to estimate the probability that a given point lies within the object (i.e. probability of occupancy), rather
than regressing the SDF directly.
This gives a function[mescheder2019occupancy] of the form

However, we note the following relationship

(1)

where is a threshold parameter.
This relationship suggests that the occupancy function is a simplified version of the SDF, moving the problem from a regression context
and into a classification context.
Further, this reformulation suggests that where the extra information provided by the SDF (but not the occupancy function) such as the
surface normal (given as, , the spatial gradient of the SDF[park2019deepsdf]) is not needed,
the occupancy function is likely to be easier to learn.
If this is true, we suggest this is the result of the occupancy network only needing to learn a decision boundary over rather than
having to learn both the boundary and then regress a points distance from it.
Further, we note that papers using the SDF formulation do not actually measure the accuracy of the overall learnt SDF, instead they make
use of metrics that compare rendered meshes with ground truth
[xu2019disn, park2019deepsdf, michalkiewicz2019deep, sitzmann2020metasdf].
The absence of metrics comparing the overall accuracy of the SDFs leaves open the possibility that the values of the implicit SDF are accurate
only near the boundary region.
Again, if this assertion is correct it would suggest that any difference in quality or output between SDF and occupancy formulations is minimal.
Given this and the simplicity of the method, we chose to use the formulation of mescheder2019occupancy for our experiments.

Iii-B Other Tasks

The conventional 3D segmentation task as explored in a number of papers[qi2017pointnet, xie2020review] typically involves predicting a
semantic label for each point in an given input point-cloud.
However, in the context of implicit representations, this task loses much of its meaning.
Particularly when the input is a degraded and noisy point-cloud (see Sec. IV).
As we are considering the occupancy of a given spatial location, it makes more sense to consider the task as determining the semantic
label of regions within the shape.

Hence, given a mesh , for each vertex with a semantic label , the semantic
class of any location lying inside the mesh, has the semantic class of the nearest
vertex of .

This scheme has the same effect as producing a Voronoi partitioning of the space inside the mesh.
Points that lie outside the mesh, are considered to be background and therefore have no valid semantic class.
The segmentation task then becomes predicting the label of a point inside the shape according to the nearest neighbour assignment.
During both training and inference, we evaluate the semantic label task at the same locations as the reconstruction task.

As well as segmentation, we also investigate the performance of our approach to the task of classification.
Unlike the segmentation task which requires the implicit code to encode information about the properties of spatial regions (similarly also with the
reconstruction task), classification requires that the encodings allow simple classification networks to discriminate between them.
Later experiments (see Sec. V-B and Sec. V-C), show that the requirements classification has for the encodings are
noticeably different to segmentation and reconstruction.

Our results show that implicit representations can be encouraged to be more representative of objects, rather than merely encoding their shape.
We focus on two particular tasks that are common tasks, but expect that generalising the encodings over further tasks is likely to also be possible.

Iii-C Architecture

An overview of our network architecture is shown in Fig. 2.
Our network takes as input to the encoder, either point-clouds or images.
Throughout all the following experiments, we use the same two encoders. One for point-cloud input, and another for image based input.


Point-cloud input
We use the same variation on the original network from [qi2017pointnet] as mescheder2019occupancy.
In this formulation, the fully connected (FC) layers normally present in the original network are replaced by residual FC
blocks[he2016deep].
During training the network samples 300 points from the input point cloud and applies Gaussian noise () before passing these into the encoder(identically to [mescheder2019occupancy]).

Image input
We use a pre-trained ResNet-18[he2016deep], followed by a linear layer to reduce the output dimension following
[mescheder2019occupancy].


The encoded features are then passed to a decoder.
For decoding point locations into either occupancy values or semantic labels we use one or more of the following, depending on
the task(s).
For classification, the encoding is passed directly to the classifier.


Occupancy Decoder
This is the same decoder used in [mescheder2019occupancy]. The network takes a number of points
as input and uses conditional batchnorms[de2017modulating], which take the encoding as their
input, to condition the network.

Classifier
A simple 2 layer MLP, that takes the encoding directly as input and returns class probabilities.

Segmentation Decoder
The same network as the occupancy decoder but with a larger output channel dimension.

Joint Segmentation and Occupancy Decoder
Also the same network as the occupancy decoder, however rather than two separate networks for each task, the same network performs both tasks
simultaneously.
The output is then sliced along the channel dimension to yield two tensors, one containing the occupancy probability, and another
containing the semantic label probabilities.


The loss functions used depend on the task.
For the reconstruction loss, , we use binary cross entropy as in[mescheder2019occupancy].
Both classification, , and segmentation,, use the cross entropy loss.
In the multi task settings, the losses were combined in a weighted linear fashion as

for all experiments .
We use an ADAM optimiser with learning rate of .

Training for joint tasks takes approximately 4 days on an NVIDIA GeForce GTX 1080Ti.

Iv Experiments

Experiments are divided into three parts.
First we consider the original dataset from [mescheder2019occupancy], establishing a baseline and some preliminary experiments involving
reconstruction and classification.
Next we examine a more challenging classification task, before turning finally to a semantic segmentation task.

Iv-a Datasets

We perform our experiments on a number of datasets. The original dataset from mescheder2019occupancy is the subset of
ShapeNetCore[shapenet2015] from choy20163d. We also make use of ModelNet40[wu2015modelnet] for further classification
experiments and ShapeNetPart[yi2016scalable] for our segmentation experiments. Data pre-processing pipelines were accelerated with
GNU Parallel[tange2011parallel].

We limit our experiments to datasets with similar properties to those used in [mescheder2019occupancy], as we are not seeking to validate
the specific implicit representation format we are using, rather the benefits of more feature rich encodings.
This means that we do not consider larger scale datasets such as Stanford3D[armeni20163d] that our chosen method might struggle with.
We leave this to future experiments with other methods such as [Peng2020convoccupancy] or [chabra2020deep] that are better able to
reconstruct larger scenes.

For all experiments, the properties of the inputs remain constant.
For point-clouds we sample 300 points from the ground truth point-cloud, and apply noise using a Gaussian distribution with zero mean
and standard deviation 0.05 to the sampled point clouds, identically to [mescheder2019occupancy].
For images we crop and resize the images identically to [mescheder2019occupancy].

Choy / ShapeNetCore

The dataset used in mescheder2019occupancy from which our work builds on, uses the renderings and voxelisations[choy20163d]
of a subset of the ShapeNetCore[shapenet2015] dataset. We use the rendered images to train the image based encoder in later
experiments.
The fully processed dataset was provided by [mescheder2019occupancy] as part of their publication.
Briefly, meshes are loaded and a large number of depth images are rendered.
These depth images are fused to form a watertight mesh from which points and their corresponding occupancy value can be sampled.
Although the occupancy samples are not provided as part of the dataset in[choy20163d], to reduce ambiguity we will refer to the dataset
from [mescheder2019occupancy] as the Choy dataset throughout this paper.
The dataset consists of 30,648 training meshes, 4,358 validation meshes and 3,738 test meshes across 13 object categories.

We use the Choy dataset both for our baseline experiments, as well as some preliminary classification experiments.
Whilst this dataset only contains 13 separate classes, this provides sufficient preliminary experiments to validate our hypothesis.
Our experiments with this dataset are outlined in Sec. V-A.

ModelNet40

For further classification experiments, we make use of the popular ModelNet40[wu2015modelnet] dataset.
As rendered images were not readily available, we rendered images using Pyrender[matl2020pyrender] in the same fashion as [choy20163d],
choosing 24 viewpoints with constant radius and altitude, but random azimuth.
The occupancy samples are generated with the code provided by [mescheder2019occupancy].
The dataset consists of 9,843 training meshes and 2,468 testing meshes across 40 object categories.
Our experiments with this dataset are outlined in Sec. V-B.

ShapeNetPart

For our semantic segmentation experiments, we make use of the dataset from yi2016scalable, which we refer to as ShapeNetPart.
Again the occupancy samples were generated using the code from [mescheder2019occupancy].
Semantic labels were assigned to the occupancy samples using a simple nearest neighbour assignment from the ground truth semantic labels
in[yi2016scalable].
The dataset consists of 12,121 training, 1,854 validation, and 2,858 testing meshes following the corresponding splits from ShapeNetCore.
Our experiments with this dataset are outlined in Sec. V-C.

V Results

V-a Choy Experiments

We begin with the dataset from the original paper.
Our experiments with point-cloud input are shown in Table I.
Given the small number of classes and fairly unique visual properties of the classes in this dataset, the high accuracy in classification is
not unexpected, even with the reduced quality of the input point-clouds.
To evaluate the performance of baseline encoder, we fix the encoder and train a simple classifier on the output.
This classifier shows a substantial reduction is accuracy, compared to the jointly trained classification and reconstruction results, where the
full accuracy on both tasks was recovered.
For the jointly trained experiment, the encoder was not fixed.

IOU Chamfer L1 Accuracy
ONet baseline 0.78 0.0081
Classification baseline 0.92
Classification w/ ONet encoder 0.80
Joint Classification & ONet 0.77 0.0084 0.92
TABLE I: Experiments on the Choy dataset with point-cloud input, showing shape IOU and classification accuracy. For the classification task with the
ONet encoder, the encoder was fixed to allow for the classification performance of the encodings them-selves to be evaluated.

Our experiments with image input are shown in Table II.
The results are similar to the point-cloud experiments. As discussed in [mescheder2019occupancy], the lower performance in reconstruction for
the ONet can potentially be attributed to occlusion.
We do not include the “Classification with ONet encoder” experiment, as the encodings from the pre-trained ResNet are likely already effective for
classification, meaning this experiment is not likely to provide any new insight.
The joint training result shows that the encoding is capable of performing both tasks without loss of accuracy.

IOU Chamfer L1 Accuracy
ONet baseline 0.58 0.021
Classification baseline 0.92
Joint Classification & ONet 0.59 0.020 0.92
TABLE II: Experiments on the Choy dataset with image input, showing shape IOU and classification accuracy.

V-B ModelNet40 Experiments

To better evaluate the classification performance, as well as the shortcomings of the reconstruction encodings in classification, we run the same
experiments as in Sec. V-A on ModelNet40, a more conventional 3D classification benchmark.

Our experiments with point-cloud input are shown in Table III.
The results follow a similar pattern to the point-cloud results from the Choy dataset.
As we expected, when we train the classifier using the fixed encoder from the reconstruction task, the classification performance is poor.
This reduction in performance is much more severe than on the Choy dataset, but is consistent with the increased difficulty shown by the lower
accuracy figure on the classification baseline.
However, this performance loss is completely recovered in the joint training, with only a minor decrease in reconstruction performance.

IOU Chamfer L1 Accuracy
ONet baseline 0.73 0.011
Classification baseline 0.82
Classification w/ ONet encoder 0.57
Joint Classification & ONet 0.70 0.012 0.82
TABLE III: Experiments on the ModelNet40 dataset with point-cloud input, showing shape IOU and classification accuracy. As in Table I, the
encoder for the classification with ONet encoder was fixed.

Our experiments with image input are shown in in Table IV.
Here we see that the joint training is able to recover much of the performance on either of the single tasks.
Again, because of the nature of the pre-trained ResNet, we do not include the fixed encoder task.

IOU Chamfer L1 Accuracy
ONet baseline 0.54 0.034
Classification baseline 0.85
Joint Classification & ONet 0.51 0.036 0.84
TABLE IV: Experiments on the ModelNet40 dataset with image input, showing shape IOU and classification accuracy.

V-C ShapeNetPart Experiments

Our metric for the segmentation task is mean average Intersection over Union (mIOU).
Points are sampled within the shape and assigned semantic labels by the decoder.
The same sample points are used for both segmentation and reconstruction.
Whilst in a real world scenario points would be sampled both inside and outside the shape, we wish to assess the performance of the segmentation
decoder independently of the reconstruction performance, and so only consider points inside the shape.
The IOU computed is for each part in each shape, and averaged to give a shape IOU.
If there are no ground truth points for a given part (e.g. whilst armrest is a part of the chair class, many of the chair instances do not have arms),
then the part is automatically assigned an IOU of 1.
We can then compute mIOU as the average of the shape IOUs.
At inference, time points are sampled randomly from a padded bounding box of the ground truth object, as in [mescheder2019occupancy].

IOU Chamfer L1 mIOU Accuracy
ONet baseline 0.69 0.010
Classification baseline 0.95
Segmentation baseline 0.53
Segmentation
w/ ONet encoder
0.49
Joint Segmentation
& ONet
0.70 0.0098 0.50
Parallel Segmentation
& ONet
0.68 0.011 0.53
Joint Segmentation
& Classification & ONet
0.72 0.0086 0.50 0.95
TABLE V: Experiments on the ShapeNetPart dataset with pointcloud input, showing shape IOU, segmentation mIOU, and classification accuracy.

Table V shows the reconstruction accuracy, mIOU and classification accracy of our different experiments on the ShapeNetPart dataset.
The results show little to no accuracy being lost in any of the tasks for the jointly trained settings.
Unlike in Table III with the fixed encoder, segmentation with a fixed ONet encoder does not show significantly worse performance than
the baseline task.
We suggest that this might be due to similarities between the reconstruction task and our modified segmentation task.
In the reconstruction task, the network is attempting to learn an encoding that represents the shape properties of a given region of space, such as
the curvature and boundaries.
These properties are likely also useful for the task of segmentation, i.e. the semantic class probabilities are potentially dependant on properties like
local curvature.

Class
ONet
(reconstruction IOU)
Segmentation
Segmentation
w/ ONet encoder
Joint Segmentation
& ONet
Parallel Segmentation
& ONet
Joint Segmentation
& Classification & ONet
Airplane 0.75 0.586 0.539 0.552 0.588 0.55
Bag 0.708 0.464 0.445 0.455 0.496 0.42
Cap 0.555 0.446 0.379 0.403 0.45 0.381
Car 0.798 0.523 0.5 0.505 0.528 0.494
Chair 0.7 0.628 0.585 0.619 0.632 0.62
Earphone 0.559 0.314 0.276 0.305 0.332 0.322
Guitar 0.751 0.679 0.657 0.662 0.684 0.659
Knife 0.705 0.549 0.5 0.533 0.557 0.536
Lamp 0.542 0.519 0.466 0.486 0.524 0.49
Laptop 0.81 0.589 0.594 0.588 0.595 0.591
Motorbike 0.527 0.53 0.431 0.468 0.528 0.446
Mug 0.758 0.576 0.554 0.533 0.575 0.545
Pistol 0.754 0.588 0.573 0.562 0.598 0.558
Rocket 0.726 0.385 0.365 0.295 0.392 0.299
Skateboard 0.684 0.508 0.479 0.469 0.504 0.458
Table 0.703 0.567 0.535 0.564 0.568 0.56
Mean 0.689 0.528 0.493 0.5 0.535 0.496
TABLE VI: Experiments on the ShapeNetPart dataset with point-cloud input, detailing per class results, showing segmentation mIOU.

Table V shows the per-class segmentation results for the baseline, fixed encoder and joint training as well as the reconstruction IOU
for the baseline.
The poor performance on some of the classes such as rocket and headphones may be explained by the thin sections in parts of those objects.
Because the network samples points within the shapes randomly, thin sections like the fins(rocket), cable(earphones), or handlebar(motorbike) are
likely to be undersampled and therefore have poor performance at inference time (see Fig 3).
As well as this imbalance, there is also significant imbalance in the number of models in certain categories which can negatively affect accuracy at
inference time.
This is reflected in the higher mIOU scores across all the experiments, for the classes with more shapes.

Fig. 4 shows some selected qualitative segmentation results.
The segmentation decoders show good results for bulk areas like the wings and body on the aeroplane(1st row) and simple objects like the
guitar(4th row).
However areas like the chair arms(3rd row) present more of a challenge.
We can also see another example of the low performance of thin sections on the roofs of the cars(2nd row), particularly so for the right hand car.

Fig. 3 shows some of the failure cases of the segmentation decoder.
A particularly extreme case (2nd row) is shown where the correct semantic labels are completely inverted.

Towards Generalising Neural Implicit Representations Towards Generalising Neural Implicit Representations Towards Generalising Neural Implicit Representations Towards Generalising Neural Implicit Representations
Fig. 3: Segmentation failure cases. Ground truth on the left, predicted segmentation on the right. For the rocket, thin features like the fins can
be under-sampled and therefore confuse the network. The cable on the earphones may also suffer from this problem. The earphones present a complete
failure with semantic classes completely inverted.
Towards Generalising Neural Implicit Representations
Fig. 4: Qualitative segmentation results on the ShapeNetPart[yi2016scalable] dataset. From left to right: Ground truth, Segmentation
baseline, Joint segmentation and classification and ONet.

Vi Conclusion

In this paper we have discussed the potential to generalise the encodings used by implicit representations to a broader range of tasks.
We discuss the current narrow focus of implicit representations, and the potential issues this raises for applications of implicit representations
in the real world.
We also introduce a modified formulation of the conventional segmentation task that is more applicable to implicit contexts, and detail an appropriate
network to use for this new formulation.
We choose two common computer vision tasks and demonstrate that through multi-task training, we can enrich the encodings achieving strong performance
across the tasks without any loss in reconstruction accuracy, also showing how certain encodings can struggle with some tasks but not others.

References

    Footnotes

    1. Arguably occupancy networks are simply SDF networks with the sign function applied to their output, however this ignores
      the increased complexity in regressing SDF values rather than simply their sign. We argue this point in more detail in Sec. III-A

    https://www.groundai.com/project/towards-generalising-neural-implicit-representations1286/


    CSIT FUN , 版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权
    转载请注明原文链接:Towards Generalising Neural Implicit Representations
    喜欢 (0)
    [985016145@qq.com]
    分享 (0)
    发表我的评论
    取消评论
    表情 贴图 加粗 删除线 居中 斜体 签到

    Hi,您需要填写昵称和邮箱!

    • 昵称 (必填)
    • 邮箱 (必填)
    • 网址