DSLib: An open source library for the dominant set clustering method

Basic Theroy Deep Talk 2周前 (10-16) 11次浏览 已收录 0个评论 扫描二维码

DSLib: An open source library for the dominant set clustering method

Abstract

DSLib is an open source implementation of the Dominant Set (DS) clustering algorithm written entirely in Matlab® . The DS method is a graph-based clustering technique rooted in the evolutionary game theory that starts gaining lots of interests in the computer science community. It has been originally introduced in [4, 5] and, thanks to its duality with game theory and its strict relation to the notion of maximal clique, has been explored in several directions not only related to clustering problems. Applications in graph matching, segmentation, classification and medical imaging are common in literature.
This package provides an implementation of the original DS clustering algorithm, since no code has been officially released yet, together with a still growing collection of methods and variants related to it.
Our library is integrable into a Matlab®pipeline without dependencies, it is simple to use and easily extendable for upcoming works.
The latest source code, the documentation and some examples can be downloaded from https://xwasco.github.io/DominantSetLibrary

Keywords: Dominant set, clustering, graph, maximal clique, game theory

/SetBgContents

1 Introduction

Clustering is a relevant task in the domain of pattern recognition.

It requires to group
elements into meaningful subsets (e.g., points which are close in a Euclidean space) exploiting the underlying hidden structure of the data.

It is in general an ill-posed problem
since different partitioning of the same dataset could end up into a meaningful subdivision of the data.
Two properties characterize, in general, every clustering algorithms: each subset (cluster) should have a high internal coherence and a high external in-coherence this means that items belonging to the same cluster should have a high similarity while items of different clusters should be more dissimilar as possible. These properties are essential to achieving a good separation of the subsets and represent the foundation of many clustering algorithms

including the dominant set clustering method.
The dominant set (DS) algorithm has been formalized in the domain of graph theory, with a strict relationship to the notion of maximal clique 1 generalizing it to the edge-weighted case.

In the DS algorithm, a dataset is represented as an edge-weighted graph where nodes are data points, and edges encode the relationship between pairs of points in terms of a similarity weight.

The graph, encoded as an affinity matrix, is given in input to the algorithm.
The clusters (dominant set) are then iteratively extracted from the graph using a dynamical system that mimics a natural selection process. Three different dynamical systems are implemented into this library.

In particular, using this clustering technique is beneficial when i) the number of clusters to be found is totally unknown, ii) the pairwise similarity function is not symmetric, iii) a quantification of the quality of cluster is needed (i.e. to detect outliers) and iv) a notion of centrality of an element with respect to the cluster (centroid) is required. The aforementioned properties are hard to find in a traditional clustering algorithm (i.e. k-means, spectral clustering and so on). Furthermore, the DS algorithm proved to be particularly suitable when the datasets are noisy and corrupted by outliers thanks to its nature of finding highly compact and coherent structures in the graph.

 

The remaining of the paper is organized as follow: in Sec 1.1 a brief introduction to the dominant set method and evolutionary game theory are provided, in Sec 2 the software package is explained with a practical example, in Sec 3 some use-cases are given and, finally, in Sec 4 conclusions are drawn.

1.1 Dominant Sets

A dataset is represented as an edge-weighted graph with no self-loops, where is the set of nodes, is the set of edges connecting pairs of nodes and is a weighting function mapping edges to nonnegative weights. One of the possible way of defining is through the following kernel where is a generic distance function (typically the Euclidean distance) between elements and (with ) and is the decay of the negative exponential (or the scale of the kernel). Since the DS method requires a graph in terms of its pair-wise affinity matrix , other similarity functions can be plugged in by the user without any problem.

Finding a dominant set (a cluster) is achieved optimizing a quadratic problem over the standard simplex:

(1)
s.t.

where is the affinity matrix of size , the number of nodes in the graph (number of elements in the dataset) and is the standard simplex consisting of -dimensional probability vectors.
A local solution of problem (1) can be found by iterating the following dynamics until convergence:

(2)

Eq.(2) is a dynamical system, known as Replicator Dynamics(RD) [11], which mimics a Darwinian selection process over the nodes, letting the ones that mostly support each other (the ones that are more similar) to survive (retain a nonzero probability) to the detriment of the others. Two additional dynamical systems are proposed in this package, the Infection Immunization [7] and the Exponential Replicator Dynamics [11] which can be used for large graphs and for faster convergence, respectively.
At convergence of the chosen dynamical system, the elements that exceed a given probability threshold represent the dominant set. Those elements are removed from the original graph and the process is reiterated on the remaining nodes until all the nodes are grouped. This iterative peeling-off procedure to extract DSs is implemented in the library (see Sec 2.3).

2 Description of the package

In this section, we provide a short description of the library and how to integrate it into a Matlab®pipeline. The inline documentation of the package provides all the information for the parameters.

2.1 Requisites

No particular requirements are needed, the library runs on any Matlab environment (Windows, Linux, Mac) and it has no dependencies. It has been tested on Matlab 2014b, 2016b and 2018b in Windows/Linux and MacOSX. Some components are written in C-mex for optimization purpose but equivalent Matlab code is available without the need of compiling it.

2.2 What’s inside the package?

In this version of the library, we have implemented the following papers: [4, 5, 7, 8] and a still growing set of papers will be added in the future. For the curious reader, we refer to [1] for an outstanding review of the DS method.

2.3 Practical usage

In this section we show how to setup a basic running example to cluster a synthetic dataset.

Step 1: Generating data and build the affinity matrix :

rng(’default’);                                                         % For repeatability
cx = [1 1;5 5 ;8 8];                                    %center of the clouds of points
npts=100; pts= repmat(cx,npts,1) + randn(npts*size(cx,1),2);
A=pdist(pts);                                                   %pairwise Euclidean distances
s=3*var(A);             %a heuristic to find sigma
A=exp(-A./s);  A=squareform(A); %from distance to similarity
A=A.*not(eye(size(A))); %the graph should not have selfloops

Step 2: Choosing the evolutionary dynamic and the DS parameters:

dynType=1;              %0=Replicator Dynamics, 1=InfectionImmunization 2=Exponential replicator dynamics
precision=1e-6;  %the precision required from the dynamical system
maxIters=1000;  %number of maximum iteration of the dynamical system
x=ones(size(A,1))./size(A,1); %starting point of the dynamical system
theta=1e-5;             %threshold used to extract the support from x.

Step 3: Call the clustering method

[C]=dominantset(A,x,theta,precision,maxIters,dynType);

alternatively, the clustering method can be called providing only the similarity matrix and using the default parameter.

[C]=dominantset(A);

Step 4: Show the cluster results

gscatter(pts(:,1),pts(:,2),C);  %show the points colored by cluster

The package comes with more complex examples and a deep inline documentation.

3 Case studies

The core described in this manuscript has been used in several studies. Here we highlight three applications in radically different domains to prove the transversality of the method:

  • Protein Clustering: In [6] the DS has been used to study the interaction of proteins during the neuronal plasticity. The tool provides unprecedented insight into the problem. Number of nodes in the graph .

  • Conversational group detection: In [10, 9] the DS has been used to automatically detect conversational groups in real-time in video sequences, a task of utmost importance in video surveillance. Number of nodes in the graph .

  • Brain connectomics: In [2, 3] the DS has been used to study the neuronal connections between the hemisphere with the aims of automatically group fibers having the same shape. Furthermore, a matching procedure allows recovering the common structures across different subjects. Number of nodes in the graph .

4 Conclusion

The DS clustering method has proven its powerfulness in the last decade, leading to a variety of applications. Here we present a library implementing the original paper and some follow-up methods. The library is written in pure Matlab® with C-mex components, is cross-platform, open source and does not have any dependency. Full documentation, tutorials, download, and the updated source code can be found at https://xwasco.github.io/DominantSetLibrary/.

Acknowledgments

We would like to acknowledge Dr. Luca Dodero, prof. Diego Sona for testing the library and giving us their invaluable support. Furthermore, we want to thanks the entire Student Branch @ DAIS of Ca’ Foscari University of Venice for their suggestions.

Footnotes

  1. A clique is a set of mutually connected nodes

References


  1. Samuel Rota Bulò and Marcello Pelillo.


    Dominant-set clustering: A review.


    European Journal of Operational Research, 262(1):1 – 13, 2017.


  2. L. Dodero, S. Vascon, L. Giancardo, A Gozzi, D. Sona, and V. Murino.


    Automatic white matter fiber clustering using dominant sets.


    In Pattern Recognition in Neuroimaging (PRNI), 2013
    International Workshop on
    , pages 216–219, June 2013.


  3. Luca Dodero, Sebastiano Vascon, Vittorio Murino, Angelo Bifone, Alessandro
    Gozzi, and Diego Sona.


    Automated multi-subject fiber clustering of mouse brain using
    dominant sets.


    Frontiers in Neuroinformatics, 8:87, 2015.


  4. M. Pavan and M. Pelillo.


    A new graph-theoretic approach to clustering and segmentation.


    In Computer Vision and Pattern Recognition, 2003. Proceedings.
    2003 IEEE Computer Society Conference on
    , volume 1, pages I–145–I–152
    vol.1, June 2003.


  5. Massimiliano Pavan and Marcello Pelillo.


    Dominant sets and pairwise clustering.


    IEEE Trans. Pattern Anal. Mach. Intell., 29(1):167–172,
    January 2007.


  6. Francesca Pennacchietti, Sebastiano Vascon, Thierry Nieus, Christian Rosillo,
    Sabyasachi Das, Shiva Tyagarajan, Alberto Diaspro, Alessio del Bue, Enrica
    Maria Petrini, Andrea Barberis, and Francesca Cella Zanacchi.


    Nanoscale molecular reorganization of the inhibitory postsynaptic
    density is a determinant of gabaergic synaptic potentiation.


    Journal of Neuroscience, 2017.


  7. Samuel Rota Bulo and Immanuel M. Bomze.


    Infection and immunization: A new class of evolutionary game
    dynamics.


    Games and Economic Behavior, 71(1):193–211, January 2011.


  8. Sebastiano Vascon, Marco Cristani, Marcello Pelillo, and Vittorio Murino.


    Using dominant sets for k-nn prototype selection.


    In Alfredo Petrosino, editor, Image Analysis and Processing –
    ICIAP 2013
    , volume 8157 of Lecture Notes in Computer Science, pages
    131–140. Springer Berlin Heidelberg, 2013.


  9. Sebastiano Vascon, Eyasu Zemene Mequanint, Marco Cristani, Hayley Hung,
    Marcello Pelillo, and Vittorio Murino.


    Detecting conversational groups in images and sequences: A robust
    game-theoretic approach.


    Computer Vision and Image Understanding, 143:11–24, 2016.


  10. Sebastiano Vascon, Eyasu Zemene Mequanint, Marco Cristani, Hayley Hung,
    Marcello Pelillo, and Vittorio Murino.


    A game-theoretic probabilistic approach for detecting conversational
    groups.


    In Asian Conference in Computer Vision, Lecture Notes in
    Computer Science, 2014.


  11. Jörgen W Weibull.


    Evolutionary game theory.


    The MIT press, 1995.

https://www.groundai.com/project/dslib-an-open-source-library-for-the-dominant-set-clustering-method/


CSIT FUN , 版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:DSLib: An open source library for the dominant set clustering method
喜欢 (0)
[985016145@qq.com]
分享 (0)
发表我的评论
取消评论
表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址