Scientific context and challenges

We are confronted with the “Big Data” phenomenon, which are data on a large-scale in terms of volume, dimension, and complexity. These large-scale data are proclaimed widely as likely for transforming business, scientific discovery, industry, etc. But, despite this enormous potential for generating several benefits for today’s society, the problem of learning from such data by automatically extracting relevant information, is still a challenge. The theory and algorithms have to be pushed forward. Indeed, the problem of large-scale data analysis, while challenging, is still less well investigated from a fundamental point of view, except by increasing computing resources, that is, from a practical point of view. We think that addressing the problem of large-scale data analysis does not only suggest addressing computational issues of state-of-the-art models, but mainly requires the development of new methodologies. Large-scale modeling has been mainly considered in supervised learning and it is only very recently that we start having significant fundamental results in statistical analysis of large-scale data, see for example Kleiner et al., (2014) work on regression and supervised classification for massive data. State-of-the-art latent data models have mainly focused on the problem of high-dimensionality (d>>n) by constructing sparse models, e.g. high-dimensional mixtures Witten and Tibshirani (2010), Ruan et al., (2011), Celeux et al., (2011), Azizyan et al., (2014). However, scaling up the models for both d and n is a challenge and opens new issues. Sparse models have focused on multivariate data with independent variables, but large-scale functional data in particular, is still not investigated. On the other hand, inference for large-scale data usually entails distributed computing, which opens fundamental issues regarding how to distribute the data while obtaining theoretical guarantees for model aggregation.


Scientific objectives and hypotheses

SMILES aims at developing a new modeling and inference framework for learning from large-scale data by establishing new statistical methodologies for unsupervised data classification and representation. It opts for the general framework of generative statistical models as the best way to model the uncertainty in large-scale data, and as a well-principled framework to deal with unsupervised learning. Inference of the models from data is a central topic of the project which considers the large-scale context as a whole, with all of its main issues related to inference from complex, heterogeneous, high-dimensional, dynamical, and massive data. SMILES mainly focuses on large-scale unsupervised learning, and has the following objectives :
i. Unsupervised large-scale statistical modeling by latent data models: In this unsupervised context, we opt for the general framework of generative learning, which includes latent data models (LDM), a very successful framework for automatically acquiring knowledge from data, including namely mixture models, which are, on the one hand, universal data density approximators, and, on the other hand, a relevant way to explicitly represent and retrieve the underlying process generating the data. LDM explicitly model the observed data x as marginal output from more general observations including some hidden information z ~ p(z), i.e: p(x;θ) = ∫p(x,z;θ) dz. Then, z can be restored through p(z|x;θ) given an estimator of θ constructed in a parametric or nonparametric setting. Despite of its appealing properties, the generative modeling framework, to the best of our knowledge, has not been successfully considered for a large-scale analysis. In the proposal, we aim at addressing this challenge by developing new large-scale LDM for different data settings, including high-dimensional multivariate data (d>>n and large n), functional data (structured families of p(x,z) and d→∞) and apply them in pilot applications.
ii. Efficient unsupervised large-scale inference for LDM: Inference of LDM in general arises in the optimization of complex non-linear problems of the form max𝞱log ∫p(x,z;θ) dz, which, in a large-scale setting, suggests new regularization strategies for effective algorithms. The regularization techniques are indeed the most likely to work at large-scale as they enhance sparse models regarding the huge number of model parameters in a very-high dimensional setting. In our context, penalized log-likelihood criteria of the form PL(θ) = log ∫p(x,z;θ)dz + Pen(θ) will be proposed where the penalty should operate for feature selection despite the missing information z (an unsupervised feature selection). On the other hand, one may rely on the theory of sampling as a relevant framework to infer the model from a big volume of data, for which distributing the computations is a natural practical solution especially for batch data. However, this raises the problem of guaranteeing the quality of estimators aggregated from locally computed estimators. In this unsupervised context, this may be handled by bootstrapping LDM and the issues to address here are i) obtaining guarantees and new strategies of estimators aggregation, that is, how to efficiently construct an estimator θ* as an agregation of (θ1*,…,θB*) issued from B-bootstrap samples, and ii) solving the model selection problem which suggests as well introducing new selection criteria adapted to this new distributed setting.