Large-scale data analysis is an inherently multidisciplinary area and is becoming of broader interest for today's society. The ANR project SMILES is a collaborative fundamental research project that aims at introducing an unsupervised statistical modeling framework and scaled inference algorithms for transforming large-scale data into knowledge. It considers the large-scale context as a whole, with its main issues related to inference from a big volume of data of very high dimension and underlying complex hidden structures. The key tenet of SMILES is to introduce large-scale latent data models for unsupervised data classification and large-scale regression-based sparse (non)parametric models for data representation. The knowledge extraction will namely consist in automatically retrieving hidden structures, summarizing prototypes, groups, sparse representations. We consider different data settings, including functional data, multimodal bioacoustical data, and biological data.