Our research described in this thesis is about the learning of a motif-based representation from time series to perform automatic classification. Meaningful information in time series can be encoded across time through trends, shapes or subsequences usually with distortions. Approaches have been developed to overcome these issues often paying the price of high computational complexity. Among these techniques, it is worth pointing out distance measures and time series representations.
We focus on the representation of the information contained in the time series. We propose a framework to generate a new time series representation to perform classical feature-based classification based on the discovery of discriminant sets of time series subsequences (motifs). This framework proposes to transform a set of time series into a feature space, using subsequences enumerated from the time series, distance measures and aggregation functions. One particular instance of this framework is the well-known shapelet approach.
The potential drawback of such an approach is the large number of subsequences to enumerate, inducing a very large feature space and a very high computational complexity. We show that most subsequences in a time series dataset are redundant. Therefore, a random sampling can be used to generate a very small fraction of the exhaustive set of subsequences, preserving the necessary information for classification and thus generating a much smaller feature space compatible with common machine learning algorithms with tractable computations. We also demonstrate that the number of subsequences to draw is not linked to the number of instances in the training set, which guarantees the scalability of the approach.
The combination of the latter in the context of our framework enables us to take advantage of advanced techniques (such as multivariate feature selection techniques) to discover richer motif-based time series representations for classification, for example by taking into account the relationships between the subsequences.
These theoretical results have been extensively tested on more than one hundred classical benchmarks of the literature with univariate and multivariate time series. Moreover, since this research has been conducted in the context of an industrial research agreement (CIFRE) with Arcelormittal, our work has been applied to the detection of defective steel products based on production line's sensor measurements.