Symbolic Data Analysis: Parametric multivariate analysis of interval data

Professor Paula Brito, Universidade do Porto

  • Thu 25 Apr 19

    11:00 - 12:30

  • Colchester Campus


  • Event speaker

    Professor Paula Brito

  • Event type

    Lectures, talks and seminars
    Mathematical Sciences Departmental Seminar

  • Event organiser

    Mathematical Sciences, Department of

  • Contact details

    Andrew Harrison

Symbolic Data is concerned with analysing data with intrinsic variability, which is to be taken into account. In Data Mining, Multivariate Data Analysis and classical Statistics, the elements under analysis are generally individual entities for which a single value is recorded for each variable - e.g., individuals, described by age, salary, education level, etc.

However, when the elements of interest are classes or groups of some kind - the citizens living in given towns; car models, rather than specific vehicles - then there is variability inherent to the data.

Symbolic data goes beyond the usual data representation model, considering variables whose observed values for each element are no longer necessarily single real values or categories, but may assume the form of sets, intervals, or, more generally, distributions. In this talk we focus on the analysis of interval data, i.e., when the variables’ values are intervals of IR.

Parametric probabilistic models for interval-valued variables have been proposed and studied by Brito & Duarte Silva (2012). These models are based on the representation of each observed interval by its MidPoint and LogRange, and Multivariate Normal and Skew-Normal distributions are assumed for the whole set of 2p MidPoints and LogRanges of the original p interval-valued variables.

The intrinsic nature of the interval-valued variables leads to different structures of the variance-covariance matrix, represented by different possible configurations. For all cases, maximum likelihood estimators of the corresponding parameters have been derived.

This framework may be applied to different statistical multivariate methodologies, thereby allowing for inference approaches for symbolic data; in particular M(ANOVA), discriminant analysis, model-based clustering, robust estimation and outlier detection are addressed. The referred modelling and methods are implemented in the R package MAINT.Data, available on CRAN.