Room P3.10, Mathematics Building

M. Rosário Oliveira, CEMAT & Instituto Superior Técnico, Universidade de Lisboa
Symbolic Covariance Matrices and Principal Component Analysis for Interval Data

Recent years witnessed a huge breakthrough of technology which enables the storage of a massive amounts of information. Additionally, the nature of the information collected is also changing. Besides the traditional format of recording single values for each observation, we have now the possibility to record lists, intervals, histograms or even distributions to characterize an observation. However, conventional data analysis is not prepared for neither of these challenges, and does not have the necessary or appropriate means to treat extremely large databases or data with a more complex structure. As an answer to these challenges Symbolic Data Analysis, introduced in the late 1980s by Edwin Diday, extends classical data analysis to deal with more complex data by taking into account inner data variability and structure.

Principal component analysis is one of the most popular statistical methods to analyse real data. Therefore, there have been several proposals to extend this methodology to the symbolic data analysis framework, in particular to interval-valued data.

In this talk, we discuss the concepts and properties of symbolic variance and covariance of an interval-valued variable. Based on these, we develop population formulations for four symbolic principal component estimation methods. This formulation introduces simplifications, additional insight and unification of the discussed methods. Additionally, an explicit and straightforward formula that defines the scores of the symbolic principal components, equivalent to the representation by Maximum Covering Area Rectangle, is also presented.

Joint work with António Pacheco (CEMAT and DM-IST, Univ. de Lisboa), Paulo Salvador (IT, Univ. Aveiro),  Rui Valadas (IT, Univ. de Lisboa), and Margarida Vilela (CEMAT).