Probability and Statistics Seminar

Past sessions

Symbolic Covariance Matrices and Principal Component Analysis for Interval Data

Recent years witnessed a huge breakthrough of technology which enables the storage of a massive amounts of information. Additionally, the nature of the information collected is also changing. Besides the traditional format of recording single values for each observation, we have now the possibility to record lists, intervals, histograms or even distributions to characterize an observation. However, conventional data analysis is not prepared for neither of these challenges, and does not have the necessary or appropriate means to treat extremely large databases or data with a more complex structure. As an answer to these challenges Symbolic Data Analysis, introduced in the late 1980s by Edwin Diday, extends classical data analysis to deal with more complex data by taking into account inner data variability and structure.

Principal component analysis is one of the most popular statistical methods to analyse real data. Therefore, there have been several proposals to extend this methodology to the symbolic data analysis framework, in particular to interval-valued data.

In this talk, we discuss the concepts and properties of symbolic variance and covariance of an interval-valued variable. Based on these, we develop population formulations for four symbolic principal component estimation methods. This formulation introduces simplifications, additional insight and unification of the discussed methods. Additionally, an explicit and straightforward formula that defines the scores of the symbolic principal components, equivalent to the representation by Maximum Covering Area Rectangle, is also presented.

Joint work with António Pacheco (CEMAT and DM-IST, Univ. de Lisboa), Paulo Salvador (IT, Univ. Aveiro),  Rui Valadas (IT, Univ. de Lisboa), and Margarida Vilela (CEMAT).

On ARL-unbiased c-charts for i.i.d. and INAR(1) Poisson counts

In Statistical Process Control (SPC) it is usual to assume that counts have a Poisson distribution. The non-negative, discrete and asymmetrical character of a control statistic with such distribution and the value of its target mean may prevent the quality control practitioner to deal with a c-chart with:

1. a positive lower control limit and the ability to control not only increases but also decreases in the mean of those counts in a timely fashion;
2. a pre-specified in-control average run length (ARL).

Furthermore, as far as we have investigated, the c-charts proposed in the SPC literature tend not to be ARL-unbiased. (The term ARL-unbiased is used here to coin any control chart for which all out-of-control ARL values are smaller than  the in-control ARL.)

In this talk, we explore the notions of unbiased, randomized and uniformly most powerful unbiased tests (resp. randomization of the emission of a signal and a nested secant rule search procedure) to:

1. eliminate the bias of the ARL function of the c-chart for the mean of i.i.d. (resp. first-order integer-valued autoregressive, INAR(1)) Poisson counts;
2. bring the in-control ARL exactly to a pre-specified and desired value.

We use the R statistical software to provide striking illustrations of the resulting ARL-unbiased c-charts.

Joint work with Sofia Paulino and Sven Knoth

Transmission and Power Generation Investment under Uncertainty

The challenge of deregulated electricity markets and ambitious renewable energy targets have contributed to an increased need of understanding how market participants will respond to a transmission planner’s investment decision. We study the optimal transmission investment decision of a transmission system operator (TSO) that anticipates a power company’s (PC) potential capacity expansion. The proposed model captures both the investment decisions of a TSO and PC and accounts for the conflicting objectives and game-theoretic interactions of the distinct agents. Taking a real options approach allows to study the effect of uncertainty on the investment decisions and taking into account timing as well as sizing flexibility.

We find that disregarding the power company’s optimal investment decision can have a large negative impact on social welfare for a TSO. The corresponding welfare loss increases with uncertainty. The TSO in most cases wants to invest in a higher capacity than is optimal for the power company. The exception is in case the TSO has no timing flexibility and faces a relatively low demand level at investment. This implies that the TSO would overinvest if it would disregard the PC’s optimal capacity decision. On the contrary, we find that if the TSO only considers the power companies sizing flexibility, it risks installing a too small capacity. We furthermore conclude that a linear subsidy in the power company's investment cost could increase its optimal capacity and therewith, could serve as an incentive for power companies to invest in larger capacities.

Joint work with Nora S. Midttun, Afzal S. Siddiqui, and Jannicke S. Sletten.

Optional-Contingent-Product Pricing in Marketing Channels

This paper studies the pricing strategies of firms belonging to a vertical channel structure where a base and an optional contingent products are sold. Optional contingent products are characterized by unilateral demand interdependencies. That is, the base product can be used independently of a contingent product. On the other hand, the contingent product’s purchase is conditional on the possession of the base product.

We find that the retailer decreases the price of the base product to stimulate demand on the contingent-product market. Even a loss-leader strategy could be optimal, which happens when reducing the base product’s price has a large positive effect on its demand, and thus on the number of potential consumers of the contingent product. The price reduction of the base product either mitigates the double-marginalization problem, or leads to an opposite inefficiency in the form of a too low price compared to the price maximizing vertically integrated channel profits. The latter happens when the marginal impact of both products’ demands on the base product’s price is low, and almost equal in absolute terms.

Joint work with Sihem Taboubi and Georges Zaccour.

Immediately followed by another seminar session.

Comparison of Statistic and Deterministic Frameworks of Uncertainty Quantification

Two different approaches to the prediction problem are compared employing a realistic example, combustion of natural gas, with 102 uncertain parameters and 76 quantities of interests. One approach, termed Bound-to-Bound Data Collaboration (abbreviated to B2B) deploys semi-definite programming algorithms where the initial bounds on unknowns are combined with the initial bound of experimental data to produce new uncertainty bounds for the unknowns that are consistent with the data and, finally, deterministic uncertainty bounds for prediction in new settings. The other approach is statistical and Bayesian, referred to as BCP (for Bayesian Calibration and Prediction). It places prior distributions on the unknown parameters and on the parameters of the measurement error distributions and produces posterior distributions for model parameters and posterior distributions for model predictions in new settings.  The predictions from the two approaches are consistent:  B2B bounds and the support of the BCP predictive distribution overlap a very large part of each other. The BCP predictive distribution is more nuanced than the B2B bounds but depends on stronger assumptions. Interpretation and comparison of the results is closely connected with assumptions made about the model and experimental data and how they are used in both settings. The principal conclusion is that use of both methods protects against possible violations of assumptions in the BCP approach and conservative specifications and predictions using B2B.

Joint work with Michael Frenklach, Andrew Packard (UC Berkeley), Jerome Sacks (National Institute of Statistical Sciences) and Gonzalo Garcia-Donato (Universidad de Castilla-La Mancha)

The importance of Statistics in Bioinformatics

Statistics acts in several areas of knowledge, being Bioinformatics one of the most recent application fields. In reality, the role of Statistics in Bioinformatics goes beyond a mere intervention. It is an integral pillar of Bioinformatics. Statistics has been gaining its space in this area, becoming an essential component of recognized merit. In this seminar the speaker intends to show the importance of Statistics in addressing systems as diverse as protein structure or microarray and NGS data. A set of specific studies in Molecular Biology, will be the basis for the presentation of some of the most common statistical methodologies in Bioinformatics. It is also shown the importance of the available software, including some R packages.

Investment Decisions under Multi-uncertainty and Exogenous Shocks

In this presentation we study the investment problem when both the demand and the investment costs are stochastic. We assume that the processes are independent, and both are modeled using geometric Brownian motion with exogenous jumps driven by independent Poisson processes. We use a real options approach, leading to an optimal stopping problem. Due to the multi-uncertainty, we propose a method to solve explicitly the problem, and we prove that this method leads exactly to the solution of the optimization problem.

Joint work with Rita Pimentel.

Statistical Learning for Natural Language Processing

The field of Natural Language Processing (NLP) deals with automatic processing of large corpora of text such as newswire articles (from online newspaper websites), social media (such as Facebook or Twitter) and user-created content (such as Wikipedia). It has experienced large growth in academia as well as in the industry,  ith large corporations such as Microsoft, Google, Facebook, Apple, Twitter, Amazon, among others, investing strongly in these technologies.

One of the most successful approaches to NLP is statistical learning (also known as machine learning), which uses the statistical properties of corpora of text to infer new knowledge.

In this talk I will present multiple NLP problems and provide a brief overview of how they can be solved with statistical learning. I will also present one of these problems (language detection) in more detail to illustrate how basic properties of Probability Theory are at the core of these techniques.

To be or not to be Bayesian: That IS NOT the question

Frequentist and Bayesian approaches to statistical thinking are different in their foundations and have been developed along somehow separated routes. In a time where more and more powerful statistics is badly needed, special attention should be given to the existing connection between the two paradigms in order to bring together their formidable strength.

Anticipative Transmission Planning under Uncertainty

Transmission system operators (TSOs) build transmission lines to take generation capacity into account. However, their decision is confounded by policies that promote renewable energy technologies. Thus, what should be the size of the transmission line to accommodate subsequent generation expansion? Taking the perspective of a TSO, we use a real options approach not only to determine the optimal timing and sizing of the transmission line but also to explore its effects on generation expansion.

A robust mixed linear model for heritability estimation in plant studies

Heritability ($H^2$) refers to the extent of how much a certain phenotype is genetically determined. Knowledge of $H^2$ is crucial in plant studies to help perform effective selection. Once a trait is known to be high heritable, association studies are performed so that the SNPs underlying those traits’ variation may be found. Here, regression models are used to test for associations between phenotype and candidate SNPs. SNP imputation ensures that marker information is complete, so both the coefficient of determination ($R^2$) and $H^2$ are equivalent. One popular model used in these studies is the animal model, which is a linear mixed model (LMM) with a specific layout. However, when the normality assumption is violated, as other likelihood-based models, this model may provide biased results in the association analysis and greatly affect the classical $R^2$. Therefore, a robust version of the REML estimates for linear LMM to be used in this context is proposed, as well as a robust version of a recently proposed $R^2$. The performance of both classical and robust approaches for the estimation of $H^2$ is thus evaluated via simulation and an example of application with a maize data set is presented.

Joint work with P.C. Rodrigues, M.S. Fonseca and A.M. Pires

Statistical methods in cancer research

Understanding trends and long-term trends in the incidence of diseases, particularly in cancer, is a major concern of epidemiologists. Several statistical methodologies are available to study cancer incidence rates. Age-Period-Cohort (APC) models may be used to study the variation of incidence rates through time. They analyse age-specific incidence according to three time scales: age at diagnosis (age), date of diagnosis (period) and date of birth (cohort). Classic and Bayesian APC models are available. Understanding geographical variations in health, particularly in small areas, has also become extremely important. Several types of spatial epidemiology studies are available such as disease mapping, usually used in ecological studies. The geographic mapping of diseases is very important in the definition of policies in oncology, namely on the allocation of resources, and on the identification of clusters with high incidence of disease. Geographical association studies, that allow the identification of risk factors associated with the spatial variation of a disease, are also indispensable and deserve special attention in disease incidence studies. For this purpose, Bayesian Hierarchical models are a common choice.

To quantify cancer survival in the absence of other causes of death, relative survival is also considered in cancer population-based studies. Several approaches to estimate regression models for relative survival using the method of maximum likelihood are available.

Finally, having an idea of the future burden of cancer is also of the utmost importance, namely for planning health services. This is why projections of cancer incidence are so important. Several projection models that differ according to cancer incidence trends are available.

The aim of this study is to investigate spatial and temporal trends in the incidence of colorectal cancer, to estimate relative survival and to make projections. It is a retrospective population-based study that considers data on all colorectal cancers registered by the Southern Portuguese Cancer Registry (ROR Sul) between 1998 and 2006.

Network Inference from Co-Occurrences

Inferring network structures is a central problem arising in many fields of science and technology, including communication systems, biology, sociology, and neuroscience. In this talk, after briefly reviewing several network inference problems, we will focus on that of inferring network structure from co-occurrence" observations. These observations identify which network components (e.g., switches, routers, genes) co-occur in a path, but do not indicate the order in which they occur in that path. Without order information, the number of structures that are data-consistent grows exponentially with the network size. Yet, the basic engineering/evolutionary principles underlying most networks strongly suggest that no all data-consistent structures are equally likely. In particular, nodes that often co-occur are probably closer than nodes that rarely co-occur. This observation suggests modeling co-occurrence observations as independent realizations of a random walk on the network, subjected to random permutations. Treating these permutations as missing data, allows deriving an expectation–maximization (EM) algorithm for estimating the random walk parameters. The model and EM algorithm significantly simplify the problem, but the computational complexity still grows exponentially in the length of each path. We thus propose a polynomial-time Monte Carlo EM algorithm based on importance sampling and derive conditions that ensure convergence of the algorithm with high probability. Finally, we report simulations and experiments with Internet measurements and inference of biological networks that demonstrate the performance of this approach.The work reported in this talk was done in collaboration with Michael Rabbat (McGill University, Canada) and Robert D. Nowak (University of Wisconsin, USA).

A Stochastic Model for Throughput in Wireless Data Networks with Single-Frequency Operation and Partial Connectivity under Contention-Based Multiple Access Modes

Wireless data (packet) networks operating in a common frequency channel, as happens for example with a Basic Service Set in the IEEE 802.11 (WiFi) system, are subject to intrinsic impairments. One such major impairment results from the broadcast nature of the radio channel: if two different transmissions arrive at a receiver with any time overlap, they will interfere destructively and thus one or both of the corresponding packets will not be correctly received (packet collision), thus wasting radio channel transmission time and possibly requiring a retransmission of the original packet(s).

In order to achieve a better utilization of the scarce radio channel resource, stations in wireless networks use multiple acess algorithms to attempt to usefully coordinate their radio transmissions. One example is given by Carrier Sense Multiple Access (CSMA), used as a basis for the sharing of the radio channel in the WiFi system, which establishes that a station should not start a new packet transmission if it can hear any other station transmitting. In a network with radio connectivity between any pair of stations (fully connected) and negligible propagation delays, such algorithm succeeds at completely preventing the existence of collisions. That is however not the case if there exist pairs of stations that cannot directly hear each other (partial connectivity). Many other multiple access algorithms have been proposed and studied.

In this talk will be presented a stochastic model for the study of throughput (i.e., the long-term fraction of time that the radio channel is occupied with successful packet transmissions) in the class of networks described above. The talk will start with a short description of the communication functions and structure of the system under study and of the class of multiple access algorithms considered. Following will be presented a Markovian model for the representation of the time evolution of the packet transmissions taking place in the network and a result given on the existence of a product form for its stationary probabilities. The next step will be to show how the desired throughputs can be obtained from the steady-state probabilities of this process and the average durations of the successful packet transmissions. The latter are obtained from the times to absorption in a set of auxiliary (absorbing) derived Markov chains. Finally, and time permitting, a reference will be made to results concerning the insensitivity of product form steady state solutions, when they exist, to the distribution of packet lengths and retransmission time intervals, by means of a Generalized Semi-Markov Process representation.

On the misleading signals in simultaneous schemes for the mean vector and covariance matrix of multivariate i.i.d. output

The performance of a product often depends on several quality characteristics. Simultaneous schemes for the process mean vector and the covariance matrix are essential to determine if unusual variation in the location and dispersion of a multivariate normal vector of quality characteristics has occurred.

Misleading signals (MS) are likely to happen while using such simultaneous schemes and correspond to valid signals that lead to a misinterpretation of a shift in mean vector (resp. covariance matrix) as a shift in covariance matrix (resp. mean vector).

This paper focuses on numerical illustrations that show that MS are fairly frequent, and on the use of stochastic ordering to qualitatively assess the impact of changes in th emean vector and covariance matrix in the probabilities of misleading signals in simultaneous schemes for these parameters while dealing with multivariate normal i.i.d. output.

(Joint work with: Manuel Cabral Morais, António Pacheco, CEMAT-IST; Wolfgang Schmid, European University Viadrina.)

On hitting times for Markov time series of counts with applications to quality control

Examples of time series of counts arise in several areas, for instance in epidemiology, industry, insurance and network analysis. Several time series models for these counts have been proposed and some are based on the binomial thinning operation, namely the integer-valued autoregressive (INAR) model, which mimics the structure and the autocorrelation function of the autoregressive (AR) model.

The detection of shifts in the mean of an INAR process is a recent research subject and it can be done by using quality control charts. Underlying the performance analysis of these charts, there is an indisputable popular measure: the run length (RL), the number of samples until a signal is triggered by the chart. Since a signal is given as soon as the control statistic falls outside the control limits, the RL is nothing but a hitting time.

In this paper, we use stochastic ordering to assess: the ageing properties of the RL of charts for the process mean of Poisson INAR(1) output; the impact of shifts in model parameters on this RL. We also explore the implications of all these properties, thus casting interesting light on this hitting time for a Markov time series of counts.

(Joint work with António Pacheco, CEMAT-IST.)

Previsões Meteorológicas — mais do que probabilidades?

A previsão meteorológica depende da existência de observações de variáveis físicas (temperatura do ar, humidade relativa do ar, vento, …) e da capacidade de prever a evolução temporal destas variáveis. Depende também de outros parâmetros meteorológicos (convergência, vorticidade, índices de estabilidade, …) que, não sendo diretamente mensuráveis mas indiretamente calculados, permitem construir o cenário futuro da atmosfera no prazo de dias a semanas.

A previsão de parâmetros meteorológicos está assente na corrida de modelos numéricos de previsão em supercomputadores de elevado desempenho, disponíveis quer a nível internacional quer a nível nacional. A nível europeu, estes desenvolvimentos são realizados pelo Centro Europeu de Previsão a Médio Prazo (ECMWF), com a disponibilização, duas vezes por dia para todo o Globo, de resultados determinísticos com resolução espacial de 16 km até 10 dias e de resultados probabilísticos com resolução espacial de 64 km até 15 dias. Em Portugal são corridos modelos numéricos de área local, com resolução espacial de 9 km até 72 horas (modelo ALADIN) e de 2.5 km até 48 horas (modelo AROME), este último em 3 domínios distintos – Continente, Madeira e Açores.

Enquanto as previsões determinísticas apresentam uma única solução para um instante futuro, as previsões probabilísticas permitem considerar intervalos de variação dos diversos parâmetros meteorológicos, o que se torna particularmente importante para parâmetros mais difíceis de prever, dos quais é exemplo a precipitação. As variações nos parâmetros meteorológicos resultam de perturbações nas condições iniciais dos modelos numéricos, que procuram representar a influência dos erros existentes nas observações meteorológicas.

Deste modo, as previsões meteorológicas em formato probabilístico recorrem a parâmetros e produtos disponíveis operacionalmente, tais como: probabilidade de ocorrência, média de ensemble, dispersão, shift of tail, meteograma, Extreme Forecast Index, Spaghetti, Clusters,… . Para prazos mais longos (semanas a meses) e, em parte, de forma experimental são ainda utilizadas anomalias (em relação a períodos passados - climatologia) de alguns parâmetros como a precipitação, a temperatura do ar, a temperatura da água do mar e a pressão atmosférica ao nível médio do mar.

Contributions to Variable Selection and Robust Anomaly Detection in Telecommunications

Over the years, we have witnessed an incredible high level of technological development where Internet plays the leading role. The Internet not only brought benefits but also originates new threats expressed by anomalies/outliers. Consequently, new and improved outlier detection methodologies need to be developed. Expectedly, we propose an anomaly detection method that combines a robust variable selection method and a robust outlier detection procedure based on Principal Component Analysis.

Our method was evaluated using a data set obtained from a network scenario capable of producing a perfect ground-truth under real (but controlled) traffic conditions. The robust variable selection step was essential to eliminate redundant and irrelevant variables that were deteriorating the performance of the anomaly detector. The variable selection methods we considered use a filter strategy based on Mutual Information and Entropy for which we have developed robust estimators. The filter methods incorporate a redundancy component which tries to capture overlaps among variables.

The performance of eight variable selection methods was studied under a theoretical framework that allows reliable comparisons among them, determining the true/theoretical variable ordering under specific evaluation scenarios, and unveiled problems in the construction of the associated objective functions. Our proposal, maxMIFS, which is associated with a simple objective function, revealed to be unaffected by these problems and achieved outstanding results. For these reasons, it was chosen to be applied in the preprocessing step. With this approach, the results improved substantially and the main objective of this work was fulfilled: improving the detection of anomalies in Internet traffic flows.

Seeing, hearing, doing multivariate statistics

Signal processing is an important task in our days that arise in various areas such as engineering and applied mathematics. A signal represents time-varying or spatially varying physical quantities. Signals of importance can include sound, electromagnetic radiation, images, telecommunication transmission signals, and many others. A signal carries information, and the objective of signal processing is to extract useful information carried by the signal. The received signal is usually disturbed by electrical, atmospheric or deliberate interferences. Due to the random nature of the signal, statistical techniques play an important role in analyzing the signal.

There are many techniques used to analyze these types of data, depending on the focus or research question of the study. Some of these techniques are Principal Component Analysis (PCA) and Fourier transform, in particular discrete Fourier transform (DFT). The main goal in this work is to explore the relations between PCA and others mathematical transforms, based on Toeplitz and circulant matrices. In this sense, the proposed method relates the theory behind the Fourier transform through the Toeplitz and circulant matrices and the PCA. To illustrate the methodology we will consider sounds and images.

Keywords: Circulant Matrix, Fourier Transform, Principal Component Analysis, Signal Processing, Toeplitz Matrix.

Older session pages: Previous 5 6 7 8 9 10 11 Oldest