# Probability and Statistics Seminar

## Past sessions

Newer session pages: Next 4 3 2 1 Newest

### 21/07/2010, 15:00 — 16:00 — Room P3.10, Mathematics Building

Graciela Boente, *Universidad de Buenos Aires and CONICET, Argentina*

### Robust inference in generalized linear models with missing responses

he generalized linear model GLM (McCullagh and Nelder, 1989) is a popular technique for modelling a wide variety of data and assumes that the observations are independent such that the conditional distribution of y|x belongs to the canonical exponential family. In this situation, the mean $E(y|x)$ is modelled linearly through a known link function. Robust procedures for generalized linear models have been considered among others by Stefanski et al. (1986), Künsch et al. (1989), Bianco and Yohai (1996), Cantoni and Ronchetti (2001), Croux and Haesbroeck (2002) and Bianco et al. (2005). Recently, robust tests for the regression parameter under a logistic model were considered by Bianco and Martínez (2009).

In practice, some response variables may be missing, by design (as in two-stage studies) or by happenstance. As it is well known, the methods described above are designed for complete data sets and problems arise when missing responses may be present, while covariates are completely observed. Even if there are many situations in which both the response and the explanatory variables are missing, we will focus our attention only when missing data occur only in the responses. Actually, missingness of responses is very common in opinion polls, market research surveys, mail enquiries, social-economic investigations, medical studies and other scientific experiments, where the explanatory variables can be controlled. This pattern is common, for example, in the scheme of double sampling proposed by Neyman (1938). Hence, we will be interested on robust inference when the response variable may have missing observations but the covariate x is totally observed.

In the regression setting with missing data, a common method is to impute the incomplete observations and then proceed to carry out the estimation of the conditional or unconditional mean of the response variable with the completed sample. The methods considered include linear regression (Yates, 1933), kernel smoothing (Cheng, 1994; Chu and Cheng, 1995) nearest neighbor imputation (Chen and Shao, 2000), semiparametric estimation (Wang et al., 2004, Wang and Sun, 2007), nonparametric multiple imputation (Aerts et al. , 2002, González-Manteiga and Pérez-Gonzalez, 2004), empirical likelihood over the imputed values (Wang and Rao, 2002), among others. All these proposals are very sensitive to anomalous observations since they are based on least squares approaches.

In this talk, we introduce a robust procedure to estimate the regression parameter under a GLM model, which includes, when there are no missing data, the family of estimators previously studied. It is shown that the robust estimates of are root-$n$ consistent and asymptotically normally distributed. A robust procedure to test simple hypothesis on the regression parameter is also considered. The finite sample properties of the proposed procedure are investigated through a Monte Carlo study where the robust test is also compared with nonrobust alternatives.

### 01/06/2010, 16:00 — 17:00 — Room P4.35, Mathematics Building

Ana Pires, *Universidade Técnica de Lisboa - Instituto Superior Técnico and CEMAT*

### CSI: are Mendel's data "Too Good to be True?"

Gregor Mendel (1822-1884) is almost unanimously recognized as the founder of modern genetics. However, long ago, a shadow of doubt was cast on his integrity by another eminent scientist, the statistician and geneticist, Sir Ronald Fisher (1890-1962), who questioned the honesty of the data that form the core of Mendel's work. This issue, nowadays called "the Mendel-Fisher controversy", can be traced back to 1911, when Fisher first presented his doubts about Mendel's results, though he only published a paper with his analysis of Mendel's data in 1936.

A large number of papers have been published about this controversy culminating with the publication in 2008 of a book (Franklin et al., "Ending the Mendel-Fisher controversy"), aiming at ending the issue, definitely rehabilitating Mendel's image. However, quoting from Franklin et al., "the issue of the `too good to be true' aspect of Mendel's data found by Fisher still stands".

We have submitted Mendel's data and Fisher's statistical analysis to extensive computations and simulations attempting to discover an hidden explanation or hint that could help finding an answer to the questions: is Fisher right or wrong, and if Fisher is right is there any reasonable explanation for the "too good to be true", other than deliberate fraud? In this talk some results of this investigation and the conclusions obtained will be presented.

### 18/05/2010, 16:00 — 17:00 — Room P4.35, Mathematics Building

Alex Trindade, *Texas Tech University*

### Fast and Accurate Inference for the Smoothing Parameter in Semiparametric Models

We adapt the method developed in Paige, Trindade, and Fernando (2009) in order to make approximate inference on optimal smoothing parameters for penalized spline, and partially linear models. The method is akin to a parametric bootstrap where Monte Carlo simulation is replaced by saddlepoint approximation, and is applicable whenever the underlying estimator can be expressed as the root of an estimating equation that is a quadratic form in normal random variables. This is the case under a variety of common optimality criteria such as ML, REML, GCV, and AIC. We apply the method to some well-known datasets in the literature, and find that under the ML and REML criteria it delivers a performance that is nearly exact, with computational speeds that are at least an order of magnitude faster than exact methods. Perhaps most importantly, the proposed method also offers a computationally feasible alternative where no known exact methods exist, e.g. GCV and AIC.

### 04/05/2010, 16:00 — 17:00 — Room P3.31, Mathematics Building

Rui Santos, *Instituto Politécnico de Leiria*

### Probability Calculus - the construction of Pacheco D’Amorim in 1914

At the end of the XIXth Century, the classical definition of Probability and its extension to the continuous case were too restrictive and some geometrical applications, based in ingenious interpretations of Bernoulli-Laplace principle of insufficient reason, led to several paradoxes. David Hilbert, in his celebrated address at the International Congress of Mathematicians of 1900, included the axiomatization of Probability in his list of 23 important unsolved problems. Only in 1933 did Kolmogorov lay down a rigorous setup for Probability, inspired by Fréchet’s idea of using Measure Theory. But before this some other efforts to build up a proper axiomatization of Probability deserve to be more widely credited. Among those, the construction of Diogo Pacheco d’Amorim, in his 1914 doctoral thesis, is one of the most interesting. His discussion of a standard model, based on the idea of random choice instead of the concept of probability itself, seems limited, but his final discussion on how to use the law of large numbers and the central limit theorem to have an objective appraisal of whether sampling made by others, or even by a mechanical device, is indistinguishable from a random choice made by one-self, is impressive, since it anticipates the ideas of Monte Carlo by almost 30 years.

### 29/03/2010, 16:00 — 17:00 — Room P4.35, Mathematics Building

Maria Eduarda Silva, *Universidade do Porto*

### Integer-valued AR models

During the last decades there has been considerable interest in integer-valued time series models and a large volume of work is now available in specialized monographs. Motivation to study discrete data models comes from the need to account for the discrete nature of certain data sets, often counts of events, objects or individuals. Examples of applications can be found in the analysis of time series of count data in many areas. Among the most successful integer-valued time series models proposed in the literature are the INteger-valued AutoRegressive model of order 1 (INAR(1)). In this talk the statistical and probabilistic properties of the INAR(1) models are reviewed.

### 11/03/2010, 16:00 — 17:00 — Amphitheatre Pa2, Mathematics Building

Sujit Samanta, *Universidade Técnica de Lisboa - Instituto Superior Técnico e CEMAT*

### Analysis of stationary discrete-time GI/D-MSP/1 queue with finite and infinite buffers

This paper considers a single-server queueing model with finite and infinite buffers in which customers arrive according to a discrete-time renewal process. The customers are served one at a time under discrete-time Markovian service process (D-MSP). This service process is similar to the discrete-time Markovian arrival process (D-MAP), where arrivals are replaced with service completions. Using the imbedded Markov chain technique and the matrix-geometric method, we obtain the system-length distribution at a prearrival epoch. We also provide the steady-state system-length distribution at an arbitrary epoch by using the supplementary variable technique and the classical argument based on renewal-theory. The analysis of actual waiting-time (in the queue) distribution (measured in slots) has also been investigated. Further, we derive the coefficient of correlation of the lagged interdeparture intervals. Moreover, computational experiences with a variety of numerical results in the form of tables and graphs are discussed.

### 11/02/2010, 14:00 — 15:00 — Conference Room, Instituto de Sistemas e Robótica, North Tower, 7th floor, IST

Paulo Rodrigues, *Banco de Portugal and Universidade Nova de Lisboa*

### Robust Inference in Predictive Regressions

In this paper we discuss new tests for predictability which are inspired in the work of Vogelsang (1998) on testing for trend. The proposed tests, use by design the same critical values irrespectively of whether the predictor is $I(0)$ or $I(1)$ and are therefore capable of detecting a more general set of alternatives, which are presently by available procedures (exceptions being the tests of Deo and Chen, 2008 and Maynard and Shimotsu, 2009). Numerical evidence suggests that our proposed procedures have good finite sample performance which coupled with the simplicity of application makes them appealing approaches for empirical research and useful alternatives to available procedures.

### 21/01/2010, 14:00 — 15:00 — Room P3.10, Mathematics Building

Daniela Rodriguez, *Universidad Buenos Aires*

### Nonparametric estimation on Riemannian manifolds

In many situations, the random variables take values in a Riemannian manifold $(M, g)$ instead of $\mathbb{R}^d$, and this structure needs to be taken into account when we generate estimation procedures. For the nonparametric regression model, we study two families of robust estimators for the regression function when the explanatory variables take values in a Riemannian manifold.

In this talk, we will give a brief introduction of the geometric objects needed to define the nonparametric estimators adapted to a manifold. We discuss the classical proposals and we introduce two families of robust estimators for the regression function. We show the asymptotic properties obtained for both proposal. Finally, through a simulation study, we compare the behavior of the robust estimators against the alternative classic. This is a joint work with Guillermo Henry.

### 15/01/2010, 14:00 — 15:00 — Room P12, Mathematics Building

Kuno Huisman, *Tilburg University*

### Strategic Capacity Investment Under Uncertainty

Contrary to most of the papers in the literature of investment under uncertainty we study models that not only capture the timing, but also the size of the investment. We consider a monopoly setting as well as a duopoly setting and compare the results with the standard models in which the firms do not have the capacity choice. Our main results are the following. First, for low uncertainty values the follower chooses a higher capacity than the leader and for high uncertainty values the leader chooses a higher capacity. Second, compared to the model without capacity choice, the monopolist and the follower invest later in a higher capacity for higher values of uncertainty. However, the leader will invest earlier in a higher capacity for higher values of uncertainty. The reverse results apply for lower values of uncertainty.

### 16/12/2009, 10:00 — 11:00 — Room P3.10, Mathematics Building

Gonçalo dos Reis, *CMAP- École Polytechnique (Paris)*

### Backward stochastic differential equations and quasi-liner PDEs

In the spirit of a forthcoming research project within CEMAT this talk aims at introducing a probabilistic approach to PDE. This probabilistic interpretation for systems of second order quasilinear parabolic PDE is obtained by establishing a kind of backward stochastic differential equation. We look at several aspects of this link.

### 11/12/2009, 16:00 — 17:00 — Room P12, Mathematics Building

Maria da Graça Magalhães, Edviges Coelho, *Instituto Nacional de Estatística*

### Methods and techniques to construct projections of resident population: Portugal, 2008-2060

The purpose of this communication is to present the methodology adopted in the last exercise of resident population projections in Portugal, carried out by the Statistics Portugal.

These population projections are based on the concept of resident population and adopt the cohort-component method, where the initial population is grouped into cohorts defined by age and sex, and continuously updated, according to the assumptions of future development set for each of the components of population change - fertility, mortality and migration - that is, by adding the natural balance and net migration, in addition to the natural aging process. This method, widely used in the elaboration of population projections at national level, allows the development of different scenarios of demographic evolution based on different combinations of likely developments of the components.

The results are conditioned, on the one hand by the structure and composition of the initial population, and on the other, by the different behaviour patterns of fertility, mortality and migration in each set of assumptions about the evolution over the projection period, so it should be emphasize the conditional nature of the results, since it is a method of scenarios of "if ... then ..." in that each combines differently the assumptions outlined for the components.

Given the importance of the projections of individual components to the outcome of the exercise, we proceed to the presentation of the methodologies used in the projection of each of these. The projection of components is carried out using a set of statistical methods, adequate to the background information and the proposed target. Thus in the case of fertility we have modelled the fertility rates using the method proposed by Schmertmann (2003), for mortality we have used the Poisson-Lee-Carter with limit life table proposed by Bravo (2007) and for migration, given the increased fragility of the data and consequently the difficulties regarding the practical application of methods for statistical modelling, was adopted as a initial reference the average of the estimated flows in the last 15 years. Finally, we will present the main results of this exercise, both in regard to components and to the future population.

### 27/11/2009, 16:00 — 17:00 — Room P12, Mathematics Building

Hannes Helgason, *Ecole Normale Superieure - Lyon*

### Nonparametric estimation of highly oscillatory signals

We will consider the problem of estimating highly oscillatory signals from noisy measurements. These signals are often referred to as chirps in the literature; they are found everywhere in nature, and frequently arise in scientific and engineering problems. Mathematically, they can be written in the general form A(t) exp(ilambda varphi(t)), where lambda is a large constant base frequency, the phase varphi(t) is time-varying, and the envelope A(t) is slowly varying. Given a sequence of noisy measurements, we study the problem of estimating this chirp from the data.

We introduce novel, flexible and practical strategies for addressing these important nonparametric statistical problems. The main idea is to calculate correlations of the data with a rich family of local templates in a first step, the multiscale chirplets, and in a second step, search for meaningful aggregations or chains of chirplets which provide a good global fit to the data. From a physical viewpoint, these chains correspond to realistic signals since they model arbitrary chirps. From an algorithmic viewpoint, these chains are identified as paths in a convenient graph. The key point is that this important underlying graph structure allows to unleash very effective algorithms such as network flow algorithms for finding those chains which optimize a near optimal trade-off between goodness of fit and complexity.

Our estimation procedures provide provably near optimal performance over a wide range of chirps and numerical experiments show that our estimation procedures perform exceptionally well over a broad class of chirps.

### 20/11/2009, 16:00 — 17:00 — Room P12, Mathematics Building

Rui Paulo, *ISEG and CEMAPRE, Technical University of Lisbon*

### Validation of Computer Models with Multivariate Output

We consider the problem of validating computer models that produce multivariate output, particularly when the model is computationally demanding. Our strategy builds on Gaussian process-based response-surface approximations to the output of the computer model independently constructed for each of its components. These are then combined in a statistical model involving field observations to produce a predictor of the multivariate output at untested input vectors. We illustrate the methodology in a situation where the output consists of a two-dimensional output of very irregular functions.

### 30/10/2009, 16:00 — 17:00 — Room P12, Mathematics Building

Maria Kulikova, *Universidade Técnica de Lisboa - Instituto Superior Técnico e CEMAT*

### Estimation of stochastic volatility models through adaptive Kalman filtering methods

Volatility is a central concept when dealing with financial applications. It is usually equated with the risk and plays a central role in the pricing of derivative securities. It is also widely acknowledged nowadays that volatility is both time-varying and predictable, and stochastic volatility models are commonplace. The approach based on autoregressive conditional heteroscedasticity (ARCH) introduced by Engle, and later generalized to GARCH by Bollerslev, was the first attempt to take into account the changes in volatility over time. The class of stochastic volatility (SV) models is now recognized as a powerful alternative to the traditional and widely used ARCH/GARCH approach. We focus on the maximum likelihood estimation of the class of stochastic volatility models. The main technique is based on the Kalman filter (KF), which is known to be numerically unstable. Using the advanced array square-root form of the KF, we construct a new square-root algorithm for the log-likelihood gradient (score) evaluation. This avoids the use of the conventional KF with its inherent numerical instabilities and improves the robustness of computations against roundoff errors. The proposed square-root adaptive KF scheme is ideal for simultaneous parameter estimation and extraction of the latent volatility series.

### 21/10/2009, 10:00 — 11:00 — Room P3.10, Mathematics Building

Graciela Boente, *Universidad de Buenos Aires and CONICET, Argentina*

### Robust estimators in functional principal components

When dealing with multivariate data, like classical PCA, robust PCA searches for directions with maximal dispersion of the data projected on it. Instead of using the variance as a measure of dispersion, a robust scale estimator s_n may be used in the maximization problem. This approach was first in Li and Chen (1985) while a maximization algorithm was proposed in Croux and Ruiz-Gazen (1996) and their influence function was derived by Croux and Ruiz-Gazen (2005). Recently, their asymptotic distribution was studied in Cui et al. (2003).

Let $X(t)$ be a stochastic process with continuous trajectories and finite second moment, defined on a finite interval. We will denote by $\Gamma (t,s)=cov(X(t),X(s))$ its covariance function and by ${\varphi}_{j}$ and ${\lambda}_{j}$ the eigenfunctions and the eigenvalues of the covariance operator with ${\lambda}_{j}$ in the decreasing order. Dauxois et al. (1982) derived the asymptotic properties of non-smooth principal components of functional data obtained by considering the eigenfunctions of the sample covariance operator. On the other hand, Silverman (1996) and Ramsay and Silverman (1997), introduced smooth principal components for functional data, based on roughness penalty methods while Boente and Fraiman (2000) considered a kernel-based approach. More recent work, dealing with estimation of the principal components of the covariance function, includes Gervini (2006), Hall and Hosseini-Nasab (2006), Hall et al. (2006) and Yao and Lee (2006). Up to our knowledge, the first attempt to provide estimators of the principal components less sensitive to anomalous observations was done by Locantore et al. (1999) who considered the coefficients of a basis expansion. Besides, Gervini (2008) studied a fully functional approach to robust estimation of the principal components by considering a functional version of the spherical principal components defined in Locantore et al. (1999). On the other hand, Hyndman and Ullah (2007) provide a method combining a robust projection-pursuit approach and a smoothing and weighting step to forecast age-specific mortality and fertility rates observed over time.

In this talk, we introduce robust estimators of the principal components and we obtain their consistency under mild conditions. Our approach combines robust projection-pursuit with different smoothing methods.

### 09/10/2009, 16:00 — 17:00 — Room P1, Mathematics Building

Maria do Rosário Oliveira, *Departmento de Matemática - Instituto Superior Técnico e CEMAT*

### Testes de diagnóstico versus métodos de detecção de anomalias: a estatística a ultrapassar barreiras

Na literatura médica, os problemas inerentes à avaliação do desempenho de testes de diagnóstico têm sido largamente estudados. Os méritos e limitações das várias abordagens são conhecidos e discutidos em variados cenários e contextos. O conhecimento adquirido nesta área pode ser usado para avaliar o desempenho de métodos detecção de anomalias na ausência de um ground truth. Em Telecomunicações, as anomalias na transmissão de dados são identificadas por eventos inesperados e desajustados ao normal fluxo dos mesmos. Na prática, podem traduzir-se em invasões a computadores alheios ou outros transtornos de grande impacto nas nossas vidas.

Nesta comunicação estabelece-se o paralelismo entre os indicadores frequentemente usados, pela comunidade médica, na avaliação do desempenho de técnicas laboratoriais e os indicadores para aferir a qualidade de um método de detecção de anomalias, pelos profissionais de Engenharia. A utilização de um ground truth imperfeito ou parcial, como referência na avaliação dos métodos de detecção de anomalias, é questionada ilustrando-se o enviesamento obtido. Por fim, o modelo de classes latentes é apontado como a solução adequada para a comparação do desempenho de métodos de detecção de anomalias na ausência do ground truth, tal como é utilizado na avaliação do desempenho de técnicas de diagnóstico na ausência de um gold standard.

### 28/09/2009, 10:00 — 11:00 — Room P3.10, Mathematics Building

Elena Almaraz Luengo, *Universidad Complutense de Madrid, Spain*

### Some Applications of Stochastic Dominance in Economy

There exists a vast range of applications of Stochastic Dominance (SD) rules in different areas of knowledge, such as: Mathematics, Statistic, Biology, Sociology, Economy, etc. Currently, the main areas of application of SD in Economics and Finance are: efficient portfolio selection, asset valuation, risk, insurance, etc. In this talk we will show the utility of SD in Economics.For that, we will start by explaining the classic concepts of SD and their economic interpretation, as well as other definitions used in this context (likelihood ratio order, hazard rate order, Lorenz’s order level crossing order, etc). One of the main topics we will treat is optimal portfolio selection and its relation with associated weighted random variables and utility functions. In particular, we will establish relations between the utilities of the weighted random variables, given the stochastic relations of the original random variables from which we obtained the weighted random variables. Another context in which SD rules are applied is the ruin and risk problems; we will show a generalization of the classic ruin mode and some SD relations between ruin times of two (stochastic) risk processes. Also SD rules can be used in asset valuation context; we will treat, as an example, the Cox and Rubinstein’s model. Others applications of SD rules will also be commented, including: Black Scholes’ model, integral stochastic calculus, inventory theory, chains, etc.

### 28/07/2009, 11:00 — 12:00 — Room P3.10, Mathematics Building

Graciela Boente, *Universidad de Buenos Aires and CONICET*

### Robust methods in semiparametric estimation with missing responses

Most of the statistical methods in nonparametric regression are designed for complete data sets and problems arise when missing observations are present which is a common situation in biomedical or socioeconomic studies, for example. Classic examples are found in the field of social sciences with the problem of non-response in sample surveys, in Physics, in Genetics (Meng, 2000), among others. We will consider inference with an incomplete data set where the responses satisfy a semiparametric partly linear regression model. We will introduce a family of robust procedures to estimate the regression parameter as well as the marginal location of the responses, when there are missing observations in the response variable, but the covariates are totally observed. In this context, it is necessary to require some conditions regarding the loss of an observation. We model the aforementioned loss assuming that the data are missing at random, i.e, the probability of observing a missing data is independent of the response variable, and it only depends on the covariate. Our proposal is based on a robust profile likelihood approach adapted to the presence of missing data. The asymptotic behavior of the robust estimators for the regression parameter is derived. Several proposals for the marginal location are considered. A Monte Carlo study is carried out to compare the performance of the robust proposed estimators among them and also with the classical ones, in normal and contaminated samples, under different missing data models.

### 21/07/2009, 14:30 — 15:30 — Conference Room, Instituto de Sistemas e Robótica, North Tower, 7th floor, IST

Wolfgang Schmid, *Department of Statistics, European University Viadrina, Frankfurt, Germany*

### Local Approaches for Simultaneous Interpolating of Air Pollution Processes

In the paper, we derive a non-linear cokriging predictor for spatial interpolating of multivariate environmental process. The suggested predictor is based on the locally weighted scatterplot smoothing method of Cleveland (1979) applied simultaneously to several processes. This approach is more flexible as the linear cokriging predictor usually applied in mulivariate environmental statistics and extends the LOESS predictor of Bodnar and Schmid (2009) to multivariate data. In an empirical study, we apply the suggested approach for interpolating the most significant air pollutants in the Berlin/Brandenburg region.

### 26/05/2009, 16:30 — 17:30 — Amphitheatre Pa2, Mathematics Building

Carlos Soares, *Faculdade de Economia, Universidade do Porto*

### Datasetoids: generating more data for empirical data analysis studies

With the increase in the number of models induced from data that are used by organizations for decision support, the problem of algorithm (and parameter) selection is becoming increasingly important. Two approaches to obtain empirical knowledge that is useful for that purpose are empirical studies and metalearning. However, most empirical (meta)knowledge is obtained from a relatively small set of datasets. In this paper, we propose a method to obtain a large number of datasets which is based on a simple transformation of existing datasets, referred to as datasetoids. We test our approach on the problem of using metalearning to predict when to prune decision trees. The results show significant improvement when using datasetoids. Additionally, we identify a number of potential anomalies in the generated datasetoids and propose methods to solve them.