# Probability and Statistics Seminar

## Past sessions

### Why we need non-linear time series models and why we are not using them so often

The Wold Decomposition theorem says that under fairly general conditions, a stationary time series ${X}_{t}$ has a unique linear causal representation in terms of uncorrelated random variables. However, The Wold Decomposition theorem gives us a representation, not a model for ${X}_{t}$, in the sense that we can only recover uniquely the moments of ${X}_{t}$ up to second order from this representation, unless the input series is a Gaussian sequence. If we look for models for ${X}_{t}$, then we should look for such model within the class of convergent Volterra series expansions. If we have to go beyond second order properties, and many real data sets from financial and environmental sciences indicate that we should, then linear models with iid Gaussian input are a very tiny, insignificant fraction of possible models for a stationary time series, corresponding to the first term of the infinite order Volterra expansion. On the other hand, Volterra series expansions are not particularly useful as a possible class of models, as conditions of stationarity and invertibility are hard to check, if not impossible, therefore they have very limited use as models for time series, unless the input series is observable. From a prediction point of view, the Projection Theorem for Hilbert spaces tells us how to obtain the best linear predictor for ${X}_{t+k}$ within the linear span of $\left\{{X}_{t},{X}_{t-1},\dots ,\right\}$ , but when linear predictors are not sufficiently good, it is not straightforward to find, if possible at all, the best predictor within richer subspaces constructed over $\left\{{X}_{t},{X}_{t-1},\dots ,\right\}$. It is therefore important to look for classes of nonlinear models to improve upon the linear predictor, which are sufficiently general, but at the same time are sufficiently flexible to work with. There are many ways a time series can be nonlinear. As a consequence, there are many classes of nonlinear models to explain such nonlinearities, but whose probabilistic characteristics are difficult to study, not to mention the difficulties associated with modeling issues. Likelihood based inference is particularly a difficult issue as for most nonlinear processes, we can not even write the likelihood. However, recently there has been very exciting advances in simulation based inferential methods such as sequential Markov Chain Monte Carlo, Particle filters and Approximate Bayesian Computation methods for generalized state space models which we will mention briefly.

### Até onde pode ir o H(h)omem?

Neste seminário será abordada a questão do “Qual é o Maior Salto em Comprimento ao alcance do H(h)omem, dado o actual state of the art”? Para responder a essa pergunta será usado o crème de la crème, i.e., os dados são coligidos a partir dos melhores atletas olímpicos na modalidade, a partir da base de dados do World Athletics Competitions - Long Jump Men Outdoors. Esta abordagem do problema é baseada na Teoria de Valores Extremos e as respectivas técnicas estatísticas. Usar-se-ão apenas os melhores desempenhos das World top lists. A estimativa final do potencial recorde, i.e., o limite superior do acontecimento salto em comprimento, permite inferir acerca da melhor marca individual possível, dadas as condições actuais, quer em termos de conhecimento do fenómeno, quer relativamente às condições e regras de registo na modalidade desportiva. Actualmente o recorde de 8,95m é detido por Mike Powell (USA) em Tokyo, 30/08/1991. Em Valores Extremos insere-se na estimativa do limite superior do suporte para uma distribuição no Max-domínio da Gumbel.

Palavras-chave: Valores Extremos em Desporto, Teoria de Valores Extremos, Estimação do Limite Superior do Suporte no Domínio Gumbel, Abordagem Semi-paramétrica para Estatística de Extremos.

### Sinais erróneos em esquemas conjuntos para o valor esperado e paraa variância de processos

Quando se pretende controlar simultaneamente o valor esperado e a variância de um processo é comum utilizar-se um esquema conjunto. Este tipo de esquema é constituído por duas cartas de controlo que operam em simultâneo, uma que controla o valor esperado e outra que controla a variância do processo. A utilização deste tipo de esquemas pode levar à ocorrência de sinais erróneos, associados, por exemplo, às seguintes situações:

• o valor esperado do processo está fora de controlo, no entanto a carta para a variância emite um sinal antes da carta usada para controlar o valor esperado;
• a variância do processo está fora de controlo mas a carta para o valor esperado é a primeira a emitir sinal.

Os sinais erróneos são sinais válidos que podem levar o operador de controlo de qualidade a desencadear acções inadequadas para corrigir uma causa inexistente. Posto isto, é importante considerar a frequência com que estes sinais ocorrem como uma medida de desempenho dos esquemas conjuntos. Neste trabalho analisa-se o desempenho de esquemas conjuntos do ponto de vista da probabilidade de ocorrência de um sinal erróneo com especial enfoque em esquemas conjuntos para processos univariados i.i.d. e autocorrelacionados.

### Strategic Capacity Investment Under Uncertainty

In this talk we consider investment decisions within an uncertain dynamic and competitive framework. Each investment decision involves to determine the timing and the capacity level. In this way we extend the main bulk of the real options theory where the capacity level is given. We consider a monopoly setting as well as a duopoly setting. Our main results are the following. In the duopoly setting we provide a fully dynamic analysis of entry deterrence/accommodation strategies. Contrary to the seminal industrial organization analyses that are based on static models, we find that entry can only be deterred temporarily. To keep its monopoly position as long as possible the first investor overinvests in capacity. In very uncertain economic environments the first investor eventually ends up being the largest firm in the market. If uncertainty is moderately present, a reduced value of waiting implies that the preemption mechanism forces the first investor to invest so soon that a large capacity cannot be afforded. Then it will end up with a capacity level being lower than the second investor.

### Production Flexibility and Capacity Investment under Demand Uncertainty

he paper takes a real option approach to consider optimal capacity investment decisions under uncertainty. Besides the timing of the investment, the firm also has to decide on the capacity level. Concerning the production decision, we study a flexible and an inflexible scenario. The flexible firm can costlessly adjust production over time with the capacity level as the upper bound, while the inflexible firm fixes production at capacity level from the moment of investment onwards. We find that the flexible firm invests in higher capacity than the inflexible firm, where the capacity difference increases with uncertainty. For the flexible firm the initial occupation rate can be quite low, especially when investment costs are concave and the economic environment is uncertain. As to the timing of the investment there are two contrary effects. First, the flexible firm has an incentive to invest earlier, because flexibility raises the project value. Second, the flexible firm has an incentive to invest later, because costs are larger due to the higher capacity level. The latter effect dominates in highly uncertain economic environments.

### Performance of passive optical networks

We introduce PONs (Passive Optical Networks), which are designed to provide high speed access to users via fiber links. The problem for the OLT (Optical Line Terminal) is to share dynamically the wavelength bandwidth among the ONUs (Optical Network Units). For that, with an optimal algorithm, the system can be modeled as a relatively standard polling system. Due to technological constraints, in the polling system, the number of servers which visit one queue at the same time is limited. The performance of the system is directly related to the stability condition of the polling model. It is unknown in general. A mean field approach provides a limit stability condition when the system gets large.

### Mathematics-A Catalyst for Innovation- Giving European Industry an Edge

We will discuss the role of Mathematics in Industry and in innovation processes. The focus will be European and we will look at good examples provided e.g. by the experiences of the network European Consortium for Mathematics in Industry (ECMI). I will also present the ongoing ESF Forward Look: "Mathematics and Industry" (see http://www.ceremade.dauphine.fr/FLMI/FLMI-frames-index.html) and discuss possible future developments on a European scale.

### Robust inference in generalized linear models with missing responses

he generalized linear model GLM (McCullagh and Nelder, 1989) is a popular technique for modelling a wide variety of data and assumes that the observations are independent such that the conditional distribution of y|x belongs to the canonical exponential family. In this situation, the mean $E(y|x)$ is modelled linearly through a known link function. Robust procedures for generalized linear models have been considered among others by Stefanski et al. (1986), Künsch et al. (1989), Bianco and Yohai (1996), Cantoni and Ronchetti (2001), Croux and Haesbroeck (2002) and Bianco et al. (2005). Recently, robust tests for the regression parameter under a logistic model were considered by Bianco and Martínez (2009).

In practice, some response variables may be missing, by design (as in two-stage studies) or by happenstance. As it is well known, the methods described above are designed for complete data sets and problems arise when missing responses may be present, while covariates are completely observed. Even if there are many situations in which both the response and the explanatory variables are missing, we will focus our attention only when missing data occur only in the responses. Actually, missingness of responses is very common in opinion polls, market research surveys, mail enquiries, social-economic investigations, medical studies and other scientific experiments, where the explanatory variables can be controlled. This pattern is common, for example, in the scheme of double sampling proposed by Neyman (1938). Hence, we will be interested on robust inference when the response variable may have missing observations but the covariate x is totally observed.

In the regression setting with missing data, a common method is to impute the incomplete observations and then proceed to carry out the estimation of the conditional or unconditional mean of the response variable with the completed sample. The methods considered include linear regression (Yates, 1933), kernel smoothing (Cheng, 1994; Chu and Cheng, 1995) nearest neighbor imputation (Chen and Shao, 2000), semiparametric estimation (Wang et al., 2004, Wang and Sun, 2007), nonparametric multiple imputation (Aerts et al. , 2002, González-Manteiga and Pérez-Gonzalez, 2004), empirical likelihood over the imputed values (Wang and Rao, 2002), among others. All these proposals are very sensitive to anomalous observations since they are based on least squares approaches.

In this talk, we introduce a robust procedure to estimate the regression parameter under a GLM model, which includes, when there are no missing data, the family of estimators previously studied. It is shown that the robust estimates of are root-$n$ consistent and asymptotically normally distributed. A robust procedure to test simple hypothesis on the regression parameter is also considered. The finite sample properties of the proposed procedure are investigated through a Monte Carlo study where the robust test is also compared with nonrobust alternatives.

### CSI: are Mendel's data "Too Good to be True?"

Gregor Mendel (1822-1884) is almost unanimously recognized as the founder of modern genetics. However, long ago, a shadow of doubt was cast on his integrity by another eminent scientist, the statistician and geneticist, Sir Ronald Fisher (1890-1962), who questioned the honesty of the data that form the core of Mendel's work. This issue, nowadays called "the Mendel-Fisher controversy", can be traced back to 1911, when Fisher first presented his doubts about Mendel's results, though he only published a paper with his analysis of Mendel's data in 1936.

A large number of papers have been published about this controversy culminating with the publication in 2008 of a book (Franklin et al., "Ending the Mendel-Fisher controversy"), aiming at ending the issue, definitely rehabilitating Mendel's image. However, quoting from Franklin et al., "the issue of the `too good to be true' aspect of Mendel's data found by Fisher still stands".

We have submitted Mendel's data and Fisher's statistical analysis to extensive computations and simulations attempting to discover an hidden explanation or hint that could help finding an answer to the questions: is Fisher right or wrong, and if Fisher is right is there any reasonable explanation for the "too good to be true", other than deliberate fraud? In this talk some results of this investigation and the conclusions obtained will be presented.

### Fast and Accurate Inference for the Smoothing Parameter in Semiparametric Models

We adapt the method developed in Paige, Trindade, and Fernando (2009) in order to make approximate inference on optimal smoothing parameters for penalized spline, and partially linear models. The method is akin to a parametric bootstrap where Monte Carlo simulation is replaced by saddlepoint approximation, and is applicable whenever the underlying estimator can be expressed as the root of an estimating equation that is a quadratic form in normal random variables. This is the case under a variety of common optimality criteria such as ML, REML, GCV, and AIC. We apply the method to some well-known datasets in the literature, and find that under the ML and REML criteria it delivers a performance that is nearly exact, with computational speeds that are at least an order of magnitude faster than exact methods. Perhaps most importantly, the proposed method also offers a computationally feasible alternative where no known exact methods exist, e.g. GCV and AIC.

### Probability Calculus - the construction of Pacheco D’Amorim in 1914

At the end of the XIXth Century, the classical definition of Probability and its extension to the continuous case were too restrictive and some geometrical applications, based in ingenious interpretations of Bernoulli-Laplace principle of insufficient reason, led to several paradoxes. David Hilbert, in his celebrated address at the International Congress of Mathematicians of 1900, included the axiomatization of Probability in his list of 23 important unsolved problems. Only in 1933 did Kolmogorov lay down a rigorous setup for Probability, inspired by Fréchet’s idea of using Measure Theory. But before this some other efforts to build up a proper axiomatization of Probability deserve to be more widely credited. Among those, the construction of Diogo Pacheco d’Amorim, in his 1914 doctoral thesis, is one of the most interesting. His discussion of a standard model, based on the idea of random choice instead of the concept of probability itself, seems limited, but his final discussion on how to use the law of large numbers and the central limit theorem to have an objective appraisal of whether sampling made by others, or even by a mechanical device, is indistinguishable from a random choice made by one-self, is impressive, since it anticipates the ideas of Monte Carlo by almost 30 years.

### Integer-valued AR models

During the last decades there has been considerable interest in integer-valued time series models and a large volume of work is now available in specialized monographs. Motivation to study discrete data models comes from the need to account for the discrete nature of certain data sets, often counts of events, objects or individuals. Examples of applications can be found in the analysis of time series of count data in many areas. Among the most successful integer-valued time series models proposed in the literature are the INteger-valued AutoRegressive model of order 1 (INAR(1)). In this talk the statistical and probabilistic properties of the INAR(1) models are reviewed.

### Analysis of stationary discrete-time GI/D-MSP/1 queue with finite and infinite buffers

This paper considers a single-server queueing model with finite and infinite buffers in which customers arrive according to a discrete-time renewal process. The customers are served one at a time under discrete-time Markovian service process (D-MSP). This service process is similar to the discrete-time Markovian arrival process (D-MAP), where arrivals are replaced with service completions. Using the imbedded Markov chain technique and the matrix-geometric method, we obtain the system-length distribution at a prearrival epoch. We also provide the steady-state system-length distribution at an arbitrary epoch by using the supplementary variable technique and the classical argument based on renewal-theory. The analysis of actual waiting-time (in the queue) distribution (measured in slots) has also been investigated. Further, we derive the coefficient of correlation of the lagged interdeparture intervals. Moreover, computational experiences with a variety of numerical results in the form of tables and graphs are discussed.

### Robust Inference in Predictive Regressions

In this paper we discuss new tests for predictability which are inspired in the work of Vogelsang (1998) on testing for trend. The proposed tests, use by design the same critical values irrespectively of whether the predictor is $I\left(0\right)$ or $I\left(1\right)$ and are therefore capable of detecting a more general set of alternatives, which are presently by available procedures (exceptions being the tests of Deo and Chen, 2008 and Maynard and Shimotsu, 2009). Numerical evidence suggests that our proposed procedures have good finite sample performance which coupled with the simplicity of application makes them appealing approaches for empirical research and useful alternatives to available procedures.

### Nonparametric estimation on Riemannian manifolds

In many situations, the random variables take values in a Riemannian manifold $(M, g)$ instead of $\mathbb{R}^d$, and this structure needs to be taken into account when we generate estimation procedures. For the nonparametric regression model, we study two families of robust estimators for the regression function when the explanatory variables take values in a Riemannian manifold.

In this talk, we will give a brief introduction of the geometric objects needed to define the nonparametric estimators adapted to a manifold. We discuss the classical proposals and we introduce two families of robust estimators for the regression function. We show the asymptotic properties obtained for both proposal. Finally, through a simulation study, we compare the behavior of the robust estimators against the alternative classic. This is a joint work with Guillermo Henry.

### Strategic Capacity Investment Under Uncertainty

Contrary to most of the papers in the literature of investment under uncertainty we study models that not only capture the timing, but also the size of the investment. We consider a monopoly setting as well as a duopoly setting and compare the results with the standard models in which the firms do not have the capacity choice. Our main results are the following. First, for low uncertainty values the follower chooses a higher capacity than the leader and for high uncertainty values the leader chooses a higher capacity. Second, compared to the model without capacity choice, the monopolist and the follower invest later in a higher capacity for higher values of uncertainty. However, the leader will invest earlier in a higher capacity for higher values of uncertainty. The reverse results apply for lower values of uncertainty.

### Backward stochastic differential equations and quasi-liner PDEs

In the spirit of a forthcoming research project within CEMAT this talk aims at introducing a probabilistic approach to PDE. This probabilistic interpretation for systems of second order quasilinear parabolic PDE is obtained by establishing a kind of backward stochastic differential equation. We look at several aspects of this link.

### Methods and techniques to construct projections of resident population: Portugal, 2008-2060

The purpose of this communication is to present the methodology adopted in the last exercise of resident population projections in Portugal, carried out by the Statistics Portugal.

These population projections are based on the concept of resident population and adopt the cohort-component method, where the initial population is grouped into cohorts defined by age and sex, and continuously updated, according to the assumptions of future development set for each of the components of population change - fertility, mortality and migration - that is, by adding the natural balance and net migration, in addition to the natural aging process. This method, widely used in the elaboration of population projections at national level, allows the development of different scenarios of demographic evolution based on different combinations of likely developments of the components.

The results are conditioned, on the one hand by the structure and composition of the initial population, and on the other, by the different behaviour patterns of fertility, mortality and migration in each set of assumptions about the evolution over the projection period, so it should be emphasize the conditional nature of the results, since it is a method of scenarios of "if ... then ..." in that each combines differently the assumptions outlined for the components.

Given the importance of the projections of individual components to the outcome of the exercise, we proceed to the presentation of the methodologies used in the projection of each of these. The projection of components is carried out using a set of statistical methods, adequate to the background information and the proposed target. Thus in the case of fertility we have modelled the fertility rates using the method proposed by Schmertmann (2003), for mortality we have used the Poisson-Lee-Carter with limit life table proposed by Bravo (2007) and for migration, given the increased fragility of the data and consequently the difficulties regarding the practical application of methods for statistical modelling, was adopted as a initial reference the average of the estimated flows in the last 15 years. Finally, we will present the main results of this exercise, both in regard to components and to the future population.

### Nonparametric estimation of highly oscillatory signals

We will consider the problem of estimating highly oscillatory signals from noisy measurements. These signals are often referred to as chirps in the literature; they are found everywhere in nature, and frequently arise in scientific and engineering problems. Mathematically, they can be written in the general form A(t) exp(ilambda varphi(t)), where lambda is a large constant base frequency, the phase varphi(t) is time-varying, and the envelope A(t) is slowly varying. Given a sequence of noisy measurements, we study the problem of estimating this chirp from the data.

We introduce novel, flexible and practical strategies for addressing these important nonparametric statistical problems. The main idea is to calculate correlations of the data with a rich family of local templates in a first step, the multiscale chirplets, and in a second step, search for meaningful aggregations or chains of chirplets which provide a good global fit to the data. From a physical viewpoint, these chains correspond to realistic signals since they model arbitrary chirps. From an algorithmic viewpoint, these chains are identified as paths in a convenient graph. The key point is that this important underlying graph structure allows to unleash very effective algorithms such as network flow algorithms for finding those chains which optimize a near optimal trade-off between goodness of fit and complexity.

Our estimation procedures provide provably near optimal performance over a wide range of chirps and numerical experiments show that our estimation procedures perform exceptionally well over a broad class of chirps.

### Validation of Computer Models with Multivariate Output

We consider the problem of validating computer models that produce multivariate output, particularly when the model is computationally demanding. Our strategy builds on Gaussian process-based response-surface approximations to the output of the computer model independently constructed for each of its components. These are then combined in a statistical model involving field observations to produce a predictor of the multivariate output at untested input vectors. We illustrate the methodology in a situation where the output consists of a two-dimensional output of very irregular functions.

Older session pages: Previous 8 9 10 11 12 Oldest