# Probability and Statistics Seminar

## Past sessions

### Robust methods in semiparametric estimation with missing responses

Most of the statistical methods in nonparametric regression are designed for complete data sets and problems arise when missing observations are present which is a common situation in biomedical or socioeconomic studies, for example. Classic examples are found in the field of social sciences with the problem of non-response in sample surveys, in Physics, in Genetics (Meng, 2000), among others. We will consider inference with an incomplete data set where the responses satisfy a semiparametric partly linear regression model. We will introduce a family of robust procedures to estimate the regression parameter as well as the marginal location of the responses, when there are missing observations in the response variable, but the covariates are totally observed. In this context, it is necessary to require some conditions regarding the loss of an observation. We model the aforementioned loss assuming that the data are missing at random, i.e, the probability of observing a missing data is independent of the response variable, and it only depends on the covariate. Our proposal is based on a robust profile likelihood approach adapted to the presence of missing data. The asymptotic behavior of the robust estimators for the regression parameter is derived. Several proposals for the marginal location are considered. A Monte Carlo study is carried out to compare the performance of the robust proposed estimators among them and also with the classical ones, in normal and contaminated samples, under different missing data models.

### Local Approaches for Simultaneous Interpolating of Air Pollution Processes

In the paper, we derive a non-linear cokriging predictor for spatial interpolating of multivariate environmental process. The suggested predictor is based on the locally weighted scatterplot smoothing method of Cleveland (1979) applied simultaneously to several processes. This approach is more flexible as the linear cokriging predictor usually applied in mulivariate environmental statistics and extends the LOESS predictor of Bodnar and Schmid (2009) to multivariate data. In an empirical study, we apply the suggested approach for interpolating the most significant air pollutants in the Berlin/Brandenburg region.

### Datasetoids: generating more data for empirical data analysis studies

With the increase in the number of models induced from data that are used by organizations for decision support, the problem of algorithm (and parameter) selection is becoming increasingly important. Two approaches to obtain empirical knowledge that is useful for that purpose are empirical studies and metalearning. However, most empirical (meta)knowledge is obtained from a relatively small set of datasets. In this paper, we propose a method to obtain a large number of datasets which is based on a simple transformation of existing datasets, referred to as datasetoids. We test our approach on the problem of using metalearning to predict when to prune decision trees. The results show significant improvement when using datasetoids. Additionally, we identify a number of potential anomalies in the generated datasetoids and propose methods to solve them.

### Numerical Inversion of Transforms Occurring in Queueing and Other Stochastic Processes

We consider the numerical inversion of three classes of generating functions (GFs): classes of probability generating functions (PGFs) that are given in rational and non-rational forms, and a class of GFs that are not PGFs. Particular emphasis is on those PGFs that are not explicitly given but contain a number of unknowns. We show that the desired sequence can be obtained to any given accuracy, so long as enough numerical precision is used.
A Sala vai ser a P1 - ATENÇÂO

### Performance analysis of joint control schemes for the process mean (vector) and (co)variance (matrix)

The presentation focus on the ongoing and future work on the performance analysis of joint control schemes for the process mean (vector) and (co)variance (matrix), when the usual assumptions of independence and normality are no longer valid. We shall give special attention to two performance measures: the probability of a misleading signal (PMS) and the run length to a misleading signal (RLMS). We use stochastic ordering to analyze their monotonicity properties in terms of shifts in the parameters being monitored, and of changes in the autocorrelation parameter.
This seminar integrates a CAT examination

### Robust tests in generalized partial linear models

In this talk we will first remind the robust procedures existing to estimate the regression parameter and the regression function under a generalized partial linear model. Based on them, we will describe how to construct a Wald type statistic to test hypothesis on the regression parameter and a robust test to decide if the regression function is linear. The asymptotic behavior of the test statistics and derived and results from a Monte Carlo study will be presented.

### Partial Differential Equations ans Stochastic Differential Equations Arising in Particle Systems

In this talk, I will introduce a classical example of Particle System: the Simple Exclusion Process. I will give the notion of hydrodynamic limit, which is a Law of Large Numbers for the empirical measure and I will explain how to derive from the microscopic dynamics between particles a partial differential equation describing the evolution of the density profile. For the Simple Exclusion Process, in the Symmetric case $\left(p=1/2\right)$ we will get to the heat equation while in the Asymmetric case $\left(p\ne 1/2\right)$ to the Burgers equation. Finally, I will introduce the Central Limit theorem for the empirical measure and the limiting process turns out to be a solution of a stochastic differential equation.

### An Overview of Retrial Queues

The talk deals with a branch of queuing theory, retrial queuing systems, which is characterized by a basic assumption: a customer who cannot receive service (due to finite capacity of the system, balking, impatience, etc.) leaves the service area, but after some random delay returns to the system again to request service. As a result, repeated attempts for service from the pool of unsatisfied customers, called the orbit, are superimposed on the ordinary stream of arrivals of first attempts.

The talk is divided into three parts. Part I introduces the audience to the broad range of applications of retrial queues and compares retrial queues to standard queues with waiting line and queues with losses. We show that, though retrial queues are closely connected with standard queuing models, they possess unique and distinguishing characteristics. In Part II, we give a survey of main results for the main M/G/1 and M/M/c retrial queues. We also present the analysis of descriptors arising from the idiosyncrasies of the retrial feature. Part III uses the matrix-analytic formalism to analyze a selected number of retrial queues with underlying structured Markov chains.

### Estimation of matrix rank: historical overview and more recent developments

esting for the rank of a matrix is an important problem arising in Statistics, Econometrics and other areas. It is of interest, for example, in reduced rank regression, cointegration analysis and other applications. In this talk, we will review past efforts in addressing this problem and discuss more recent developments concerning testing for the rank in symmetric matrices. (The latter part is based on joint work with Stephen Donald and Natercia Fortuna.)
This talk is part of the UTL Probability and Statistics Seminar

### Análise Probabilística de Opções Reais

A análise de opções reais, muito em voga nos meios financeiros actuais, lida com a necessidade premente de prever preços futuros de opções (de compra ou venda) num cenário aleatório, de forma a estabelecer preços contractuais justos para ambas as partes. Neste contexto a componente estocástica tem um papel relevante, que será explorado nesta apresentação. Veremos como o movimento Browniano, por exemplo, é indispensável na análise em horizonte finito de produtos financeiros, e como no dia a dia de "traders" especializados aparecem integrais estocásticos.

### Análise bayesiana de privação das famílias portuguesas

Neste trabalho pretende-se analisar multidimensionalmente a pobreza das famílias portuguesas considerando quatro dimensões de bem-estar — Habitação, Bens de Conforto, Capacidade Económica e Redes de Sociabilidade — com base no Painel Europeu de Agregados Domésticos Privados do Eurostat. Propõe-se uma abordagem em várias etapas permitindo uma análise parcial e global da privação, recorrendo-se, para tal, à análise de modelos bayesianos de classes latentes através do método Monte Carlo via Cadeias de Markov. Os resultados obtidos evidenciam uma melhoria substancial no bem-estar das famílias entre 1995 e 2001. As dimensões Capacidade Económica e Redes de Sociabilidade são as que mais contribuem para a situação de privação das famílias.

### Erros nos regressores: realismo, desconforto e desafios

Os modelos de regressão constituem uma das ferramentas mais usadas pelos utilizadores da Estatística — por vezes ignorando que as variáveis são medidas com erro. Que fazer quando um regressor contém erros? Os modelos com erros-nas-variáveis dão a resposta. São uma extensão dos modelos de regressão, que disponibiliza uma descrição mais realista dos fenómenos em análise. Apesar das aparentes vantagens da modelação, os modelos com erros-nas-variáveis não têm tido a procura dos seus congéneres, nem a mesma aceitação. Porque será? Nesta apresentação procura-se responder às questões anteriores, divulgando esta família de modelos, comparando-os com os modelos de regressão, apontando os métodos a usar, as vantagens e os desafios associados à sua aplicação.

### Regressão Robusta

Desde os finais do século XIX que a regressão é uma espécie de pão de cada dia para uma vasta população de utilizadores da estatística nos mais variados domínios. O método dos mínimos quadrados é como que a manteiga que prepara e facilita a ingestão, permitindo o estudo do modelo. Nesta apresentação mostra-se que este pão com manteiga da Estatística nem sempre tem o sabor delicioso que o seu consumo voraz parece sugerir. A receita pode falhar e o sabor pode ser amargo. Para prevenir acidentes e desconfortos propõe-se um condimento muito abrangente, capaz de satisfazer e deixar mais tranquilo o utilizador em geral, mesmo o mais exigente, aquele que trabalha em domínios ou casos em que a aplicação dos mínimos quadrados conduz a soluções enganadoras. Trata-se da regressão robusta cuja análise constitui a preocupação central do seminário: o que é, que vantagens, que limitações, que métodos e que popularidade, são alguns dos tópicos a abordar.

### A Bioestatística integrada numa perspectiva multidisciplinar

Actualmente os investigadores das áreas biomédicas estão mais sensibilizados para a intervenção da Estatística nos seus projectos de investigação. A participação tardia do estatístico (apenas na fase da análise de dados) ainda é frequente. Contudo, existe uma maior preocupação na integração atempada do estatístico na fase do planeamento e recolha de dados. Por outro lado, o desenvolvimento das áreas biomédicas (e.g. Biologia Molecular) tem suscitado o desenvolvimento de metodologias estatísticas cada vez mais sofisticadas e que requerem a intervenção da Teoria das Probabilidades, da Investigação Operacional e das Ciências da Computação, habitualmente, não contempladas nas instituições biomédicas. Neste trabalho aborda-se a problemática do diálogo, nem sempre fácil, com outras áreas do saber de forma a reforçar o papel da Bioestatística em alguns projectos de investigação. Fazendo referência ao Projecto “Epidemiologia e Controlo da Leptospirose nos Açores” ilustram-se os esforços e as dificuldades sentidas no terreno para realizar uma amostragem aleatória e recolher os dados. Outra área importante na Bioestatística relaciona-se com o estudo da sensibilidade e especificidade de técnicas laboratoriais na ausência de uma técnica de referência fiável (Gold standard). Os Modelos de Classes Latentes, tradicionais na Psicologia e Sociologia, têm sido aplicados para estimar especificidades e sensibilidades de técnicas laboratoriais sem fixar uma técnica como referência. Através de um problema prático exemplifica-se a importância destes modelos no diagnóstico de algumas doenças tropicais. Paralelamente, ao explorar este problema prático, surgiu a necessidade de revisitar alguns problemas antigos da Estatística, relacionados com os intervalos de confiança para proporções (próximas de $0$ ou $1$) e, consequentemente, com o cálculo do tamanho da amostra. Em suma, a multidisciplinaridade está naturalmente presente na Bioestatística e as linhas de investigação, em oposição à Estatística Teórica, podem ser enriquecidas pela diversidade de problemas práticos e pelo diálogo com outros profissionais.

### Robust estimators in Generalized Partially Linear Models

Semiparametric models contain both a parametric and a nonparametric component. Sometimes the nonparametric component plays the role of a nuisance parameter. The aim of this talk is to consider semiparametric versions of the generalized linear models where the response $y$ is to be predicted by covariates $({\bf x},t)$, where ${\bf x}\in\mathbb{R}^{p}$ and $t\in\mathbb{R}$. It will be assumed that the conditional distribution of $y|({\bf x},t)$ belongs to the canonical exponential family $\exp\left[y\theta({\bf x},t)-B\left(\theta({\bf x},t)\right)+C(y)\right]$, for known functions $B$ and $C$. The generalized linear model (McCullagh and Nelder, 1989), which is a popular technique for modelling a wide variety of data, assumes that the mean is modelled linearly through a known link function, $g$, i.e., $g(\mu\left({\bf x},t\right))=\theta({\bf x},t)=\beta_{0}+{\bf x}^T{\bf\beta}+\alpha t\;.$ In many situations, the linear model is insufficient to explain the relationship between the response variable and its associated covariates. A natural generalization, which suffers from the curse of dimensionality, is to model the mean nonparametrically in the covariates. An alternative strategy is to allow most predictors to be modeled linearly while one or a small number of predictors enter the model nonparametrically. This is the approach we will follow, so that the relationship will be given by the semiparametric generalized partially linear model $$\mu\left({\bf x},t\right)=E\left(y|({\bf x},t)\right)=H\left(\eta(t)+{\bf x}^T{\bf\beta}\right)\qquad(\text{GPLM})$$ where $H=g^{-1}$ is a known link function, ${\bf\beta}\in\mathbb{R}^{p}$ is an unknown parameter and $\eta$ is an unknown continuous function. Severini and Wong (1992) introduced the concept of generalized profile likelihood, which was later applied to this model by Severini and Staniswalis (1994). In this method, the nonparametric component is viewed as a function of the parametric component, and root--$n$ consistent estimates for the parametric component can be obtained when the usual optimal rate for the smoothing parameter is used. Such estimates fail to deal with outlying observations. In a semiparametric setting, outliers can have a devastating effect, since the extreme points can easily affect the scale and the shape of the function estimate of $\eta$, leading to possibly wrong conclusions on $\beta$. Robust procedures for generalized linear models have been considered among others by Stephanski, Carroll and Ruppert (1986), Künsch, Stefanski and Carroll (1989), Bianco and Yohai (1995), Cantoni and Ronchetti (2001), Croux and Haesbroeck (2002) and Bianco, García Ben and Yohai (2005). The basic ideas from robust smoothing and from robust regression estimation have been adapted to deal with the case of independent observations following a partly linear regression model with $g(t)=t$; we refer to Gao and Shi (1997) and Bianco and Boente (2004), and He, Zhu and Fung (2002). In this talk, we will first remind the classical approach to generalized partly linear models. The sensitivity to outliers of the classical estimates for these models is good evidence that robust methods are needed. The problem of obtaining a family of robust estimates was first considered by Boente, He and Zhou (2006). However, their procedure is computationally expensive. We will introduce a general three--step robust procedure to estimate the parameter ${\bf\beta}$ and the function $\eta$, under a generalized partly linear model (GPLM), that is easier to compute than the one introduce by Boente, He and Zhou (2006). It is shown that the estimates of ${\bf\beta}$ are root--$n$ consistent and asymptotically normal. Through a Monte Carlo study, we compare the performance of these estimators with that of the classical ones. Besides, through their empirical influence function we study the sensitivity of the estimators. A robust procedure to choose the smoothing parameter is also discussed. We will briefly discuss the generalized partially linear single index model which generalizes the previous one since the independent observations are such that $y_{i}|\left({{\bf x}_{i},t_{i}}\right)\sim F\left(\cdot,\mu_{i}\right)$ with $\mu_{i}=H\left(\eta({\bf\alpha}^T{\bf t}_{i})+{\bf x}_{i}{\bf\beta}^T\right)$, where now ${\bf t}_{i}\in\mathbb{R}^{q}$, ${\bf x}_{i}\in\mathbb{R}^{p}$ and $\eta:\mathbb{R}\to\mathbb{R}$, ${\bf\beta}\in\mathbb{R}^{p}$ and ${\bf\alpha}\in\mathbb{R}^{q}$ ($\|{\bf\alpha}\|=1$) are the unknown parameters to be estimated. Two families of robust estimators are introduced which turn out to be consistent and asymptotically normally distributed. Their empirical influence function is also computed. The robust proposals improve the behavior of the classical ones when outliers are present.

Trabalho efectuado em parceria com Daniela Rodriguez

### Robust statistics: an overview

The basic ingredients of a statistical analysis are, in general, a data set, a model and a number of statistical procedures (estimation methods and tests). These procedures require that certain assumptions are fulfilled in order to function properly. Examples of this kind of assumptions are: normality of the observations, their independence and distributional identity (i.i.d.), homogeneity of variances, linearity or stationarity. If one or several of these assumptions are not verified the results of the statistical procedures may become completely aberrant. When this happens the procedure is called "non-robust". If, on the contrary, the results do not change a lot in the presence of small deviations from the assumptions, the procedure is called "robust". The importance of robust statistical procedures comes from the fact that the ideal assumptions are barely or never met in practice. Several simple examples will be presented to illustrate the severe effects of the violation of the underlying assumptions on the results of statistical procedures. After this introduction the talk goes on with a brief presentation of the basic concepts of robust statistics and with the discussion of robust methods in two main areas of statistics: regression and multivariate analysis. These areas are precisely those where there is a stronger need for robust methods and where the research effort has been more concentrated. The talk ends with some considerations on the future of robust statistics.

### Analysis of a Stochastic Model for Flash Crowd Scenarios

In this talk we investigate the performance of a file sharing principle similar to the one implemented by eMule and BitTorrent. For this purpose, we consider a system composed of N peers becoming active at exponential random times, thus modeling a flash crowd'' scenario where an initial burst of clients occurs. The system is initiated with only one server offering the desired file and the other peers try to download it after becoming active. Once the file has been downloaded by a peer, this one immediately becomes a server. While the system starts in a congested state where all servers available are saturated by incoming demands, it shifts to a state where a growing number of servers are idle. We are interested in the time needed for this shift to happen, which is closely related to the transient performance of this file sharing principle. In spite of its apparent simplicity, this queueing model (with a random number of servers) reveals quite difficult to analyze. A formulation in terms of an urn and ball model is proposed and corresponding scaling results are derived. These asymptotic results are then compared against simulations.

### Filas de espera oscilantes $M/G/1/n$ e $GI/M(m)//n$ com chegadas em grupo

As filas de espera oscilantes $M/G/1/n$ e $GI/M(m)//n$ com chegadas em grupo oscilam entre duas fases que têm impacto nas características de serviço. Quando o sistema está na fase 1 o número de clientes no sistema varia entre $0$ e $b-1$, e quando está na fase 2 o número de clientes no sistema varia entre $a+1$ e $n$, com $a$ e $b$ sendo dois números inteiros tais que $a \lt b$. Um sistema oscilante evolui da seguinte maneira: se num instante o sistema opera na fase 1, o número de clientes no sistema é menor do que $b$, e o sistema permanece nesta fase até que o número de clientes no sistema seja maior ou igual a $b$. Nesse instante o sistema muda para a fase 2 e permanece nesta fase até ao primeiro instante em que o número de clientes no sistema seja menor ou igual do que $a$. Nesse instante o sistema muda para a fase 1 e assim sucessivamente. O estudo das filas oscilantes $M/G/1/n$ com chegadas em grupo é feito tirando partido da sua estrutura regenerativa markoviana. Usamos cadeias de Markov embebidas e caracterizamos a distribuição limite do número de clientes no sistema. São também estudadas duas outras importantes medidas de desempenho do sistema, particularmente importantes na análise de transmissão de dados e vídeo na Internet: as probabilidades de perdas consecutivas de clientes em períodos de ocupação contínua, e a duração de períodos de ocupação contínua. O estudo das filas oscilantes $GI/M(m)//n$ com chegadas em grupo é feito combinando a metodologia das cadeias embebidas e o uso de uniformização. Caracterizamos as probabilidades limite do número de clientes no sistema e determinamos probabilidades de perda consecutiva em períodos de ocupação contínua.

Older session pages: Previous 6 7 8 Oldest