Probability and Statistics Seminar   RSS

Past sessions

Newer session pages: Next 5 4 3 2 1 Newest 

18/04/2013, 11:00 — 12:00 — Room P3.10, Mathematics Building
Ana Subtil, CEMAT, Instituto Superior Técnico, Universidade Técnica de Lisboa; Faculdade de Ciências, Universidade de Lisboa

Using Latent Class Models to Evaluate the Performance of Diagnostic Tests in the Absence of a Gold Standard

Diagnostic tests are helpful tools for decision-making in a biomedical context. In order to determine the clinical relevance and practical utility of each test, it is critical to assess its ability to correctly distinguish diseased from non-diseased individuals. Statistical analysis has an essential role in the evaluation of diagnostic tests, since it is used to estimate performance measures of the tests, such as sensitivity and specificity. Ideally, these measures are determined by comparison with a gold standard, i.e., a reference test with perfect sensitivity and specificity.

When no gold standard is available, admitting the supposedly best available test as a reference may cause misclassifications leading to biased estimates. Alternatively, Latent Class Models (LCM) may be used to estimate diagnostic tests performance measures as well as the disease prevalence, in the absence of a gold standard. The most common LCM estimation approaches are the maximum likelihood estimation using the Expectation-Maximization algorithm and the Bayesian inference using Markov Chain Monte Carlo methods, via Gibbs sampling.

This talk illustrates the use of Bayesian Latent Class Models (BLCM) in the context of malaria and canine dirofilariosis. In each case, multiple diagnostic tests were applied to distinct subpopulations. To analyze the subpopulations simultaneously, a product multinomial distribution was considered, since the subpopulations were independent. By introducing constraints, it was possible to explore differences and similarities between subpopulations in terms of prevalence, sensitivities and specificities.

We also discuss statistical issues such as the assumption of conditional independence, model identifiability, sampling strategies and prior distribution elicitation.

05/04/2013, 14:30 — 15:30 — Room P3.10, Mathematics Building
, Faculdade de Economia / LIAAD - INESC TEC, Universidade do Porto

Taking Variability in Data into Account: Symbolic Data Analysis

Symbolic Data, introduced by E. Diday in the late eighties of the last century, is concerned with analysing data presenting intrinsic variability, which is to be explicitly taken into account. In classical Statistics and Multivariate Data Analysis, the elements under analysis are generally individual entities for which a single value is recorded for each variable - e.g., individuals, described by their age, salary, education level, marital status, etc.; cars each described by its weight, length, power, engine displacement, etc.; students for each of which the marks at different subjects were recorded. But when the elements of interest are classes or groups of some kind - the citizens living in given towns; teams, consisting of individual players; car models, rather than specific vehicles; classes and not individual students - then there is variability inherent to the data. To reduce this variability by taking central tendency measures - mean values, medians or modes - obviously leads to a too important loss of information.

Symbolic Data Analysis provides a framework allowing representing data with variability, using new variable types. Also, methods have been developed which suitably take data variability into account. Symbolic data may be represented using the usual matrix-form data arrays, where each entity is represented in a row and each column corresponds to a different variable - but now the elements of each cell are generally not single real values or categories, as in the classical case, but rather finite sets of values, intervals or, more generally, distributions.

In this talk we shall introduce and motivate the field of Symbolic Data Analysis, present into some detail the new variable types that have been introduced to represent variability, illustrating with some examples. We shall furthermore discuss some issues that arise when analysing data that does not follow the usual classical model, and present data representation models for some variable types.

14/03/2013, 11:00 — 12:00 — Room P3.10, Mathematics Building
Iryna Okhrin & Ostap Okhrin, Faculty of Business Administration and Economics at the Europa-Universität Viadrina Frankfurt (Oder) & Wirtschaftswissenschafttliche Fakultät at the Humboldt-Universität zu Berlin

Forecasting The Temperature Data & Localising Temperature Risk

Forecasting The Temperature Data:

This paper aims at describing the intraday temperature variations which is a challenging task in modern econometrics and environmetrics. Having a high-frequency data, we separate the dynamics within a day and over days. Three main models have been considered in our study. As the benchmark we employ a simple truncated Fourier series with autocorrelated residuals. The second model uses the functional data analysis, and is called the shape invariant model (SIM). The third one is the dynamic semiparametric factor model (DSFM). In this work we discuss rises and pitfalls of all the methods and compare their in- and out-of-sample performances.


Localising Temperature Risk:

On the temperature derivative market, modelling temperature volatility is an important issue for pricing and hedging. In order to apply the pricing tools of financial mathematics, one needs to isolate a Gaussian risk factor. A conventional model for temperature dynamics is a stochastic model with seasonality and intertemporal autocorrelation. Empirical work based on seasonality and autocorrelation correction reveals that the obtained residuals are heteroscedastic with a periodic pattern. The object of this research is to estimate this heteroscedastic function so that, after scale normalisation, a pure standardised Gaussian variable appears. Earlier works investigated temperature risk in different locations and showed that neither parametric component functions nor a local linear smoother with constant smoothing parameter are flexible enough to generally describe the variance process well. Therefore, we consider a local adaptive modelling approach to find, at each time point, an optimal smoothing parameter to locally estimate the seasonality and volatility. Our approach provides a more flexible and accurate fitting procedure for localised temperature risk by achieving nearly normal risk factors. We also employ our model to forecast the temperature in different cities and compare it to a model developed in Campbell and Deibol (2005).

28/02/2013, 11:00 — 12:00 — Room P3.10, Mathematics Building
Graciela Boente, Universidad de Buenos Aires and CONICET, Argentina

S-estimators for functional principal component analysis

A well-known property of functional principal components is that they provide the best q-dimensional approximation to random elements over separable Hilbert spaces. Our approach to robust estimates of principal components for functional data is based on this property since we consider the problem of robustly estimating these finite-dimensional approximating linear spaces. We propose a new class of estimators for principal components based on robust scale functionals by finding the lower dimensional linear space that provides the best prediction for the data. In analogy to the linear regression case, we call this proposal S-estimators. This method can also be applied to sparse data sets when the underlying process satisfies a smoothness condition with respect to the functional associated with the scale defining the S-estimators. The motivation is a problem of outlier detection in atmospheric data collected by weather balloons launched into the atmosphere and stratosphere.

14/02/2013, 11:00 — 12:00 — Room P3.10, Mathematics Building
, Department of Operations, Faculty of Business and Economics, University of Lausanne

Technological Change: A Burden or a Chance

The photography industry underwent a disruptive change in technology during the 1990s when the traditional film was replaced by digital photography (see e.g. The Economist January 14th 2012). In particular Kodak was largely affected : by 1976 Kodak accounted for 90% of film and 85% of camera sales in America. Hence it was a near-monopoly in America. Kodak′s revenues were nearly 16 billion in 1996 but the prediction is that it will decrease to 6.2 billion in 2011. Kodak tried to get (squeeze) as much money out of the film business as possible and it prepared for the switch to digital film. The result was that Kodak did eventually build a profitable business out of digital cameras but it lasted only a few years before camera phones overtook it.

According to Mr Komori, the former CEO of Fujifilm of 2000-2003, Kodak aimed to be a digital company, but that is a small business and not enough to support a big company. For Kodak it was like seeing a tsunami coming and there′s nothing you can do about it, according to Mr. Christensen in The Economist (January 14th 2012).

In this paper we study the problem of a firm that produces with a current technology for which it faces a declining sales volume. It has two options: it can either exit this industry or invest in a new technology with which it can produce an innovative product. We distinguish between two scenarios in the sense that the resulting new market can be booming or ends up to be smaller than the old market used to be.

We derive the optimal strategy of a firm for each scenario and specify the probabilities with which a firm would decide to innovate or to exit. Furthermore, we assume that the firm can additionally choose to suspend production for some time in case demand is too low, instead of immediately taking the irreversible decision to exit the market. We derive conditions under which such an suspension area exists and show how long a firm is expected to remain in this suspension area before resuming production, investing in new technology or exiting the market.

04/06/2012, 14:30 — 15:30 — Room P3.10, Mathematics Building
Bruno de Sousa, Instituto de Higiene e Medicina Tropical, UNL, CMDT

Understanding the state of men's health in Europe through a life expectancy analysis

A common feature of the health of men across Europe is their higher rates of premature mortality and shorter life expectancy than women. Following the publication of the first State of Men's Health in Europe we sought to explore possible reasons.

We described trends in life expectancy in the European Union member States (EU27) between 1999 and 2008 using mortality data obtained from Eurostat. We then used Pollard's decomposition method to identify the contribution of deaths from different causes and at different age groups to differences in life expectancy. We first examined the change in life expectancy for men and for women between the beginning and end of this period. Second, we examined the gap in life expectancy between men and women at the beginning and end of this period.

Between 1999 and 2008 life expectancy in the EU27 increased by 2.77 years for men and by 2.12 years for women. Most of these improvements were due to reductions in mortality at ages over 60, with cardiovascular disease accounting for 1.40 years of the reduction in men. In 2008 life expectancy of men in the EU27 was 6.04 years lower than that of women. Deaths from all major groups of causes, and at all ages, contribute to this gap, with external causes contributing 1.00 year, cardiovascular disease 1.75 years and neoplasms 1.71 years.

Improvements in the life expectancy of men and women have mostly occurred at older ages. There has been little improvement in the high rate of premature death in younger men. This would suggest a need for interventions to tackle the high death rate in younger men. The demonstration of variations in premature death and life expectancy seen in men within the new European Commission report, highlight the impact of poor socio-economic conditions. The more pronounced adverse effect on the health of men suggests that men suffer from 'heavy impact diseases' and these are more quickly life-limiting with women more likely to survive, but with poorer health.

16/05/2012, 14:30 — 15:30 — Room P3.10, Mathematics Building
Verena Hagspiel , CentER, Department of Econometrics and Operations Research Tilburg University, The Netherlands

Optimal Technology Adoption when the Arrival Rate of New Technologies Changes

Our paper contributes to the literature of technology adoption. In most of these models it is assumed that after the arrival of a new technology the probability of the next arrival is constant. We extend this approach by assuming that after the last technology jump the probability of a new arrival can change. Right after the arrival of a new technology the intensity equals a specific value that switches if no new technology arrival has taken place within a certain period after the last technology arrival. We look at different scenarios, dependent on whether the firm is threatened by a drop in the arrival rate after a certain time period or expects the rate of new arrivals to rise. We analyze the effect of variance of time between two consecutive arrivals on the optimal investment timing and show that larger variance accelerates investment in a new technology. We find that firms often adopt a new technology a time lag after its introduction, which is a phenomenon frequently observed in practice. Regarding a firm's technology releasing strategy we explain why clear signals set by regular and steady release of new product generations stimulates customers buying behavior. Depending on whether the arrival rate is assumed to change or be constant over time, the optimal technology adoption timing changes significantly. In a further step we add an additional source of uncertainty to the problem and assume that the length of the time period after which the arrival intensity changes is not known to the firm in advance. Here, we find that increasing uncertainty accelerates investment, a result that is opposite to the standard real options theory.

02/05/2012, 14:30 — 15:30 — Room P3.10, Mathematics Building
, Departamento de Matemática - CEMAT - IST

On the Aging Properties of the Run Length of Markov-Type Control Charts

A change in a production process must be detected quickly so that a corrective action can be taken. Thus, it comes as no surprise that the run length (RL) is usually used to describe the performance of a quality control chart.

This popular performance measure has a phase-type distribution when dealing with Markov-type charts, namely, cumulative sum (CUSUM) and exponentially weighted moving average (EWMA) charts, as opposed to a geometric distribution, when standard Shewhart charts are in use.

In this talk, we briefly discuss sufficient conditions on the associated probability transition matrix to deal with run lengths with aging properties such as new better than used in expectation, new better than used, and increasing hazard rate.

We also explore the implications of these aging properties of the run lengths, namely when we decide to confront the in control and out-of-control variances of the run lengths of matched in control Shewhart and Markov-type control charts.


Phase-type distributions; Run length; Statistical process control; Stochastic ordering.


Morais, M.C. and Pacheco, A. (2012). A note on the aging properties of the run length of Markov-type control charts. Sequential Analysis 31, 88-98.

16/04/2012, 15:00 — 16:00 — Room P3.10, Mathematics Building
, Matemática/DCEB, ISA/UTL e CEAUL/UL

Espaço das variáveis: onde estatística e geometria se casam. O caso das distâncias de Mahalanobis.

A forma usual de conceptualizar a representação gráfica duma matriz $X_{n\times p}$ de dados de indivíduos $\times$ variáveis consiste em associar um eixo a cada variável e nesse referencial cartesiano representar cada individuo por um ponto, cujas coordenadas são dadas pela linha de $X$ correspondente ao individuo. A popularidade desta representação no espaço dos individuos ($\mathbb{R}^p$) resulta, em grande medida, do facto de ser visualizável para dados bivariados ou tri-variados. No entanto, para um número maior de variáveis ($p \gt 3$) essa vantagem deixa de existir.

Uma representação alternativa é importante na análise e modelação dos dados. No espaço das variáveis, cada eixo corresponde a um individuo e cada variável é representada por um vector a partir da origem, definido pelas $n$ coordenadas da respectiva coluna matricial. Esta representação das variáveis em $\mathbb{R}^n$ tem a enorme vantagem de casar conceitos estatísticos e conceitos geométricos, permitindo uma melhor compreensão dos primeiros. Tem raízes sólidas na escola francesa de análise de dados, mas o seu potencial nem sempre é explorado.

Nesta comunicação começa-se por relembrar os conceitos geométricos correspondentes a indicadores fundamentais da estatística univariada e bivariada (média, desvio padrão, coeficiente de variação ou coeficiente de correlação) ou multivariada (exemplificando com o caso da análise em componentes principais). Aprofunda-se a discussão no contexto de regressões lineares múltiplas, cujos conceitos fundamentais (coeficiente de determinação, as três somas de quadrados e a sua relação fundamental) têm interpretação geométrica no espaço das variáveis.

Seguidamente, discute-se a utilidade desta representação geométrica no estudo das distâncias de Mahalanobis, que desempenham um papel de primeiro plano na estatística multivariada. Mostra-se como as distâncias (ao quadrado) de Mahalanobis medem a inclinação do subespaço de $\mathbb{R}^n$ gerado pelas colunas da matriz centrada dos dados, o subespaço $\mathcal{C}(X_c)$, em relação ao sistema de eixos. Em particular, mostra-se como as distâncias de Mahalanobis ao centro, \[D^2_{x_i,\overline{x}}=(x_i-\overline{x})^t \S^{-1} (x_i-\overline{x}),\] são apenas função de $n$ e do ângulo $\theta_i$ entre o eixo correspondente ao indivíduo $i$ e $\mathcal{C}(X_c)$, enquanto que a distância (ao quadrado) de Mahalanobis entre dois individuos, \[D^2_{x_i,x_j}=(x_i-x_j)^t \S^{-1} (x_i-x_j),\] é também função apenas de $n$ e do ângulo entre $\mathcal{C}(X_c)$ e a bissectriz gerada por $e_i-e_j$, sendo $e_i$ e $e_j$ os vectores canónicos de $\mathbb{R}^n$ associados aos dois individuos. Algumas recentes majorações e outras propriedades importantes destas distâncias (Gath & Hayes, 2006 e Branco & Pires, 2011) são expressão directa destas relações geométricas. Apesar das distâncias de Mahalanobis dizerem respeito aos individuos, os conceitos geométricos que lhes estão associados no espaço das variáveis podem ser explorados para aprofundar e estender esses resultados.

26/03/2012, 14:30 — 15:30 — Room P3.10, Mathematics Building
Russell Alpizar-Jara , Research Center in Mathematics and Applications (CIMA-U.E.) Department of Mathematics, University of Évora

An overview of capture-recapture models

Capture-recapture methods have been widely used in Biological Sciences to estimate population abundance and related demographic parameters (births, deaths, immigration, or emigration). More recently, these models have been used to estimate community dynamics parameters such as species richness, rates of extinction, colonization and turnover, and other metrics that require presence/absence data of species counts. In this presentation, we will use the latest application to illustrate some of the concepts and the underlying theory of capture-recapture models. In particular, we will review basic closed-population, open-population, and combination of closed and open population models. We will briefly mention about other applications of these models to Medical, Social and Computer Sciences.

Keywords: Capture-recapture experiments; multinomial and mixture distributions; non-parametric and maximum likelihood estimation; population size estimation.

07/03/2012, 14:30 — 15:30 — Room P3.10, Mathematics Building
, CEAUL - DEIO - FCUL - University of Lisbon

Why we need non-linear time series models and why we are not using them so often

The Wold Decomposition theorem says that under fairly general conditions, a stationary time series X t has a unique linear causal representation in terms of uncorrelated random variables. However, The Wold Decomposition theorem gives us a representation, not a model for X t, in the sense that we can only recover uniquely the moments of X t up to second order from this representation, unless the input series is a Gaussian sequence. If we look for models for X t, then we should look for such model within the class of convergent Volterra series expansions. If we have to go beyond second order properties, and many real data sets from financial and environmental sciences indicate that we should, then linear models with iid Gaussian input are a very tiny, insignificant fraction of possible models for a stationary time series, corresponding to the first term of the infinite order Volterra expansion. On the other hand, Volterra series expansions are not particularly useful as a possible class of models, as conditions of stationarity and invertibility are hard to check, if not impossible, therefore they have very limited use as models for time series, unless the input series is observable. From a prediction point of view, the Projection Theorem for Hilbert spaces tells us how to obtain the best linear predictor for X t+k within the linear span of {X t,X t1,,} , but when linear predictors are not sufficiently good, it is not straightforward to find, if possible at all, the best predictor within richer subspaces constructed over {X t,X t1,,}. It is therefore important to look for classes of nonlinear models to improve upon the linear predictor, which are sufficiently general, but at the same time are sufficiently flexible to work with. There are many ways a time series can be nonlinear. As a consequence, there are many classes of nonlinear models to explain such nonlinearities, but whose probabilistic characteristics are difficult to study, not to mention the difficulties associated with modeling issues. Likelihood based inference is particularly a difficult issue as for most nonlinear processes, we can not even write the likelihood. However, recently there has been very exciting advances in simulation based inferential methods such as sequential Markov Chain Monte Carlo, Particle filters and Approximate Bayesian Computation methods for generalized state space models which we will mention briefly.

22/02/2012, 14:30 — 15:30 — Room P3.10, Mathematics Building
, CEAUL-DEIO- FC - Universidade de Lisboa

Até onde pode ir o H(h)omem?

Neste seminário será abordada a questão do “Qual é o Maior Salto em Comprimento ao alcance do H(h)omem, dado o actual state of the art”? Para responder a essa pergunta será usado o crème de la crème, i.e., os dados são coligidos a partir dos melhores atletas olímpicos na modalidade, a partir da base de dados do World Athletics Competitions - Long Jump Men Outdoors. Esta abordagem do problema é baseada na Teoria de Valores Extremos e as respectivas técnicas estatísticas. Usar-se-ão apenas os melhores desempenhos das World top lists. A estimativa final do potencial recorde, i.e., o limite superior do acontecimento salto em comprimento, permite inferir acerca da melhor marca individual possível, dadas as condições actuais, quer em termos de conhecimento do fenómeno, quer relativamente às condições e regras de registo na modalidade desportiva. Actualmente o recorde de 8,95m é detido por Mike Powell (USA) em Tokyo, 30/08/1991. Em Valores Extremos insere-se na estimativa do limite superior do suporte para uma distribuição no Max-domínio da Gumbel.

Palavras-chave: Valores Extremos em Desporto, Teoria de Valores Extremos, Estimação do Limite Superior do Suporte no Domínio Gumbel, Abordagem Semi-paramétrica para Estatística de Extremos.

10/02/2012, 11:00 — 12:00 — Room P3.10, Mathematics Building
Patrícia Ferreira, CEMAT - Departamento de Matemática - IST

Sinais erróneos em esquemas conjuntos para o valor esperado e paraa variância de processos

Quando se pretende controlar simultaneamente o valor esperado e a variância de um processo é comum utilizar-se um esquema conjunto. Este tipo de esquema é constituído por duas cartas de controlo que operam em simultâneo, uma que controla o valor esperado e outra que controla a variância do processo. A utilização deste tipo de esquemas pode levar à ocorrência de sinais erróneos, associados, por exemplo, às seguintes situações:

  • o valor esperado do processo está fora de controlo, no entanto a carta para a variância emite um sinal antes da carta usada para controlar o valor esperado;
  • a variância do processo está fora de controlo mas a carta para o valor esperado é a primeira a emitir sinal.

Os sinais erróneos são sinais válidos que podem levar o operador de controlo de qualidade a desencadear acções inadequadas para corrigir uma causa inexistente. Posto isto, é importante considerar a frequência com que estes sinais ocorrem como uma medida de desempenho dos esquemas conjuntos. Neste trabalho analisa-se o desempenho de esquemas conjuntos do ponto de vista da probabilidade de ocorrência de um sinal erróneo com especial enfoque em esquemas conjuntos para processos univariados i.i.d. e autocorrelacionados.

19/01/2012, 11:00 — 12:00 — Room P3.10, Mathematics Building
Peter Kort, Tilburg University

Strategic Capacity Investment Under Uncertainty

In this talk we consider investment decisions within an uncertain dynamic and competitive framework. Each investment decision involves to determine the timing and the capacity level. In this way we extend the main bulk of the real options theory where the capacity level is given. We consider a monopoly setting as well as a duopoly setting. Our main results are the following. In the duopoly setting we provide a fully dynamic analysis of entry deterrence/accommodation strategies. Contrary to the seminal industrial organization analyses that are based on static models, we find that entry can only be deterred temporarily. To keep its monopoly position as long as possible the first investor overinvests in capacity. In very uncertain economic environments the first investor eventually ends up being the largest firm in the market. If uncertainty is moderately present, a reduced value of waiting implies that the preemption mechanism forces the first investor to invest so soon that a large capacity cannot be afforded. Then it will end up with a capacity level being lower than the second investor.

04/05/2011, 14:00 — 15:00 — Room P4.35, Mathematics Building
Verena Hagspiel, Tilburg University, Netherlands

Production Flexibility and Capacity Investment under Demand Uncertainty

he paper takes a real option approach to consider optimal capacity investment decisions under uncertainty. Besides the timing of the investment, the firm also has to decide on the capacity level. Concerning the production decision, we study a flexible and an inflexible scenario. The flexible firm can costlessly adjust production over time with the capacity level as the upper bound, while the inflexible firm fixes production at capacity level from the moment of investment onwards. We find that the flexible firm invests in higher capacity than the inflexible firm, where the capacity difference increases with uncertainty. For the flexible firm the initial occupation rate can be quite low, especially when investment costs are concave and the economic environment is uncertain. As to the timing of the investment there are two contrary effects. First, the flexible firm has an incentive to invest earlier, because flexibility raises the project value. Second, the flexible firm has an incentive to invest later, because costs are larger due to the higher capacity level. The latter effect dominates in highly uncertain economic environments.

01/03/2011, 11:00 — 12:00 — Room P3.10, Mathematics Building
Christine Fricker, INRIA, France

Performance of passive optical networks

We introduce PONs (Passive Optical Networks), which are designed to provide high speed access to users via fiber links. The problem for the OLT (Optical Line Terminal) is to share dynamically the wavelength bandwidth among the ONUs (Optical Network Units). For that, with an optimal algorithm, the system can be modeled as a relatively standard polling system. Due to technological constraints, in the polling system, the number of servers which visit one queue at the same time is limited. The performance of the system is directly related to the stability condition of the polling model. It is unknown in general. A mean field approach provides a limit stability condition when the system gets large.

07/10/2010, 16:30 — 17:30 — Room P3.10, Mathematics Building
Magnus Fontes, Lund University

Mathematics-A Catalyst for Innovation- Giving European Industry an Edge

We will discuss the role of Mathematics in Industry and in innovation processes. The focus will be European and we will look at good examples provided e.g. by the experiences of the network European Consortium for Mathematics in Industry (ECMI). I will also present the ongoing ESF Forward Look: "Mathematics and Industry" (see and discuss possible future developments on a European scale.

21/07/2010, 15:00 — 16:00 — Room P3.10, Mathematics Building
Graciela Boente, Universidad de Buenos Aires and CONICET, Argentina

Robust inference in generalized linear models with missing responses

he generalized linear model GLM (McCullagh and Nelder, 1989) is a popular technique for modelling a wide variety of data and assumes that the observations are independent such that the conditional distribution of y|x belongs to the canonical exponential family. In this situation, the mean $E(y|x)$ is modelled linearly through a known link function. Robust procedures for generalized linear models have been considered among others by Stefanski et al. (1986), Künsch et al. (1989), Bianco and Yohai (1996), Cantoni and Ronchetti (2001), Croux and Haesbroeck (2002) and Bianco et al. (2005). Recently, robust tests for the regression parameter under a logistic model were considered by Bianco and Martínez (2009).

In practice, some response variables may be missing, by design (as in two-stage studies) or by happenstance. As it is well known, the methods described above are designed for complete data sets and problems arise when missing responses may be present, while covariates are completely observed. Even if there are many situations in which both the response and the explanatory variables are missing, we will focus our attention only when missing data occur only in the responses. Actually, missingness of responses is very common in opinion polls, market research surveys, mail enquiries, social-economic investigations, medical studies and other scientific experiments, where the explanatory variables can be controlled. This pattern is common, for example, in the scheme of double sampling proposed by Neyman (1938). Hence, we will be interested on robust inference when the response variable may have missing observations but the covariate x is totally observed.

In the regression setting with missing data, a common method is to impute the incomplete observations and then proceed to carry out the estimation of the conditional or unconditional mean of the response variable with the completed sample. The methods considered include linear regression (Yates, 1933), kernel smoothing (Cheng, 1994; Chu and Cheng, 1995) nearest neighbor imputation (Chen and Shao, 2000), semiparametric estimation (Wang et al., 2004, Wang and Sun, 2007), nonparametric multiple imputation (Aerts et al. , 2002, González-Manteiga and Pérez-Gonzalez, 2004), empirical likelihood over the imputed values (Wang and Rao, 2002), among others. All these proposals are very sensitive to anomalous observations since they are based on least squares approaches.

In this talk, we introduce a robust procedure to estimate the regression parameter under a GLM model, which includes, when there are no missing data, the family of estimators previously studied. It is shown that the robust estimates of are root-$n$ consistent and asymptotically normally distributed. A robust procedure to test simple hypothesis on the regression parameter is also considered. The finite sample properties of the proposed procedure are investigated through a Monte Carlo study where the robust test is also compared with nonrobust alternatives.

Ana Pires 01/06/2010, 16:00 — 17:00 — Room P4.35, Mathematics Building
Ana Pires, Universidade Técnica de Lisboa - Instituto Superior Técnico and CEMAT

CSI: are Mendel's data "Too Good to be True?"

Gregor Mendel (1822-1884) is almost unanimously recognized as the founder of modern genetics. However, long ago, a shadow of doubt was cast on his integrity by another eminent scientist, the statistician and geneticist, Sir Ronald Fisher (1890-1962), who questioned the honesty of the data that form the core of Mendel's work. This issue, nowadays called "the Mendel-Fisher controversy", can be traced back to 1911, when Fisher first presented his doubts about Mendel's results, though he only published a paper with his analysis of Mendel's data in 1936.

A large number of papers have been published about this controversy culminating with the publication in 2008 of a book (Franklin et al., "Ending the Mendel-Fisher controversy"), aiming at ending the issue, definitely rehabilitating Mendel's image. However, quoting from Franklin et al., "the issue of the `too good to be true' aspect of Mendel's data found by Fisher still stands".

We have submitted Mendel's data and Fisher's statistical analysis to extensive computations and simulations attempting to discover an hidden explanation or hint that could help finding an answer to the questions: is Fisher right or wrong, and if Fisher is right is there any reasonable explanation for the "too good to be true", other than deliberate fraud? In this talk some results of this investigation and the conclusions obtained will be presented.

18/05/2010, 16:00 — 17:00 — Room P4.35, Mathematics Building
Alex Trindade, Texas Tech University

Fast and Accurate Inference for the Smoothing Parameter in Semiparametric Models

We adapt the method developed in Paige, Trindade, and Fernando (2009) in order to make approximate inference on optimal smoothing parameters for penalized spline, and partially linear models. The method is akin to a parametric bootstrap where Monte Carlo simulation is replaced by saddlepoint approximation, and is applicable whenever the underlying estimator can be expressed as the root of an estimating equation that is a quadratic form in normal random variables. This is the case under a variety of common optimality criteria such as ML, REML, GCV, and AIC. We apply the method to some well-known datasets in the literature, and find that under the ML and REML criteria it delivers a performance that is nearly exact, with computational speeds that are at least an order of magnitude faster than exact methods. Perhaps most importantly, the proposed method also offers a computationally feasible alternative where no known exact methods exist, e.g. GCV and AIC.

Older session pages: Previous 7 8 9 10 11 Oldest