# Probability and Statistics Seminar

## Past sessions

### Strategies to reduce the probability of a misleading signal

Standard practice in statistical process control is to run two individual charts, one for the process mean and another one for the process variance. The resulting scheme is known as a simultaneous scheme and it provides a way to satisfy Shewhart's dictum that proper process control implies monitoring both location and dispersion.

When we use a simultaneous scheme, the quality characteristic is deemed to be out-of-control whenever a signal is triggered by either individual chart. As a consequence, the misidentification of the parameter that has changed can occur, meaning that a shift in the process mean can be misinterpreted as a shift in the process variance and vice-versa. These two events are known as misleading signals (MS) and can occur quite frequently.

We discuss (necessary and) sufficient conditions to achieve values of PMS smaller than or equal to $$0.5$$, explore, for instance, alternative simultaneous Shewhart-type schemes and check if they lead to PMS which are smaller than the ones of the popular $$(\bar{X}, S^2)$$ simultaneous scheme.

### Price Modelling in Carbon Emission and Electricity Markets

We present a model to explain the joint dynamics of the prices of electricity and carbon emission allowance certificates as a function of exogenously given fuel prices and power demand. The model for the electricity price consists of an explicit construction of the electricity supply curve; the model for the allowance price takes the form of a coupled forward-backward stochastic differential equation (FBSDE) with random coefficients. Reflecting typical properties of emissions trading schemes the terminal condition of this FBSDE exhibits a gradient singularity. Appealing to compactness arguments we prove the existence of a unique solution to this equation. We illustrate the relevance of the model at the example of pricing clean spread options, contracts that are frequently used to value power plants in the spirit of real option theory.

### Incorporating parameter uncertainty into the setup of EWMA control charts monitoring normal variance

Most of the literature concerned with the design of control charts relies on perfect knowledge of the distribution for at least the good (so-called in-control) process. Some papers treated the handling of EWMA charts monitoring normal mean in case of unknown parameters - refer to Jones, Champ and Rigdon (2001) for a good introduction. In Jensen, Jones-Farmer, Champ, and Woodall (2006): “Effects of Parameter Estimation on Control Chart Properties: A Literature Review” a nice overview was given. Additionally, it was mentioned that it would be interesting and useful to evaluate and take into account these effects also for variance control charts. Here, we consider EWMA charts for monitoring the normal variance. Given a sequence of batches of size $n$, $\{X_{i j}\}$, $i=1,2,\ldots$ and $j=1,2,\ldots,n$ utilize the following EWMA control chart: \begin{align*} Z_0 & = z_0 = \sigma_0^2 = 1 \,, \\ Z_i & = (1-\lambda) Z_{i-1} + \lambda S_i^2 \,,\; i = 1,2,\ldots \,,\\ & \qquad\qquad S_i^2 = \frac{1}{n-1} \sum_{i=1}^n (X_{ij} - \bar X_i)^2 \,,\; \bar X_i = \frac{1}{n} \sum_{i=1}^n X_{ij} \,, \\ L & = \inf \left\{ i \in I\!\!N: Z_i > c_u \sigma_0^2 \right\} \,. \end{align*} The parameters $\lambda \in (0,1]$ and $c_u \gt 0$ are chosen to enable a certain useful detection performance (not too much false alarms and quick detection of changes). The most popular performance measure is the so-called Average Run Length (ARL), that is $E_{\sigma}(L)$ for the true standard deviation $\sigma$. If $\sigma_0$ has to be estimated by sampling data during a pre-run phase, then this uncertain parameter effects, of course, the behavior of the applied control chart. Typically the ARL is increased. Most of the papers about characterizing the uncertainty impact deal with the changed ARL patterns and possible adjustments. Here, a different way of designing the chart is treated: Setup the chart through specifying a certain false alarm probability such as $P_{\sigma_0}(L\le 1000) \le \alpha$. This results in a specific $c_u$. Here we describe a feasible way to determine this value $c_u$ also in case of unknown parameters for a pre-run series of given size (and structure). A two-sided version of the introduced EWMA scheme is analyzed as well.

### Reaching the best possible rate of convergence to equilibrium of Boltzmann-equation solutions

This talk concerns a definitive answer to the problem of quantifying the relaxation to equilibrium of the solutions to the spatially homogeneous Boltzmann equation for Maxwellian molecules. Under really mild conditions on the initial datum - closed to necessity - and a weak, physically consistent, angular cutoff hypothesis, the main result states that the total variation distance (i.e. the ${L}^{1}$-norm in the absolutely continuous case) between the solution and the limiting Maxwellian distribution admits an upper bound of the form $C\mathrm{exp}\left(-{\Lambda }_{b}^{*}t\right)$, ${\Lambda }_{b}^{*}$ being the spectral gap of the linearized collision operator and $C$ a constant depending only on the initial datum. Hilbert hinted at the validity of this quantification in 1912, which was explicitly formulated as a conjecture by McKean in 1966. The main line of the new proof is based on an analogy between the problem of convergence to equilibrium and the central limit theorem of probability theory, as suggested by McKean.

### Robust Procedures for Nonlinear Models for Full and Incomplete Data

Linear models are one of the most popular models in Statistics. However, in many situations the nature of the phenomenon is intrinsically nonlinear and so, linear approximations are not valid and the data must be fitted using a nonlinear model. Besides, in some occasions the responses are incomplete and some of them are missing at random.

It is well known that, in this setting, the classical estimator of the regression parameter based on least squares is very sensitive to outliers. A family of general M-estimators is proposed to estimate the regression parameter in a nonlinear model. We give a unified approach to treat full data or data with missing responses. Under mild conditions, the proposed estimators are Fisher-consistent, consistent and asymptotically normal. To study local robustness, their influence function is also derived.

A family of robust tests based on a Wald-type statistic is introduced in order to check hypotheses that involve the regression parameter. Monte Carlo simulations illustrate the finite sample behaviour of the proposed procedures in different settings in contaminated and uncontaminated samples.

### An INteger AutoRegressive afternoon - Statistical analysis of discrete valued time series

Part I: Univariate and multivariate models based on thinning

Part II: Modelling and forecasting time series of counts

Time series of counts arise when the interest lies on the number of certain events occurring during a specified time interval. Many of these data sets are characterized by low counts, asymmetric distributions, excess zeros, over dispersion, etc, ruling out normal approximations. Thus, during the last decades there has been considerable interest in models for integer-valued time series and a large volume of work is now available in specialized monographs. Among the most successful models for integer-valued time series are the INteger- valued AutoRegressive Moving Average, INARMA, models based on the thinning operation. These models are attractive since they are linear-like models for discrete time series which exhibit recognizable correlation structures. Furthermore, in many situations the collected time series are multivariate in the sense that there are counts of several events observed over time and the counts at each time point are correlated. The first talk introduces univariate and multivariate models for time series of counts based on the thinning operator and discusses their statistical and probabilistic properties. The second talk addresses estimation and diagnostic issues and illustrates the inference procedures with simulated and observed data.

### Mathematical Finance in South Africa

I have been involved in Math Finance university education in South Africa since 1996. During this time I have produced numerous graduates & grown an extensive network of industry & academic partners. I'll talk about these experiences & take questions.

### Aggregational Gaussianity Using Sobol Sequencing In the South African Equity Markets: Implications for the Pricing of Risk

Stylized facts of asset returns in the South African market have received extensive attention, with multiple studies published on non-normality of returns, heavy-tailed distributions, gain-loss asymmetry and, particularly, volatility clustering. The one such fact that has received only cursory attention world-wide is that of Aggregational Gaussianity - the widely-accepted/stylized fact that empirical asset returns tend to normality when the period over which the return is computed increases. The aggregational aspect arises from the $$n$$-day log-return being the simple sum of $$n$$ one-day log-returns. This fact is usually established using Q-Q-plots over longer and longer intervals, and can be qualitatively confirmed. However, this methodology inevitably uses overlapping data series, especially for longer period returns. When an alternative resampling methodology for dealing with common time-overlapping returns data is used an alternate picture emerges. Here we describe evidence from the South African market for a discernible absence of Aggregational Gaussianity and briefly discuss the implications of these findings for the quantification of risk and to the pricing and hedging of derivative securities.

### Real Time Statistical Process Control of the Quantity of Product in Prepackages

In this presentation we will describe how we developed a methodology for the statistical quantity control processes of prepackagers and present a number of different case studies based on the type of product, packaging, production, filling line and system of data acquisition. With the aim of establishing a global strategy to control the quantity of product in prepackages, an integrated planning model based on statistical tools was developed. This model is able to manage the production functions concerning the legal metrological requirements. These requirements are similar all around the world because they are based on the recommendation R-87: 2004 (E) from the International Organization of Legal Metrology (OIML). Based on the principles of Statistical Process Control a methodology to analyze in real time the quantity of product in prepackages was proposed; routine inspections, condition monitoring of the main components and friendly comprehension of the outputs were taken into account. Subsequently, software of data acquisition, registration to guarantee traceability and treatment for decisions which can be configured for any kind of filling process was introduced. The impacts of this system, named ACCEPT- Computer Based Help for the Statistic Control of the Filling Processes, at the industry is demonstrated by the large number of companies that are using this system to control their processes. In Portugal, more than 50 companies and thousands of operators with very low qualifications are working every day with SPC tools and capability analysis in order to minimize variability and waste (for example: over filling), to ensure compliance and to guarantee the consumers rights.

### Corporate cash policy with liquidity and profitability risks

We develop a dynamic model of a firm facing both liquidity and profitability concerns. This leads us to study and to solve explicitly a bi-dimensional control problem where the two state variables are the controlled cash reserves process and the belief process about the firm’s profitability. Our model encompasses previous studies and provides new predictions for corporate cash policy. The model predicts a positive relationship between cash holdings and beliefs about the firm’s profitability, a non-monotonic relationship between cash holdings and the volatility of the cash flows as well as a non-monotonic relationship between cash holdings and the risk of profitability. This yields novel insights on the firm’s default policy and on the relationship between volatility of stock prices and the level of stock.

### Using Latent Class Models to Evaluate the Performance of Diagnostic Tests in the Absence of a Gold Standard

Diagnostic tests are helpful tools for decision-making in a biomedical context. In order to determine the clinical relevance and practical utility of each test, it is critical to assess its ability to correctly distinguish diseased from non-diseased individuals. Statistical analysis has an essential role in the evaluation of diagnostic tests, since it is used to estimate performance measures of the tests, such as sensitivity and specificity. Ideally, these measures are determined by comparison with a gold standard, i.e., a reference test with perfect sensitivity and specificity.

When no gold standard is available, admitting the supposedly best available test as a reference may cause misclassifications leading to biased estimates. Alternatively, Latent Class Models (LCM) may be used to estimate diagnostic tests performance measures as well as the disease prevalence, in the absence of a gold standard. The most common LCM estimation approaches are the maximum likelihood estimation using the Expectation-Maximization algorithm and the Bayesian inference using Markov Chain Monte Carlo methods, via Gibbs sampling.

This talk illustrates the use of Bayesian Latent Class Models (BLCM) in the context of malaria and canine dirofilariosis. In each case, multiple diagnostic tests were applied to distinct subpopulations. To analyze the subpopulations simultaneously, a product multinomial distribution was considered, since the subpopulations were independent. By introducing constraints, it was possible to explore differences and similarities between subpopulations in terms of prevalence, sensitivities and specificities.

We also discuss statistical issues such as the assumption of conditional independence, model identifiability, sampling strategies and prior distribution elicitation.

### Taking Variability in Data into Account: Symbolic Data Analysis

Symbolic Data, introduced by E. Diday in the late eighties of the last century, is concerned with analysing data presenting intrinsic variability, which is to be explicitly taken into account. In classical Statistics and Multivariate Data Analysis, the elements under analysis are generally individual entities for which a single value is recorded for each variable - e.g., individuals, described by their age, salary, education level, marital status, etc.; cars each described by its weight, length, power, engine displacement, etc.; students for each of which the marks at different subjects were recorded. But when the elements of interest are classes or groups of some kind - the citizens living in given towns; teams, consisting of individual players; car models, rather than specific vehicles; classes and not individual students - then there is variability inherent to the data. To reduce this variability by taking central tendency measures - mean values, medians or modes - obviously leads to a too important loss of information.

Symbolic Data Analysis provides a framework allowing representing data with variability, using new variable types. Also, methods have been developed which suitably take data variability into account. Symbolic data may be represented using the usual matrix-form data arrays, where each entity is represented in a row and each column corresponds to a different variable - but now the elements of each cell are generally not single real values or categories, as in the classical case, but rather finite sets of values, intervals or, more generally, distributions.

In this talk we shall introduce and motivate the field of Symbolic Data Analysis, present into some detail the new variable types that have been introduced to represent variability, illustrating with some examples. We shall furthermore discuss some issues that arise when analysing data that does not follow the usual classical model, and present data representation models for some variable types.

### Forecasting The Temperature Data & Localising Temperature Risk

Forecasting The Temperature Data:

This paper aims at describing the intraday temperature variations which is a challenging task in modern econometrics and environmetrics. Having a high-frequency data, we separate the dynamics within a day and over days. Three main models have been considered in our study. As the benchmark we employ a simple truncated Fourier series with autocorrelated residuals. The second model uses the functional data analysis, and is called the shape invariant model (SIM). The third one is the dynamic semiparametric factor model (DSFM). In this work we discuss rises and pitfalls of all the methods and compare their in- and out-of-sample performances.

&

Localising Temperature Risk:

On the temperature derivative market, modelling temperature volatility is an important issue for pricing and hedging. In order to apply the pricing tools of financial mathematics, one needs to isolate a Gaussian risk factor. A conventional model for temperature dynamics is a stochastic model with seasonality and intertemporal autocorrelation. Empirical work based on seasonality and autocorrelation correction reveals that the obtained residuals are heteroscedastic with a periodic pattern. The object of this research is to estimate this heteroscedastic function so that, after scale normalisation, a pure standardised Gaussian variable appears. Earlier works investigated temperature risk in different locations and showed that neither parametric component functions nor a local linear smoother with constant smoothing parameter are flexible enough to generally describe the variance process well. Therefore, we consider a local adaptive modelling approach to find, at each time point, an optimal smoothing parameter to locally estimate the seasonality and volatility. Our approach provides a more flexible and accurate fitting procedure for localised temperature risk by achieving nearly normal risk factors. We also employ our model to forecast the temperature in different cities and compare it to a model developed in Campbell and Deibol (2005).

### S-estimators for functional principal component analysis

A well-known property of functional principal components is that they provide the best q-dimensional approximation to random elements over separable Hilbert spaces. Our approach to robust estimates of principal components for functional data is based on this property since we consider the problem of robustly estimating these finite-dimensional approximating linear spaces. We propose a new class of estimators for principal components based on robust scale functionals by finding the lower dimensional linear space that provides the best prediction for the data. In analogy to the linear regression case, we call this proposal S-estimators. This method can also be applied to sparse data sets when the underlying process satisfies a smoothness condition with respect to the functional associated with the scale defining the S-estimators. The motivation is a problem of outlier detection in atmospheric data collected by weather balloons launched into the atmosphere and stratosphere.

### Technological Change: A Burden or a Chance

The photography industry underwent a disruptive change in technology during the 1990s when the traditional film was replaced by digital photography (see e.g. The Economist January 14th 2012). In particular Kodak was largely affected : by 1976 Kodak accounted for 90% of film and 85% of camera sales in America. Hence it was a near-monopoly in America. Kodak′s revenues were nearly 16 billion in 1996 but the prediction is that it will decrease to 6.2 billion in 2011. Kodak tried to get (squeeze) as much money out of the film business as possible and it prepared for the switch to digital film. The result was that Kodak did eventually build a profitable business out of digital cameras but it lasted only a few years before camera phones overtook it.

According to Mr Komori, the former CEO of Fujifilm of 2000-2003, Kodak aimed to be a digital company, but that is a small business and not enough to support a big company. For Kodak it was like seeing a tsunami coming and there′s nothing you can do about it, according to Mr. Christensen in The Economist (January 14th 2012).

In this paper we study the problem of a firm that produces with a current technology for which it faces a declining sales volume. It has two options: it can either exit this industry or invest in a new technology with which it can produce an innovative product. We distinguish between two scenarios in the sense that the resulting new market can be booming or ends up to be smaller than the old market used to be.

We derive the optimal strategy of a firm for each scenario and specify the probabilities with which a firm would decide to innovate or to exit. Furthermore, we assume that the firm can additionally choose to suspend production for some time in case demand is too low, instead of immediately taking the irreversible decision to exit the market. We derive conditions under which such an suspension area exists and show how long a firm is expected to remain in this suspension area before resuming production, investing in new technology or exiting the market.

### Understanding the state of men's health in Europe through a life expectancy analysis

A common feature of the health of men across Europe is their higher rates of premature mortality and shorter life expectancy than women. Following the publication of the first State of Men's Health in Europe we sought to explore possible reasons.

We described trends in life expectancy in the European Union member States (EU27) between 1999 and 2008 using mortality data obtained from Eurostat. We then used Pollard's decomposition method to identify the contribution of deaths from different causes and at different age groups to differences in life expectancy. We first examined the change in life expectancy for men and for women between the beginning and end of this period. Second, we examined the gap in life expectancy between men and women at the beginning and end of this period.

Between 1999 and 2008 life expectancy in the EU27 increased by 2.77 years for men and by 2.12 years for women. Most of these improvements were due to reductions in mortality at ages over 60, with cardiovascular disease accounting for 1.40 years of the reduction in men. In 2008 life expectancy of men in the EU27 was 6.04 years lower than that of women. Deaths from all major groups of causes, and at all ages, contribute to this gap, with external causes contributing 1.00 year, cardiovascular disease 1.75 years and neoplasms 1.71 years.

Improvements in the life expectancy of men and women have mostly occurred at older ages. There has been little improvement in the high rate of premature death in younger men. This would suggest a need for interventions to tackle the high death rate in younger men. The demonstration of variations in premature death and life expectancy seen in men within the new European Commission report, highlight the impact of poor socio-economic conditions. The more pronounced adverse effect on the health of men suggests that men suffer from 'heavy impact diseases' and these are more quickly life-limiting with women more likely to survive, but with poorer health.

### Optimal Technology Adoption when the Arrival Rate of New Technologies Changes

Our paper contributes to the literature of technology adoption. In most of these models it is assumed that after the arrival of a new technology the probability of the next arrival is constant. We extend this approach by assuming that after the last technology jump the probability of a new arrival can change. Right after the arrival of a new technology the intensity equals a specific value that switches if no new technology arrival has taken place within a certain period after the last technology arrival. We look at different scenarios, dependent on whether the firm is threatened by a drop in the arrival rate after a certain time period or expects the rate of new arrivals to rise. We analyze the effect of variance of time between two consecutive arrivals on the optimal investment timing and show that larger variance accelerates investment in a new technology. We find that firms often adopt a new technology a time lag after its introduction, which is a phenomenon frequently observed in practice. Regarding a firm's technology releasing strategy we explain why clear signals set by regular and steady release of new product generations stimulates customers buying behavior. Depending on whether the arrival rate is assumed to change or be constant over time, the optimal technology adoption timing changes significantly. In a further step we add an additional source of uncertainty to the problem and assume that the length of the time period after which the arrival intensity changes is not known to the firm in advance. Here, we find that increasing uncertainty accelerates investment, a result that is opposite to the standard real options theory.

### On the Aging Properties of the Run Length of Markov-Type Control Charts

A change in a production process must be detected quickly so that a corrective action can be taken. Thus, it comes as no surprise that the run length (RL) is usually used to describe the performance of a quality control chart.

This popular performance measure has a phase-type distribution when dealing with Markov-type charts, namely, cumulative sum (CUSUM) and exponentially weighted moving average (EWMA) charts, as opposed to a geometric distribution, when standard Shewhart charts are in use.

In this talk, we briefly discuss sufficient conditions on the associated probability transition matrix to deal with run lengths with aging properties such as new better than used in expectation, new better than used, and increasing hazard rate.

We also explore the implications of these aging properties of the run lengths, namely when we decide to confront the in control and out-of-control variances of the run lengths of matched in control Shewhart and Markov-type control charts.

#### Keywords

Phase-type distributions; Run length; Statistical process control; Stochastic ordering.

#### Bibiography

Morais, M.C. and Pacheco, A. (2012). A note on the aging properties of the run length of Markov-type control charts. Sequential Analysis 31, 88-98.

### Espaço das variáveis: onde estatística e geometria se casam. O caso das distâncias de Mahalanobis.

A forma usual de conceptualizar a representação gráfica duma matriz $X_{n\times p}$ de dados de indivíduos $\times$ variáveis consiste em associar um eixo a cada variável e nesse referencial cartesiano representar cada individuo por um ponto, cujas coordenadas são dadas pela linha de $X$ correspondente ao individuo. A popularidade desta representação no espaço dos individuos ($\mathbb{R}^p$) resulta, em grande medida, do facto de ser visualizável para dados bivariados ou tri-variados. No entanto, para um número maior de variáveis ($p \gt 3$) essa vantagem deixa de existir.

Uma representação alternativa é importante na análise e modelação dos dados. No espaço das variáveis, cada eixo corresponde a um individuo e cada variável é representada por um vector a partir da origem, definido pelas $n$ coordenadas da respectiva coluna matricial. Esta representação das variáveis em $\mathbb{R}^n$ tem a enorme vantagem de casar conceitos estatísticos e conceitos geométricos, permitindo uma melhor compreensão dos primeiros. Tem raízes sólidas na escola francesa de análise de dados, mas o seu potencial nem sempre é explorado.

Nesta comunicação começa-se por relembrar os conceitos geométricos correspondentes a indicadores fundamentais da estatística univariada e bivariada (média, desvio padrão, coeficiente de variação ou coeficiente de correlação) ou multivariada (exemplificando com o caso da análise em componentes principais). Aprofunda-se a discussão no contexto de regressões lineares múltiplas, cujos conceitos fundamentais (coeficiente de determinação, as três somas de quadrados e a sua relação fundamental) têm interpretação geométrica no espaço das variáveis.

Seguidamente, discute-se a utilidade desta representação geométrica no estudo das distâncias de Mahalanobis, que desempenham um papel de primeiro plano na estatística multivariada. Mostra-se como as distâncias (ao quadrado) de Mahalanobis medem a inclinação do subespaço de $\mathbb{R}^n$ gerado pelas colunas da matriz centrada dos dados, o subespaço $\mathcal{C}(X_c)$, em relação ao sistema de eixos. Em particular, mostra-se como as distâncias de Mahalanobis ao centro, $D^2_{x_i,\overline{x}}=(x_i-\overline{x})^t \S^{-1} (x_i-\overline{x}),$ são apenas função de $n$ e do ângulo $\theta_i$ entre o eixo correspondente ao indivíduo $i$ e $\mathcal{C}(X_c)$, enquanto que a distância (ao quadrado) de Mahalanobis entre dois individuos, $D^2_{x_i,x_j}=(x_i-x_j)^t \S^{-1} (x_i-x_j),$ é também função apenas de $n$ e do ângulo entre $\mathcal{C}(X_c)$ e a bissectriz gerada por $e_i-e_j$, sendo $e_i$ e $e_j$ os vectores canónicos de $\mathbb{R}^n$ associados aos dois individuos. Algumas recentes majorações e outras propriedades importantes destas distâncias (Gath & Hayes, 2006 e Branco & Pires, 2011) são expressão directa destas relações geométricas. Apesar das distâncias de Mahalanobis dizerem respeito aos individuos, os conceitos geométricos que lhes estão associados no espaço das variáveis podem ser explorados para aprofundar e estender esses resultados.

### An overview of capture-recapture models

Capture-recapture methods have been widely used in Biological Sciences to estimate population abundance and related demographic parameters (births, deaths, immigration, or emigration). More recently, these models have been used to estimate community dynamics parameters such as species richness, rates of extinction, colonization and turnover, and other metrics that require presence/absence data of species counts. In this presentation, we will use the latest application to illustrate some of the concepts and the underlying theory of capture-recapture models. In particular, we will review basic closed-population, open-population, and combination of closed and open population models. We will briefly mention about other applications of these models to Medical, Social and Computer Sciences.

Keywords: Capture-recapture experiments; multinomial and mixture distributions; non-parametric and maximum likelihood estimation; population size estimation.

Older session pages: Previous 7 8 9 10 11 12 Oldest