# Probability and Statistics Seminar

## Past sessions

### An ARL-unbiased $np-$chart

We usually assume that counts of nonconforming items have a binomial distribution with parameters $(n,p)$, where $n$ and $p$ represent the sample size and the fraction nonconforming, respectively.

The non-negative, discrete and usually skewed character and the target mean $(np_0)$ of this distribution may prevent the quality control engineer to deal with a chart to monitor $p$ with: a pre-specified in-control average run length (ARL), say $1/\alpha$; a positive lower control limit; the ability to control not only increases but also decreases in $p$ in a expedient fashion. Furthermore, as far as we have investigated, the $np-$ and $p-$charts proposed in the Statistical Process Control literature are ARL-biased, in the sense that they take longer, in average, to detect some shifts in the fraction nonconforming than to trigger a false alarm.

Having all this in mind, this paper explores the notions of uniformly most powerful unbiased tests with randomization probabilities to eliminate the bias of the ARL function of the $np-$chart and to bring its in-control ARL exactly to $1/\alpha$.

### The Block Maxima and POT methods and, an extension of POT to integrated stochastic processes

We shall review the classical maximum domain of attraction condition underlying BM and POT, two fundamental methods in Extreme Value Theory. A theoretical comparison between the methods will be presented.

Afterwards, the maximum domain of attraction condition to spatial context will be discussed. Then a POT-type result for the integral of a stochastic process verifying the maximum domain of attraction condition will be obtained.

### On Eigenvalues of the Transition Matrix of some Count Data Markov Chains

A stationary Markov chain is uniquely determined by its transition matrix, the eigenvalues of which play an important role for characterizing the stochastic properties of a Markov chain. Here, we consider the case where the monitored observations are counts, i.e., having values in either the full set of non-negative integers, or in a finite set of the form ${0,...,n}$ with a prespecified upper bound $n$. Examples of count data time series as well as a brief survey of some basic count data time series models is provided.

Then we analyze the eigenstructure of count data Markov chains. Our main focus is on so-called CLAR(1) models, which are characterized by having a linear conditional mean, and also on the case of a finite range, where the second largest eigenvalue determines the speed of convergence of the forecasting distributions. We derive a lower bound for the second largest eigenvalue, which often (but not always) even equals this eigenvalue. This becomes clear by deriving the complete set of eigenvalues for several specific cases of CLAR(1) models. Our method relies on the computation of appropriate conditional (factorial) moments.

### From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

The softmax transformation is a key component of several statistical learning models, encompassing multinomial logistic regression, action selection in reinforcement learning, and neural networks for multi-class classification. Recently, it has also been used to design attention mechanisms in neural networks, with important achievements in machine translation, image caption generation, speech recognition, and various tasks in natural language understanding and computation learning. In this talk, I will describe sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities. After deriving its properties, I will show how its Jacobian can be efficiently computed, enabling its use in a neural network trained with backpropagation. Then, I will propose a new smooth and convex loss function which is the sparsemax analogue of the logistic loss. An unexpected connection between this new loss and the Huber classification loss will be revealed. We obtained promising empirical results in multi-label classification problems and in attention-based neural networks for natural language inference. For the latter, we achieved a similar performance as the traditional softmax, but with a selective, more compact, attention focus.

### Geostatistical History Matching with Ensemble Updating

In this work, a new history matching methodology is proposed, coupling within the same framework the advantages of using geostatistical sequential simulation and the principles of ensemble Kalman filters: history matching based on ensemble updating.  The main idea of this procedure is to use simultaneously the relationship between the petrophysical properties of interest and the dynamical results to update the static properties at each iteration, and to define areas of influence for each well. This relation is established through the experimental non-stationary covariances, computed from the ensemble of realizations. A set of petrophysical properties of interest is generated through stochastic sequential simulation. For each simulated model, we obtain its dynamic responses at the wells locations by running a fluid flow simulator over each single model. Considering the normalized absolute deviation between the dynamic responses and the real dynamic response in each well as state variables, we compute the correlation coefficients of the deviations with each grid cell through the ensemble of realizations. Areas of high correlation coefficients are those where the permeability is more likely to play a key role for the production of that given well. Using a local estimation of the response of the deviations, through a simple kriging process, we update the subsurface property of interest at a given localization.

### Statistical Modeling of Integer-valued Time Series: An Introduction

Modeling and predicting the temporal dependence and evolution of low integer-valued time series have attracted a lot of attention over the last years. This is partially due to the increasing availability of relevant high-quality data sets in various fields of applications ranging from finance and economy to medicine and ecology. It is important to stress, however, that there is no a unifying approach applicable to modeling all integer-valued time series and, consequently, the analysis of such time series has to be restricted to special classes of integer-valued models. A useful division of these models can be made as being either observation-driven or parameter-driven models.  A suitable class of observation-driven models is the one including models based on thinning operators. Models belonging to this class are obtained by replacing the multiplication in the conventional time series models by an appropriate thinning operator, along with considering a discrete distribution for the sequence of innovations in order to preserve the discreteness of the counts.

This talk aims at providing an overview of recent developments in thinning-based time series models paying particular attention to models obtained as discrete counterparts of conventional univariate and multivariate autoregressive moving average models, with either finite or infinite support. Finally, we also outline and discuss likely directions of future research.

### Robust heritability and predictive accuracy estimation in plant breeding

Genomic prediction is used in plant breeding to help find the best genotypes for selection. Here, the  accurate estimation of  predictive accuracy (PA) and heritability (H) is essential for genomic selection (GS). As in other applications,  field data are analyzed via regression models, which are known to lead to biased estimation when the normality premise is violated, biases that may translate into inaccurate H and PA estimates and negatively impact GS. Therefore, a robust analogue of a method from the literature used for H and PA estimation is presented. Both techniques are then compared through simulation.

(Joint work with Hans-Peter Piepho & Joseph O. Ogutu, Bioinformatics Unit, Institute of Crop Science, University of Hohenheim, Stuttgart, Germany)

### On stochastic ordering and control charts for the traffic intensity

The traffic intensity is a crucial parameter of a queueing system since it is a measure of the average occupancy of a server. Expectedly, an increase in the traffic intensity must be detected quickly so that appropriate corrective actions are taken.

In this talk, we:

• briefly review existing procedures used to monitor the traffic intensity of M/G/1, GI/M/1 and a few other queues;
• focus on control charts to detect increases in traffic intensity whose control statistics are integer-valued and/or can be (approximately) modelled by discrete time Markov chains;
• investigate the stochastic monotonicity properties of the associated probability transition matrices;
• explore the implications of these properties to provide insights on the performance of such control charts.

(Joint work with António Pacheco)

### Behind the myth of Option Trading

Starting from the basics of derivative products, you will learn more about financial markets, how they are organized, what products are traded, by who and why. But we won’t stop there, and you will also have to properly put yourself into traders’ shoes and realize why derivatives lay where mathematics meet finance. We will go through some notions (Derivatives, tangible assets, financial contracts, Call/Put options, European/American options, option pricing, maxima…) focusing in the pratical application of pricing models like B&S formula limit & reality, new statistical models studied, etc.

### Symbolic Covariance Matrices and Principal Component Analysis for Interval Data

Recent years witnessed a huge breakthrough of technology which enables the storage of a massive amounts of information. Additionally, the nature of the information collected is also changing. Besides the traditional format of recording single values for each observation, we have now the possibility to record lists, intervals, histograms or even distributions to characterize an observation. However, conventional data analysis is not prepared for neither of these challenges, and does not have the necessary or appropriate means to treat extremely large databases or data with a more complex structure. As an answer to these challenges Symbolic Data Analysis, introduced in the late 1980s by Edwin Diday, extends classical data analysis to deal with more complex data by taking into account inner data variability and structure.

Principal component analysis is one of the most popular statistical methods to analyse real data. Therefore, there have been several proposals to extend this methodology to the symbolic data analysis framework, in particular to interval-valued data.

In this talk, we discuss the concepts and properties of symbolic variance and covariance of an interval-valued variable. Based on these, we develop population formulations for four symbolic principal component estimation methods. This formulation introduces simplifications, additional insight and unification of the discussed methods. Additionally, an explicit and straightforward formula that defines the scores of the symbolic principal components, equivalent to the representation by Maximum Covering Area Rectangle, is also presented.

Joint work with António Pacheco (CEMAT and DM-IST, Univ. de Lisboa), Paulo Salvador (IT, Univ. Aveiro),  Rui Valadas (IT, Univ. de Lisboa), and Margarida Vilela (CEMAT).

### On ARL-unbiased c-charts for i.i.d. and INAR(1) Poisson counts

In Statistical Process Control (SPC) it is usual to assume that counts have a Poisson distribution. The non-negative, discrete and asymmetrical character of a control statistic with such distribution and the value of its target mean may prevent the quality control practitioner to deal with a c-chart with:

1. a positive lower control limit and the ability to control not only increases but also decreases in the mean of those counts in a timely fashion;
2. a pre-specified in-control average run length (ARL).

Furthermore, as far as we have investigated, the c-charts proposed in the SPC literature tend not to be ARL-unbiased. (The term ARL-unbiased is used here to coin any control chart for which all out-of-control ARL values are smaller than  the in-control ARL.)

In this talk, we explore the notions of unbiased, randomized and uniformly most powerful unbiased tests (resp. randomization of the emission of a signal and a nested secant rule search procedure) to:

1. eliminate the bias of the ARL function of the c-chart for the mean of i.i.d. (resp. first-order integer-valued autoregressive, INAR(1)) Poisson counts;
2. bring the in-control ARL exactly to a pre-specified and desired value.

We use the R statistical software to provide striking illustrations of the resulting ARL-unbiased c-charts.

Joint work with Sofia Paulino and Sven Knoth

### Transmission and Power Generation Investment under Uncertainty

The challenge of deregulated electricity markets and ambitious renewable energy targets have contributed to an increased need of understanding how market participants will respond to a transmission planner’s investment decision. We study the optimal transmission investment decision of a transmission system operator (TSO) that anticipates a power company’s (PC) potential capacity expansion. The proposed model captures both the investment decisions of a TSO and PC and accounts for the conflicting objectives and game-theoretic interactions of the distinct agents. Taking a real options approach allows to study the effect of uncertainty on the investment decisions and taking into account timing as well as sizing flexibility.

We find that disregarding the power company’s optimal investment decision can have a large negative impact on social welfare for a TSO. The corresponding welfare loss increases with uncertainty. The TSO in most cases wants to invest in a higher capacity than is optimal for the power company. The exception is in case the TSO has no timing flexibility and faces a relatively low demand level at investment. This implies that the TSO would overinvest if it would disregard the PC’s optimal capacity decision. On the contrary, we find that if the TSO only considers the power companies sizing flexibility, it risks installing a too small capacity. We furthermore conclude that a linear subsidy in the power company's investment cost could increase its optimal capacity and therewith, could serve as an incentive for power companies to invest in larger capacities.

Joint work with Nora S. Midttun, Afzal S. Siddiqui, and Jannicke S. Sletten.

### Optional-Contingent-Product Pricing in Marketing Channels

This paper studies the pricing strategies of firms belonging to a vertical channel structure where a base and an optional contingent products are sold. Optional contingent products are characterized by unilateral demand interdependencies. That is, the base product can be used independently of a contingent product. On the other hand, the contingent product’s purchase is conditional on the possession of the base product.

We find that the retailer decreases the price of the base product to stimulate demand on the contingent-product market. Even a loss-leader strategy could be optimal, which happens when reducing the base product’s price has a large positive effect on its demand, and thus on the number of potential consumers of the contingent product. The price reduction of the base product either mitigates the double-marginalization problem, or leads to an opposite inefficiency in the form of a too low price compared to the price maximizing vertically integrated channel profits. The latter happens when the marginal impact of both products’ demands on the base product’s price is low, and almost equal in absolute terms.

Joint work with Sihem Taboubi and Georges Zaccour.

Immediately followed by another seminar session.

### Comparison of Statistic and Deterministic Frameworks of Uncertainty Quantification

Two different approaches to the prediction problem are compared employing a realistic example, combustion of natural gas, with 102 uncertain parameters and 76 quantities of interests. One approach, termed Bound-to-Bound Data Collaboration (abbreviated to B2B) deploys semi-definite programming algorithms where the initial bounds on unknowns are combined with the initial bound of experimental data to produce new uncertainty bounds for the unknowns that are consistent with the data and, finally, deterministic uncertainty bounds for prediction in new settings. The other approach is statistical and Bayesian, referred to as BCP (for Bayesian Calibration and Prediction). It places prior distributions on the unknown parameters and on the parameters of the measurement error distributions and produces posterior distributions for model parameters and posterior distributions for model predictions in new settings.  The predictions from the two approaches are consistent:  B2B bounds and the support of the BCP predictive distribution overlap a very large part of each other. The BCP predictive distribution is more nuanced than the B2B bounds but depends on stronger assumptions. Interpretation and comparison of the results is closely connected with assumptions made about the model and experimental data and how they are used in both settings. The principal conclusion is that use of both methods protects against possible violations of assumptions in the BCP approach and conservative specifications and predictions using B2B.

Joint work with Michael Frenklach, Andrew Packard (UC Berkeley), Jerome Sacks (National Institute of Statistical Sciences) and Gonzalo Garcia-Donato (Universidad de Castilla-La Mancha)

### The importance of Statistics in Bioinformatics

Statistics acts in several areas of knowledge, being Bioinformatics one of the most recent application fields. In reality, the role of Statistics in Bioinformatics goes beyond a mere intervention. It is an integral pillar of Bioinformatics. Statistics has been gaining its space in this area, becoming an essential component of recognized merit. In this seminar the speaker intends to show the importance of Statistics in addressing systems as diverse as protein structure or microarray and NGS data. A set of specific studies in Molecular Biology, will be the basis for the presentation of some of the most common statistical methodologies in Bioinformatics. It is also shown the importance of the available software, including some R packages.

### Investment Decisions under Multi-uncertainty and Exogenous Shocks

In this presentation we study the investment problem when both the demand and the investment costs are stochastic. We assume that the processes are independent, and both are modeled using geometric Brownian motion with exogenous jumps driven by independent Poisson processes. We use a real options approach, leading to an optimal stopping problem. Due to the multi-uncertainty, we propose a method to solve explicitly the problem, and we prove that this method leads exactly to the solution of the optimization problem.

Joint work with Rita Pimentel.

### Statistical Learning for Natural Language Processing

The field of Natural Language Processing (NLP) deals with automatic processing of large corpora of text such as newswire articles (from online newspaper websites), social media (such as Facebook or Twitter) and user-created content (such as Wikipedia). It has experienced large growth in academia as well as in the industry,  ith large corporations such as Microsoft, Google, Facebook, Apple, Twitter, Amazon, among others, investing strongly in these technologies.

One of the most successful approaches to NLP is statistical learning (also known as machine learning), which uses the statistical properties of corpora of text to infer new knowledge.

In this talk I will present multiple NLP problems and provide a brief overview of how they can be solved with statistical learning. I will also present one of these problems (language detection) in more detail to illustrate how basic properties of Probability Theory are at the core of these techniques.

### To be or not to be Bayesian: That IS NOT the question

Frequentist and Bayesian approaches to statistical thinking are different in their foundations and have been developed along somehow separated routes. In a time where more and more powerful statistics is badly needed, special attention should be given to the existing connection between the two paradigms in order to bring together their formidable strength.

### Anticipative Transmission Planning under Uncertainty

Transmission system operators (TSOs) build transmission lines to take generation capacity into account. However, their decision is confounded by policies that promote renewable energy technologies. Thus, what should be the size of the transmission line to accommodate subsequent generation expansion? Taking the perspective of a TSO, we use a real options approach not only to determine the optimal timing and sizing of the transmission line but also to explore its effects on generation expansion.

### A robust mixed linear model for heritability estimation in plant studies

Heritability ($H^2$) refers to the extent of how much a certain phenotype is genetically determined. Knowledge of $H^2$ is crucial in plant studies to help perform effective selection. Once a trait is known to be high heritable, association studies are performed so that the SNPs underlying those traits’ variation may be found. Here, regression models are used to test for associations between phenotype and candidate SNPs. SNP imputation ensures that marker information is complete, so both the coefficient of determination ($R^2$) and $H^2$ are equivalent. One popular model used in these studies is the animal model, which is a linear mixed model (LMM) with a specific layout. However, when the normality assumption is violated, as other likelihood-based models, this model may provide biased results in the association analysis and greatly affect the classical $R^2$. Therefore, a robust version of the REML estimates for linear LMM to be used in this context is proposed, as well as a robust version of a recently proposed $R^2$. The performance of both classical and robust approaches for the estimation of $H^2$ is thus evaluated via simulation and an example of application with a maize data set is presented.

Joint work with P.C. Rodrigues, M.S. Fonseca and A.M. Pires

Older session pages: Previous 5 6 7 8 9 10 11 12 Oldest