# Probability and Statistics Seminar ## Past sessions

### Distributed Learning Algorithms for Big Data

Modern datasets are increasingly collected by teams of agents that are spatially distributed: sensor networks, networks of cameras, and teams of robots. To extract information in a scalable manner from those distributed datasets, we need distributed learning. In the vision of distributed learning, no central node exists; the spatially distributed agents are linked by a sparse communication network and exchange short messages between themselves to directly solve the learning problem. To work in the real-world, a distributed learning algorithm must cope with several challenges, e.g., correlated data, failures in the communication network, and minimal knowledge of the network topology. In this talk, we present some recent distributed learning algorithms that can cope with such challenges. Although our algorithms are simple extensions of known ones, these extensions require new mathematical proofs that elicit interesting applications of probability theory tools, namely, ergodic theory.

### Contributions for the detection of multivariate outliers

The detection of outliers in multivariate models is always a dicult matter, but the subject is even more complex when dealing with dependent structures, as it is the case with the Simultaneous Equation Model (SEM). Unlike other models dened by systems of equations, such as the multivariate regression, the SEM assumes that the response variable in each equation can be stated as an explanatory variable in the rest of the system, meaning that explanatory variables can be correlated with the error terms. We present a method of outlier detection that bypasses those diculties using the asymptotic distribution of adequate robust Mahalanobis distances. The process identies anomalous data points as outliers of the SEM in simple steps and it provides a clear visualization. We illustrate this procedure with a real econometric data set.

### Robust logistic regression with sparse predictor variables

Nowadays, dealing with high-dimensional data is a recurrent problem that cuts across modern statistics. One main feature of high dimensional data is that the dimension $p$, that is, the number of covariates, is high, while the sample size $n$ is relatively small. In this circumstance, the bet on sparsity principle suggests to proceed under the assumption that most of the effects are not significant. Sparse covariates are frequent in the classification problem and in this situation the task of variable selection may be also of interest. We focus on the logistic regression model and our aim is to address robust and sparse estimators of the regression parameter in order to perform estimation and variable selection at the same time.For this purpose, we introduce a family of penalized M-type estimators for the logistic regression parameter that are stable against atypical data. We explore different penalizations functions and we introduce the so-called sign penalization. This new penalty has the advantage that it does not shrink the estimated coefficients to $0$ and that it depends only on one parameter.We will discuss the variable selection capability of the proposal as well as its asymptotic behaviour. Through a numerical study, we compare the finite sample performance of the proposal with different penalized estimators either robust or classical, under different scenarios.

### A Comprehensive Methodology to Analyse Topic Difficulties in Educational Programmes

We propose a comprehensive Learning Analytics methodology to investigate the level of understanding students achieve in the learning process. The goals of such methodology are

1. To identify topics in which students experience difficulties on;
2. To assess whether these difficulties are recurrent along semesters;
3. To decide if there are conceptual associations between topics in which students experience difficulties on; and, more generally,
4. To discover statistically significant groups of topics in which students show similar performance.

The proposed methodology uses statistics and data visualization techniques to address the first and the second goals, frequent itemset mining to tackle the third goal, and biclustering is proposed to find relationships within educational data, revealing meaningful and statistically significant patterns of students’ performance.

We illustrate the application of the methodology to a Computer Science course.

### Working towards a typology of indices of agreement for clustering evaluation

Indices of agreement (IA) are commonly used to evaluate stability of a clustering solution or its agreement with ground truth – internal and external validation of the same solution, respectively.

IA provide different measures of the accordance between two partitions of the same data set, being based on contingency table data. Despite their frequent use in clustering evaluation, there are still open issues regarding the specific thresholds for each index to conclude about the degree of agreement between the partitions.

To acquire new insights on the indices behavior that may help improve clustering evaluation, 14 paired indices of indices are analyzed within diverse experimental scenarios - with balanced or unbalanced clusters and poorly, moderately or well separated ones. The paired indices’ observed values are all based on a cross-classification table of counts of pairs of observations both partitions agree to join and/or separate in the clusters. The IADJUST method is used to learn about the behavior of the indices under the hypothesis of agreement between partitions occurring by chance (H0). It relies on the generation of contingency tables under H0, being a simulation based procedure that enables to correct any index of agreement by deducting agreement by chance, overcoming previous limitations of analytical or approximate approaches – (Amorim and Cardoso, 2015).

The results suggest a preliminary typology of paired indices of agreement based on their distributional characteristics under H0. Inter-scenarios symbolic data referring to location, dispersion and shape measures of IA distributions under H0 are used to build this typology.

### Reference

Amorim, M. J., & Cardoso, M. G. (2015). Comparing clustering solutions: The use of adjusted paired indices. Intelligent Data Analysis, 19(6), 1275-1296.

Joint work with Maria José Amorim (Department of Mathematics of ISEL, Lisbon, Portugal).

### Feed-in Tariff Contract Schemes and Regulatory Uncertainty

This paper presents a novel analysis of four finite feed-in tariff (FIT) schemes, namely fixed-price, fixed-premium, minimum price guarantee and sliding premium with a cap and a floor, under market and regulatory uncertainty. Using an analytical real options framework, we derive the project value, the optimal investment threshold and the value of the investment opportunity for the four FIT schemes. Regulatory uncertainty is modeled allowing the tariff to be reduced before the signature of the contract. While market uncertainty defers investment, a higher and more likely tariff reduction accelerates investment. We also present several findings that are aimed at policymaking decisions, regarding namely the choice, level and duration of the FIT. For instance, the investment threshold of the sliding premium with a cap and a floor is lower than the minimum price guarantee, which suggests that the first regime is a better policy than the latter because it accelerates the investment while avoiding overcompensation.

### Selecting differentially expressed genes in samples subgroups on microarray data

A common task in analysing microarray data is to determine which genes are differentially expressed under two (or more) kinds of tissue samples or samples submitted under different experimental conditions. It is well known that biological samples are heterogeneous due to factors such as molecular subtypes or genetic background, which are often unknown to the investigator. For instance, in experiments which involve molecular classification of tumours it is important to identify significant subtypes of cancer. Bimodal or multimodal distributions often reflect the presence of subsamples mixtures.

Consequently, truly differentially expressed genes on sample subgroups may be lost if usual statistical approaches are used. In this work it is proposed a graphical tool which identifies genes with up and down regulation, as well as genes with differential expression which revels hidden subclasses, that are usually missed if current statistical methods are used.

### Optimal investment decision under switching regimes of subsidy support

We address the problem of making a managerial decision when the investment project is subsidized, which results in the resolution of an infinite-horizon optimal stopping problem of a switching diffusion driven by either a homogeneous or an inhomogeneous continuous-time Markov chain. We provide a characterization of the value function (and optimal strategy) of the optimal stopping problem. On the one hand, broadly, we can prove that the value function is the unique viscosity solution to a system of HJB equations. On the other hand, when the Markov chain is homogeneous and the switching diffusion is one-dimensional, we obtain stronger results: the value function is the difference between two convex functions

### Monitoring Non-Stationary Processes

In nearly all papers on statistical process control for time-dependent data it is assumed that the underlying process is stationary. However, in finance and economics we are often faced with situations where the process is close to non-stationarity or it is even non-stationary.

In this talk the target process is modeled by a multivariate state-space model which may be non-stationary. Our aim is to monitor its mean behavior. The likelihood ratio method, the sequential probability ratio test, and the Shiryaev-Roberts procedure are applied to derive control charts signaling a change from the supposed mean structure. These procedures depend on certain reference values which have to be chosen by the practitioner in advance. The corresponding generalized approaches are considered as well, and generalized control charts are determined for state-space processes. These schemes do not have further design parameters. In an extensive simulation study the behavior of the introduced schemes is compared with each other using various performance criteria as the average run length, the average delay, the probability of a successful detection, and the probability of a false detection.

### Literature

• Lazariv T. and Schmid W. (2018). Surveillance of non-stationary processes. AStA - Advances in Statistical Analysis, https://doi.org/10.1007/s10182-018-00330-4 .
• Lazariv T. and Schmid W. (2018). Challenges in monitoring non-stationary time series. In Frontiers in Statistical Process Control, Vol. 12, pp. 257-275. Berlin: Springer.

Joint work with Taras Lazariv (European University Viadrina, Department of Statistics, Germany).

### A thinning-based EWMA chart to monitor counts: some preliminary results

Shewhart control charts are known to be somewhat insensitive to shifts of small and moderate size. Expectedly, alternative control schemes such as the cumulative sum (CUSUM) and the exponentially weighted moving average (EWMA) charts have been proposed to speed up the detection of such shifts.

The novel chart we propose relies on a EWMA control statistic where the usual scalar product is replaced by what we call a fractional binomial thinning to avoid the typical over smoothing ascribable to ceiling, rounding, and flooring operations. The properties of this discrete statistic are, to a moderate extent, similar to the ones of its continuous EWMA counterpart and the run length (RL) performance of the associated chart can be computed exactly using the Markov chain approach for independent and identically distributed (i.i.d.) counts. Moreover, this chart is set in such way that: the average run length (ARL) curve attains a maximum in the in-control situation, i.e., the chart is ARL- unbiased; and the in-control ARL is equal to a pre-specified value.

We use the R statistical software to provide compelling illustrations of this unconventional EWMA chart and to compare its RL performance with the ones of a few competing control charts for the mean of i.i.d. Poisson counts.

### Keywords

Average run length; Exponentially weighted moving average; Fractional binomial thinning; Statistical process control.

### Distributed learning in large scale networks: from GPS-denied localization to MAP inference

Big Data can elicit greater insight, but storage or computational limitations — or even privacy concerns — challenge learning from massive data sets. The distributed paradigm fits such problems just right: such algorithms work on partial data and fuse intermediate results within local neighborhoods, over a distributed network of computing nodes. In this talk we will take a tour starting on GPS-denied localization and culminating on a general distributed MAP inference algorithm for graphical models.

### Estimation of the drift of a $2n$-dimension OU process

A $2n$-dimension Ornstein-Uhlenbeck (OU) process for which the diffusion matrix is singular is considered. This process is used as a model for the dynamic behavior of vibrating engineering structures such as bridges, buildings, dams, among others. We study the problem of estimating the vibration frequencies of the structure or, equivalently, the parameters of the stochastic differential equation (SDE) that governs the OU process.

Firstly, it is considered the case where the OU process is perturbed by an independent wiener process. The maximum likelihood estimator of the drift matrix is obtained and the properties of the estimator are established. The local asymptotic normality of the estimator is analyzed in detail. Since general regularity conditions do not hold in this case (the diffusion matrix is singular), theoretical results from the classic literature on the subject do not immediately apply and an alternative approach based on the Laplace transform is used.

Secondly, it is considered the case where the OU process is perturbed by two independent fractional brownian motions. Models involving fractional noises have not been widely used in engineering. However, many problems in engineering involve processes exhibiting long memory. For this reason, the estimation of the parameters of multidimensional state space linear models, described by SDEs and disturbed by fractional Brownian motion, has a potential application in different areas of engineering. We analyze the problem of estimating the drift parameters of a $2$- dimension linear stochastic differential equation perturbed by two independent fractional Brownian motions with the same Hurst parameter belonging to $(1/2,1)$. The maximum likelihood estimator of the drift parameters is obtained after a transformation of the original model and making use of the so called fundamental martingale.

In both cases, a simulation study is presented in the context of a real world situation that illustrates the asymptotic behavior of the maximum likelihood estimator of the drift matrix.

### Multiple-valued symbolic data clustering: heuristic and model-based approaches

Symbolic data analysis (SDA) has been developed as an extension of the data analysis to handle more complex data structures. In this general framework the pair observation/variable is characterized by more than one value: from two (e.g., interval-value data defined by minimum and maximum values) to multiple-valued variables (e.g., frequencies or proportions).

This research discusses the clustering of multiple-valued symbolic data. First, we discuss an extension of heuristic clustering based on the symmetric Kullback-Leibler distance combined with a complete-linkage rule within the hierarchical clustering framework. Then, we propose a new model-based clustering framework. These new family of models based on the Dirichlet distribution includes mixture of regression/expert models. Results are illustrated with synthetic and demographic (population pyramids) data.

### Market Risk Measurement — Theory and Practice

Topics that will be covered in this talk:

• Value-at-Risk (VaR)
• Expected Shortfall (ES)
• VaR/ES Measurement
• Historical Simulation
• Model Building Approach
• Monte Carlo Simulation Approach
• VaR Backtesting

### Evaluation of volatility models for forecasting Value-at-Risk and Expected Shortfall in the Portuguese Stock Market

The objective of this paper is to run a forecasting competition of different parametric volatility time series models to estimate Value-at-Risk (VaR) and Expected Shortfall (ES) within the Portuguese Stock Market. This work is also intended to bring new insights about the methods used throughout this exercise. Finally, we want to relate the timing of the exceptions (extreme losses surpassing the VaR) with events at the firm level and with national/international economic conditions.

For these purposes, a number of models from the General Autoregressive Conditional Heteroscedasticity (GARCH) class are used with different distribution functions for the innovations, in particular, Normal, Student-t and Generalized Error Distribution (GED) and corresponding skewed versions. The GARCH models are also used in conjunction with the Generalized Pareto Distribution through the use of extreme value theory.

The performance of these different models to forecast 1% and 5% VaR and ES for 1-day, 5-days and 10-days horizons are analyzed for a set of companies traded in the EURONEXT Lisbon stock exchange. The results obtained for the VaRs and ESs are evaluated with backtesting procedures based on a number of statistical tests and compared with the use of different loss functions.

The final results are analyzed in several dimensions. Preliminary analysis show that the use of extreme value theory generally leads to better results, especially for low values of alpha. This is more evident in the case of the statistical backtests dealing with ES. Moreover, skewed distributions generally do not seem to perform better than their centered counterparts

### Dynamic Capital Structure Choice and Investment Timing

The paper considers the problem of an investor that has the option to acquire a firm. Initially this firm is run as to maximize shareholder value, where the shareholders are risk averse. To do so it has to decide each time on investment and dividend levels. The firm's capital stock can be financed by equity and debt, where less solvable firms pay a higher interest rate on debt. Revenue is stochastic.

We find that the firm is run such that capital stock and dividends develop in a fixed proportion to the equity. In particular, it turns out that more dividends are paid if the economic environment is more uncertain. We also derive an explicit expression for the threshold value of the equity above which it is optimal for the investor to acquire the firm. This threshold increases in the level of uncertainty reflecting the value of waiting that uncertainty generates.

Joint work with Engelbert J. Dockner (deceased) and Richard F. Hartl

### Processes with jumps in Finance

We will addresses two particular investment problems that share, in particular, the following feature: the processes that model the uncertainty exhibit discontinuities in their sample paths. These discontinuities — or jumps — are driven by jump processes, hereby modelled by Poisson processes. Above all, the problems addressed are all problems that fall in the category of optimal stopping problems: choose a time to take a given action (in particular, the time to decide to invest, as here we consider investment problems) in order to maximize an expected payoff.

In the first problem, we assume that a firm is currently receiving a profit stream from an already operational project, and has the option to invest in a new project, with impact in its profitability. Moreover, we assume that there are two sources of uncertainty that influence the firm’s decision about when to invest: the random fluctuations of the revenue (depending on the random demand) and the changing investment cost. And, as already mentioned, both processes exhibit discontinuities in their sample paths.

The second problem is developed in the scope of technology adoption. The technology innovation is, by far, an example of a discontinuous process: the technological level does not increase in a steady pace, but instead from now and then some improvement or breakthrough happens. Thus it is natural to assume that technology innovations are driven by jump processes. As such, in this problem we consider a firm that is producing in a declining market, but with the option to undertake an innovation investment and thereby to replace the old product by a new one, paying a constant sunk cost. As the first product is a well established one, its price is deterministic. Upon investment in the second product, the price may fluctuate, according to a geometric Brownian motion. The decision is when to invest in a new product.

### Robust inference for ROC regression

The receiver operating characteristic (ROC) curve is the most popular tool for evaluating the diagnostic accuracy of continuous biomarkers. Often, covariate information that affects the biomarker performance is also available and several regression methods have been proposed to incorporate covariates in the ROC framework. In this work, we propose robust inference methods for ROC regression, which can be used to safeguard against the presence of outlying biomarker values. Simulation results suggest that the methods perform well in recovering the true conditional ROC curve and corresponding area under the curve, on a variety of data contamination scenarios. Methods are illustrated using data on age-specific accuracy of glucose as a biomarker of diabetes.

(Joint work with: Vanda I. de Carvalho & Miguel de Carvalho, University of Edinburgh, UK)

### Challenges of Clustering

Grouping similar objects in order to produce a classification is one of the basic abilities of human beings. It is one of the primary milestones of a child's concrete operational stage and continues to be used throughout adult life, playing a very important role on how we analyse our world. Although being a practical skill, clustering techniques are also commonly used in several applications areas such as social sciences, medicine, biology, engineering and computer science. Despite its wide application there are two issues that remain as ongoing research issues: (i) how many clusters should be selected? and (ii) which are the relevant variables for clustering? These two questions are crucial in order to obtain the best solution. We will answer them using a model-based approach based on finite mixture distributions and information criteria: Bayesian Information Criteria (BIC), Akaike's Information Criteria (AIC), Integrated Completed Likelihood (ICL) and Minimum Message Length (MML).

### Accurate implementations of nonlinear Kalman-like filtering methods with application to chemical engineering

A goal in many practical applications is to combine a priori knowledge about a physical system with experimental data to provide on-line estimation of states and/or parameters of that system. The time evolution of the (hidden) state is modeled by using dynamic system which is perturbed by a certain process noise. This noise is used for modeling the uncertainties in the system dynamics. The term optimal filtering traditionally refers to a class of methods that can be used for estimating the state of a time-varying system which is indirectly observed through noisy measurements. In this talk, we discuss the development of advanced Kalman-like filtering methods for estimating continuous-time nonlinear stochastic systems with discrete measurements. We starts with a brief overview of existing nonlinear Bayesian methods . Next, we focus on the numerical implementation of the Kalman-like filters (the Extended Kalman filter, the Unscented Kalman filter and Cubature Kalman filter) for estimating the state of continuous-discrete models . The standard approach implies that the Euler-Maruyama method is used for discretization of the underlying (process) stochastic differential equation (SDE). To reduce the discretization error, some subdivisions might be additionally introduced in each sampling interval. Some modern continuous-time filtering methods are developed by using a higher order methods, e.g. see the cubature Kalman filter based on the Ito-Taylor expansion for discretizing the underlying SDE in . However, all resulted implementations are the fixed step size methods and they do not allow for a proper processing of long and irregular sampling intervals (e.g. when missing measurements are appeared). An alternative methodology is to derived the moment differential equations, first. Next, the resulted ordinary differential equations (ODEs) are solved by modern ODE solvers. This approach allows for using variable step size solvers and copes with long/irregular sampling intervals accurately. Besides, we use the ODE solvers with global error control that improves the estimation quality further . As a numerical example we consider the batch reactor model studied in chemical engineering literature .