# Probability and Statistics Seminar

## Past sessions

### Quasi-analytical solution of an investment problem with decreasing investment cost due to technological innovations

In this talk we address, in the context of real options, an investment problem with two sources of uncertainty: the price (reflected in the revenue of the firm) and the level of technology. The level of technology impacts in the investment cost, that decreases when there is a technology innovation. The price follows a geometric Brownian motion, whereas the technology innovations are driven by a Poisson process. As a consequence, the investment region may be attained in a continuous way (due to an increase of the price) or in a discontinuous way (due to a sudden decrease of the investment cost).

For this optimal stopping problem no analytical solution is known, and therefore we propose a quasi-analytical method to find an approximated solution that preserves the qualitative features of the exact solution. This method is based on a truncation procedure and we prove that the truncated solution converges to the solution of the original problem.

We provide results for the comparative statics for the investment thresholds. These results show interesting behaviors, particularly, the investment may be postponed or anticipated with the intensity of the technology innovations and with their impact on the investment cost.

(joint work with Carlos Oliveira and Rita Pimentel)

Nunes_C_slides.pdf

### On ARL-unbiased charts to monitor the traffic intensity of a single server queue

We know too well that the effective operation of a queueing system requires maintaining the traffic intensity at a target value. This important measure of congestion can be monitored by using control charts, such as the one found in the seminal work by Bhat and Rao (1972) or more recently in Chen and Zhou (2015). For all intents and purposes, this paper focus on three control statistics chosen by Morais and Pacheco (2016) for their simplicity, recursive and Markovian character:

• the number of customers left behind in the M/G/1 system by the n-th departing customer;
• the number of customers seen in the GI/M/1 system by the n-th arriving customer;
• the waiting time of the n-th arriving customer to the GI/G/1 system.

Since an upward and a downward shift in the traffic intensity are associated with a deterioration and an improvement (respectively) of the quality of service, the timely detection of these changes is an imperative requirement, hence, begging for the use of ARL-unbiased charts Pignatiello et al. (1995), in the sense that they detect any shifts in the traffic intensity sooner than they trigger a false alarm. In this paper, we focus on the design of these type of charts for the traffic intensity of the three single server queues mentioned above.

Joint work with Sven Knoth

Slides of the talk

### On ARL-unbiased charts to monitor the traffic intensity of a single server queue

We know too well that the effective operation of a queueing system requires maintaining the traffic intensity at a target value.
This important measure of congestion can be monitored by using control charts, such as the one found in the seminal work by Bhat and Rao (1972) or more recently in Chen and Zhou (2015).
For all intents and purposes, this paper focus on three control statistics chosen by Morais and Pacheco (2016) for their simplicity, recursive and Markovian character:
- the number of customers left behind in the M/G/1 system by the n-th departing customer;
- the number of customers seen in the GI/M/1 system by the n-th arriving customer;
- the waiting time of the n-th arriving customer to the GI/G/1 system.
Since an upward and a downward shift in the traffic intensity are associated with a deterioration and an improvement (respectively) of the quality of service, the timely detection of these changes is an imperative requirement, hence, begging for the use of ARL-unbiased charts Pignatiello et al. (1995), in the sense that they detect any shifts in the traffic intensity sooner than they trigger a false alarm.
In this paper, we focus on the design of these type of charts for the traffic intensity of the three single server queues mentioned above.

Joint work with Sven Knoth

Cancelled due to Covid-19 containment measures.

### Extreme Value Theory applied to Longevity of Humans

There has been a long discussion on whether the distribution of human longevity has a finite or infinite right support. We shall discuss some recent results on Extreme Value Theory applied to Longevity of Humans. Some basic methods of EVT will be reviewed, where discussion will be oriented towards applications on human life-span data. It turns out that the quality of the actual data is a crucial issue. The results are based on data sets from the International Database on Longevity.

Joint work with Fei Huang (RSFAS, College of Business and Economics, Australian National University).

### Monitoring Image Processes

In recent years we observe dramatic changes in the way in which quality features of manufactured products are designed and inspected. The modeling and monitoring problems obtained by new inspection methods and fast multi-stream high-speed sensors are quite complex. These measurement tools are used in emerging technologies like, e.g., additive manufacturing. It has been shown that in these fields other types of quality characteristics have to be monitored. It is mainly not the mean, the variance, the covariance matrix or a simple profile which reflects the behavior of the quality characteristics but the shape, surfaces and images, etc. This is a new area for SPC. Note that more complicated characteristics arise in other fields of applications as well like, e.g., the monitoring of optimal portfolio weights in finance. Since in the last years many new approaches have been developed in the fields of image analysis, spatial statistics and for spatio-temporal modeling a huge amount of tools are available to model the underlying processes. Thus the main problem lies on the development of monitoring schemes for such structures.

In this talk new procedures for monitoring image processes are introduced. They are based on multivariate exponential smoothing and cumulative sums taking into account the local correlation structure. A comparison is given with existing methods. Within an extensive simulation study the performance of the analyzed methods is discussed.

The presented results are based on a joint work with Yarema Okhrin and Ivan Semeniuk.

### A LASSO-type model for the bulk and tail of a heavy-tailed response

As widely known, in an extreme value framework, interest focuses on modelling the most extreme observations — disregarding the central part of the distribution; commonly, the effort centers on modelling the tail of the distribution by the generalized Pareto distribution, in a Peaks over threshold framework. Yet, in most practical situations it would be desirable to model both the bulk of the data along with the extreme values. In this talk, I will introduce a novel regression model for the bulk and the tail of a heavy-tailed response. Our regression model builds over the extended generalized Pareto distribution, as recently proposed by Naveau et al (2016). The proposed model allows us to learn the effect of covariates on a heavy-tailed response via a LASSO-type specification conducted via a Lagrangian restriction. The performance of the proposed approach will be assessed through a simulation study, and the method will be applied to a real data set.

### First Come, First Served Queues with Two Classes of Impatient Customers

We study systems with two classes of impatient customers who differ across the classes in their distribution of service times and patience times. The customers are served on a first-come, first served basis (FCFS), regardless of their class. Such systems are common in customer call centers, which often segment their arrivals into classes of callers whose requests may differ greatly in their complexity and criticality. We first consider an $M/G/1 + M$ queue and then analyze the $M/M/k + M$ case. Analyzing these systems using a queue length process proves intractable as it would require us to keep track of the class of each customer at each position in queue. Consequently, we introduce a virtual waiting time process where the service times of customers who will eventually abandon the system are not considered. We analyze this process to obtain performance characteristics such as the percentage of customers who receive service in each class, the expected waiting times of customers in each class, and the average number of customers waiting in queue. We use our characterization of the system to perform a numerical analysis of the $M/M/k + M$ system, and find several managerial implications of administering a FCFS system with multiple classes of impatient customers. Finally, we compare the performance a system based on data from a call center with the steady-state performance measures of a comparable $M/M/k + M$ system. We find that the performance measures of the $M/M/k + M$ system serve as good approximations of the system based on real data.

Joint work with:

Ivo Adan, Eindhoven University of Technology, the Netherlands,

and

Brett Hathaway, Kenan-Flagler School of Business, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.

### Optimal reinsurance of dependent risks

The talk will focus on the optimal reinsurance problem for two dependent risks, from the point of view of the ceding insurance company. We aim at maximizing the expected utility or the adjustment coefficient of the insurer wealth. The insurer buys reinsurance on each risk separately. By risk we mean a line of business, a portfolio of policies or a policy. We assume a generic known dependence structure, so that the optimal solution depends on the joint distribution. Due to dependencies, the optimal level of reinsurance for each risk involves a trade-off between the reinsurance premia of both risks. We study the shape of this trade-off and characterize the optimal treaties. We show that an optimal solution exists and provide an optimality condition. Unfortunately, explicit optimal treaties are not easy to compute from this condition. We discuss some strategies to obtain numerical approximations for the optimal treaties and discuss some aspects of the structure of the optimal strategy. Numerical results are presented assuming that the two risks are dependent by means of a copula structure and that the reinsurance treaty consists of a combination of quota-share and stop-loss. Sensitivity of the optimal reinsurance strategy is analyzed numerically to several factors, including the dependence structure, through the copula chosen, and the dependence strength, by means of the dependence parameter, corresponding to different values of the Kendall’s tau. A variety of reinsurance premium calculation principles are also considered.

### Improving the ARL profile of the Poisson EWMA chart

The Poisson exponentially weighted moving average (PEWMA) chart was proposed by Borror et al. (1998) to monitor the mean of counts of nonconformities. This chart regrettably fails to have an in-control average run length (ARL) larger than any out-of-control ARL, i.e., the PEWMA chart is ARL-biased. Moreover, due to the discrete character of its control statistic the PEWMA it is difficult to set the control limits in such way that the in-control takes a desired value, say ARL0. In this paper, we propose an ARL-unbiased counterpart of the PEWMA chart and use the R statistical software to provide gripping illustrations of this chart with a decidedly improved ARL profile and an in-control ARL equal to ARL0. We also compare the ARL performance of the proposed chart with the one of a few competing control charts for the mean of i.i.d. Poisson counts.

Joint work with Sven Knoth (Department of Mathematics and Statistics — Faculty of Economics and Social Sciences — Helmut Schmidt University, Hamburg, Germany)

### The coupling method in extreme value theory

One of the main goal of extreme value theory is to infer probabilities of extreme events for which only limited observations are available and require extrapolation of the tail distribution of the observations. One major result is Balkema-de Haan-Pickands theorem that provides an approximation of the distribution of exceedances above high threshold by a Generalized Pareto distribution. We revisit these results with coupling arguments and provide quantitative estimates for the Wasserstein distance between the empirical distribution of exceedances and the limit Pareto model. In a second part of the talk, we extend the results to the analysis of a proportional tail model for quantile regression closely related to the heteroscedastic extremes framework developed by Einmahl et al. (JRSSB 2016). We introduce coupling arguments relying on total variation and Wasserstein distances for the analysis of the asymptotic behavior of estimators of the extreme value index and integrated skedasis function.

Joint work with B. Bobbia and D. Varron (Université de Franche Comté).

### Geostatistical analysis of sardine eggs data — a Bayesian approach

Understanding the distribution of animals over space, as well as how that distribution is influenced by environmental covariates, is a fundamental requirement for the effective management of animal populations. This is especially the case for populations which are harvested. The sardine is one of the most important fisheries species, both for its economic, sociologic, antropologic and cultural values.

Here we intend to understand the spatial distribution of the average number of sardine eggs by $m^3$. Our main objectives are to identify the environmental variables that better explain the spatial variation in sardine eggs density and to make predictions in spatial points that were not observed.

The data structure presents an excess of zeros and extreme values. To deal with this, we propose a point-referenced zero-inflated model to model the probability of presence together with the positive sardine eggs density and a point-referenced generalized Pareto model for the extremes. Finally, we combine the results of these two models to get the spatial predictions of the variable of interest. We follow a Bayesian approach and the inference is made using the package R-INLA in the software R.

### Distributed Learning Algorithms for Big Data

Modern datasets are increasingly collected by teams of agents that are spatially distributed: sensor networks, networks of cameras, and teams of robots. To extract information in a scalable manner from those distributed datasets, we need distributed learning. In the vision of distributed learning, no central node exists; the spatially distributed agents are linked by a sparse communication network and exchange short messages between themselves to directly solve the learning problem. To work in the real-world, a distributed learning algorithm must cope with several challenges, e.g., correlated data, failures in the communication network, and minimal knowledge of the network topology. In this talk, we present some recent distributed learning algorithms that can cope with such challenges. Although our algorithms are simple extensions of known ones, these extensions require new mathematical proofs that elicit interesting applications of probability theory tools, namely, ergodic theory.

### Contributions for the detection of multivariate outliers

The detection of outliers in multivariate models is always a dicult matter, but the subject is even more complex when dealing with dependent structures, as it is the case with the Simultaneous Equation Model (SEM). Unlike other models dened by systems of equations, such as the multivariate regression, the SEM assumes that the response variable in each equation can be stated as an explanatory variable in the rest of the system, meaning that explanatory variables can be correlated with the error terms. We present a method of outlier detection that bypasses those diculties using the asymptotic distribution of adequate robust Mahalanobis distances. The process identies anomalous data points as outliers of the SEM in simple steps and it provides a clear visualization. We illustrate this procedure with a real econometric data set.

### Robust logistic regression with sparse predictor variables

Nowadays, dealing with high-dimensional data is a recurrent problem that cuts across modern statistics. One main feature of high dimensional data is that the dimension $p$, that is, the number of covariates, is high, while the sample size $n$ is relatively small. In this circumstance, the bet on sparsity principle suggests to proceed under the assumption that most of the effects are not significant. Sparse covariates are frequent in the classification problem and in this situation the task of variable selection may be also of interest. We focus on the logistic regression model and our aim is to address robust and sparse estimators of the regression parameter in order to perform estimation and variable selection at the same time.For this purpose, we introduce a family of penalized M-type estimators for the logistic regression parameter that are stable against atypical data. We explore different penalizations functions and we introduce the so-called sign penalization. This new penalty has the advantage that it does not shrink the estimated coefficients to $0$ and that it depends only on one parameter.We will discuss the variable selection capability of the proposal as well as its asymptotic behaviour. Through a numerical study, we compare the finite sample performance of the proposal with different penalized estimators either robust or classical, under different scenarios.

### A Comprehensive Methodology to Analyse Topic Difficulties in Educational Programmes

We propose a comprehensive Learning Analytics methodology to investigate the level of understanding students achieve in the learning process. The goals of such methodology are

1. To identify topics in which students experience difficulties on;
2. To assess whether these difficulties are recurrent along semesters;
3. To decide if there are conceptual associations between topics in which students experience difficulties on; and, more generally,
4. To discover statistically significant groups of topics in which students show similar performance.

The proposed methodology uses statistics and data visualization techniques to address the first and the second goals, frequent itemset mining to tackle the third goal, and biclustering is proposed to find relationships within educational data, revealing meaningful and statistically significant patterns of students’ performance.

We illustrate the application of the methodology to a Computer Science course.

### Working towards a typology of indices of agreement for clustering evaluation

Indices of agreement (IA) are commonly used to evaluate stability of a clustering solution or its agreement with ground truth – internal and external validation of the same solution, respectively.

IA provide different measures of the accordance between two partitions of the same data set, being based on contingency table data. Despite their frequent use in clustering evaluation, there are still open issues regarding the specific thresholds for each index to conclude about the degree of agreement between the partitions.

To acquire new insights on the indices behavior that may help improve clustering evaluation, 14 paired indices of indices are analyzed within diverse experimental scenarios - with balanced or unbalanced clusters and poorly, moderately or well separated ones. The paired indices’ observed values are all based on a cross-classification table of counts of pairs of observations both partitions agree to join and/or separate in the clusters. The IADJUST method is used to learn about the behavior of the indices under the hypothesis of agreement between partitions occurring by chance (H0). It relies on the generation of contingency tables under H0, being a simulation based procedure that enables to correct any index of agreement by deducting agreement by chance, overcoming previous limitations of analytical or approximate approaches – (Amorim and Cardoso, 2015).

The results suggest a preliminary typology of paired indices of agreement based on their distributional characteristics under H0. Inter-scenarios symbolic data referring to location, dispersion and shape measures of IA distributions under H0 are used to build this typology.

### Reference

Amorim, M. J., & Cardoso, M. G. (2015). Comparing clustering solutions: The use of adjusted paired indices. Intelligent Data Analysis, 19(6), 1275-1296.

Joint work with Maria José Amorim (Department of Mathematics of ISEL, Lisbon, Portugal).

### Feed-in Tariff Contract Schemes and Regulatory Uncertainty

This paper presents a novel analysis of four finite feed-in tariff (FIT) schemes, namely fixed-price, fixed-premium, minimum price guarantee and sliding premium with a cap and a floor, under market and regulatory uncertainty. Using an analytical real options framework, we derive the project value, the optimal investment threshold and the value of the investment opportunity for the four FIT schemes. Regulatory uncertainty is modeled allowing the tariff to be reduced before the signature of the contract. While market uncertainty defers investment, a higher and more likely tariff reduction accelerates investment. We also present several findings that are aimed at policymaking decisions, regarding namely the choice, level and duration of the FIT. For instance, the investment threshold of the sliding premium with a cap and a floor is lower than the minimum price guarantee, which suggests that the first regime is a better policy than the latter because it accelerates the investment while avoiding overcompensation.

### Selecting differentially expressed genes in samples subgroups on microarray data

A common task in analysing microarray data is to determine which genes are differentially expressed under two (or more) kinds of tissue samples or samples submitted under different experimental conditions. It is well known that biological samples are heterogeneous due to factors such as molecular subtypes or genetic background, which are often unknown to the investigator. For instance, in experiments which involve molecular classification of tumours it is important to identify significant subtypes of cancer. Bimodal or multimodal distributions often reflect the presence of subsamples mixtures.

Consequently, truly differentially expressed genes on sample subgroups may be lost if usual statistical approaches are used. In this work it is proposed a graphical tool which identifies genes with up and down regulation, as well as genes with differential expression which revels hidden subclasses, that are usually missed if current statistical methods are used.

### Optimal investment decision under switching regimes of subsidy support

We address the problem of making a managerial decision when the investment project is subsidized, which results in the resolution of an infinite-horizon optimal stopping problem of a switching diffusion driven by either a homogeneous or an inhomogeneous continuous-time Markov chain. We provide a characterization of the value function (and optimal strategy) of the optimal stopping problem. On the one hand, broadly, we can prove that the value function is the unique viscosity solution to a system of HJB equations. On the other hand, when the Markov chain is homogeneous and the switching diffusion is one-dimensional, we obtain stronger results: the value function is the difference between two convex functions

Older session pages: Previous 2 3 4 5 6 7 8 9 10 11 Oldest