Probability and Statistics Seminar

Past sessions

Processes with jumps in Finance

We will addresses two particular investment problems that share, in particular, the following feature: the processes that model the uncertainty exhibit discontinuities in their sample paths. These discontinuities — or jumps — are driven by jump processes, hereby modelled by Poisson processes. Above all, the problems addressed are all problems that fall in the category of optimal stopping problems: choose a time to take a given action (in particular, the time to decide to invest, as here we consider investment problems) in order to maximize an expected payoff.

In the first problem, we assume that a firm is currently receiving a profit stream from an already operational project, and has the option to invest in a new project, with impact in its profitability. Moreover, we assume that there are two sources of uncertainty that influence the firm’s decision about when to invest: the random fluctuations of the revenue (depending on the random demand) and the changing investment cost. And, as already mentioned, both processes exhibit discontinuities in their sample paths.

The second problem is developed in the scope of technology adoption. The technology innovation is, by far, an example of a discontinuous process: the technological level does not increase in a steady pace, but instead from now and then some improvement or breakthrough happens. Thus it is natural to assume that technology innovations are driven by jump processes. As such, in this problem we consider a firm that is producing in a declining market, but with the option to undertake an innovation investment and thereby to replace the old product by a new one, paying a constant sunk cost. As the first product is a well established one, its price is deterministic. Upon investment in the second product, the price may fluctuate, according to a geometric Brownian motion. The decision is when to invest in a new product.

Robust inference for ROC regression

The receiver operating characteristic (ROC) curve is the most popular tool for evaluating the diagnostic accuracy of continuous biomarkers. Often, covariate information that affects the biomarker performance is also available and several regression methods have been proposed to incorporate covariates in the ROC framework. In this work, we propose robust inference methods for ROC regression, which can be used to safeguard against the presence of outlying biomarker values. Simulation results suggest that the methods perform well in recovering the true conditional ROC curve and corresponding area under the curve, on a variety of data contamination scenarios. Methods are illustrated using data on age-specific accuracy of glucose as a biomarker of diabetes.

(Joint work with: Vanda I. de Carvalho & Miguel de Carvalho, University of Edinburgh, UK)

Challenges of Clustering

Grouping similar objects in order to produce a classification is one of the basic abilities of human beings. It is one of the primary milestones of a child's concrete operational stage and continues to be used throughout adult life, playing a very important role on how we analyse our world. Although being a practical skill, clustering techniques are also commonly used in several applications areas such as social sciences, medicine, biology, engineering and computer science. Despite its wide application there are two issues that remain as ongoing research issues: (i) how many clusters should be selected? and (ii) which are the relevant variables for clustering? These two questions are crucial in order to obtain the best solution. We will answer them using a model-based approach based on finite mixture distributions and information criteria: Bayesian Information Criteria (BIC), Akaike's Information Criteria (AIC), Integrated Completed Likelihood (ICL) and Minimum Message Length (MML).

Accurate implementations of nonlinear Kalman-like filtering methods with application to chemical engineering

A goal in many practical applications is to combine a priori knowledge about a physical system with experimental data to provide on-line estimation of states and/or parameters of that system. The time evolution of the (hidden) state is modeled by using dynamic system which is perturbed by a certain process noise. This noise is used for modeling the uncertainties in the system dynamics. The term optimal filtering traditionally refers to a class of methods that can be used for estimating the state of a time-varying system which is indirectly observed through noisy measurements. In this talk, we discuss the development of advanced Kalman-like filtering methods for estimating continuous-time nonlinear stochastic systems with discrete measurements. We starts with a brief overview of existing nonlinear Bayesian methods [1]. Next, we focus on the numerical implementation of the Kalman-like filters (the Extended Kalman filter, the Unscented Kalman filter and Cubature Kalman filter) for estimating the state of continuous-discrete models [2]. The standard approach implies that the Euler-Maruyama method is used for discretization of the underlying (process) stochastic differential equation (SDE). To reduce the discretization error, some subdivisions might be additionally introduced in each sampling interval. Some modern continuous-time filtering methods are developed by using a higher order methods, e.g. see the cubature Kalman filter based on the Ito-Taylor expansion for discretizing the underlying SDE in [3]. However, all resulted implementations are the fixed step size methods and they do not allow for a proper processing of long and irregular sampling intervals (e.g. when missing measurements are appeared). An alternative methodology is to derived the moment differential equations, first. Next, the resulted ordinary differential equations (ODEs) are solved by modern ODE solvers. This approach allows for using variable step size solvers and copes with long/irregular sampling intervals accurately. Besides, we use the ODE solvers with global error control that improves the estimation quality further [4]. As a numerical example we consider the batch reactor model studied in chemical engineering literature [5].

References

[1] Arasaratnam I., Haykin S. (2009) Cubature Kalman Filters, IEEE Trans. Automat. Contr. 54(6): 1254–1269.
[2] Kulikov G. Yu., Kulikova M. V. (2014) Accurate numerical implementation of the continuous-discrete extended Kalman filter, IEEE Transactionson Automatic Control, 59(1): 273–279.
[3] Arasaratnam I., Haykin S., Hurd T. R. (2010) Cubature Kalman filtering for continuous-discrete systems: Theory and simulations, IEEE Trans. Signal Process. 58(10): 4977-4993.
[4] Kulikov G. Yu., Kulikova M. V. (2016) The accurate continuous-discrete extended Kalman filter for radar tracking. IEEE Transactions on Signal Processing, 64(4): 948-958.
[5] Kulikov G. Yu., Kulikova M. V. (2015) State estimation in chemical systems with infrequent measurements. Proceedings of European Control Conference, Linz, Austria, pp. 2688–2693.

ARL-unbiased geometric control charts for high-yield processes

The geometric distribution is the basic model for the quality characteristic that represents the cumulative count of conforming (CCC) items between two nonconforming ones in a high-yield process.

In order to control increases and decreases in the fraction nonconforming in a timely fashion, the geometric chart should be set in such way that the average run length (ARL) curve attains a maximum in the in-control situation, i.e., it should be ARL-unbiased.

By exploring the notions of uniformly most powerful unbiased tests with randomization probabilities, we are able not only to eliminate the bias of the ARL function of the existing geometric charts, but also to bring their in-control ARL exactly to a pre-specified value.

Instructive examples are provided to illustrate that the ARL-unbiased geometric charts have the potential to play a major role in the prompt detection of the deterioration and improvement of real high-yield processes.

The challenge of inserting wind power generation into the Brazilian hydrothermal optimal dispatch

Brazil has a total of 4.648 power generation projects in operation, totaling 161 GW of installed capacity, where 74% comes from hydroelectric power plants and about 7% from intermittent generation sources (wind power in particular). An addition of 25 GW is scheduled for the next few years in the country's generation capacity, where 43% of this increment will come from intermittent sources. Nowadays, planning the Brazilian energy sector means, basically, making decisions about the dispatch of hydroelectric and thermoelectric plants where the operation strategy minimizes the expected value of the operation cost during the planning period, which is composed of fuel costs plus penalties for failing to supply the projected expected load. Given the growing trend of wind power generation, basically in the Northeast region of the country, within the Brazilian energy matrix, it is necessary to include this type of generation into the optimal approach dispatch currently used, so that this type of generation is effectively considered in the long term planning. This talk aims to show the preliminary developments toward the implementation in a stochastic way of such kind of energy generation, in order to generate the optimal hydrothermal wind dispatch.

Keywords: Hydro, Thermal wind power generation, optimal dispatch, demand forecast, inflow and wind speed uncertainties

Martingalas e Análise de Sobrevivência

Apresentação do Seminário de Investigação em Probabilidades e Estatística I no âmbito do Mestrado em Matemática e Aplicações (MMA) e Doutoramento em Estatística e Processos Estocásticos, em colaboração com o CEMAT.

Intervalos de confiança bootstrap

Apresentação do Seminário de Investigação em Probabilidades e Estatística I no âmbito do Mestrado em Matemática e Aplicações (MMA) e Doutoramento em Estatística e Processos Estocásticos, em colaboração com o CEMAT.

Bayesian nonparametric inference for the covariate-adjusted ROC curve

Accurate diagnosis of disease is of fundamental importance in clinical practice and medical research. Before a medical diagnostic test is routinely used in practice, its ability to distinguish between diseased and nondiseased states must be rigorously assessed through statistical analysis. The receiver operating characteristic (ROC) curve is the most popular used tool for evaluating the discriminatory ability of continuous-outcome diagnostic tests. Recently, it has been acknowledged that several factors (e.g., subject-specific characteristics, such as age and/or gender) can affect the test’s accuracy beyond disease status. In this work, we develop Bayesian nonparametric inference, based on a combination of dependent Dirichlet process mixture models and the Bayesian bootstrap, for the covariate-adjusted ROC curve (Janes and Pepe, 2009, Biometrika), a measure of covariate-adjusted diagnostic accuracy. Applications to simulated and real data are provided.

Adaptive SVD-based Kalman filtering for state and parameter estimation of linear Gaussian dynamic stochastic models

In this talk, the recently published results on the robust adaptive Kalman filtering are presented. Such methods allow for simultaneous state and parameter estimation of dynamic stochastic systems. Any adaptive filtering scheme typically consists of two parts: (i) a recursive optimization method for identifying the uncertain system parameters by minimizing an appropriate performance index (e.g. the negative likelihood function, if the method of maximum likelihood is used for parameter estimation), and (ii) the application of the underlying filter for estimating the unknown dynamic state of the examined model as well as for computing the chosen performance index. In this paper we study the gradient-based adaptive techniques that require the corresponding performance index gradient evaluation. The goal is to propose the robust computational procedure that is inherently more stable (with respect to roundoff errors) than the classical approach based on the straightforward differentiation of the Kalman filtering equations.

Our solution is based on the SVD factorization. First, we have designed new SVD-based Kalman filter implementation method in [1]. Next, we have extended the obtained result on the gradient evaluation (with respect to unknown system parameters) and, hence, designed the SVD-based adaptive scheme in [2]. The newly-developed SVD-based methodology is algebraically equivalent to the conventional approach and the previously derived stable Cholesky-based methods, but outperforms them for estimation accuracy in ill-conditioned situations.

(joint work with Julia Tsyganova)

References:

[1] Kulikova M.V., Tsyganova J.V. (2017) Improved discrete-time Kalman filtering within singular value decomposition. IET Control Theory & Applications, 11(15): 2412-2418

[2] Tsyganova J.V., Kulikova M.V.(2017) SVD-based Kalman filter derivative computation. IEEE Transactions on Automatic Control, 62(9): 4869-4875

GIS and geostatistics: Applications to climatology and meteorology

The use of GIS and geostatistics in climatology/meteorology, at the Portuguese Sea and Atmosphere Institute – IPMA, is the main focus of the presentation. Several examples of applications have been selected to demonstrate the contribution of geospatial analysis and modelling to these sciences. A special emphasis is made on the procedures and capabilities for integrating data from various sources, on quality control, on presentation and cartography of climate data.

The main applications are related with spatial interpolation (geostatistical or other methods) of temperature, precipitation, humidity, climate indices. Subjects like validation or uncertainty analysis will also be discussed.

Spatial analysis and modelling of fire risk, climate monitoring, climate change scenarios, drought, geospatial data management, metadata and data interoperability are other interesting areas to be addressed. Further developments are foreseen on multi-hazard modelling and on climate-driven health applications.

Tests for the Weights of the Global Minimum Variance Portfolio in a High-Dimensional Setting

In this talk tests for the weights of the global minimum variance portfolio (GMVP) in a high-dimensional setting are presented, namely when the number of assets $p$ depends on the sample size $n$ such that $p/n \to c$ in $(0,1)$, as $n$ tends to infinity. The introduced tests are based on the sample estimator and on a shrinkage estimator of the GMVP weights (cf. Bodnar et al. 2017). The asymptotic distributions of both test statistics under the null and alternative hypotheses are derived. Moreover, we provide a simulation study where the performance of the proposed tests is compared with each other and with an approach of Glombeck (2014). A good performance of the test based on the shrinkage estimator is observed even for values of $c$ close to $1$.

(joint work with Taras Bodnar, Solomiia Dmytriv and Nestor Parolya)

References:

• Bodnar, T. and Schmid, W. (2008). A test for the weights of the global minimum variance portfolio in an elliptical model, Metrika, 67, 127-143.
• Bodnar, T., Parolya, N. and Schmid, W. (2017). Estimation of the minimum variance portfolio in high dimensions, European Journal of Operational Research, in press.
• Glombeck, K. (2014). Statistical inference for high-dimensional global minimum variance portfolios, Scandinavian Journal of Statistics, 41, 845-865.
• Okhrin, Y. and Schmid, W. (2006). Distributional properties of portfolio weights, Journal of Econometrics, 134, 235-256.
• Okhrin, Y. and Schmid, W. (2008). Estimation of optimal portfolio weights, International Journal of Theoretical and Applied Finance, 11, 249-276.

Near-exact distributions – comfortable lying closer to exact distributions than common asymptotic distributions

We are all quite familiar with the concept of asymptotic distribution. For some sets of statistics, as it is for example the case with the likelihood ratio test statistics, mainly those used in Multivariate Analysis, some authors developed what are nowadays seen as “standard” methods of building such asymptotic distributions, as it is the case of the seminal paper by Box (1949).

However, such asymptotic distributions quite commonly yield approximations which fall short of the precision we need and/or may also exhibit some problems when some parameters in the exact distributions grow large, as it is indeed the case with many asymptotic distributions commonly used in Multivariate Analysis when the number of variables involved grows even just modertely large.

The pertinent question is thus the following one: are we willing to pay a bit more in terms of a more elaborate structure for the approximating distribution, anyway keeping it much manageable in terms of allowing for a quite easy computation of p-values and quantiles?

If our answer to the above question is affirmative, then we are ready to enter the amazing world of the near-exact distributions.

Near-exact distributions are asymptotic distributions developed under a new concept of approximating distributions. Based on a decomposition (i.e., a factorization or a split in two or more terms) of the characteristic function of the statistic being studied, or of the characteristic function of its logarithm, they are asymptotic distributions which lie much closer to the exact distribution than common asymptotic distributions.

If we are able to keep untouched a good part of the original structure of the exact distribution of the random variable or statistic being studied, we may in this way obtain a much better approximation, which not only does not exhibit anymore the problems referred above which occur with some asymptotic distributions, but which on top of this exhibits extremely good performances even for very small sample sizes and large numbers of variables involved, being asymptotic not only for increasing sample sizes but also (opposite to what happens with the common asymptotic distributions) for increasing values of the number of variables involved.

GARCH processes and the phenomenon of misleading and unambiguous signals

In Finance it is quite usual to assume that a process behaves according to a previously specified target GARCH process. The impact of rumours or other events on this process can be frequently described by an outlier responsible for a short-lived shift in the process mean or by a sustained change in the process variance. This calls for the use of joint schemes for the process mean and variance, such as the ones proposed by Schipper (2001) and Schipper and Schmid (2001).

Since changes in the mean and in the variance require different actions from the traders/brokers, this paper provides an account on the probabilities of misleading and unambiguous signals (PMS and PUNS) of those joint schemes, thus adding valuable insights on the out-of-control performance of those schemes.

We are convinced that this talk is of interest to business persons/traders/brokers, quality control practitioners, and statisticians alike.

Joint work with:

• Beatriz Sousa (MMA; CGD);
• Yarema Okhrin (Faculty of Business and Economics, University of Augsburg, Germany);
• Wolfgang Schmid (Department of Statistics, European University Viadrina, Germany)

References

• Schipper, S. (2001). Sequential Methods for Detecting Changes in the Volatility of Economic Time Series. Ph.D. thesis, European University, Department of Statistics, Frankfurt (Oder), Germany.
• Schipper, S. and Schmid, W. (2001). Control charts for GARCH processes. Nonlinear Analysis: Theory, Methods & Applications 47, 2049-2060.

Education: a challenge to statistical modelling

In Education, the main random variable is, in general, the knowledge-level of the student on a certain subject. This is generally considered to be the result of a process that involves one or more educational agent (e.g. teachers) and the reflection and training of the student. There are many covariates including: innate capacities, resiliency, perseverance, family and health conditions, and quality of education.

While being a subject that all of us has some a priori knowledge (essential in statistical modelling), it has an initial difficulty that may lead to significant bias and wrong interpretations: the measurement of the main random variable. From Educational psychometrics, measurement tools are the tests or examinations and the main variable is a latent variable.

In this talk we shall review some recent advances in statistics applied to education, namely to international evaluation of students and to the assessment of the impact of educational political policies. We also review the main data basis from DGEEC (Direção Geral de Estatísticas da Educação e Ciência) and its use for research purposes.

Some hot topics of maximum entropy research in economics and statistics

Maximum entropy is often used for solving ill-posed problems that occur in diverse areas of science (e.g., physics, informatics, biology, medicine, communication engineering, statistics and economics). The works of Kullback, Leibler, Lindley and Jaynes in the fifties of the last century were fundamental to connect the areas of maximum entropy and information theory with statistical inference. Jaynes states that the maximum entropy principle is a simple and straightforward idea. Indeed, it provides a simple tool to make the best prediction (i.e., the one that is the most strongly indicated) from the available information and it can be seen as an extension of the Bernoulli's principle of insufficient reason. The maximum entropy principle provides an unambiguous solution for ill-posed problems by choosing the distribution of probabilities that maximizes the Shannon entropy measure. Some recent research in regularization (e.g., ridGME and MERGE estimators), variable selection (e.g., normalized entropy with information from the ridge trace), inhomogeneous large-scale data (e.g., normalized entropy as an alternative to maximin aggregation) and stochastic frontier analysis (e.g., generalized maximum entropy and generalized cross-entropy with data envelopment analysis as an alternative to maximum likelihood estimation) will be presented, along with several real-world applications in engineering, medicine and economics.

Statistical models for the relationship between daily temperature and mortality

The association between daily ambient temperature and health outcomes has been frequently investigated based on a time series design. The temperature–mortality relationship is often found to be substantially nonlinear and to persist, but change shape, with increasing lag. Thus, the statistical framework has gained a substantial development during last years. In this talk I describe the general features of time series regression, outlining the analysis process to model short-term fluctuations in the presence of seasonal and long-term pattern. I also offer an overview of the recent extend family of distributed lag non-linear models (DLNM), a modelling framework that can simultaneously represent non-linear exposure–response dependencies and delayed effects. To illustrate the methodology, I use an example to represent the relationship between temperature and mortality, using data from the MCC Collaborative Research Network, an international research program on the association between weather and health.

On a class of optimal stopping problems with applications to real option theory

We consider an optimal stopping time problem related with many models found in real options problems. The main goal of this work is to bring for the field of real options different and more realistic pay-off functions, and negative interest rates. Thus, we present analytical solutions for a wide class of pay-off functions, considering quite general assumptions over the model. Also, an extensive and general sensitivity analysis to the solutions, and an economic example which highlight the mathematical difficulties in the standard approaches, are provided.

(joint work with Manuel Guerra and Carlos Oliveira)

Applied environmental time series analysis

Since the very beginning developments in data analysis and time series methods have been closely associated with phenomena in the natural environment, from the power spectra of Schuster motivated by earthquakes and sunspots, Walker's analysis of the Southern Oscillation in attempting to predict the indian monsoon, Tukey's cross-spectrum for the analysis of seismic waves, Gumbel's extreme value analysis inspired by meteorological and hydrological phenomena or Hurst's long memory concept from observations of Nile's water levels. In the modern era of disposable, low-power devices and cheap storage, the natural environment is monitored at an incredible pace, yielding copious time series of high-resolution (sub-hourly) observations which are widely available. Despite the many computationally-intensive approaches developed and currently available to handle streams of data, their success in producing new and environmentally-relevant information is surprisingly low. An obvious challenge is the integration of problem-related knowledge and context in the data analysis process, and going from data summaries/visualisations/alarms to an exploratory analysis aiming to discover new and physically-relevant information from the environmental data. This talk addresses the practical challenges and opportunities in the analysis of high-resolution environmental time series, as illustrated by time series of environmental radioactivity and by measurements from the ongoing gamma radiation monitoring campaign in the Azores.

Air quality science: putting statistics to work

Several statistical tools have been used to analyse air quality data with different purposes. This talk will highlight some of these examples and how the different statistical tools can be bring an added value for this scientific environmental area. First, changes in pollutant concentrations were examined and clustered by means of quantile regression, which allows to analyse the trends not only in the mean but in the overall data distribution. The clustering procedure has shown/indicated where the largest trends are found, in terms of space (location) and quantiles. Secondly, the resulting individual variance/covariance profiles of a set of air quality hourly time series are embedded in a wavelet decomposition-based clustering algorithm in order to identify groups of stations exhibiting similar profiles. The results clearly indicate a geographical pattern among different type of stations and allowed to identify sites which need revision concerning classification according to environment/ influence type. Both exercises were particular important for air quality management practices, in particular regarding the design of the national monitoring network.

Older session pages: Previous 3 4 5 6 7 8 9 10 11 Oldest