Probability and Statistics Seminar

Past sessions

Newer session pages: Next 2 1 Newest

Intervalos de confiança bootstrap

Apresentação do Seminário de Investigação em Probabilidades e Estatística I no âmbito do Mestrado em Matemática e Aplicações (MMA) e Doutoramento em Estatística e Processos Estocásticos, em colaboração com o CEMAT.

Bayesian nonparametric inference for the covariate-adjusted ROC curve

Accurate diagnosis of disease is of fundamental importance in clinical practice and medical research. Before a medical diagnostic test is routinely used in practice, its ability to distinguish between diseased and nondiseased states must be rigorously assessed through statistical analysis. The receiver operating characteristic (ROC) curve is the most popular used tool for evaluating the discriminatory ability of continuous-outcome diagnostic tests. Recently, it has been acknowledged that several factors (e.g., subject-specific characteristics, such as age and/or gender) can affect the test’s accuracy beyond disease status. In this work, we develop Bayesian nonparametric inference, based on a combination of dependent Dirichlet process mixture models and the Bayesian bootstrap, for the covariate-adjusted ROC curve (Janes and Pepe, 2009, Biometrika), a measure of covariate-adjusted diagnostic accuracy. Applications to simulated and real data are provided.

Adaptive SVD-based Kalman filtering for state and parameter estimation of linear Gaussian dynamic stochastic models

In this talk, the recently published results on the robust adaptive Kalman filtering are presented. Such methods allow for simultaneous state and parameter estimation of dynamic stochastic systems. Any adaptive filtering scheme typically consists of two parts: (i) a recursive optimization method for identifying the uncertain system parameters by minimizing an appropriate performance index (e.g. the negative likelihood function, if the method of maximum likelihood is used for parameter estimation), and (ii) the application of the underlying filter for estimating the unknown dynamic state of the examined model as well as for computing the chosen performance index. In this paper we study the gradient-based adaptive techniques that require the corresponding performance index gradient evaluation. The goal is to propose the robust computational procedure that is inherently more stable (with respect to roundoff errors) than the classical approach based on the straightforward differentiation of the Kalman filtering equations.

Our solution is based on the SVD factorization. First, we have designed new SVD-based Kalman filter implementation method in [1]. Next, we have extended the obtained result on the gradient evaluation (with respect to unknown system parameters) and, hence, designed the SVD-based adaptive scheme in [2]. The newly-developed SVD-based methodology is algebraically equivalent to the conventional approach and the previously derived stable Cholesky-based methods, but outperforms them for estimation accuracy in ill-conditioned situations.

(joint work with Julia Tsyganova)

References:

[1] Kulikova M.V., Tsyganova J.V. (2017) Improved discrete-time Kalman filtering within singular value decomposition. IET Control Theory & Applications, 11(15): 2412-2418

[2] Tsyganova J.V., Kulikova M.V.(2017) SVD-based Kalman filter derivative computation. IEEE Transactions on Automatic Control, 62(9): 4869-4875

GIS and geostatistics: Applications to climatology and meteorology

The use of GIS and geostatistics in climatology/meteorology, at the Portuguese Sea and Atmosphere Institute – IPMA, is the main focus of the presentation. Several examples of applications have been selected to demonstrate the contribution of geospatial analysis and modelling to these sciences. A special emphasis is made on the procedures and capabilities for integrating data from various sources, on quality control, on presentation and cartography of climate data.

The main applications are related with spatial interpolation (geostatistical or other methods) of temperature, precipitation, humidity, climate indices. Subjects like validation or uncertainty analysis will also be discussed.

Spatial analysis and modelling of fire risk, climate monitoring, climate change scenarios, drought, geospatial data management, metadata and data interoperability are other interesting areas to be addressed. Further developments are foreseen on multi-hazard modelling and on climate-driven health applications.

Tests for the Weights of the Global Minimum Variance Portfolio in a High-Dimensional Setting

In this talk tests for the weights of the global minimum variance portfolio (GMVP) in a high-dimensional setting are presented, namely when the number of assets $p$ depends on the sample size $n$ such that $p/n \to c$ in $(0,1)$, as $n$ tends to infinity. The introduced tests are based on the sample estimator and on a shrinkage estimator of the GMVP weights (cf. Bodnar et al. 2017). The asymptotic distributions of both test statistics under the null and alternative hypotheses are derived. Moreover, we provide a simulation study where the performance of the proposed tests is compared with each other and with an approach of Glombeck (2014). A good performance of the test based on the shrinkage estimator is observed even for values of $c$ close to $1$.

(joint work with Taras Bodnar, Solomiia Dmytriv and Nestor Parolya)

References:

• Bodnar, T. and Schmid, W. (2008). A test for the weights of the global minimum variance portfolio in an elliptical model, Metrika, 67, 127-143.
• Bodnar, T., Parolya, N. and Schmid, W. (2017). Estimation of the minimum variance portfolio in high dimensions, European Journal of Operational Research, in press.
• Glombeck, K. (2014). Statistical inference for high-dimensional global minimum variance portfolios, Scandinavian Journal of Statistics, 41, 845-865.
• Okhrin, Y. and Schmid, W. (2006). Distributional properties of portfolio weights, Journal of Econometrics, 134, 235-256.
• Okhrin, Y. and Schmid, W. (2008). Estimation of optimal portfolio weights, International Journal of Theoretical and Applied Finance, 11, 249-276.

Near-exact distributions – comfortable lying closer to exact distributions than common asymptotic distributions

We are all quite familiar with the concept of asymptotic distribution. For some sets of statistics, as it is for example the case with the likelihood ratio test statistics, mainly those used in Multivariate Analysis, some authors developed what are nowadays seen as “standard” methods of building such asymptotic distributions, as it is the case of the seminal paper by Box (1949).

However, such asymptotic distributions quite commonly yield approximations which fall short of the precision we need and/or may also exhibit some problems when some parameters in the exact distributions grow large, as it is indeed the case with many asymptotic distributions commonly used in Multivariate Analysis when the number of variables involved grows even just modertely large.

The pertinent question is thus the following one: are we willing to pay a bit more in terms of a more elaborate structure for the approximating distribution, anyway keeping it much manageable in terms of allowing for a quite easy computation of p-values and quantiles?

If our answer to the above question is affirmative, then we are ready to enter the amazing world of the near-exact distributions.

Near-exact distributions are asymptotic distributions developed under a new concept of approximating distributions. Based on a decomposition (i.e., a factorization or a split in two or more terms) of the characteristic function of the statistic being studied, or of the characteristic function of its logarithm, they are asymptotic distributions which lie much closer to the exact distribution than common asymptotic distributions.

If we are able to keep untouched a good part of the original structure of the exact distribution of the random variable or statistic being studied, we may in this way obtain a much better approximation, which not only does not exhibit anymore the problems referred above which occur with some asymptotic distributions, but which on top of this exhibits extremely good performances even for very small sample sizes and large numbers of variables involved, being asymptotic not only for increasing sample sizes but also (opposite to what happens with the common asymptotic distributions) for increasing values of the number of variables involved.

GARCH processes and the phenomenon of misleading and unambiguous signals

In Finance it is quite usual to assume that a process behaves according to a previously specified target GARCH process. The impact of rumours or other events on this process can be frequently described by an outlier responsible for a short-lived shift in the process mean or by a sustained change in the process variance. This calls for the use of joint schemes for the process mean and variance, such as the ones proposed by Schipper (2001) and Schipper and Schmid (2001).

Since changes in the mean and in the variance require different actions from the traders/brokers, this paper provides an account on the probabilities of misleading and unambiguous signals (PMS and PUNS) of those joint schemes, thus adding valuable insights on the out-of-control performance of those schemes.

We are convinced that this talk is of interest to business persons/traders/brokers, quality control practitioners, and statisticians alike.

Joint work with:

• Beatriz Sousa (MMA; CGD);
• Yarema Okhrin (Faculty of Business and Economics, University of Augsburg, Germany);
• Wolfgang Schmid (Department of Statistics, European University Viadrina, Germany)

References

• Schipper, S. (2001). Sequential Methods for Detecting Changes in the Volatility of Economic Time Series. Ph.D. thesis, European University, Department of Statistics, Frankfurt (Oder), Germany.
• Schipper, S. and Schmid, W. (2001). Control charts for GARCH processes. Nonlinear Analysis: Theory, Methods & Applications 47, 2049-2060.

Education: a challenge to statistical modelling

In Education, the main random variable is, in general, the knowledge-level of the student on a certain subject. This is generally considered to be the result of a process that involves one or more educational agent (e.g. teachers) and the reflection and training of the student. There are many covariates including: innate capacities, resiliency, perseverance, family and health conditions, and quality of education.

While being a subject that all of us has some a priori knowledge (essential in statistical modelling), it has an initial difficulty that may lead to significant bias and wrong interpretations: the measurement of the main random variable. From Educational psychometrics, measurement tools are the tests or examinations and the main variable is a latent variable.

In this talk we shall review some recent advances in statistics applied to education, namely to international evaluation of students and to the assessment of the impact of educational political policies. We also review the main data basis from DGEEC (Direção Geral de Estatísticas da Educação e Ciência) and its use for research purposes.

Some hot topics of maximum entropy research in economics and statistics

Maximum entropy is often used for solving ill-posed problems that occur in diverse areas of science (e.g., physics, informatics, biology, medicine, communication engineering, statistics and economics). The works of Kullback, Leibler, Lindley and Jaynes in the fifties of the last century were fundamental to connect the areas of maximum entropy and information theory with statistical inference. Jaynes states that the maximum entropy principle is a simple and straightforward idea. Indeed, it provides a simple tool to make the best prediction (i.e., the one that is the most strongly indicated) from the available information and it can be seen as an extension of the Bernoulli's principle of insufficient reason. The maximum entropy principle provides an unambiguous solution for ill-posed problems by choosing the distribution of probabilities that maximizes the Shannon entropy measure. Some recent research in regularization (e.g., ridGME and MERGE estimators), variable selection (e.g., normalized entropy with information from the ridge trace), inhomogeneous large-scale data (e.g., normalized entropy as an alternative to maximin aggregation) and stochastic frontier analysis (e.g., generalized maximum entropy and generalized cross-entropy with data envelopment analysis as an alternative to maximum likelihood estimation) will be presented, along with several real-world applications in engineering, medicine and economics.

Statistical models for the relationship between daily temperature and mortality

The association between daily ambient temperature and health outcomes has been frequently investigated based on a time series design. The temperature–mortality relationship is often found to be substantially nonlinear and to persist, but change shape, with increasing lag. Thus, the statistical framework has gained a substantial development during last years. In this talk I describe the general features of time series regression, outlining the analysis process to model short-term fluctuations in the presence of seasonal and long-term pattern. I also offer an overview of the recent extend family of distributed lag non-linear models (DLNM), a modelling framework that can simultaneously represent non-linear exposure–response dependencies and delayed effects. To illustrate the methodology, I use an example to represent the relationship between temperature and mortality, using data from the MCC Collaborative Research Network, an international research program on the association between weather and health.

On a class of optimal stopping problems with applications to real option theory

We consider an optimal stopping time problem related with many models found in real options problems. The main goal of this work is to bring for the field of real options different and more realistic pay-off functions, and negative interest rates. Thus, we present analytical solutions for a wide class of pay-off functions, considering quite general assumptions over the model. Also, an extensive and general sensitivity analysis to the solutions, and an economic example which highlight the mathematical difficulties in the standard approaches, are provided.

(joint work with Manuel Guerra and Carlos Oliveira)

Applied environmental time series analysis

Since the very beginning developments in data analysis and time series methods have been closely associated with phenomena in the natural environment, from the power spectra of Schuster motivated by earthquakes and sunspots, Walker's analysis of the Southern Oscillation in attempting to predict the indian monsoon, Tukey's cross-spectrum for the analysis of seismic waves, Gumbel's extreme value analysis inspired by meteorological and hydrological phenomena or Hurst's long memory concept from observations of Nile's water levels. In the modern era of disposable, low-power devices and cheap storage, the natural environment is monitored at an incredible pace, yielding copious time series of high-resolution (sub-hourly) observations which are widely available. Despite the many computationally-intensive approaches developed and currently available to handle streams of data, their success in producing new and environmentally-relevant information is surprisingly low. An obvious challenge is the integration of problem-related knowledge and context in the data analysis process, and going from data summaries/visualisations/alarms to an exploratory analysis aiming to discover new and physically-relevant information from the environmental data. This talk addresses the practical challenges and opportunities in the analysis of high-resolution environmental time series, as illustrated by time series of environmental radioactivity and by measurements from the ongoing gamma radiation monitoring campaign in the Azores.

Air quality science: putting statistics to work

Several statistical tools have been used to analyse air quality data with different purposes. This talk will highlight some of these examples and how the different statistical tools can be bring an added value for this scientific environmental area. First, changes in pollutant concentrations were examined and clustered by means of quantile regression, which allows to analyse the trends not only in the mean but in the overall data distribution. The clustering procedure has shown/indicated where the largest trends are found, in terms of space (location) and quantiles. Secondly, the resulting individual variance/covariance profiles of a set of air quality hourly time series are embedded in a wavelet decomposition-based clustering algorithm in order to identify groups of stations exhibiting similar profiles. The results clearly indicate a geographical pattern among different type of stations and allowed to identify sites which need revision concerning classification according to environment/ influence type. Both exercises were particular important for air quality management practices, in particular regarding the design of the national monitoring network.

The max-semistable laws: characterization, estimation and testing

In this talk we present the class of max-semistable distribution functions that appear as the limit, in distribution, of the maximum, suitably centered and normalized, of $k_n$ independent and identically distributed random variables, where $k_n$ is an integer-valued geometric sequence with ratio $r$ (larger or equal to $1$). This class of distributions includes all the max-stable distributions but also multimodal distributions and discrete distributions. We will characterize the max-semistable laws, discuss the estimation of the parameters and the fractal component and propose a test that allow us to distinguish between max-stable and max-semistable laws.

Join work with Luísa Canto e Castro and Maria da Graça Temido.

Comparison of joint schemes for multivariate normal i.i.d. output

The performance of a product frequently relies on more than one quality characteristic. In such a setting, joint control schemes are used to determine whether or not we are in the presence of unfavorable disruptions in the location and spread of a vector of quality characteristics. A common joint scheme for multivariate output comprises two constituent control charts: one for the mean vector based on a weighted Mahalanobis distance between the vector of sample means and the target mean vector; another one for the covariance matrix depending on the ratio between the determinants of the sample covariance matrix and the target covariance matrix. Since we are well aware that there are plenty of quality control practitioners who are still reluctant to use sophisticated control statistics, this paper tackles Shewhart-type charts for the location and spread based on a few pairs of control statistics that depend on the nominal mean vector and covariance matrix. We recall or derive the joint probability density functions of these pairs of control statistics in order to investigate the impact on the ability of the associated joint schemes to detect shifts in the process mean vector or covariance matrix for various out-of-control scenarios.

Joint work with Wolfgang Schmid, Patrícia Ferreira Ramos, Taras Lazariv, António Pacheco.

Modelling extremal temporal dependence in stationary time series

Extreme value theory concerns the statistical study of the extremal properties of random processes. The most common problems treated by extreme value methods involve modeling the tail of an unknown distribution function from a set of observed data with the purpose of quantifying the frequency and severity of events more extreme than any that have been observed previously. A fundamental issue in applied multivariate extreme value (MEV) analysis is modelling dependence within joint tail regions. In this seminar we suggest modelling joint tails of the distribution of two consecutive pairs $(X_i;X_{i+1})$ of a first-order stationary Markov chain by a dependence model described in Ramos and Ledford (2009). Applications of this modelling approach to real data are then considered.

Ramos and Ledford (2009). A new class of models for bivariate joint tails. J. R. Statist. Soc., B. 71. p. 219-241.

Binary autoregressive geometric modelling in a DNA context

Symbolic sequences occur in many contexts and can be characterized e.g. by integer-valued intersymbol distances or binary-valued indicator sequences. The analysis of these numerical sequences often sheds light on the properties of the original symbolic sequences. This talk introduces new statistical tools to explore the autocorrelation structure in indicator sequences and to evaluate its impact on the probability distribution of intersymbol distances. The methods are illustrated with data extracted from mitochondrial DNA sequences.

This is a joint work with Manuel Scotto (IST, Lisbon, Portugal), Christian Weiss (Helmut Schmidt University, Hamburg, Germany) and Paulo Ferreira (DETI, IEETA, Aveiro, Portugal).

On the peaks-over-threshold method in extreme value theory

The origin, the development and the use of the peaks-over-threshold method (in particular in higher-dimensional spaces) will be discussed as well as some issues that need clarification.

Spatial and Spatio-Temporal Nonlinear Time Series

In this talk we present a new spatial model that incorporates heteroscedastic variance depending on neighboring locations. The proposed process is regarded as the spatial equivalent to the temporal autoregressive conditional heteroscedasticity (ARCH) model. We show additionally how the introduced spatial ARCH model can be used in spatiotemporal settings. In contrast to the temporal ARCH model, in which the distribution is known given the full information set of the prior periods, the distribution is not straightforward in the spatial and spatiotemporal setting. However, it is possible to estimate the parameters of the model using the maximum-likelihood approach. Via Monte Carlo simulations, we demonstrate the performance of the estimator for a specific spatial weighting matrix. Moreover, we combine the known spatial autoregressive model with the spatial ARCH model assuming heteroscedastic errors. Eventually, the proposed autoregressive process is illustrated using an empirical example. Specifically, we model lung cancer mortality in 3108 U.S. counties and compare the introduced model with two benchmark approaches.

(joint work with Robert Gartho and Philipp Otto)

Distributed and robust network localization

Signal processing over networks has been a broad and hot topic in the last few years. In most applications networks of agents typically rely on known node positions, even if the main goal of the network is not localization. Also, mobile agents need localization for, e.g., motion planning, or formation control, where GPS might not be an option. Also, real-world conditions imply noisy environments, and the network real-time operation calls for fast and reliable estimation of the agents’ locations. So, galvanized by the compelling applications researchers have dedicated a great amount of work to finding the nodes in networks. With the growing network sizes of devices constrained in energy expenditure and computation power, the need for simple, fast, and distributed algorithms for network localization spurred this work. Here, we approach the problem starting from minimal data collection, aggregating only range measurements and a few landmark positions. We explore tailored solutions recurring to the optimization and probability tools that can leverage performance under noisy and unstructured environments. Thus, the contributions are, mainly:
• Distributed localization algorithms characterized for their simplicity but also strong guarantees;
• Analyses of convergence, iteration complexity, and optimality bounds for the designed procedures;
• Novel majorization approaches which are tailored to the specific problem structure.

Older session pages: Previous 4 5 6 7 8 9 10 11 12 Oldest