Fitting and Estimating Parameter Confidence Limits with Sherpa

Sherpa is a generalized modeling and fitting package. Primarily developed for the Chandra Interactive Analysis of Observations (CIAO) package by the Chandra X-ray Center, Sherpa provides an Object-Oriented Programming (OOP) API for parametric data modeling. It is designed to use the forward fitting technique to search for the set of best-fit parameter values in parametrized model functions. Sherpa can also estimate the confidence limits on best-fit parameters using a new confidence method or using an algorithm based on Markov chain Monte Carlo (MCMC). Confidence limits on parameter values are necessary for any data analysis result, but can be non-trivial to compute in a non-linear and multi-parameter space. This new, robust confidence method can estimate confidence limits of Sherpa parameters using a finite convergence rate. The Sherpa extension module, pyBLoCXS, implements a sophisticated Bayesian MCMC-based algorithm for simple single-component spectral models defined in Sherpa. pyBLoCXS has primarily been developed in Python using high-energy X-ray spectral data. We describe the algorithm including the features for defining priors and incorporating deviations in the calibration information. We will demonstrate examples of estimating confidence limits using the confidence method and processing simulations using pyBLoCXS.


Introduction
Sherpa is an extensible, general purpose modeling and fitting application written in Python and Python C/C++/FORTRAN extensions. Originally developed for users of NASA's Chandra Xray Observatory, Sherpa has also been used to analyze data from other astronomy missions, and even non-astronomical data. Sherpa provides Python data classes to encapsulate various types of astronomical data sets (spectra, images, time series, light curves). But to provide the greatest flexibility, Sherpa is also designed to read in any data set that can be represented as a collection of arrays. From its first version, Sherpa has been designed to help scientists analyze data from many different sources, and to be extensible by scientific users, to help solve new problems. Sherpa's main task is to help users fit parametrized models to their data. Sherpa provides a library of physical and mathematical models, also written in Python. These models can be combined in arbitrarily complex expressions, that are interpreted by the Python parser; such expressions can include Sherpa models, arithmetic operators, models written by users in Python, and even other Python functions.
To compare models and data, Sherpa includes statistics such as least-squares, chi-squared based on Gaussian statistics, and maximum likelihood based on Poisson statistics. As model parameters are varied, Sherpa can then measure whether the new model parameter values improve or worsen the fit to the data, using one of these statistics. Sherpa also provides functions to search parameter space for the set of best-fit parameter values: a non-linear least squares using the Levenberg-Marquardt algorithm [lm]; and the Nelder-Mead simplex algorithm [nm].
However, the analysis is not complete when a user has found a set of best-fit model parameter values consistent with the data. Because of measurement errors and statistical noise, there is some probability distribution in parameter space of parameter values that are consistent with the data. If the user can examine this probability distribution in some way, the user can determine how well the best-fit parameter values are constrained. Such constraints are often summarized as confidence limits, stating that parameters are known to a certain level of confidence [avn1976]. For example, after an examination of parameter space, the user might determine that, for a model having temperature as a parameter, a best-fit temperature of 1.2 keV has 90% confidence limits of +0.2 keV, -0.4 keV. Meaning that if the observation and resulting fit were replicated 1000 times, then in 900 trials the best-fit temperature would be between 1.4 and 0.8 keV. The narrower the confidence limits, the better the constraints on the best-fit parameter value.
In this paper, we describe several methods we make available to Sherpa users and Python programmers to put confidence limits on fitted parameter values. We discuss a confidence limit function included in Sherpa, that examines parameter space near the local minimum representing best-fit parameter values, and that returns the desired confidence limits. We provide an interface to this function allowing users to add their own statistic and fitting functions, making this function available to SciPy users. We also discuss the use of simulations in Sherpa to derive limits from distributions of fitted parameter values after many simulations and fits, and show an example to derive both flux and flux errors from a model fitted to spectral data. Finally, we present a new Python module providing a Bayesian approach to deriving fitted parameter values and confidence limits: pyBLoCXS, a new importable Python module, that allows use of prior distributions on model parameters via extensions to Sherpa statistics classes.

Data Preparation
Sherpa provides native Python data classes that encapsulate 1-D and 2-D data sets, i.e., (x, y) and (x_1, x_2, y) respectively. These classes can be extended to contain data of higher dimensionality. In all data classes, the x-array(s) are considered to be the independent variable(s), and the y-array is considered the dependent variable. Any model to be fit to the data must take the form f(x, p), where x is the collection of x-array(s) taken from the data, and p is the array of parameter values that may be varied by the Sherpa fitting function. The model returns an array of model values that are compared to the data's y-values.
Sherpa data classes also include error bars on the dependent variable; these error bars are assumed to be symmetric. (Error bars on the independent x-array(s) are not yet supported, and so are not assumed to be significant.) The data classes can contain both statistical and systematic errors; if both are present, they are added in quadrature to provide the error bars on the data. Systematic errors come from measurements; if not provided along with the data, Sherpa assumes the systematic errors are zero. If statistical errors are not measured and provided with the data, then Sherpa can estimate Gaussian errors as needed in χ 2 fitting; or, the user can use one of the maximum likelihood statistics, in which Poisson statistics are assumed; or, the user can do a simple least-squares fit to the data.

Fitting Models to Data
Sherpa models are assumed to be parametrized functions f(x, p), where x is the collection of x-array(s) from the data, and p is the array of model parameters. When the model is calculated, the return value is an array of predicted data values that can be directly compared to the observed data values (that are contained in the data's y-array). Sherpa includes a model description syntax for users to build composite models that are arbitrarily complex. To support such a powerful feature, the user is not required to provide a function to calculate the derivatives. For least squares fitting using Levenberg-Marquardt, Sherpa estimates the gradient using forward difference approximation (LMDIF) and backward difference approximation if the fit is at an upper parameter boundary. An estimate of the gradient is not needed for fitting using simplex, only the fit statistic value is required.
In some cases, the fit parameters are not necessarily independent and identically distributed (i.i.d.) and correlations between parameters are present. This can lead to non-linear effects and complex parameter spaces, see Figure 1. We present a method designed to calculate confidence intervals in non-linear regression and a Bayesian method to sample the posterior probability distribution.

Confidence Intervals
The optimizer's search for the best-fit parameters stops when the fit statistic or error function has reached an optimal value. For least squares, the optimal value is when the sum of squared residuals is a minimum. For the maximum likelihood estimator, the optimal value is found when the log-likelihood is a maximum. Once the best-fit parameter values are found, users typically determine how well constrained the parameter values are at a certain confidence level by calculating confidence intervals for each parameter. The confidence level is a value of the fit statistic that describes a constraint on the parameter value. The confidence interval is the range that likely contains the parameter value at which the fit statistic reaches its confidence level while other parameters reach new best-fit values. See Figure 2. For example, consider calculating the confidence intervals at a value of σ = 1, or 68% confidence. If the observed data is re-sampled and the model is fit again with new data, there would be a 68% chance that the confidence intervals would constraint the parameter value. The narrower the confidence interval, the more the model parameter value becomes accurately constrained.

Fig. 2: A closeup view of a local minima
In the neighborhood of the fit statistic minimum, the multidimensional parameter space can take the shape of an asymmetric paraboloid. The confidence intervals are calculated for each selected parameter independently by viewing the parameter space along the current parameter's dimension. This view can be represented as a 1-D asymmetric parabola, see Figure 2. Suppose that x 0 represents a parameter's best-fit value. Its associated confidence intervals are represented as x 0 ± δ 1 δ 2 where δ 1 = δ 2 in non-linear parameter spaces, so each confidence limit must be calculated independently. In turn, the statistic value should equal an amount of σ 2 (where σ represents the degree of confidence) at each confidence interval x 0 + δ 1 and x 0 − δ 2 as other parameters vary to new best-fit values. The degree to which the confidence limit is bounded can be characterized by the shape of the well in a multi-dimensional parameter space. A well that is a deep-andnarrow corresponds to a tight confidence interval while a well that is shallow-and-broad represents a wider confidence interval. The confidence intervals can be reduced to a root solving problem by translating the y-axis by an amount equal to σ 2 and selecting points along the fit statistic curve.

Method for Determining Confidence
Calculating the confidence interval for a selected fit parameter can be transformed into a one dimensional root finding problem with the correct coordinate translation. By simply translating the parameter dimension by an amount equal to σ 2 , the confidence intervals now become x-axis intercepts in the parameter dimension. This is an important step in the algorithm because a change in sign will bracket the root. The green and blue points in Figure  3 effectively bracket the requested confidence limit.

Method for Selecting Abscissae
Sherpa's confidence method uses Müller's root finding method to calculate the confidence intervals given three points. Sherpa begins at the best-fit value and calculates points along the fit statistic curve using the covariance, if available, and the secant method. Müller's method is the a good algorithm for finding the root of a curve that is approximated by a parabola near the minimum. We argue that the function curve can be approximated by parabola given that the function can be represented as a Taylor's series. The leading term in series expansion is quadratic since the gradient of the statistic curve can be ignored near the minimum.
The confidence method assumes that the parameter values are located in a minimum approximated by a parabola, that the bestfit is sufficiently far from any parameter boundaries, and that the bracketed parameter interval is larger than the requested machine tolerance.

A Bayesian Approach to Confidence
Fitting Poisson data with χ 2 can lead to biased results. Using likelihood statistics like cash or C do not introduce bias, but lack simple tests for characterizing how well the model fits the data. Such likelihood statistics often require additional methods to validate model selection and to determine "goodness-of-fit". Such methods involve sampling from the posterior probability distribution. Sherpa includes fit statistics derived from the likelihood and complimentary optimization methods, but on its own Sherpa does not include the means to calculate the posterior.
pyBLoCXS is an additional Python module that complements Sherpa to probe the posterior probability and to verify model selection using Bayesian methods. pyBLoCXS is designed to use Markov chain Monte Carlo (MCMC) techniques to explore parameter space at a suspected minimum. pyBLoCXS was originally implemented and tested to handle Bayesian Low-Count X-ray Spectral (BLoCXS) analysis in Sherpa using simple composite spectral models, and additional research is underway to test more complex cases.
The underlying statistical model in pyBLoCXS employs Bayes' Rule 2 where the posterior probability distribution is proportional to the product of the conditional and prior distributions.
Where p(θ |d, I) represents the posterior distribution; p(d|θ , I), the likelihood; p(θ |I), the prior; and p(d|I) is considered constant.
Where θ represents the model parameters; d, the observed data; and I, the initial information.
The pyBLoCXS package includes a method get_draws to sample the posterior distribution for a specified number of iterations. The loop draws parameter values from a multi-variate Student's t distribution and calculates the likelihood on the parameter proposal given the observed data. The proposal is then accepted or rejected according to the current Metropolis-Hastings acceptance criterion and repeat. See Figure 4 for a graphical representation of the MCMC loop. pyBLoCXS currently has two sampling methods. The Python class, MH, implements a Metropolis-Hastings jumping rule characterized by the Student's t distribution based on the input scales, best-fit values, and user-specified degrees of freedom. The second class, MetropolisMH, is a variation on MH in that it implements a Metropolis-Hastings jumping rule with a Metropolis jumping rule centered on the current draw.
The pyBLoCXS package can be used separately from Sherpa using just Python and NumPy. The main inputs to pyBLoCXS are a callable function to calculate the log-likelihood, an ndarray of best-fit parameter values of size n, an ndarray of the multi-variate scales of size n x n, and the degrees of freedom. The ndarray of multi-variate scales is typically the covariance matrix calculated at the best-fit parameter values.
pyBLoCXS is based on the techniques described in the paper [van2001], however, pyBLoCXS implements a different type of sampler. A description of the MCMC methods implemented in pyBLoCXS can be found in Chapter 11 of [gel2004].

Example
The Thurber problem is an example of Non-linear least squares regression from the Statistical Reference Datasets (StRD) at the National Institute of Standards and Technology (NIST). The observed data results from a NIST study of semiconductor electron mobility. The data includes 37 observations with the dependent variable (y) represented as electron mobility and the independent variable (x) as the log of the density.
We define a compact high-level UI to access the Sherpa confidence method. The illustrative example below minimizes the Thurber function using least-squares and Sherpa's implementation of Levenberg-Marquardt (LMDIF). The results can be found in Table 1. The fit results agree to 99.99% for all parameters.   # define a tolerance tol = 1.e-9

Sherpa Confidence Method
The example below highlights the calculation of the asymmetric 1σ confidence limits on seven parameters using conf using the C-statistic and simplex. The confidence limits are accessible as NumPy arrays pmins and pmaxes.   Confidence limits on the example Thurber problem are listed in Table 2.

Sherpa Covariance Method
To compute the covariance matrix, Sherpa first estimates the information matrix by finite differences by reducing a multidimensional problem to a series of 1-D problems. Sherpa then iteratively applies second central differencing with extrapolation (Kass 1987). The covariance matrix follows by inverting the information matrix. The example below calculates the covariance matrix accessible as a NumPy array for the seven parameter values. An estimation of the symmetric confidence limits are found in the NumPy arrays pmins and pmaxes. It is important to note that the parameter uncertainties computed by covariance do not consider correlations between parameters and can underestimate or overestimate the true uncertainty. Compare the differences in uncertainties computed by conf and covar in Tables 2 and 3. pyBLoCXS includes high level plotting functions to display the trace, the cumulative distribution function, and the probability distribution function. The trace plot for β 1 includes gaps in the line that indicate rejected parameter proposals. This example has an acceptance rate of~24%, well within the accepted range for an MCMC chain.
The scatter function in matplotlib can be used to visualize the log-likelihood according to two selected parameters. Using Metropolis-Hastings as the sampler, the density plot is shown in Figure 7 . For parameters β 3 and β 4 , a distinct correlation is shown as a long and narrow well.

Accounting for Calibration Uncertainties
Future released versions of pyBLoCXS will include methods to incorporate the systematic uncertainties in modeling high energy spectra. These uncertainties which have largely been ignored due to the lack of a comprehensive method, can introduce bias in the calculation of model parameters and can underestimate their variance. Specifically, pyBLoCXS will utilize the calibration uncertainties in the effective area curve for spectral analysis. The effective area for high energy detectors records the sensitivity of the detector as a function of energy.
Calibration samples of the effective area are described in Drake et al. (2006) using Principle Component Analysis (PCA) to represent the curve's variability. Samples of the effective area can also be found using simulations.
pyBLoCXS perturbs the effective area curve by sampling from the calibration information at each iteration in the MCMC loop accurately accounting for the non-linear effects in the systematic uncertainty. With this method, best-fit model parameters values and their uncertainty are estimated more accurately and efficiently using Sherpa and pyBLoCXS.

Conclusion
We describe the Sherpa confidence method and the techniques included in pyBLoCXS to estimate parameter confidence when fit parameters present with correlations or the parameters are not themselves normally distributed. Multi-dimensional parameter space is typically non-uniform and Sherpa provides the user with options to explore its topology. The included code example describes an application of the Sherpa confidence method and the pyBLoCXS sampling method. Support of the development of Sherpa is provided by National Aeronautics and Space Administration through the Chandra Xray Center, which is operated by the Smithsonian Astrophysical Observatory for and on behalf of the National Aeronautics and Space Administration contract NAS8-03060.