본문 바로가기

방법론 공부/계량통계 방법론

Data Reconciliation - 데이터 분석에 앞서 해야 할 작업

Data reconciliation is a technique that targets at correcting measurement errors that are due to measurement noise, i.e. random errors. From a statistical point of view the main assumption is that no systematic errors exist in the set of measurements, since they may bias the reconciliation results and reduce the robustness of the reconciliation.


Definition and overview of Data Reconciliation

Data reconciliation provides estimates of process variables based on combining measurement information with process knowledge (such as mass and energy balances) in the form of equations and inequality constraints. If the constraints are correct, and measurements fit assumptions about their noise, the resulting estimates will be better than those obtained just from raw measurements. Data reconciliation is generally accomplished by formulating a least-squares optimization problem to minimize a weighted sum of the measurement adjustments and solving it.

The field of data reconciliation generally also addresses gross error detection and handling (handling unexpectedly large measurement errors or model errors like leaks). Because of the importance of this, it is sometimes referred to as “Data Validation and Reconciliation”, or DVR for short.

Other key parts of the data reconciliation field include, observability (what variables can be estimated), and redundancy (which measurements could have been estimated even without a sensor -- required for data reconciliation to adjust data to better than raw measurement values). Other issues that must be addressed in practical implementations also include measurement filtering before reconciliation to eliminate high frequency noise, estimating the typical variability (variance) of sensors, and recognizing when steady state assumptions do not apply. Recognizing and dealing with sensors that have failed is referred to as data validation. This is generally done prior to the final data reconciliation run for a given data set, often included in the overall data reconciliation process. Software implementing data reconciliation, like other software, must have a usable GUI for model development and end users, and effective data integration to get the sensor data.

The initial focus of data reconciliation was in using algebraic equations, but later “dynamic data reconciliation” also addressed systems changing over time. The Kalman filters popular for dynamic systems did not directly apply because of data reconciliation’s emphasis on algebraic equations; the need to use measurements of process inputs as well as outputs, and the need to estimate values for those inputs.

The field of data reconciliation got its start in 1961 with a paper by Kuehn and Davidson formulating and analytically solving the case with linear constraints. Subsequent papers by Vaclavek and coworkers introduced many of the basic ideas in the field. The earliest published installation was at Amoco (American Oil Company, now part of BP). Exxon was the first company known to provide a formal software package (for internal use) and apply it widely. Currently, numerous commercial software packages are available.

Terminology associated with Data Reconciliation

Data Reconciliation

Estimation of a set of variables consistent with a set of constraints (such as material and energy balances), given a set of measurements. If the constraints are correct, and measurements fit assumptions about their noise, the resulting estimates will be better than those obtained just from raw measurements.

Gross Error

Gross errors are significant deviations from assumptions such as assumed error probability distributions in the case of measurements, or incorrect constraints. Gross errors in measurements typically reflect instrument failures, bias errors, or unusual noise spikes if only a short time averaging period is used. An example of a gross error due to an incorrect constraint is the unexpected presence of a significant leak. Gross errors invalidate the assumptions made in data reconciliation, so it is important to detect them and remove their effects. Otherwise, the reconciled estimates could actually be worse than just using the raw data.  So, gross error detection is generally done prior to final estimates, although some techniques modify the problem to try to minimize the damage done by the gross errors..

Observability

Observability analysis answers the basic question of what variables can be determined given a set of measurements and constraints. Observability is checked to ensure that data reconcilation works at all.

Redundancy

Redundancy analysis determines which measurements could be estimated from other variables using the constraint equations. Without redundancy, data reconcilation cannot use the constraints to improve the estimates, so recognizing a lack of redundancy is key to knowing if the data reconcilation will be useful.

Variance

A measure of the variability of a sensor. It is the square of the standard deviation.

 

The published technical papers presented next (based on Ph.D thesis work and also later experience at Exxon and Gensym) formalized the framework for understanding data reconciliation and the related topics of gross error detection, observability, and redundancy, partitioning a system based on observability and redundancy, extensions for dynamic data reconciliation, and use of data reconciliation for generating signatures for gross error detection to be analyzed by techniques such as neural networks.

Data Reconciliation in steady state systems

The technical paper by Mah, Stanley, and Downing: Reconciliation and Rectification of Process Flow and Inventory Data, formalized and popularized data reconciliation in flow networks. It also introduced tests and an algorithm for detecting gross errors in flow networks (measurement errors and leaks) by analyzing nodal imbalances. It applied graph theory, significantly simplified the analysis and decomposition of problems, showed a practical application, and introduced a variety of concepts such as the “environment node” in the process graph to eliminate the distinction between “internal” and “external” flows. This paper provided the first formal use of graph theory both for analyzing flow reconciliation, and for diagnosing gross errors. Although the formal solution is for steady state systems, the paper points out that inventory changes as measured by tank level changes can be easily accounted for by treating the inventory changes as equivalent to an additional flow. The abstract for the paper is:

This paper shows how information inherent in the process constraints and measurement statistics can be used to enhance flow and inventory data. Two important graph-theoretic results are derived and used to simplify the reconciliation of conflicting data and the estimation of unmeasured process streams. The scheme was implemented and evaluated on a CDC-6400 computer. For a 32-node 61-stream problem, the results indicate a 42 to 60 % reduction in total absolute errors, for the three cases in which the number of measured streams were 36, 50, and 61 respectively. A gross error detection criterion based on nodal imbalances is proposed. This criterion can be evaluated prior to any reconciliation calculations and appeared to be effective for errors of 20 % or more for the simulation cases studied. A logically consistent scheme for identifying the error sources was developed using this criterion. Such a scheme could be used as a diagnostic aid in process analysis.

That paper emphasized the analytical solutions for linear systems, and was mostly dedicated specifically to flow networks.

Extensions for nonlinear systems and for dynamics

Numerous enhancements in algorithms have been made since the early work. For steady state systems, the emphasis shifted from analytical solutions for the linear problems to numerical solutions to nonlinear problems, using nonlinear optimization. The optimization approach can also account for inequalities such as the physical constraints that flows are non-negative in most normal operations, and must be non-negative even during abnormal situations if check valves are present.

Numerous extensions for dynamic systems have also been developed. The first paper to address dynamics along with steady state constraints was Estimation of Flows and Temperatures in Process Networks. This paper by Stanley and Mah was the first to introduce the combination of Kalman Filtering and Data Reconciliation, by estimating biases and other slowly changing variables such as heat transfer coefficients. These systems were defined as “Quasi Steady State” (QSS). The paper introduced the terminology “spatial redundancy” (redundancy due to the algebraic equations over one time period), and “temporal redundancy” (extra information available from sampling at multiple time intervals). It also introduced the algorithms for taking advantage of both forms of redundancy.

That paper also addressed estimation in nonlinear systems (e.g., including temperatures and energy flows as well as material flows), by using an Extended Kalman Filter approach. The abstract for the paper is:

It is shown that temperatures and flows in a process network can be estimated from a quasi steady state model and a discrete Kalman filter. The data needed for such an application are readily available in many operating plants, and the computational requirements are within the capabilities of available process computers.

Observability and redundancy in process data estimation

The technical paper Observability and redundancy in process data estimation by Stanley and Mah addressed questions that remained unanswered in earlier work on data reconciliation. First of all, when will data reconciliation or QSS filtering perform adequately? Are there situations in which it will fail? What is the effect of measurement placement on estimator performance? Redundancy had already been shown to be useful, but how does one determine if a measurement is redundant? These questions are clearly of importance in selecting a measurement strategy. 

The paper answered these questions with a general theory of observability and redundancy. Originally, observability was defined by Kalman for dynamic systems. But the fundamental issue is the same in steady state and dynamic systems: a system is observable if a given set of measurements can be used to uniquely determine the state of the system. In this paper observability was defined as a property of a steady state system defined by set membership constraints such as those described by equations and inequalities. Redundancy has a simple definition: a measurement is redundant if its removal causes no loss of observability. So, a redundant measurement could be estimated using other measurements and constraints, even it measurement values were missing.

The paper provided the first rigorous definitions of observability and redundancy for steady state and quasi-steady state systems, whether linear or described by nonlinear equations and set constraints such as inequalities. It provided the first practical tests for observability and redundancy for steady state systems, and first fully explored the implications for estimator performance and problem decomposition. For nonlinear systems, observability and redundancy can be global (independent of specific values) or local (tied to specific sets of values). Simple examples of blending nodes, heat exchangers, and flow meters with different ranges illustrated the point.

The paper demonstrated the importance of these concepts in predicting qualitative estimator performance, not only for a QSS filter, but also for any constrained least-squares estimator like data reconciliation and others. When estimates approach points that are unobservable, the estimator breaks down. When estimates approach points where redundancy is lost, raw sensor data cannot be improved, and are used directly (and hence estimates are the most sensitive to any errors, including gross errors).

Observability was defined in a very general way using topological properties of sets. Results to classify observability, predict estimator performance, and decompose problems based on observability and redundancy were then made more and more specific as additional assumptions were made, such as the existence of derivatives or second derivatives for nonlinear equations, and the extreme but important case of linear constraints and measurements. The abstract for the paper is:

By analogy to the development for dynamic systems, concepts of observability and redundancy may be developed with respect to a steady state system. These concepts differ from their counterparts for dynamic systems in that they can be used to characterize individual variables and local behavior as well as system and global behavior. Relations between local observability, global observability, calculability and redundancy are established and explored in this paper. It is shown that these concepts are useful in characterizing the performance of process data estimators with regard to bias and uniqueness of an estimate, convergence of estimation procedures and the feasibility and implications of problem decomposition.

Observability and redundancy classification in process networks

The paper Observability and redundancy classification in process networks by Stanley and Mah specialized the analysis of observability and redundancy to process networks - that is, systems defined by material and energy balances. This typically meant estimating mass flows, temperatures and energy flows, with additional relationships between temperature and enthalpy built into the measurement equations. Given the special structure of process networks, it was possible to use graph theory to predict observability and redundancy. For instance, this paper was the first to point out that for mass flow constraints, lack of observability is associated with cycles of flow arcs with zero measurements (including the “environment node”). Similarly, lack of redundancy is associated with cycles of flow arcs with exactly one measurement. Forms of cycle criteria also apply when energy balances are considered. Because of nonlinearities with energy balances, local observability and global observability are addressed. Based on the previous paper, these criteria could then be used to predict the performance of data reconciliation, in terms of ability to estimate the system state, improve estimates, and sensitivity to errors such as gross errors. The abstract for the paper is:

The utility of observability and redundancy in characterizing the performance of process data estimators was established in previous studies. In this paper two classification algorithms for determining local and global observability and redundancy for individual variables and measurements are presented. The concepts of biconnected components, perturbation subgraphs and feasible unmeasurable perturbations are introduced, and their properties are developed and used to effect classification, simplification and dimensional reduction. Step-by-step application of these algorithms is illustrated by examples. 

Online data reconciliation for process control

The technical paper Online data reconciliation for process control by Stanley documents theory and applications of data reconciliation for process control applications used online in a chemical plant at Exxon. It introduced an approach for “dynamic reconciliation” that accounted for process dynamics separately from the algebraic constraints. This included some techniques for accomplishing dynamic data reconciliation such as cascade estimation. It also provided insights from a frequency response viewpoint, such as a key role of data reconciliation in estimating slowly-changing biases such as those introduced by sensors. (This was the first published data reconciliation paper pointing out that high frequency noise is eliminated by simple filtering of the raw data with exponential filters or moving averages -- what is left is estimation of bias errors and elimination of gross errors.) The paper had an emphasis of not just trying to get better estimates, but in providing robust estimates of process variables used in closed loop control schemes -- estimates that provided bumpless transfer when gross errors were detected and redundant sensors were removed from an estimator feeding a control scheme. Despite the “steady state” orientation of data reconciliation, it is possible to exploit its use of “spatial redundancy” and use it in certain circumstances with closed loop control, as outlined in the paper. The abstract for the paper is:

Combined data reconciliation with estimation of slowly changing parameters has been implemented for closed-loop control in a Chemical Plant. Goals include streamlining use of redundant measurements for backing up failed instruments, filtering noise, and, in some cases, reducing steady state estimation errors. Special considerations include bumpless transfer from failed instruments and automatic equipment up/down classification. Parameters are calculated and filtered, then held fixed during each data reconciliation.

Gross error detection / fault diagnosis

“Gross errors” are unexpectedly large errors, due to instrument problems or un-modeled problems such as leaks. Initial tests for these were already mentioned in the papers cited above. Much additional work has been done on this, described in the books cited below. 

Gross error detection can be considered as one part of the overall more general problem of fault detection and diagnosis, which may be more effective when considering additional models and heuristics, and a larger number of sensors, controller modes, and valve positions not involved in just the steady state balance equations. This includes the analysis of measurement noise, and sudden jumps in value that can reveal problems or instrument calibration procedures that would be masked in the averages used as inputs to data reconciliation. 

An example is detecting stuck measurements for sensors normally involved in closed loop control. This can be detected outside of data reconciliation because a stuck measurement will lead to the calculation of near-zero standard deviation in the raw (unfiltered, unreconciled) values sampled at a shorter time interval than the reconciliation interval. That evidence might be combined with observing the controller output swinging to either the minimum or maximum value as long as there is some integral action in the controller. (Similar controller behavior but with normal sensor noise could indicate a stuck valve - a process problem rather than a sensor problem). When the operator notices the problem, they will put the controller into manual, which is also a heuristic indication of a possible failure, while the sensor or valve is being fixed.

This technical paper by Stanley shows an approach to model-based diagnostics using either model errors or data reconciliation, combined with a pattern analyzer such as a neural net: Neural nets for fault diagnosis based on model errors or data reconciliation. The abstract is:

 Instrument faults and equipment problems can be detected by pattern analysis tools such as neural networks. While pattern recognition alone may be used to detect problems, accuracy may be improved by "building in" knowledge of the process. When models are known, accuracy, sensitivity, training, and robustness for interpolation and extrapolation should be improved by building in process knowledge. This can be done by analyzing the patterns of model errors, or the patterns of measurement adjustments in a data reconciliation procedure. Using a simulation model, faults are hypothesized, during "training", for later matching at run time. Each fault generates specific model deviations. When measurement standard deviations can be assumed, data reconciliation can be applied, and the measurement adjustments can be analyzed using a neural network. This approach is tested with simulation of flows and pressures in a liquid flow network.  A generic, graphically-configured simulator & case-generating mechanism simplified case generation. 

The concept paper Pipeline Diagnosis Emphasizing Leak Detection: An Approach And Demonstration outlines an approach to pipeline leak detection that combines causal models of abnormal behavior with both static (algebraic) models and dynamic models, making use of data reconciliation.

Copyright 2010-2013, Greg Stanley

External links

Books

S. Narasimhan and C. Jordache, Data Reconciliation and Gross Error Detection: An Intelligent Use of Process Data, Gulf Publishing Company, Houston, 2000.

J. Romagnoli and M. Sanchez, Data Processing and Reconciliation for Chemical Process Operations, Volume 2 (Process Systems Engineering), Academic Press, San Diego, 2000.

Tutorials

 Introduction to data reconciliation and gross error diagnosis (Narasimhan)

Data Reconciliation - Validation Intro (Heyen)

General

LinkedIn Data Reconciliation group

 


출처: http://gregstanleyandassociates.com/whitepapers/DataRec/datarec.htm