The use of complementary techniques of machine learning to discover knowledge in real complex domains

Author: David F. Nettleton
University: Universitat Politècnica de Catalunya
Advisor: Vicenç Torra
Year: 2002

This thesis is concerned with developing and refining a collection of methods and tools which can be applied to the different steps of the Data Mining process. Data Mining is understood as the analysis of data using sophisticated tools and methods, which include aspects of data representation, data exploration, knowledge discovery, data modelling and data aggregation. Data Mining can be applied in real and complex domains, such as the domain of clinical prognosis, as well as with artificial test, or benchmark data. Medical informatics is a dynamic area where new approaches and techniques are constantly being developed, the objective being to improve current data representation, modelling and aggregation methods to achieve better diagnosis and prognosis. In this work we focus on two medical data domains: prognosis for ICU patients and diagnosis of Sleep Apnea cases, although it is proposed that the techniques have general use for any data domain. A key approach which is used for data processing and representation is that of fuzzy logic techniques. Existing techniques are benchmarked against the data, such as neural networks, tree induction and standard statistical analysis methods such as correlation, principal components and regression models.
We carry out a survey of existing techniques, authors and their approaches, in order to establish their strong and weak points, limitations, and opportunities where improvement may be achieved.
The first major area under consideration is data representation: how to define a unified scheme which encompasses different data types, such as numeric, continuous, ordered categorical, unordered categorical, binary and fuzzy; how to define membership functions; how to measure differences and similarities in the data. This is followed by a comprehensive benchmarking of existing AI and statistical algorithms on a real ICU medical dataset, comparing the ‘Data Mining’ results to methods proposed by the author.
We define  ‘fuzzy covariance’ as a  value which permits the measurement of relation between two fuzzy variables. Previous fuzzy covariance work was limited to the covariance of a fuzzy cluster to its fuzzy prototype [Gustafson79]. More recent authors [Nakamori97][Wangh95][Watada94] have created specialised fuzzy covariance calculations tailored for specific applications. In this work, a general fuzzy covariance algorithm, which measures the fuzzy covariance between two fuzzy variables, has been conceived, developed and tested. The initial work based the Hartigan joining algorithm and fuzzy covariances evolves into and is contrasted with the later work on data and attribute fusion using the WOWA aggregation operator .
‘Aggregation operators’ are considered as a method for modelling data for clinical diagnosis, and use ‘relevance’ and ‘reliability’ meta-data together with grades of membership to enhance the information which the aggregation operator receives in order to model the data. We also make enhancements to the WOWA operator, to enable it to process data with missing values and we develop a novel method for learning the weighting vectors.