Multilinear regression Selection of descriptors Multivariate methods

## Selection of descriptors

A rigorously correct solution for descriptor selection requires a full search procedure of the discrete descriptor space. Unfortunately, combinatorial explosion does not allow the application of a full search procedure to real tasks. For example, if we search for a 5-parameter correlation on a space of 1000 descriptors (numbers are realistic for a typical search), we would have to test over 8*1012 correlations for their ability to match some criterion (usually the squared correlation coefficient). Modern machines have achieved sufficient productivity to calculate one correlation each 0.0001-0.0002 seconds using highly optimized linear algebra libraries (CODESSA PRO uses an Intel MKL – LAPACK [3] clone), and highly optimized low level code. Even with this high level of optimization, the time required for the solution of the aforementioned task, using a full search procedure, is 8*1012*0.0001 seconds (about 26 years).

Because it is impossible to solve typical tasks in a reasonable amount of time using full search, methods for simplification have been developed. Such methods of descriptor selection can be categorized as either deterministic or stochastic.

Throughout years of research, many algorithms for non-full searches were developed. The best known deterministic algorithms [1] are forward entry/backward removal of effects (in our case – descriptors). The methods of forward and backward stepwise searches combine the entering and removal of the effects at each step. Each of the methods mentioned above have many limitations, [2] the majority of which are concerned with the absence of a consistent set of the correlations (models) which represent the upper segment of the search space. The best-subset methods (proposed in this work) are the next alternative to the full-search procedure, bur possess such limitations.

Two methods for reducing full search procedure were utilized by our group: [4] the heuristic method and the best multi-linear regression.

The heuristic method for descriptor selection proceeds with a pre-selection of descriptors by sequentially eliminating descriptors that do not match any of the following criteria: (i) Fisher F-criteria greater than 1.0; (ii) R2 value less than a value defined at the start; (iii) Student’s t-criterion less than a defined value; (iv) duplicate descriptors having a higher squared intercorrelation coefficient than a predetermined level (retaining the descriptor with higher R2 with reference to the property). The descriptors that remain are then listed in decreasing order of correlation coefficients when used in global search for 2-parameter correlations. Each significant 2-parameter correlation by F-criteria is recursively expanded to an n-parameter correlation till the normalized F-criteria remains greater than the startup value. The best N correlations by R2, as well as by F-criterion, are saved.

The best multilinear regression method is based on the (i) selection of the orthogonal descriptor pairs, (ii) extension of the correlation (saved on the previous step) with the addition of new descriptors until the F-criteria becomes less than that of the best 2-parameter correlation. The best N correlations (by R2) are saved.

Both methods successfully solve the initial selection problem by reducing the number of pairs of descriptors in the "starting set". The major limitations of these methods are the pairwise selection on the first step and the low consistence of the presentation of the upper (according to the selected criteria) segment of the search (N in both cases is 400) due to the small size of the correlation selection.

A review of the stochastic methods, the genetic algorithm (GA) in particular, was recently published by Leardi.[5] The same author also published the first application of the genetic algorithm, [6] and in a review, mentioned the two major disadvantages of using GA: the repeatability of the optimization and the unpredictable coverage of the search space. The repeatability problem is a failure of all stochastic methods by definition and is therefore unacceptable for an industrial strength system. The possibility of chance correlations is a disadvantage to all methods of effects (descriptors) selection and it is surely the most important factor, which limits generalized and extensive use of GA. [7]

References:

1. Darlington, R. B. Regression and linear models. New York: McGraw-Hill, 1990.

2. Neter, J.; Wasserman, W.; Kutner, M. H. Applied linear regression models (2nd ed.). Homewood, IL: Irwin, 1989.

3. Anderson, E; Bai, Z.; Bischof, C.; Demmel, J.; Dongarra, J.; Ducroz, J.; Greenbaum, A.; Hammarling, S.; McKenney, A.; Sorensen, D. LAPACK: A portable linear algebra library for high-performance computers. Computer Science Dept. Technical Report CS-90-105, University of Tennessee: Knoxville, TN, 1990

4. Katritzky, A.R.; Lobanov V.; Karelson, M. CODESSA Reference Manual. University of Florida, Gainesville, 1996.

5. Leardi, R. Genetic algorithm in chemometrics and chemistry: a review. J. Chemometrics 2001, 15, 559-569.

6. Leardi, R.; Boggia, R.; Terrile, M. Genetic algorithm as a strategy for feature selection, J. Chemometrics 1992, 6, 267-281.

7. Leardi, R.; Gonzalez, A. L. Generic algorithm applied to feature selection on PLS regression: how and when to use them. Chemometr. Intell. Lab. 1998, 41, 195-207.