
Selection of descriptorsA rigorously correct solution for descriptor selection requires a full search procedure of the discrete descriptor space. Unfortunately, combinatorial explosion does not allow the application of a full search procedure to real tasks. For example, if we search for a 5parameter correlation on a space of 1000 descriptors (numbers are realistic for a typical search), we would have to test over 8*10^{12 }correlations for their ability to match some criterion (usually the squared correlation coefficient). Modern machines have achieved sufficient productivity to calculate one correlation each 0.00010.0002 seconds using highly optimized linear algebra libraries (CODESSA PRO uses an Intel MKL – LAPACK [3] clone), and highly optimized low level code. Even with this high level of optimization, the time required for the solution of the aforementioned task, using a full search procedure, is 8*10^{12}*0.0001 seconds (about 26 years). Because it is impossible to solve typical tasks in a reasonable amount of time using full search, methods for simplification have been developed. Such methods of descriptor selection can be categorized as either deterministic or stochastic. Throughout years of research, many algorithms for nonfull searches were developed. The best known deterministic algorithms [1] are forward entry/backward removal of effects (in our case – descriptors). The methods of forward and backward stepwise searches combine the entering and removal of the effects at each step. Each of the methods mentioned above have many limitations, [2] the majority of which are concerned with the absence of a consistent set of the correlations (models) which represent the upper segment of the search space. The bestsubset methods (proposed in this work) are the next alternative to the fullsearch procedure, bur possess such limitations. Two methods for reducing full search procedure were utilized by our group: [4] the heuristic method and the best multilinear regression. The heuristic method for descriptor selection proceeds with a preselection of descriptors by sequentially eliminating descriptors that do not match any of the following criteria: (i) Fisher Fcriteria greater than 1.0; (ii) R^{2} value less than a value defined at the start; (iii) Student’s tcriterion less than a defined value; (iv) duplicate descriptors having a higher squared intercorrelation coefficient than a predetermined level (retaining the descriptor with higher R^{2} with reference to the property). The descriptors that remain are then listed in decreasing order of correlation coefficients when used in global search for 2parameter correlations. Each significant 2parameter correlation by Fcriteria is recursively expanded to an nparameter correlation till the normalized Fcriteria remains greater than the startup value. The best N correlations by R^{2}, as well as by Fcriterion, are saved. The best multilinear regression method is based on the (i) selection of the orthogonal descriptor pairs, (ii) extension of the correlation (saved on the previous step) with the addition of new descriptors until the Fcriteria becomes less than that of the best 2parameter correlation. The best N correlations (by R^{2}) are saved. Both methods successfully solve the initial selection problem by reducing the number of pairs of descriptors in the "starting set". The major limitations of these methods are the pairwise selection on the first step and the low consistence of the presentation of the upper (according to the selected criteria) segment of the search (N in both cases is 400) due to the small size of the correlation selection. A review of the stochastic methods, the genetic algorithm (GA) in particular, was recently published by Leardi.[5] The same author also published the first application of the genetic algorithm, [6] and in a review, mentioned the two major disadvantages of using GA: the repeatability of the optimization and the unpredictable coverage of the search space. The repeatability problem is a failure of all stochastic methods by definition and is therefore unacceptable for an industrial strength system. The possibility of chance correlations is a disadvantage to all methods of effects (descriptors) selection and it is surely the most important factor, which limits generalized and extensive use of GA. [7] References:


University of
Florida 2001. All rights reserved 