The idea of simulated evaluation has been demonstrated as very powerful idea and was applied to construct a variety of effective calculation algorithms. In the book [Fog66] it is considered a simulated evaluation of population of finite automata and a number of applications to the problems of prediction and computer learning are described. Method of stochastic search with adaptation is considered in [Sla71]. It is applied to the problem of selection of the best set of features in regression analysis and pattern recognition. Simulated evaluation of regression models is the main idea of adaptive learning networks or group method of data handling [Iva71]. Implementation of this technique in APL is presented in [Skom93].
The works mentioned above use the general idea of simulated evaluation to solve some specific problems of data analysis or computer science. Genetic algorithms make a significant step forward and suggest an universal approach for a wide variety of computational problems.
Implementation of genetic algorithms in APL was considered in papers [Alf91] and [Gey95]. In this paper we try to introduce more application oriented approach. That means
The aim of genetic algorithms is searching an optimal alternatives or solutions among a set of given variants. That is quite general problem formulation. Lots of computer science and data analysis problems may be formulated this way. The following is set of some examples:
The problems we mentioned above are from different areas of optimization, regression analysis, pattern recognition, statistics, but may be solved in uniform manner using genetic algorithms approach.
The first, technical step of genetic algorithm application is encoding of searching space points or possible variants of the given problem solution. The character string of symbols from a given alphabet is used for that in general. To simplify further consideration we will assume that the alphabet has two symbols only. That is absolutely enough to solve wide variety of problems and allows to encode points of searching space as binary vectors of 0 and 1. Let us consider an examples.
In the problems of choosing the best subset of input factors for prediction an output or informative features in pattern recognition we have an universal of all possible factors, let say nine of them:
x1, x2, x3, x4, x5, x6, x7, x8, x9
Then binary vector
0 0 1 0 1 0 0 1 0
presents a variant when factors
x3, x5, and x8
are temporarily considered as the best subset.
Genetic algorithms require that the set of alternatives to be searched through be finite (as in an example above). If we want to apply them to an optimization problem where this requirement is not satisfied, we have to discretize the set involved and select an appropriate finite subset. That may be easy done in most cases. For instance if we are finding a maximum value of a continued function we may represent a range of function argument as a finite set of values given with some small step. In this case encoding may be binary representation of number of an argument value in sorted sequence.
The strings (binary vectors) which represent encoded alternatives are called chromosomes. And the symbols that form them (0 and 1 in our case) or items of binary vectors are called genes (see [Bio71] for analogy).
The natural question for now is what is the purpose of such encoding of alternatives? The main purpose is an easy and universal generation of new alternatives from a given ones. This generation is just changing, mostly at random, one or more genes of chromosome items of our binary vector. Then we should decode the new chromosome and get the new alternative of optimization procedure.
Genetic algorithms search for the best alternative (in the sense of a given fitness function) through chromosomes evolution.
Basic steps in genetic algorithms are the following.
The same steps of this process, evaluation and natural selection, are then applied to chromosomes of the resulting population. The whole process is repeated until given stopping criteria are met.
The solution in expressed by the best chromosomes in the final population.
The size of the population of chromosomes is usually kept constant during the execution of a genetic algorithm. When new members are added to the population, the corresponding number of old members are excluded.
Let us consider APL2 implementation of genetic algorithms together with simple numeric example, presented in [Fuz95].
Given function
f(x) = 2x - x² ÷ 16
on the interval [0,31] integer values of argument x. The problem is to find the maximum of the function in the given interval.
The following simple APL function performs calculation of f(x).
[0] Y„F_TEST X [1] © Test function to find maximum [2] Y„(2×X)-(X*2)÷16
The whole set of alternatives is 32 integer points: 0, 1,...,31. This case chromosome may be treated as binary representation of these points. Then all possible chromosomes are binary integers from 00000 through 11111.
The following functions are used to generate initial population of chromosomes of a given size M:
[0] P„INIT_TEST M;N [1] © M is constant population size (memory) [2] N„32 © Total N of chromosomes (alternatives) [3] © Generate starting population at random [4] P„N BINARY¨¯1+M?N
[0] B„MAX BINARY D [1] © Decimal D to binary B with number [2] © of digits defined by MAX - maximum possible D [3] B„((—2µMAX)½2)‚D
To generate a new chromosome we used mutation. Given a chromosome x = (x1, x2,..., xn) and an integer i, which is called mutation position, the operation of mutation replaces x with
x = (x1,..., xi-1, z, xi+1,..., xn)
where z is a randomly chosen gene from the gene pool G. This operation implemented as function MUTATION.
[0] CHR1„I MUTATION CHR0 [1] © Generates mutated chromosome changing I-th bit [2] © to random from GENE_POOL [3] CHR1„CHR0 [4] CHR1[I]„(?½GENE_POOL)ÞGENE_POOL
Global variable GENE_POOL is logical vector of 0,1 in our case.
Fitness of a given chromosome is calculated for the test using function:
[0] FIT„EVAL_TEST P [1] © Evaluate chromosome P in terms of its fitness [2] FIT„F_TEST 2ƒP
In line 2 chromosome is decoded to decimal value of argument x and then value of f(x) is calculated. As we are interested in finding maximum of this function we are maximizing the criterion.
The whole genetic algorithm implemented as APL2 operator GENETIC. Its operands are functions EVALUATE to calculate criteria and GENERATION to calculate next generation.
[0] R„GENS(GENERATION GENETIC EVALUATE)POP;POP1;FIT;FIT1;E;BEST [1] © Main operator for Genetic Algorithm [2] © POP - starting population, may be initialized while [3] © calling GENETIC using problem specific function [4] © GENS - number of generations (steps) of genetic algorithm [5] © EVALUATE - problem specific function to calculate fitness [6] © of each chromosome from population [7] © GENERATION - function to generate new population, includes [8] © type of mutation and approach of chromosomes replacement. [9] [10] ©--- Define parameters [11] M„½POP © Constant population size (memory) [12] R„¼0 © To keep learning results for analysis [13] [14] EVAL: [15] ©--- Evaluate each chromosome in terms of its fitness [16] FIT„¹EVALUATE¨POP © Fitness calculated for population [17] BEST„FIT¼—/FIT © Maximize criteria [18] R„R,›BESTœ¨POP FIT © Remeber results for this population [19] [20] …(0=GENS„GENS-1)/QUIT © Stopping criteria [21] [22] ©--- Produce a population of new chromosomes (next generation) [23] POP1„GENERATION POP [24] FIT1„¹EVALUATE¨POP1 [25] ((FIT1>FIT)/POP)„(FIT1>FIT)/POP1 © Replace if new are better [26] …EVAL [27] QUIT:POP_1„POP © Keep all last population for analysis
Let us now apply genetic algorithm to a given problem. At first we initialize starting population of chromosomes with size equal to 2. And then apply operator GENETIC with appropriate operands EVAL_TEST and NEXT_TEST and number of generation equal to 10 as a left argument:
P„INIT_TEST 2
REZ„10(NEXT_TEST GENETIC EVAL_TEST)P
The result of learning is written into variable REZ. The size of this variable is NG+1. Each item is vector of two elements the best chromosome from the generation and the value of the criterion. Lets see this variable:
œREZ
1 1 1 1 0 3.75
0 1 1 1 0 15.75
0 1 1 1 0 15.75
0 1 1 1 0 15.75
0 1 1 1 0 15.75
0 1 1 1 1 15.9375
1 0 0 0 0 16
1 0 0 0 0 16
1 0 0 0 0 16
1 0 0 0 0 16
1 0 0 0 0 16
We see that the best chromosome at the end of learning is binary vector 1 0 0 0 0, which is decimal 16 and the criterion is also equal to 16. It is maximum value of the function we investigated. So, the problem is solved.
In the papers [Skom94] and [Skom94] we gave a short description of automated mail warehouse for the packaging and shipment of financial forms. This modern technological system is operated at DATEV large Information Systems Consulting Company in Germany. The requests for processing financial forms are received from tax accountants, certified tax advisors, auditors and book keepers, attorneys, as well as major taxation, auditing and bookkeeping firms all over Germany.
In our previous work we concentrated on simulation, optimization and control of automated mail warehouse, technological process itself. That was an internal view to the system. The other important view is an external one, when the whole system is considered as a queueing system. Here the problem is to predict the queue of orders for the given day to make working plan. The purpose of this application is to develop a mathematical model of input queue and to use this model for input stream of orders prediction.
To predict daily amount of work we should take into consideration some dependencies and correlations which are known for today at quantitative level. The main features to be considered are the following:
It is clear that summer months of vocations are different from winter month, days of payment salary are different to the holidays and Monday or Tuesday are different from Saturday. The difficulties of estimation these dependences are that they are different for different clients of the system and they are different for different types of financial forms.
To maintain mentioned and some other dependences the data base was implemented on the mainframe computer of DATEV Computing Center. This data base retrieves records for each of approximately 70,000 of DATEV clients and takes about 1.5GB of hard disk. Data base is implemented in APL2 with using Associated Processor 12 to perform accessing files as arrays.
Each record keeps truck of statistic on number of orders from a given client for any type of financial form, separately for each day of week, day of month and month of year.
For now we have in data base records for September of 1995 only. This allows to see how monthly orders to DATEV are distributed.
For each consultant (client of the system) we calculated number of days when he ordered at least one application (type of financial form). And than we calculated how many clients ordered something for 0,1,...,25 days (number of working days and Saturdays in September).
The result was represented in variable DAYS_IDS numeric vector of 25 items. Value of item number k is equal to the total number of consultants, which ordered (at least one application) k+1 days in September.
Sum of elements of vector DAYS_IDS represents total number of consultants. And first item of DAYS_IDS represents number of clients which didnt order anything in September.
Currently it is
+/DAYS_IDS
69144
DAYS_IDS[1]
4391
All distribution is shown in Figure 1

Let us start from quantitative analysis of the distribution shown in Figure 1. At least two important things must be noticed after the first look at Figure 1.
First, distribution estimated from real world data without any kind of smoothing or approximation looks unexpectedly smooth. That may be truth if (a) we have large amount of statistical data and (b) there is strong dependence and real data are generated in accordance to this dependence.
The second peculiarity, we can easily see in Figure 1, are two well-distinguished groups of consultants. This peculiarity is much more important we may say that there are at least two different types of consultants and members of different groups have different behavior.
The first is group of consultants, which order from DATEV relatively seldom 1,2 or 3 days per month. This is group related to left part of distribution with maximum values consultants - about 9, 8 or 7 thousand of consultants for 1,2 or 3 values of monthly orders (days per month). Average two days per month looks very much like days of salary payment. These may be consultants that serve companies rather than individuals.
The second group is the active consultants, which order something from DATEV 22-23 days per month almost every working day and even Saturdays. This group related to right part of distribution shown in Figure 1, where the local maximum of distribution exists. This is probably group of consultants who works mostly with random and uniform stream of individuals.
The intermediate part of distribution in Figure 1 may be crossover of mentioned above two different distributions, or probably the third one.
The Binomial distribution looks being very appropriated for considered data. In terms of this distribution we assume that some consultant has the constant probability P to order at least one application per day. Then we may easy estimate N the average number of days per month when this consultant order something. For 30 days of September it is
N = P × 30
Moreover we may calculate probability of k any specific number of orders per month, including 0 (no orders):
k k 30-k
P(k) = C × P × (1-P)
N
where 1-P is probability of do not order from DATEV and CkN is number of cases of k things taken from N.
The first group of discussed above related to Binomial distribution with low level of probability P and looks like exponential one (see Figure 1). The second group is related to Binomial distribution with high value of P and looks like normal or Gaussian distribution (it is usual approximation for Binomial distributions with high P).
In terms of Binomial distribution one test interpreted as seeing if this consultant ordered something or not this day (like in classical model of probability theory with coins hurling and checking the side is above). Length of testing serial is number of days in month in our case.
For brief testing of our hypothesis we performed the following preliminary numeric analysis. From Figure 1 the mean values for two groups were assumed as 3 and 22. Then Binomial probabilities are:
3 ÷ 26=0.1153846154 and 22 ÷ 26=0.8461538462
Using these probabilities we calculated two distributions and presented mixture of distribution as their summation. The result is presented in Figure 2 in comparison to real distribution.

We may see that the main feature of data grouping is reproduced by a model. Meanwhile we have zero probability for the region from approximately 10 up to 15 orders per month. That tells that there are actually three groups of consultants and the third one has Binomial probability related to the mentioned region. One more thing should be improved is optimization of fractions of dividing the whole set of consultants into groups. In our preliminary analysis we assumed equal fractions.
We formulated the problem as a problem of separating mixture of three Binomial distributions and optimization of fractions values and Binomial probabilities for each group of clients. Totally 5 parameters were under optimization:
Average error of prediction measured in percents was chosen as fitness criteria. Learning process is illustrated in Figure 3. First the accuracy of prediction was increasing very quickly and after about 100 generations (steps of genetic algorithm work) it stabilized at the value of 32% of prediction accuracy.

In Figure 4 the real at predicted distributions are shown. It is clearly that the fitness is very high and the proposed model of monthly requests to DATEV (the model of 3 different groups of consultants) is close to reality and allows to perform good prediction.

Lets discuss in brief meaning of estimated parameters of the model. They are presented in the table.
Û Group Û Binomial Û Fraction Û Working Û Count Û Û number Û Probability Û Û days Û Saturdays Û --------------------------------------------------------- ÛGroup I Û Low 0.09 Û 60% Û 1.9 Û 2.3 Û ÛGroup II Û Middle 0.43 Û 13% Û 9.0 Û 11.2 Û ÛGroup III Û Low 0.82 & Û 27% Û 17.2 Û 21.3 Û
We see that the first group is 60% of all consultants, the second and the third are 9% and 43% accordingly. Having Binomial probabilities we may calculate average number of days when consultant from the given group ordered something from DATEV during N days of observance. This N may be number of working days (21 for September 1995) or number of working days plus Saturdays (26 for September 1995). These results are also represented in the table above.
Now we may see that the consultants from the Group I order about twice per month. And that should be days of salary payment. As we told above these are probably consultants that serve companies rather than individuals.
The Group III consultants are the most active and order something from DATEV almost every working day and even Saturday. This is probably group of consultants who works mostly with random and uniform stream of individuals.
The Group II is a middle (and smallest) one. Here the average activity is about half of working days.
The main benefit of discussed dependence and analysis results is improving prediction quality and simplifying prediction in some sense. Now we should classify each consultant to one of the three (or maybe later analysis show more) groups and formulate the individual rule (mostly fuzzy rules) for each group. For instance if some consultant belongs to the Group I and ordered twice per month only in salary days and it is NOT salary days tomorrow we may skip him or her with result of prediction equal to 0 requests.
The paper demonstrates benefits of using APL2 to implement and perform genetic algorithms. The main reasons for that are the following:
The core of the paper is real life application of genetic algorithm. The approach we used allowed solve complex problem which is difficult or impossible using traditional optimization technique. The resulting mathematical model gives possibility to predict amount of work in large queueing system with high accuracy.
Alexander O. Skomorokhov
Institute of Atomic Energetics
P.O.Box 5061, Obninsk-5
Kaluga Region249020
RUSSIA
phone: +7-08439-31463
e-mail: askom@APL2.obninsk.su