JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS EISEVIER Journal of Computational and Applied Mathematics 84 (1997) 207-217 A comparison of some estimators of the mixture proportion of mixed normal distributions’ M.C. Pardo Universitary School of Statistics, Cornplutense University of Madrid, 28040-Madrid, Sptzin Received 29 October 1996; received in revised form 16 June 1997 Abstract Fisher’s method of maximum likelihood breaks down when applied to the problem of estimating the five parameters of a mixture of two normal densities from a continuous random sample of size n. Alternative methods based on minimum- distance estimation by grouping the underlying variable are proposed. Simulation results compare the efficiency as well as the robustness under symmetric departures from component normality of these estimators. Our results indicate that the estimator based on Rao’s divergence is better than other classic ones. Keywords: Minimum-distance estimator; Simulation; Relative efficiency AA4S classification: 62FlO; 62F35 1. Introduction Distributions which result from the mixing of two or more component distributions are designated as “compound” or “mixed” distributions. Such distributions arise in a wide variety of practical situations ranging from distributions of wind velocities to distributions of physical dimensions of various mass-produced items. The moment solution to the problem of estimating the five parameters of an arbitrary mixture of two unspecific normal densities was studied as early as 1894 by Karl Pearson [ 191. Yet, despite the fact that many random phenomena have subsequently been shown to follow this distribution, it is only recently that the estimation problem has been seriously reconsidered. Hasselblad [ 121 seems to have been the first to reopen the question. Since then, the problem has also attracted the attention of Cohen [7], who shows how the computation of Pearson’s moment method can be lightened to some extent. Maximum-likelihood estimates computed with all the information available are reported to be the best under all circumstances, however, they plainly misbehave in estimating mixed distributions because the maximum-likelihood function is not a bounded function ’ This work was supported by Grant DGICYT PB94-0308. 0377-0427197/$17.00 @ 1997 Elsevier Science B.V. All rights reserved PII SO377-0427(97)00124-6 208 M.C. Pardol Journal qf Computational and Applied Mathematics 84 (1997) 207-217 in this case, Le Cam [15]. Then the maximum-likelihood procedure with original data cannot be universally recommended. So Day [9] and Behboodian [2] find an appropriate local maximum of the likelihood function by using iterative techniques. Fryer and Robertson [lo] compare the moment estimates and the multinomial maximum likelihood and minimum x2 estimates obtained by grouping the underlying variable. They show that the grouped estimates are more accurate than the moment estimates for most distributions. Recently, Woodward et al. [24, 251 have carried out an interesting comparison between the maximum-likelihood estimator and the minimum distance estimators based on the Cramer-Von Mises and Hellinger distances, respectively. In this paper we examine the use of minimum-distance estimation based on Burbea and Rao divergence [6] (MR,Z ) as an alternative to maximum-likelihood (ML) estimation by ‘grouping the underlying variable in both cases for the estimation of the parameters of the mixture density f(x) = ~fI(X) + (1 - ~).l”2(x> when the component distributions in the simulated samples are normal and when they are not. There is no doubt that the choice of the number of classes, M, to group the underlying variable is an important question. However, in this paper, it is not so important since we only compare estimators obtained by grouping the underlying variable. Therefore, we are interested in studying the behavior of the estimators under the same conditions. In fact, there are no papers related with the mixture of normal distributions which carry out the study of the choice of M. Fryer and Robertson [lo] said “The method of grouping was dictated to a large extent by practical considerations, and it is not claimed that the groupings are in any sense optimal”. We join with them in that feeling. Anyway, we suppose that we need the estimations to construct a goodness-of-fit test. There are many papers that study the problem of choosing cells in this situation. One alternative is to do the same partition to estimate the parameters than to test the null hypothesis. This choice is guided by two considerations: the power of the resulting test, and the desire to use the asymptotic distribution of the statistic as an approximation to the exact distribution for sample size II. Mann and Wald [ 161 initiated the study of the choice of cells in the Pearson test of fit to a continuous distribution. They recommended, first, that the cells be chosen to have equal probabilities under the hypothesized distribution. The advantages of such a choice for the Pearson tests are (1) unbiasedness, (2) maximal power, and (3) empirical studies have shown that the asymptotic distribution of these statistics is a more accurate approximation to the exact distribution. Mann and Wald then made recommendations on the number M of equiprobable cells to be used. They found that for a sample of size n (large) and significance level a, one should use approximately M=4 [2n2/c(a)2] I”, where c(a) is the upper a-point of the standard normal distribution. Retracting the Mann-Wald calculations using better approximations, as in Schorr [22], confirms that the optimum M is smaller than this value. He recommended to use A4 = 2n2/5. Another alternative is to consider different A4 values and to calculate for each one the corre- sponding estimator. The best A4 would be that corresponding to the estimator with less bias and mean-squared error. In Section 2, we provide background material on the minimum &,-divergence estimator (MR,? E). In Section 3 we carry out a simulation study for comparing the ML estimator (MLE), the minimum chi-square estimator (MCSE) and the MR,7 E with different a values for grouped data. M. C. Pardo I Journal of Computational and Applied Mathematics 84 (1997) 207-217 209 2. The minimum Burbea and Rao distance estimator Consider the probability densities fe(x) with respect to a o-finite measure p on the statistical space (WxJs)oEo~R”‘” and a decomposition {A,, . . .,AM} of X. Then the formula PO(&) = qi(O), i = 1 , . . . ,M defines a discrete statistical model. Let X1,. . . ,X, be a random sample drawn from the previous population and let ji = ni/n be the Ai relative frequency, i = 1, . . . ,M. If we are interested in estimating 6 by the maximum-likelihood method we have to maximize for fixed (ni, . . . ,nM) Po(N,=n I,..., N&f=n&f)= n! n,! . ..Q! q1w . . . qM(Q)n.“, so log Po(N, = rzl,. . . , NM = n,,,,) = -nDKULLBACK(Ej, Q(O)) + 1, where i=(j,,... > i),J, Q = (q@>, . . . , qu(6))t, DKULLBACK the Kullback divergence [ 141 and 1 an independent value of 19. Then to estimate 8 by the discrete model maximum-likelihood estimator is equivalent to minimize on 19 E 0 & RMO the Kullback divergence. Now, then the Kullback divergence is not the unique divergence measure, so we can choose as 0 estimator the 8 value which verifies the following: D being every divergence measure. Depending on the divergence measure chosen, you have different estimators. On the one hand, if M (Fi - qi(d))2 WR Q(Q)> = n c /=I qiCO> the corresponding 3 is the well-known minimum x2 estimator, studied in this context by Fryer and Robertson [ lo]. On the other hand, if we consider the Burbea and Rao divergence [6], where M fL,dP) = c L (Pg - Pi> i=, 1 - CI a# 1, -Pz ln Pi a=1 is the entropy of degree a due to Havrda and Charvat [ 131, the corresponding 8 will be called the minimum Q-divergence estimator. Rao [21] used the family of $,-entropies in genetic diversity between populations. In the particular case of a=2, we obtain the Gini-Simpson index. This measure of entropy was introduced by Gini [ 1 l] and by Simpson [23] in biometry and its properties have been studied by various authors (Bhargava and Doyle [3], Bhargava and Uppulari [4], Agresti and 210 M. C. Pardo I Journal of Computational and Applied Mathematics 84 (1997) 207-217 Agresti [ 11). Note that if we consider the Gini-Simpson index, then the associated R#-divergence is proportional to the square of the Euclidean distance In order to solve the problem to estimate the mixture proportion of mixed-normal distributions, we will define in a convenient way the R4Y-divergence estimator. The following definition was given in Pardo [ 171. Definition 1. Let us suppose that n observations are drawn at random and with replacement from a population with statistical space (X,/I,, Po)BEoc fl~fO, the minimum R#Y-divergence estimator of 6’ is every dbX E 0 that verifies where f is the relative frequency vector. So the minimum R,x -divergence estimator will be 8, = arg inf RbY@., Q(e)>. 0~63 c R”O The importance of the family of divergence measures considered in the previous definition can be seen in the aforementioned paper of Burbea and Rao [6]. For example, a surprising result is the fact that the RbT-divergence is convex on dw x du, where du = {(RI,. . . , pw)’ / CE, pi = 1, pi 2 0, i= 1 Ye.., M}, if only if a~[1,2] for M>2, and if only if a~[1,2] or a~ [3,:] for M=2. This establishes the range of CI for which this measure is useful in practical applications. Some important properties of this divergence family can be seen in Pardo and Vajda [ 181. Throughout, we assume the model is correct and MO < M - 1. Furthermore, we restrict ourselves to unknown parameters 8’ satisfying the regularity conditions of Birch [5] which are neccesary to prove that the MLE for grouped data is asymptotically distributed as a normal. Consider A(0) = diag (Q(e)(“2’-‘) J(0), where J(e> = (Jjr(e))/z,k: ;,io is an M x MO Jacobian matrix being Jj,(@ = y. I Then, assuming that the function Q : 0 * + AM has continuous second partial derivatives in a neigh- borhood of 8’, the following asymptotic properties were shown in Pardo [ 171: (1) & = 80 + (A(8°)tA(00))-1A(80)tdiag (Q(s”)(J2’-I) (? - ~(0~)) + 0(ll B - Q(e”) II), where Ptiz is unique in a neighborhood of 8’. M. C. Pardo I Journal of Computational and Applied Mathematics 84 (1997) 207-217 211 (2) II where C = B(6’) (diag(Q(0’)) - Q(O”)Q(Oo)‘) B(0’)’ with B(e”) = (A(O”)tA(~o))-‘A(Oo)t diag (Q(0)(a’2)-1) . (y Q(0+) is a &z-consistent estimator of Q(0’), i.e., fi II Q&J - Q@‘) II d O,U>. Remark 1. We note that if we consider the R-divergence or equivalently a + 1, we get that iti, = 8’ + (A(t!I”)tA(t90))-‘A(60)t diag (Q(@-“*) (P - Q(0’)) + o([[ p - Q(d”) II), where A(8) = diag (Q(t1°)-“2) J(0). and fi(e,, - coy + N(o,r(eo)-11, being I(0) the Fisher information matrix. So the 64, estimator is a BAN (best asymptotically normal) estimator. In the following section we will present a simulation study to know the behavior of our estimator. 3. Simulation results In this section we report the results of simulations designed to empirically compare ML, minimum chi-square (MCS), MR$, and MR,, estimations of parameters for a mixture of normal, in which we analyze the efficiency as well as the robustness of them. Simulations reported in this section are based on mixing proportions 0.25, 0.5 and 0.75. For each of these mixing proportions, firstly, we considered mixtures of the densities f,(x) and f&), where f,(x) is the density for the random variable X = aY and f2(x) is the density associated with X = Y + b where a > 0 and b > 0 and the distribution of Y is normal. Secondly, we consider Y as a Student’s t with two or four degrees of freedom, or double exponential, to study the robustness under symmetric departures from component normality. Thus, “a” is the ratio of scale parameters which we take to be 1 and -\/z while “b” was selected to provide the desired overlap between the two distributions. We considered “overlap” [24] as the probability of misclassification using this rule: Classify an observation x as being from population 1 if x E MRd, E MCSE MR$*E MR& E MCSE MRgzE MR+, E MCSE MR+: E MR& E MCSE MRb2 E MR#, E MCSE 0.97 0.87 0.99 0.98 0.93 0.98 1.06 1.00 0.97 0.97 0.95 0.90 0.96 0.99 0.82 1.02 1.07 1.01 0.95 0.84 1.02 0.96 0.84 1.04 0.98 0.98 1.02 0.97 0.83 0.81 0.97 0.83 0.82 1.01 0.95 0.98 0.94 0.92 0.92 0.95 0.92 0.90 0.94 0.93 0.97 MRdJE MRb, E MCSE MRs, E MRS, E MCSE 1.19 0.80 1.07 0.98 0.76 1.16 1.02 1.01 1.06 0.94 1.04 0.95 0.94 1.01 0.88 0.99 1.06 1.03 MRgz E 0.96 0.83 1.05 MRO,E 0.96 0.81 1.07 MCSE 0.99 0.97 1.02 MR,p> E MRA E MCSE MR@*E MRV‘J, E MCSE 0.93 0.93 0.99 0.98 1.00 0.97 0.95 0.89 1.02 0.94 1.06 0.96 0.90 0.84 0.91 0.88 1.06 1.02 1.72 1.02 1.06 0.96 1.01 1.00 0.90 0.85 0.89 0.84 0.96 1 .oo 1.05 0.94 1.04 1.01 0.99 1.05 0.83 0.93 0.76 0.94 0.98 0.94 0.90 0.94 0.85 0.96 0.97 1.09 t(4) 0.97 0.95 0.95 0.97 0.95 0.90 0.80 0.88 0.90 0.87 1 .oo 1.02 0.91 0.94 0.95 0.92 0.96 0.96 0.81 0.89 0.89 0.90 0.96 0.98 0.86 0.92 0.91 0.94 0.95 0.88 1.11 1.05 1.11 0.99 0.93 1.07 1.05 1.10 1.01 1.03 1.03 0.98 1.06 1.00 0.97 0.89 0.86 0.89 1.05 1 .oo 0.86 0.92 0.90 1 .oo 1.00 0.97 0.95 0.97 0.94 0.80 1.03 1.06 0.99 0.92 0.79 1.04 1.06 0.97 0.98 0.92 0.96 1.02 0.99 1.03 1.08 0.85 0.81 0.95 1.03 1.11 0.88 0.83 0.95 1.02 1.12 0.92 0.90 0.96 1.08 1.02 0.82 0.69 1.07 1.06 0.95 0.79 0.69 1.01 1.02 1.00 0.95 0.97 1.03 0.82 0.70 1.13 1.10 1.03 0.78 0.77 1.15 1.10 1.04 1.03 0.90 1.00 1.02 0.99 1.03 1.08 0.85 0.81 0.92 1.03 1.11 0.88 0.83 0.90 1.02 1.12 0.92 0.90 1.00 0.83 0.93 0.98 1.04 1.01 3.14 0.91 0.96 0.97 0.93 0.91 1 .oo 0.89 0.97 1.01 1.01 0.89 0.85 0.68 0.91 1 .oo 0.96 0.81 0.69 0.94 0.95 0.90 0.97 0.89 0.92 1.08 1.07 0.76 1.09 1.08 0.87 0.66 0.82 1.09 0.98 1.06 0.96 1.08 0.97 0.99 1.00 0.95 the relative efficiencies of the MCSE, MR,,E and MR,,E with the MLE, i.e., k = I%&(# MLE) %&MLE) ’ - Analyzing the results of Table 1, we find that the estimated bias and the &6% associated with the MRti7E are generally smaller than those for the MLE and the MCSE. 216 M.C. Pardo I Journal of Computational and Applied Mathematics 84 (1997) 207-217 In Table 2 we display the results for the nonnormal components. In the case of double-exponential components is not clear which is the best because, in general, the MCSE has less Bxs than the others but the MLE has less @E. However, in the case of Student’s t components are very clear than the MR,7E is the best. It has less I%% and B% than the others. The superiority of it is even clear for t(2) components, i.e., when the departure from normality is more extreme. In this setting the performance of the MLE and MCSE further deteriorates with respect to that of the MRbzE. Although our emphasis here has been on the estimation of the mixing proportion, the estimation routines used here obtain estimation for all five of the parameters. So, it seems obvious to question about whether the results shown for II are similar for the rest of the parameters ,u], cl, p2 and rs2. In Table 3 we display empirical relative efficiencies for all the parameters for normal and t(4) mixtures. From the table we see that the results for the other parameters also exhibited patterns similar to those shown in Tables 1 and 2, i.e., the MR,7E is a very attractive alternative to both the MLE and the MCSE. 4. Concluding remarks Our results indicate that the MR,7E is better than the MLE and MCSE at the true model and under the Students’ t components. While all of them perform comparably under the double-exponential components. As would be expected, the performance of the estimators declines as the overlap between the two components increases. References [l] A. Agresti, R.F. Agresti, Statistical analysis of qualitative variation, K.E. Schussler (Ed.,), Sot. Methodology, (1978) pp. 204-237. [2] J. Behboodian, On a mixture of normal distributions, Biometrika 57 (1970) 215-217. [3] T.N. Bhargava, P.H. Doyle, A geometric study of diversity, J. Theoret. Biol. 43 (1974) 241-251. [4] T.N. Bhargava, V.R.R. Uppului, On an axiomatic derivation of Gini diversity, with applications, Metron 30-VI (1975) 1-13. [5] M.W. Birch, A new proof of the Pearson-Fisher theorem, Ann. Math. Statist. 35 (1964) 817-824. [6] J. Burbea, C.R. Rao, On the convexity of some divergence measures based on entropy functions, IEEE Trans. Inform. Theory 28 (1982) 489495. [7] A.C. Cohen, Estimation in mixtures of two normal distributions, Technometrics 9 (1967) 15-28. [8] I. Csiszar, A class of measures of informativity of observation channels, Period. Math. Hungar. 2 (1972) 191-213. [9] N.E. Day, Estimating the components of a mixture of normal distributions, Biometrika 56(3) (1969) 463474. [lo] J.G. Fryer, C.A. Robertson, A comparison of some methods for estimating mixed normal distributions, Biometrika 59(3) (1972) 6399648. [ 1 l] C. Gini, Variabilita e mutabilita, Studi Economico-Giuridici della Facolta di Giiurisprudenza dell Universita di Cagliari, a III, PArte II, (1912). [12] V. Hassenblad, Estimation of parameters for a mixture of normal distributions, Technometrics 8 (1966) 431434. [13] M.E. Havrda, F. Charvat, Quantification method of classification processes: concept of structural a-entropy, Kybernetika 3 (1967) 30-35. [ 141 S. Kullback, R. Leibler, On information and sufficiency. Arm. Math. Statist. 22 (1951) 79-86. [ 151 L. Le Cam, Maximum likelihood: Au introduction, Intemat. Statist. Rev. 58(2) (1990) 153-171. M. C. Pardo I Journal of Computational and Applied Mathematics 84 (1997) 207-217 217 [16] H.B. Mann, A. Wald, On the choice of the number of class intervals in the application of the chi-square test, Ann. Math. Statist. 13 (1942) 306317. [17] M.C. Pardo, Asymptotic behaviour of an estimator based on Rao’s divergence, Kybemetika 33 (1997). [ 181 M.C. Pardo, I. Vajda, About distances of discrete distributions satisfying the data proccesing theorem of information theory, Trans. IEEE Inform. Theory 43(4) (1997) 1288-1293. [ 191 K. Pearson, Contributions to the mathematical theory of evolution, Philos. Trans. Roy. Sot. Ser.A 185 (1894) 71-110. [20] K. Pearson, On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, Philos. Mag. 50 (1900) 157-172. [21] C.R. Rao, Diversity and dissimilarity coefficients: an unified approach, J. Theoret. Pop. Biol. 21 (1982) 24-43. [22] A.R. Schorr, On the choice of the class intervals in the application of the chi-square test, Math. Oper. Forsch. u. Statist. 5 (1974) 357-377. [23] E.H. Simpson, Measurement of diversity, Nature 163 (1949) 688. [24] W.A. Woodward, W.C. Parr, W.R. Schucany, H. Lindsay, A comparison of minimum distance and maximum likelihood estimation of a mixture proportion, J. Amer. Statist. Assoc. 79 (1984) 590-598. [25] W.A. Woodward, P. Whitney, P.W. Eslinger, Minimum Hellinger distance estimation of mixture proportions, J. Statist. Plann. Inference 48 (1995) 303-3 19.