Clustering with finite data from semi-parametric mixture distributions
Clustering with finite data from semi-parametric mixture distributions
 Yong Wang Ian H. Witten Computer Science Department Computer Science Department
 University of Waikato, New Zealand University of Waikato, New Zealand
 Email: [email protected] Email: [email protected]
 Abstract
 Existing clustering methods for the semi-parametric
 mixture distribution perform well as the volume of data increases. However,
they all suffer from a serious draw-
 back in finite-data situations: small outlying groups of data points can
be completely ignored in the clusters
 that are produced, no matter how far away they lie from the major clusters.
This can result in unbounded loss
 if the loss function is sensitive to the distance between clusters.
 This paper proposes a new distance-based clustering
 method that overcomes the problem by avoiding global constraints. Experimental
results illustrate its superior-
 ity to existing methods when small clusters are present in finite data sets;
they also suggest that it is more ac-
 curate and stable than other methods even when there
 are no small clusters.
 1 Introduction A common practical problem is to fit an underlying sta-
 tistical distribution to a sample. In some applications,
 this involves estimating the parameters of a single dis- tribution function|e.g.
the mean and variance of a nor-
 mal distribution. In others, an appropriate mixture of
 elementary distributions must be found|e.g. a set of
 normal distributions, each with its own mean and vari-
 ance. Among many kinds of mixture distribution, one in particular is attracting
increasing research attention
 because it has many practical applications: the semi-
 parametric mixture distribution.
 A semi-parametric mixture distribution isone whose cumulativedistribution
function {CDF} has the form
 F
 G
 {x} =
 Z
 \\002
 F {x; \\022} dG{\\022}; {1}
 where \\022 2 \\002, the parameter space, and x 2 X , the
 sample space. This gives the CDF of the mixture dis-
 tribution F
 G
 {x} in terms of two more elementary dis-
 tributions: thecomponent distribution F {x; \\022}, which is given, and the
mixing distribution G{\\022}, which is un- known. The former has a single
unknown parameter \\022,
 while the latter gives a CDF for \\022. For example, F {x; \\022}
 might be the normal distribution with mean \\022 and unit
 variance, where \\022 is a random variable distributed ac- cording to G{\\022}.
The problem that we will address is the estimation of
 G{\\022} from sampled data that are independent and identi-
 cally distributed according to the unknown distribution
 F
 G
 {x}. OnceG{\\022} has been obtained, it is a straightfor-
 ward matter to obtain the mixture distribution.
 The CDF G{\\022} can be either continuous or discrete.
 In the latter case, G{\\022} is composed of a number of
 mass points, say, \\022
 1
 ; : : : ; \\022 k
 with masses w 1
 ; : : : ; w k
 re-
 spectively, satisfying
 P
 k
 i=1
 w
 i
 = 1. Then {1} can be
 re-written as
 F
 G {x} =
 k X
 i=1
 w i
 F {x; \\022
 i
 }; {2}
 each mass point providing a component, or cluster, in the mixture with the
corresponding weight. If the number
 of components k is finite and known a priori, the mix-
 ture distribution is called finite; otherwise it is treated as countably
infinite. The qualifier \\\\countably" is nec-
 essary to distinguish this case from the situation with
 continuousG{\\022}, which is also infinite.
 We will focus on the estimation of arbitrary mix-
 ing distributions, i.e., G{\\022} is any general probability
 distribution|finite, countably infinite or continuous. A
 few methods for tackling this problem can be found in
 the literature. However, as we shall see, they all suffer
 from a serious drawback in finite-data situations: small
 outlying groups of data points can be completely ignored
 in the clusters that are produced.
 Thisphenomenon seems to have been overlooked, pre-
 sumably for three reasons: small amounts of data may be assumed to represent
asmall loss; a few data points
 1can easily be dismissed as outliers; and in the limit the
 problem evaporates because most estimators possess the property of strong
consistency|which means that, al-
 most surely, they converge weakly to any given G{\\022}
 as the sample size approaches infinity. However, often
 these reasons are inappropriate: the loss function may
 be sensitive to the distance between clusters; the small number of outlying
data points may actually represent
 small clusters; and any practical clustering situation will necessarily
involve finite data. This paper proposes a new method, based on the idea
 of local fitting, that successfully solves the problem. The experimental
results presented below illustrate its su-
 periority to existing methods when small clusters are
 present in finite data sets. Moreover, they also suggest
 that it is more accurate and stable than other meth-
 ods even when there are no small clusters. Existing clustering methods for
semi-parametric mixture distri-
 butions are briefly reviewed in the next section. Section
 3 identifies a common problem from which these current methods suffer. Then
we present the new solution, and
 in Section 5 we describe experiments that illustrate the problem that has
been identified and show how the new
 method overcomes it.
 2 Clustering methods
 The general problem of inferring mixture models is
 treated extensively and in considerable depth in books
 by Titterington et al. {1985}, McLachlan and Bas-
 ford {1988} and Lindsay {1995}. For semi-parametric
 mixture distributions there are three basic approaches:
 minimum distance, maximum likelihood, and Bayesian.
 Webriefly introduce the first approach, which is the one
 adopted in the paper, review the other two to show why
 they are not suitable for arbitrary mixtures, and then return to the chosen
approach and review the minimum
 distance estimators for arbitrary semi-parametric mix-
 ture distributions that have been described in the liter- ature.
 The idea of the minimum distance method is to de-
 fine some measure of the goodness of the clustering and optimize this by
suitable choice of a mixing distribution
 G
 n {\\022} for a sample of size n. We generally want the estimator to be strongly
consistent as n ! 1, in the
 sense defined above, for arbitrary mixing distributions. We also generally
want to take advantage of the special
 structure of semi-parametric mixtures to come up with an efficient algorithmic
solution. The maximum likelihood approach maximizes the
 likelihood {or equivalently the log-likelihood} of the data
 by suitable choice of G
 n
 {\\022}. It can in fact be viewed as a minimumdistance method that uses the
Kullback{
 Leibler distance {Titterington et al., 1985}. This ap-
 proach has been widely used for estimating finite mix- tures, particularly
when the number of clusters is fairly
 small, and it is generally accepted that it is more accu-
 rate than other methods. However,it has not been used
 toestimate arbitrary semi-parametric mixtures, presum-
 ably because of its high computational cost. Its speed
 drops dramatically as the number of parameters that
 must be determined increases, which makes it computa-
 tionallyinfeasible for arbitrary mixtures, since each data point might represent
a component of the final distribu-
 tion with its own parameters.
 Bayesian methods assume prior knowledge, often
 given by some kind of heuristic, to determine a suitable
 a priori probability density function. They are often
 used to determine the number of components in the fi-
 naldistribution|particularly when outliers are present.
 Like the maximum likelihood approach they are com-
 putationally expensive, for they use the same computa-
 tional techniques. We now review existing minimumdistance estimators
 for arbitrary semi-parametric mixture distributions. We
 begin with some notation. Let x
 1
 ; : : : ; x
 n
 be a sample
 chosen according to the mixture distribution, and sup-
 pose {without loss of generality} that the sequence is
 ordered so that x 1
 \\024 x
 2 \\024 : : : \\024 x n
 . Let G n
 {\\022} be a discrete estimator of the underlying mixing distribution
 with a set of support points at f\\022
 nj
 ; j = 1; : : :; k n
 g. Each \\022
 nj
 provides a component of the final clustering with weightw
 nj
 \\025 0, where
 P
 k
 n
 j=1 w
 nj
 = 1. Given the sup-
 portpoints, obtaining G
 n
 {\\022} is equivalent to computing the weight vector w
 n
 = {w
 n1
 ; w
 n2 ; : : :; w
 nk n
 }
 0
 . Denote
 by F
 G
 n
 {x} the estimated mixture CDF with respect to G
 n
 {\\022}.
 Two minimumdistance estimators were proposed in
 the late 1960s. Choiand Bulgren {1968} used 1n
 n
 X
 i=1 [F
 G
 n {x
 i
 } \\000 i=n]
 2
 {3}
 as the distance measure. Minimizing this quantity with
 respect to G
 n
 yields a strongly consistent estimator. A slight improvement is obtained
by using the Cramer-von
 Mises statistic
 1n
 n
 X
 i=1
 [F
 G n
 {x
 i } \\000 {i \\000 1=2}=n]
 2
 + 1={12n
 2 }; {4}
 which essentially replaces i=n in {3} with {i \\000
 12
 }=n with- out affecting the asymptotic result. As might be ex-
 pected,this reduces the bias for small-sample cases, aswas demonstrated
empiricallyby Macdonald {1971} in a
 note on Choi and Bulgren's paper.
 At about the same time, Deely and Kruse {1968} used
 the sup-norm associated with the Kolmogorov-Smirnov
 test. The minimization is over sup
 1\\024i\\024n
 fjF
 G
 n {x
 i
 } \\000 {i \\000 1}=nj; jF
 G
 n {x
 i
 } \\000 i=njg; {5}
 and this leads to a linear programming problem. Deely
 and Kruse also established the strong consistency of their
 estimator G n
 . Ten years later, this approach was ex-
 tended by Blum and Susarla {1977} by using any se- quence ff
 n
 g of functions which satisfies supjf
 n
 \\000f G
 j ! 0
 a.s. as n ! 1. Each f
 n
 can, for example, be obtained by a kernel-based density estimator. Blum
and Susarla
 approximated the function f
 n
 by the overall mixture pdf
 f
 G
 n
 , and established the strong consistency of the esti-
 matorG
 n
 under weak conditions. For reason of simplicity and generality, we will
denote
 the approximation between two mathematical entities of the same type by
 \\030 =
 , which implies the minimization with
 respect to an estimator of a distance measure between
 the entities on either side. The types of entity involved
 in this paper include vector, function and measure, and
 we use the same symbol
 \\030
 = for each.
 Inthe work reviewed above, two kinds of estimator are used: CDF-based {Choi
and Bulgren, Macdonald, and
 Deely and Kruse} and pdf-based {Blum and Susarla}.
 CDF-based estimators involve approximating an empir-
 ical distribution with an estimated one F
 G n
 . We write this as
 F
 G n
 \\030
 =
 F
 n
 ; {6} where F
 n
 is the Kolmogorov empirical CDF|or indeed
 any empirical CDF that converges to it. Pdf-based es-
 timators involve the approximation between probability density functions:
 f
 G
 n
 \\030
 =
 f
 n
 ; {7}
 where f
 G n
 is the estimated mixture pdf and f
 n
 is the empirical pdf described above. The entities involved in {6} and {7}
are functions.
 When the approximation is computed, however, it is
 computed between vectors that represent the functions.
 These vectors contain the function values at a particular
 set of points, which we call \\\\fitting points." In the work
 reviewed above, the fitting points are chosen to be the
 data points themselves. 3 The problem of minority clus- ters
 Although they perform well asymptotically,all the min-
 imumdistance methods described above suffer fromthe
 finite-sample problem discussed earlier: they can neglect
 small groups of outlying data points no matter how far
 they lie from the dominant data points. The underlying
 reason is that the objective function to be minimized is
 defined globally rather than locally. A global approach
 means that the value of the estimated probability den-
 sity function at a particular place will be influenced by
 all data points, no matter how far away they are. This
 can cause smallgroups of data points to be ignored even
 if they are a long way from the dominant part of the data sample. From a
probabilistic point of view, how-
 ever, there is no reason to subsume distant groups within the major clusters
just because they are relatively small.
 The ultimate effect of suppressing distant minority
 clusters depends on how the clustering is applied. If the
 application's loss function depends on the distance be-
 tween clusters, the result may prove disastrous because there is no limit
to how far away these outlying groups
 may be. One mightargue that small groups of points can
 easily be explained away as outliers, because the effect
 will become less important as the number of data points
 increases|and it will disappear in the limit of infinite data. However,
in a finite-data situation|and all practi-
 cal applications necessarily involve finite data|the \\\\out- liers" may equally
well represent small minority clusters.
 Furthermore, outlying data points are not really treated as outliers by
these methods|whether or not they are
 discarded is merely an artifact of the global fitting cal-
 culation. When clustering, the final mixture distribu- tion should take
all data points into account|including
 outlying clusters if any exist. If practical applications demand that small
outlying clusters are suppressed, this
 should be done in a separate stage.
 In distance-based clustering, each data point has a far-
 reaching effect because of two global constraints. Oneis
 theuse of the cumulative distribution function; the other is the normalization
constraint P
 k
 n
 j=1
 w
 nj = 1. These
 constraints may sacrifice a small number of data points|
 at any distance|for a better overallfit to the data as a
 whole. Choi and Bulgren {1968}, the Cramer-von Mises
 statistic {Macdonald, 1971}, and Deely and Kruse {1968}
 all enforce both the CDF and the normalizationcon-
 straints. Blum and Susarla {1977} drop the CDF, but still enforce the normalization
constraint. The result is
 that these clustering methods are only appropriate for
 finite mixtures without small clusters, where the risk of suppressing clusters
is low.This paper addresses the general problem of arbitrary
 mixtures. Of course, the minority cluster problem exists for all types of
mixture|including finite mixtures. Even
 here, the maximum likelihood and Bayesian approaches do not solve the problem,
because they both introduce
 a global normalization constraint.
 4 Solving the minority cluster
 problem
 Now that the source of the problem has been identified, the solution is
clear, at least in principle: drop both
 the approximation of CDFs, as Blum and Susarla {1977}
 do, and the normalizationconstraint|no matter how seductive it may seem.
 Let G
 0
 n
 be a discrete function with masses fw
 nj g at
 f\\022
 nj
 g; note that we do not require the w
 nj
 to sum to one. Since the new method operates in terms of measures
 rather than distribution functions, the notion of approx-
 imation is altered to use intervals rather than points.
 Using the formulation described in Section 2, we have P
 G
 0
 n
 \\030
 =
 P
 n
 ; {8} where P
 G
 0 n
 is the estimated measure and P
 n
 is the em- pirical measure. The intervals over which the approxi-
 mation takes place are called \\\\fitting intervals." Since {8} is not subject
to the normalizationconstraint, G
 0
 n is
 not a CDF and P G
 0
 n
 is not a probability measure. How- ever, G
 0
 n can be easily converted into a CDF estimator
 by normalizing it after equation {8} has been solved.
 To define the estimation procedure fully, we need to determine {a} the set
of support points, {b} the set of
 fittingintervals, {c} the empirical measure, and {d} the distance measure.
Here we discuss these in an intuitive
 manner; Wang and Witten {1999} show how to deter-
 mine them in a way that guarantees a strongly consistent
 estimator.
 Support points. The support points are usually
 suggestedby the data points in the sample. For exam-
 ple, if the component distribution F {x; \\022} is the normal
 distribution with mean \\022 and unit variance, each data
 pointcan be taken as a support point. Infact, the sup- port points are more
accurately described as potential
 support points, because their associated weights may be-
 come zero after solving {8}|and, in practice, manyoften
 do. Fitting intervals. The fitting intervals are also sug-
 gested by the data points. In the normal distribution
 example, each data point x
 i
 can provide one interval,
 such as [x
 i \\000 3\\033; x
 i
 ], or two, such as [x
 i
 \\000 3\\033; x
 i
 ] and
 [x
 i
 ; x
 i
 + 3\\033], or more. There is no problem if the fitting
 intervals overlap. Their length should not be so large
 that points can exert an influence on the clustering at
 an unduly remote place, nor so small that the empirical
 measure is inaccurate. The experiments reported below
 use intervals of a few standard deviations around each
 datapoint, and, as we will see, this works well. Empirical measure. The
empirical measure can be
 the probability measure determined by the Kolmogorov
 empirical CDF, or any measure that converges to it. The fitting intervals
discussed above can be open, closed, or
 semi-open. Thiswill affect the empirical measure if data points are used
as interval boundaries, although it does
 not change the values of the estimated measure because
 the corresponding distribution is continuous. In small-
 samplesituations, bias can be reduced by careful atten-
 tion to this detail|as Macdonald {1971} discusses with respect to Choi and
Bulgren's {1968} method.
 Distance measure. The choice of distance mea-
 sure determines what kind of mathematical program-
 ming problem must be solved. For example, a quadratic distance will give
rise to a least squares problem under
 linear constraints, whereas the sup-norm gives rise to a
 linear programming problem that can be solved using thesimplex method. These
two measures have efficient
 solutions that are globally optimal.
 It is worth pointing out that abandoning the global
 constraints associated with both CDFs and normaliza-
 tion can brings with it a computational advantage. In
 vector form, we write P
 G 0
 n
 = A G
 0
 n
 w
 n
 , where w n
 is the
 {unnormalized} weight vector and each element of the matrix A
 G
 0
 n
 is the probability value of a component dis-
 tribution over an fitting interval. Then, provided the
 support points corresponding to w
 0
 n and w
 00
 n lie outside
 each others' sphere of influence as determined by the
 componentdistributions F {x; \\022}, the estimation proce- dure becomes
 \\022
 A
 0
 G
 0
 n
 0
 0 A
 00
 G
 0
 n
  \\022 w
 0
 n
 w
 00
 n
 
 \\030
 =
 \\022
 P
 0
 n
 P
 00
 n
 
 ; {9} subject to w
 0 n
 \\025 0 and w 00
 n
 \\025 0. This is the same as com-
 bining the solutions of two sub-equations, A 0
 n
 w
 0
 n
 \\030
 =
 P
 0
 n
 subject to w
 0
 n
 \\025 0, and A
 00
 n
 w 00
 n
 \\030 =
 P
 00
 n
 subject to w
 00
 n
 \\025 0. Ifthe relevant support points continue to lie outside each
 others' sphere of influence, the sub-equations can be fur- ther partitioned.
This implies that when data points are
 sufficiently far apart, the mixing distribution G can be estimated by grouping
data points in different regions.
 Moreover, the solution in each region can be normalized
 separately before they are combined, which yields a bet-
 ter estimation of the mixing distribution.
 If the normalization constraint
 P
 k n
 j=1
 w nj
 = 1 is re-
 tained when estimating the mixing distribution, the es-timation procedure
becomes P
 G
 n
 \\030 =
 P
 n
 : {10}
 where the estimator G n
 is a discrete CDF on \\002. This
 constraint is necessary for the left-hand side of {10} to
 be a probability measure. Although he did not develop an operational estimation
scheme, Barbe {1998} sug-
 gested exploiting the fact that the empirical probability
 measure is approximated by the estimated probability measure|buthe retained
the normalization constraint.
 As noted above, relaxing the constraint has the effect of
 loosening the throttling effect of large clusters on small groups of outliers,
and our experimental results show
 that the resulting estimator suffers fromthe drawback
 noted earlier.
 Both estimators, G
 n
 obtained from {10} and G
 0
 n
 from {8}, have been shown to be strongly consistent under
 weak conditions similar to those used by others {Wang
 & Witten, 1999}. Of course, the weak convergence of
 G
 0
 n
 is in the sense of general functions, not CDFs. The
 strong consistency of G
 0 n
 immediately implies the strong consistency of the CDF estimator obtained
by normaliz-
 ing G
 0
 n
 .
 5 Experimental validation
 We have conducted experiments to illustrate the failure
 of existing methods to detect small outlying clusters, and
 the improvement achieved by the new scheme. The re-
 sults also suggest that the new method is more accurate
 and stable than the others.
 When comparing clustering methods, it is not always
 easy to evaluate the clusters obtained. To finesse this
 problem we consider simple artificial situations in which
 the proper outcome is clear. Some practical applica-
 tions of clusters do provide objective evaluation func- tions; however,
these are beyond the scope of this paper.
 The methods used are Choi and Bulgren {1968} {de- notedchoi}, Macdonald's
application of the Cramer-von
 Mises statistic {cram
 
 er}, the new method with the nor-
 malizationconstraint {test}, and the new method with- out that constraint
{new}. In each case, equations in- volving non-negativity and/or linear equality
constraints
 are solved as quadratic programming problems using the
 elegant and efficient procedures nnls and lsei provided
 by Lawson and Hanson {1974}. All four methods have
 the same computational time complexity. We set the sample size n to 100
throughout the exper-
 iments. The data points are artificially generated from
 a mixture of two clusters: n
 1
 points from N {0; 1} and n 2
 points from N {100; 1}. The values of n
 1
 and n
 2
 are in
 the ratios 99 : 1, 97 : 3, 93 : 7, 80 : 20 and 50 : 50. Every data point
is taken as a potential support point
 in all four methods: thus the number of potential com-
 ponents in the clustering is 100. For test and new, fitting intervals need
to be determined. In the experi-
 ments, each data point x
 i
 provides the two fitting inter-
 vals [x
 i
 \\000 3; x i
 ] and [x
 i
 ; x
 i
 + 3]. Any data point located on the boundary of an interval is counted as
half a point
 when determining the empirical measure over that inter- val.
 These choices are admittedly crude, and further im-
 provements in the accuracy and speed of test and new
 are possible that take advantage of the flexibility pro- vided by {10} and
{8}. For example, accuracy will
 likely increase with more|and more carefully chosen| support points and
fitting intervals. The fact that it per-
 forms well even with crudely chosen support points and fittingintervals
testifies to the robustness of the method.
 Our primary interest in this experiment is the weights of the clusters that
are found. To cast the results
 in terms of the underlying models, we use the cluster weights to estimate
values for n
 1
 and n 2
 . Of course, the results often do not contain exactly two clusters|but
 because the underlying cluster centres, 0 and 100, are
 wellseparated compared to their standard deviation of 1, it is highly unlikely
that any data points from one
 cluster will fall anywhere near the other. Thus we use athreshold of 50
to divide the clusters into two groups:
 those near 0 and those near 100. The final cluster weights are normalized,
and the weights for the first group are
 summed to obtain an estimate ^n
 1
 of n
 1
 , while those for
 the second group are summed to give an estimate ^n
 2
 of
 n
 2
 .
 Table 1 shows results for each of the four methods.
 Each cell represents one hundred separate experimen- tal runs. Three figures
are recorded. At the top is the
 numberof times the method failed to detect the smaller cluster, that is,
the number of times ^n
 2
 = 0. In the mid-
 dle are the average values for ^n
 1
 and ^n
 2
 . At the bottom
 is the standard deviation of ^n
 1 and ^n
 2 {which are equal}.
 These three figures can be thought of as measures of re-
 liability, accuracy and stability respectively.
 The top figures in Table 1 show clearly that only new
 is always reliable in the sense that it never fails to de-
 tect the smaller cluster. The other methods fail mostly when n
 2
 = 1; their failure rate gradually decreases as n
 2
 grows.The center figures show that, under all condi- tions, new gives a
more accurate estimate of the correct
 values of n 1
 and n
 2 than the other methods. As ex-
 pected, cram
  er shows a noticeable improvement over
 choi, but it is very minor. The test method has lower failure rates and
produces estimates that are more accu-
 rate and far more stable {indicated by the bottom fig-n
 1
 = 99n
 1
 = 97n
 1
 = 93n
 1
 = 80n
 1
 = 50n
 2
 = 1n
 2
 = 3n 2
 = 7n 2
 = 20n 2
 = 50choi Failures8642400^n
 1 =^ n
 299.9/0.199.2/0.895.8/4.282.0/18.050.6/49.4SD{^ n
 1
 }0.360.981.711.771.30cram
 
 er Failures8031100^n
 1
 =^ n
 299.8/0.298.6/1.495.1/4.981.6/18.449.7/50.3SD{^ n
 1
 }0.501.131.891.801.31test Failures525000^n
 1 =^ n
 299.8/0.298.2/1.894.1/5.980.8/19.250.1/49.9SD{^ n
 1
 }0.320.830.870.780.55new Failures00000^n
 1 =^ n
 299.0/1.096.9/3.192.8/7.279.9/20.150.1/49.9SD{^ n
 1
 }0.010.160.190.340.41Table 1: Experimental results for detecting small clusters
 ures} than those for choi and cram 
 er|presumably be- cause it is less constrained. Of the four methods, new
is
 clearly and consistently the winner in terms of all three
 measures: reliability, accuracy and stability.
 The results of the new method can be further im-
 proved. If the decomposed form {9} is used instead of {8},
 and the solutions of the sub-equations are normalized be- fore combining
them|which is feasible because the two
 underlying clusters are so distant from each other|the
 correct values are obtained for ^n 1
 and ^n
 2
 in virtually
 everytrial. 6 Conclusions
 We have identified a shortcoming of existing clustering methods for arbitrary
semi-parametric mixture distribu-
 tions: they fail to detect very small clusters reliably.
 Thisis a significant weakness when the minority clus- tersare far from the
dominant ones and the loss function
 takes account of the distance of misclustered points.
 We have described a new clustering method for arbi-
 trary semi-parametric mixture distributions, and shown
 experimentally that it overcomes the problem. Further- more, the experiments
suggest that the new estimator is
 more accurate and more stable than existing ones.
 References Barbe, P. {1998}. Statistical analysis of mixtures and
 the empirical probability measure. Acta Applican-
 dae Mathematicae, 50{3}, 253{340. Blum, J. R. & Susarla, V. {1977}. Estimationof
a mixing
 distribution function. Ann. Probab, 5, 200{209. Choi, K. & Bulgren, W. B.
{1968}. An estimation pro-
 cedure for mixtures of distributions. J. R. Statist.
 Soc. B, 30, 444{460.
 Deely, J. J. & Kruse, R. L. {1968}. Construction of se-
 quences estimating the mixing distribution. Ann.
 Math. Statist., 39, 286{288.
 Lawson, C. L. & Hanson, R. J. {1974}. Solving Least
 SquaresProblems. Prentice-Hall, Inc.
 Lindsay, B. G. {1995}. Mixture models: theory, geometry,
 and applications, Volume 5 of NSF-CBMS Regional Conference Series in Probability
and Statistics. In-
 stitute for Mathematical Statistics: Hayward, CA.
 Macdonald, P. D. M. {1971}. Comment on a paper by
 Choi and Bulgren. J. R. Statist. Soc. B, 33, 326{
 329. McLachlan, G. & Basford, K. {1988}. Mixture Models:
 Inference and Applications to Clustering. Marcel Dekker, New York.
 Titterington, D. M., Smith, A. F. M. & Makov, U. E. {1985}. Statistical
Analysis of Finite Mixture Dis-
 tributions. John Wiley & Sons.
 Wang, Y. & Witten, I. H. {1999}. The estimation of mix-
 ing distributions by approximating empirical mea-
 sures. Technical Report {in preparation}, Dept. of Computer Science, University
of Waikato, New
 Zealand.