Adapted G-mode Clustering Method applied to Asteroid Taxonomy

The original G-mode was a clustering method developed by A. I. Gavrishin in the late 60’s for geochemical classification of rocks, but was also applied to asteroid photometry, cosmic rays, lunar sample and planetary science spectroscopy data. In this work, we used an adapted version to classify the asteroid photometry from SDSS Moving Objects Catalog. The method works by identifying normal distributions in a multidimensional space of variables. The identification starts by locating a set of points with smallest mutual distance in the sample, which is a problem when data is not planar. Here we present a modified version of the G-mode algorithm, which was previously written in FORTRAN 77, in Python 2.7 and using NumPy, SciPy and Matplotlib packages. The NumPy was used for array and matrix manipulation and Matplotlib for plot control. The Scipy had a import role in speeding up G-mode, Scipy.spatial.distance.mahalanobis was chosen as distance estimator and Numpy.histogramdd was applied to find the initial seeds from which clusters are going to evolve. Scipy was also used to quickly produce dendrograms showing the distances among clusters. Finally, results for Asteroids Taxonomy and tests for different sample sizes and implementations are presented.


Introduction
The clusters are identified using the G-mode multivariate clustering method, designed by A. I. Gavrishin and published in Russia in the late 60's [Cor76]. The algorithm was originally written in FORTRAN V by A. Coradini in the 70's [Cor77] to classify geochemical samples [Cor76,Bia80], but is also applicable to a wide range of astrophysical fields, as Small Solar System Bodies [Bar87,Bir96,Ful08,Per10], disk-resolved remote sensing [Pos80,Tos05,Cor08,Ley10,Tos10], cosmic rays [Gio81] and quasars [Cor83]. In 1987, Bar87 used original Gmode implementation to classify measurements of asteroids made by the Eight-Color Asteroid Survey [Zel85] and IRAS geometric albedos [Mat86] to produce a taxonomic scheme. Using a sample of 442 asteroids with 8 variables, they recognized 18 classes using a confidence level of 97.7%. Those classes were grouped to represent the asteroid taxonomic types. G-mode also identified that just 3 variables were enough to characterize the asteroid taxonomy.
The G-mode classifies N elements into Nc unimodal clusters containing Na elements each. Elements are described by M variables. This method is unsupervised, which allows an automatic identification of clusters without any a priori knowledge of sample distribution. For that, user must control only one critical parameter for the classification, the confidence levels q 1 or its corresponding critical value G q1 . Smaller this parameter get, more clusters are resolved and smaller their spreads are.
So, we chose this method to classify the asteroid observations from Sloan Digital Sky Moving Object Catalog, the largest data set on photometry containing around 400,000 moving object entries, due to its previous success on asteroid taxonomy, unsupervision and lower number of input parameters. However, we were aware the computational limitation we were going to face, since the method never was applied to samples larger than 10,000 elements [Ley10] and its last implementation was outdated. Therefore, the G-mode used here follows an adapted version of the original method published by Gav92, briefly described by Ful00 and reviewed by Tos05 . Median central tendency and absolute deviation estimators, a faster initial seed finder and statistical whitening were introduced to produce a more robust set of clusters and optimize the processing time. The coding was performed using Python 2.7 with support of Matplotlib, NumPy and SciPy packages * . The algorithm can be briefly summarized by two parts: the first one is the cluster recognition and the second evaluates each variable in the classification process. Each one is going to be described in the following sections.

Recognition Of The Unimodal Clusters
The first procedure can be summarized by the following topics and code snippets: • The data is arranged in N X M matrix. All variables are Scipy.cluster.vq.whiten , which means they are divided by their absolute deviation to scale all them up. This is a important measure when dealing with percentage variables, such as geometric albedos.
• Initial seed of a forming cluster is identified. At the original implementation, the G-mode relied on a bruteforce algorithm to find the three closest elements as initial seed, which required long processing time. Therefore, in this version, the initial seeds are searched recursively using Numpy.histogramdd , which speeds up the output: The function above divides the variable hyperspace into large sectors, and the initial seed is searched for only in the most crowded sector. Recursively, the most crowded sector is once divided as long as the density increases. When density decreases or the minimal number of points set by the user is reached, the procedure stops. The initial seed is chosen from the elements of the most crowded sector. In the end, starting central tendency µ i and standard deviation σ i are estimated from the initial seed. If any standard deviation is zero, the value is replaced by the median uncertainty of the variable.
• Z² criterion. In the next step, the Mahalanobis distance (Scipy.spatial.distance.mahalanobis) between the tested cluster and all elements are computed: where χ j is the jth element and S is covariance matrix of the tested cluster.
• Hypothesis Testing. The Z² estimator follows a χ 2 distribution, but for sake of simplification, Z² can be transformed to Gaussian estimator G if the degree of freedom f is large enough, which is satisfied for most of samples. Now, the critical value G q1 in hypothesis testing are given as multiples of σ , simplifying its interpretation. Therefore, the vectorized transformation [Abr72] can be written: while the elements of the vector degree of freedom are given by: for f k > 100 , where r 2 ks is the correlation coefficient. For 30 < f k < 100 , the G parameter becomes: Then the null hypothesis χ i j = µ i is tested with a statistical significance level of P(G j ≤ G q 1 , f ) (P, probability) for a χ j to belong to a tested class, i.e., a class contains the χ j element if its estimator G j satisfies G j ≤ G q 1 .
• µ i and σ i are redefined on each iteration. The iteration is executed until the Na and correlation matrix R converge to stable values. Once the first unimodal cluster is formed, its members are removed from sample and the above procedure is applied again until all the sample is depleted, no more initial seeds are located or the condition N > M-1 is not satisfied anymore. If a initial seed fails to produce a cluster, its elements are also excluded from the sample.
As soon as all unimodal clusters are found and its central tendency and absolute deviation are computed, the method goes to the next stage: to measure the hyper-dimension distance between classes and evaluate the variable relevance to the classification.

Variable Evaluation and Distance Matrix
This part of the method is also based on Z² criterion, but now the objects of evaluation are the clusters identified on the previous stage. The variables are tested for their power to discriminate clusters against each other. For this purpose, the Nc × Nc (Nc, the number of clusters) symmetric matrices of Gaussian estimators are computed for each variable i as follows: where Na and Nb are respectively the number of members in the a-th and b-th class, while Z 2 i (a, b) and Z 2 i (b, a) are a reformulation of Z² estimator, now given by: can be found just by permuting the equation indices. The Gc i matrix gives the efficiency of variable i to resolve the clusters, each element represent the capacity of a variable i to discriminate a pair of cluster from each other. If all the elements are lower then a given critical value, then this variable is not significant for the classification procedure. Thus, smaller matrix values indicate less distinction between clusters. To discriminate the redundant variables, all the elements of Gc i matrix are tested against the null hypothesis µ i,a = µ i,b , and if none of them satisfy Gc i (a, b) < G q 1 , the method is iterated again without the variable i.
The method is repeated until stability is found on the most suitable set of meaningful variables for the sample.
The Nc × Nc symmetric Distance Matrix between clusters with respect to all meaningful variables is also calculated. The same interpretation given to Gc i matrices can be used here: higher D²(a,b) elements, more distinction between clusters are presented. D²(a,b) matrix is used to produce a Scipy.cluster.hierarchy.dendrogram , which graphically shows the relation among all clusters.

Robust Median Statistics
Robust Statistics seeks alternative estimators which are not excessively affected by outliers or departures from an assumed sample distribution. For central tendency estimator µ i , the median was chosen over mean due to its breakdown point of 50% against 0% for mean. Higher the breakdown point, the estimator is more resistant to variations due to errors or outliers. Following a median-based statistics, the Median of Absolute Deviation (MAD) was selected to represent the standard deviation estimator σ . The MAD is said to be conceived by Gauss in 1816 [Ham74] and can be expressed as: To be used as a estimator of standard deviation, the MAD must be multiplied by a scaling factor K, which adjusts the value for a assumed distribution. For Gaussian distribution, which is the distribution assumed for clusters in the G-mode, K = 1.426 . Therefore: Computing the Mahalanobis distance is necessary to estimate the covariance matrix. MAD is expanded to calculate its terms: The correlation coefficient r s,k used in this G-mode version was proposed by She97 to be a median counterpart to the Pearson correlation coefficient, with breakpoint of 50%, similar to MAD versus standard deviation. The coefficient is based on linear data transformation and depends on MAD and the deviation of each element from the median: The application of median statistics on G-mode is a departure from the original concept of the method. The goal is producing more stable classes and save processing time from unnecessary successive iterations.

Code Structure, Input And Output
The GmodeClass package, hosted in GitHub , is organized in a object-oriented structure. The code snippets below show how the main class and its objects are implemented, explaining what each one does, and also highlighting its dependences: Originally, G-mode relied on a single parameter, the confidence level q1, to resolve cluster from a sample. However, tests on simulated sample and asteroid catalogs (More in next sections), plus changes on initial seed finder, revealed that three more parameters were necessary for high quality classification. Thus, the last code version ended up with the following input parameters: • q 1 or G q 1 ( --q1, self.q1) : Confidence level or critical value. Must be inserted in multiple of σ . Usually it assumes values between 1.5 and 3.0 .
• Grid (--grid, -g, self.grid) : Number of times which barycenter.barycenter_density() will divide each variable up on each iteration, according to sample's upper and lower ranges. Values between 2 and 4 are preferable.

Minimum Deviation Limit
(--mlim, -m, self.mlim) : Sometimes the initial seeds starts with zeroth deviation, thus this singularity is corrected replacing all deviation by the minimum limit when lower than it. This number is given in fraction of median error of each variable.

Upper Deviation Limit
(--ulim, -u, self.ulim) : This optional parameter is important when the clusters have high degree of superposition and its necessary the identification of smaller mingled clusters. The upper limit is a restriction which determines how much a cluster might grow up. This value is given in fraction of total standard deviation of each variable.
The output is contained in a directory created in /TESTS/ and organized in a series of lists and plots. On the directory /TESTS/.../maps/ , there are on-the-fly density distribution plots showing the locus of each cluster in sample. On /TESTS/.../plots/ , a series of variable plots permits the user to verify each cluster profile. On the lists clump_xxx.dat , gmode1_xxx.dat , gmode2_xxx.dat and log_xxx.dat the informations about cluster statistics, classification per each data element, classification per unique ID and report of the formation of clusters and distance matrices are gathered. Working on a Python Interpreter, once Gmode.Run() was executed, users might call self.cluster_members to get a list of sample indexes organized into each cluster they are members of. The self.cluster_stats returns a list with each cluster statistics. Gmode.Evaluate() gives the self.Gc matrix and self.D2 distance matrix among clusters.
Users must be aware that input data should be formatted in columns in this order: measurement designation, unique identification, variables, errors. If errors are not available, its values should be replaced by 0.0 and mlim parameter might not be used. There is no limit on data size, however the processing time is very sensitive to the number of identified cluster, which may slow down the method for a bigger number. For example, with 20,000 elements and 41 clusters, the G-mode takes around to 2 minutes for whole procedure (plots creation not included) when executed in a Intel Core 2 Quad 2.4 GHz with 4 Gb RAM.
Our implementation also allows to import Gmode and use it on a Python Interpreter or through shells as in the example below: python Gmode.py --in path/to/file \ --q1 2.0 -g 3 -u 0.5 -m 0.5 -n Nickname Finally, since the plot limits, normalization and axis are optimized to asteroid photometry, users on shell are invited to directly change this parameters in config.cfg. If data is not normalized thus norm = None. More aesthetic options are going to be implemented in future versions using Matplotlib.rcParams.

Code Testing
For testing the efficiency of the Adapted G-mode version, a bidimensional sample of 2000 points was simulated using Numpy.random. The points filled a range of 0 to 10. Three random Gaussian distributions containing 500 points each (Numpy.random.normal), plus 500 random points (Numpy.random.rand) composed the final sample ( Figure  1). These Gaussians were the aim for the recognition ability of clustering method, while the random points worked as background noise. Then, simulated sample was classified using the Original [Gav92] and Adapted G-mode version. The results are presented in Table 1 and figures below. Comparing results from both versions, it is noticeable how each version identifies clusters differently. Since the initial seed in †. Central Tendency.
‡. Standard Deviation.   the Original G-mode starts from just the closest points, there is no guarantee that initial seeds will start close or inside clusters. The Original version is also limited for misaligned-axis clusters, due to the use of a normalized euclidean distance estimator, that does not have correction for covariance. This limitation turn impossible the identification of misaligned clusters without including random elements in, as seen in Figure 2 . The Adapted version, otherwise, seeks the initial seed through densest regions, thus ensuring its start inside or close to clusters. Moreover, by using the Mhalonobis distance as estimator, the covariance matrix is taken into account, which makes a more precise identification of cluster boundaries (Figure 3). Nevertheless, Adapted G-mode has a tendency to undersize the number of elements on the misaligned clusters. For cluster number 3 in Table 1 , a anti-correlated gaussian distribution, the undersizing reaches 30.8%. If the undersizing becomes too large, its possible that "lost elements" are identified as new cluster. Therefore, it may be necessary to group clusters according to its d²(a,b) distances.  containing 471,569 detections of moving objects, where 202,101 are linked to 104,449 unique objects. It has a system of five magnitudes in the visible [Fuk96] , providing measurements and corresponding uncertainties. As the photometric observations are obtained almost simultaneously, rotational variations can be discarded for most of the asteroids. The SDSS-MOC4 magnitudes employed here are first converted to normalized reflected intensities 1 [Lup99]. Thereby solar colors were obtained from Ive01 and extracted from asteroid measurements. A middle band called g' was chosen as reference [Car10], thus being discarded from the classification procedure.
In what follows, all observations of non-numbered asteroids, with uncertainties in each filter greater than the 3rd quartile, have been excluded. Moreover, all detections 15 degrees from the Galactic Plane and with |DEC| < 1.26 were eliminated due to inclusion of sources in crowded stellar regions, which have a high possibility of misidentification 2 . Finally, the sample contained 21,419 detections linked to 17,027 asteroids.

Preliminary Results on Asteroid Photometric Classification
When looking at the density distributions ( Figure 4) it is possible to notice two large agglomerations with accentuated superposition between them. Previous photometry-based taxonomic systems [Tho84,Bar87] were developed over smaller samples, with less than 1,000 asteroids, thus overlay was not a huge problem. Those two groups are the most common asteroid types S (from Stone) and C (from Carbonaceous). A important indicative that a classification method is working for asteroid taxonomy is at least the detachment of both groups. Nonetheless, even though both groups are being identified in the first and second clusters when SDSSMOC4 sample is classified, the third cluster was engulfing part of members left from both groups and other smaller groups mingled among them ( Figure 5). The loss of obvious unimodal distribution patterns on data may be the cause for such generalization in the third cluster. This behavior was interrupting the capacity of the method to identify smaller clusters. Therefore, to deal with that, a upper deviation limit was introduced to halt the cluster evolution, thus not permiting clusters to become comparable in sample size. Figure 6 is a example of a cluster recognized with upper deviation limit on, showing that third cluster is not getting  into a large size anymore, allowing other cluster to be identified. This specific test resulted in 58 cluster recognitions, most of them with lower than 100 members. Thus, the upper limit parameter turned up useful for sample with varied degrees of superposition.

Conclusions
In this paper, a refined version of a clustering method developed in the 70's was presented. The Adapted G-mode used Mahalonobis distance as estimator to better recognize misaligned clusters, and used Numpy.histogramdd to faster locate initial seeds. Robust median statistics was also implemented to more precisely estimate central tendency and standard deviation, and take less iteration to stabilize clusters.
Tests with simulated samples showed a quality increase in classification and successful recognition of clusters among random points. However, tests with asteroid samples indicated that for presence of superposition is necessary introduction of one more parameter. Therefore, users must previously inspect their samples before enabling an upper limit parameter.
Finally, the Adapted G-mode is available for anyone through GitHub . The codebase has no restriction on sample or variable size. Users must only fulfill the requirements related to installed packages and data format.