Automatic Classification of Variable Stars integrating multiple Catalogs

supported by Fondecyt Project 11140643

General Description

In the last few years, there has been an increased interest towards computer applications for astronomical research. This interest has been mainly triggered by the ongoing and future observational projects expected to deliver huge amounts of high-quality data, which needs to be promptly analyzed. Some illustrative examples are: the upcoming Large Synoptic SurveyTelescope (LSST), the ongoing Vista Variables in the Via Láctea (VVV) ESO Public Survey, and the Atacama Large Millime-ter/submillimeter Array (ALMA), among others. Besides the upcoming telescopes, today we have hundreds of astronomical catalogs of tens of millions of objects, available online. All this information makes the analysis and classification of objects by astronomers impossible without the help of computer systems. So far, there are many computer programs able to classify objects with high accuracy for specific catalogs. Unfortunately those programs only can deal with the catalog-specific data(specific variables and classes) and cannot be directly extended to work with new classes or variables without re-training the classifier. For example, there are very accurate models able to classify between periodic and non-periodic stars, also thereare good models to classify quasars versus non-quasars (including many periodic stars in the non-quasar classes), and good models that classify among many variable stars (Cepheids, Eclipsing Binaries, Mira, Be stars, quasars, etc.), where some ofthose classes are sub-classes of periodic and non-periodic stars. Moreover, each of these models use their own set of variables to perform classification. If new data is available, the work should not be repeated since it is known that the set of classes involved in the new data will not exactly match the set of classes already modeled with the previous datasets. Yet one can reutilize the previous models and spend resources only in what is necessary to complement the models already developed. Due to the huge size of astronomical catalogs, the option of joining all the catalogs and train one big model able to classify everything is not possible.


MACHO Data Set

The MACHO Project (Cook et al. 1995) observed the Magellanic Clouds and Galactic bulge with the main purpose of detecting microlensing events. Observations were done using blue (∼4500–6300 Å) and red (∼6300–7600Å) passbands. The cadence is about one observation per two days for 7.4 years, which generates approximately 1000 observations per object. The light curves used in this work are from the Small and Large Magellanic Clouds. The fields cover almost the entire LMC bar (10 square degrees) to a limiting magnitude of V ≈ 22. The training set contains 6059 labeled light curves (Kim et al. 2011)

EROS Data Set

The EROS project (Derue et al. 1999) observed the Galactic Spiral Arms (GSA), LMC, SMC, and Galactic bulge during 6.7 years, dedicated to detect microlensing events. Observations were done in two nonstandard passbands. One is the EROS-red passband Re , centered on λ = 762 nm and EROS-visible passband Ve , centered on λ = 600 nm. The light curves used in this work are from the LMC (60 fields) and SMC (10 fields). The limiting magnitude of the EROS Ve band is ~ 20. The cadence varies among the fields, but, in average, about 500 observations were obtained for each light curve. The training set contains 68,718 labeled light curves, obtained from Kim et al. (2014)

OGLE-III Catalog of Variable Stars

The OGLE is a wide-field sky survey originally designed to search for microlensing events (Paczynski 1986). The brightness of more than 200 million stars in the Magellanic Clouds and the Galactic bulge is regularly monitored on a timescale of years. A by-product of these observations is an enormous database of photometric measurements. The OGLE-III Catalog of Variable Stars (Udalski et al. 2008) corresponds to the photometric data collected during the third phase of this survey, which began in 2001.

The HiTS Survey

The High Cadence Transient Survey (HiTS) first campaign started in 2013 with the objective of exploring transient and periodic objects with characteristic timescales between a few hours and days. This discovery survey uses high cadency data obtained from the Dark Energy Camera (DECam) mounted on a 4m telescope at Cerro Tololo Interamerican Observatory (CTIO). The large etendue (product of collecting area and field of view) of the DECam allows the observation of apparent magnitudes as low as 24.5 mag Forster et al. (2016).

New representation of Time Series for variable stars classification

The success of automatic classification of variable stars depends strongly on the lightcurve representation. Usually, lightcurves are represented as a vector of many descriptors designed by astronomers called features. These descriptors are expensive in terms of computing, require substantial research effort to develop, and do not guarantee a good classification. Today, lightcurve representation is not entirely automatic; algorithms must be designed and manually tuned up for every survey. The amounts of data that will be generated in the future mean astronomers must develop scalable and automated analysis pipelines. In this work we present a feature learning algorithm designed for variable objects. Our method works by extracting a large number of lightcurve subsequences from a given set, which are then clustered to find common local patterns in the time series. Representatives of these common patterns are then used to transform lightcurves of a labeled set into a new representation that can be used to train a classifier. The proposed algorithm learns the features from both labeled and unlabeled lightcurves, overcoming the bias using only labeled data. We test our method on data sets from the Massive Compact Halo Object survey and the Optical Gravitational Lensing Experiment; the results show that our classification performance is as good as and in some cases better than the performance achieved using traditional statistical features, while the computational cost is significantly lower. With these promising results, we believe that our method constitutes a significant step toward the automation of the lightcurve classification pipeline.

Overview of the representation method

Meta-Classification for variable stars

The need for the development of automatic tools to explore astronomical databases has been recognized since the inception of CCDs and modern computers. Astronomers already have developed solutions to tackle several science problems, such as automatic classification of stellar objects, outlier detection, and globular clusters identification, among others. New scientific problems emerge, and it is critical to be able to reuse the models learned before, without rebuilding everything from the beginning when the sciencientific problem changes. In this paper, we propose a new meta-model that automatically integrates existing classification models of variable stars. The proposed meta-model incorporates existing models that are trained in a different context, answering different questions and using different representations of data. A conventional mixture of expert algorithms in machine learning literature cannot be used since each expert (model) uses different inputs. We also consider the computational complexity of the model by using the most expensive models only when it is necessary. We test our model with EROS-2 and MACHO data sets, and we show that we solve most of the classification challenges only by training a meta-model to learn how to integrate the previous experts.

Meta Classifier learned from MACHO dataset

Meta Classifier learned from EROS dataset

Transfer learning between astronomical catalogs

Machine Learning techniques have been successfully used to classify variable stars on widely-studied astronomical surveys. These datasets have been available to astronomers long enough, thus allowing them to perform deep analysis over several variable sources, and generating useful catalogs with identified variable stars. The products of these studies are labeled data that facilitate supervised learning models to be trained successfully. However, when these models are blindly applied to data from new sky surveys their performance drops significantly. Furthermore, unlabeled data becomes available at a much higher rate than its labeled counterpart since labeling is a manual and time- consuming effort. Domain adaptation techniques aim to learn from a domain where labeled data is available, the source domain, and through some adaptation to perform well on a different domain, the target domain. We propose a new probabilistic model that finds a transformation of the distributions of the features between one survey and another; effectively transfers labeled data to a study with no labeled data. Our approach allows running a variable stars classification model (trained in a given initial survey) in a new survey without the need of re-training from scratch. Our model represents the features of each domain as a Gaussian mixture and models the transformation as a translation, rotation and scaling. We perform tests using three different variability catalogs: EROS, MACHO, and HITS, presenting differences among them, such as the amount of observation per star, cadence, observational time, and optical bands, among others.

Probabilistic Graphical Model to perform transfer learning between catalogs

Overview of the transfer learning process

Example of adapted features between catalogs