Karim Pichara, PUC Chile

Karim Pichara

I'm currently an Associate Professor at the Computer Science Department from Universidad Católica de Chile and Research Assistant at the Institute of Applied Computational Science (IACS) from Harvard University. I'm also a researcher in the Millenium Institute of Astrophysics (MAS). I received a Ph. D. degree in Computer Science from Universidad Católica in Chile, 2010. During 2011-2012, I did a postdoc in Machine Learning for Astronomy, at Harvard University with professor Pavlos Protopapas. My main research areas are Data Science and Machine Learning for Astronomy, focusing in the development of several new tools for automatic and streaming classification of variable stars, detection of quasars, missing data algorithms, meta-classification, transfer learning, Bayesian Model Selection, Variational Inference and Deep Learning, among others. Besides my academic research, I'm the CTO and co-founder at NotCo, an innovative tech startup that is changing the way that food is made in the world.

Research Interest: Data Mining, Machine Learning, Astro-Statistics and Astro-Informatics

kpb@ing.puc.cl - kpichara@seas.harvard.edu

Pontificia Universidad Católica de Chile
School of Engineering
Computer Science Department
Assistant Professor

Institute for Applied Computational Science
Harvard University
Research Associate

Research Group

Our research group is mainly composed by astronomers and computer scientists, with the common interest in data science and machine learning for astronomy. Together with postgraduate students and professor Pavlos Protopapas from Harvard, we have continuous exchange programs where we travel between PUC and Harvard several times every year. We have open positions for talented students who want to learn machine learning data science, the main requirement is to have a strong background in computer programming, math, statistics, and the wish to work very hard.

The Not Co

The Not Company is a Food Tech startup that creates innovative & disruptive food products combining machine learning with applied scientific knowledge. Based in Santiago, Chile. I’m a Co-Founder and CTO, I developed and implemented the first version of Giuseppe, the Machine Learning system that creates the plant-based formulas that form our main food products. More info at www.notco.com

Data Mining

In this course we study and implement most of the basic algorithms in Data Mining, such as association rules: Apriori and FP-Growth; linear regression: Gaussian and Polynomial basis functions; automatic classification: loggistic regression, decision trees, Random Forest and Nearest Neighbours; and Clustering: K-Means, DB-Scan, Gaussian Mixtures (with EM) and Hierarchical. We also study techniques for data discretization, integration, transformation and cleaning.

Approximate Bayesian Inference

This course is focusing in the Bayesian approach for Machine Learning. We mainly study approximate inference methods, such as Monte Carlo and variational inference. This course is very practical, we implement most of the techniques we study, mainly with Python.

Probabilistic Graphical Models

In this course we deeply study the theory behind Graphical Models, specifically Bayesian and Markov networks. After the understanding of the theory behind these models, we learn the main algorithms to perform exact inference, structure and parameter learning.

Advanced Python Programming

In this course we study (and mainly implement) advanced topics in Computer Programming such as: Object Oriented design, data structures, functional programming, threading, simulation, meta programming, Input/Output, unit testing and graphical interfaces. We are about finishing a book that will be available online with all the contents of this course.

Publications

Plenary talk at ACAT 2016

I was invited to give a plenary talk in the 17th INTERNATIONAL WORKSHOP ON ADVANCED COMPUTING AND ANALYSIS TECHNIQUES IN PHYSICS RESEARCH (ACAT), more information here: https://indico.cern.ch/event/397113/timetable/#20160121.detailed

Harvard-Chile students exchange program started

During January 2016, several students and professors from Harvard came to Chile to work with students and professors from Pontificia Universidad Católica and University of Chile. This program mainly involves the collaborative work in problems related to data science for astronomy. The program will continue at Harvard during July 2016.

Presentation at the Semantic Web Seminar

Invited to give a talk in the Semantic Web Seminar ( http://ciws.cl )

Research

Our research is focused on Machine Learning and Data Science, mainly applied to the analysis of Astronomical Time Series. Our main interest is to develop automatic tools to classify objects in the Universe, based on information provided by telescopes. There are several important challenges to overcome, such as huge data processing algorithms, parallel programming, non structured and multivariate time series representation, intelligent integration of expert models, dealing with missing data, unsupervised representation, among others. Chile is one of the most attractive places to perform scientist research in Astronomy, given that most of the state of the art telescopes are installed in the North of the Country.

A Global Data Warehouse for Astronomy

This project is leaded by Javier Machin (Master Student). Javier is developing a Data Warehouse for Astronomy (today just for time domain astronomy), where we get together several (soon will be most of them) astronomical catalogs (OGLE, VVV, MACHO, EROS, Catalina Survey, Stripe 82, etc.), preprocessed, integrated, with visualization tools, visual querying tools, selection and integration tools, slicing and dicing capabilities, data sharing, machine learning models and lightcurve parameters extraction (FATS), among others. This project has been developed in collaboration with Professor Andrés Neyém.

Automatic Survey-Invariant Classification of Variable Stars

This project has been developed by Patricio Benavente (Master Student), in collaboration with Professor Pavlos Protopapas. Lightcurve datasets have been available to astronomers long enough, thus allowing them to perform deep analysis over several variable sources and generating useful catalogs with identified variable stars. The products of these studies are labeled data that enable supervised learning models to be trained successfully. However, when these models are blindly applied to data from new sky surveys their performance drops significantly. Furthermore, unlabeled data becomes available at a much higher rate than its labeled counterpart, since labeling is a manual and time-consuming effort. Domain adaptation techniques aim to learn from a domain where labeled data is available, the source domain, and through some adaptation perform well on a different domain, the target domain. We propose a full probabilistic model that represents the joint distribution of features from two surveys as well as a probabilistic transformation of the features between one survey to the other.

Time Series Variability Tree for fast light curve retrieval

This project has been developed by Lucas Valenzuela (Master Student). We propose a new algorithm and data structure to index astronomical lightcurves in order to provide astronomer with a fast lightcurve search in catalogs of millions of objects. We believe that this tool is very useful to perform manual exploration of catalogs, where astronomers need to find certain type of lightcurves, similar to some lightcurve they already have. Also, classification algorithms based on similarities can rely on this model, where instead of using traditional search methods they can use this indexing procedure and perform much faster.

A full probabilistic model for yes/no query type Crowdsourcing

This project has been developed by Belén Saldías (Master Student), in collaboration with Professor Pavlos Protopapas. Crowdsourcing has become widely used in supervised scenarios where unlabeled data is abundant and labeled data is hard to obtain and scarce. Despite there are several crowdsourcing models in the literature, most of them assume annotators can provide answers for standard complete questions. In classification contexts, complete questions mean that an annotator is asked to discern among all the possible classes. Unfortunately, that assumption is not always true in realistic scenarios. In this work we provide a full probabilistic model for a new type of queries where instead of asking complete questions to the labelers, we just ask queries that require a "yes" or "no" response.

Visualization tool for relevant patterns in Light Curves

This project has been developed by Christian Pieringer, in collaboration with Professors Márcio Catelán and Pavlor Protopapas. Current Machine Learning methods for automatic lightcurves classification have shown successful results, reaching high classification performance in known catalogs. Recently, visualization of time series is attracting more attention in machine learning as a tool to visually help experts for recognizing significant patterns in the complex dynamics of the astronomical time series. Inspired in dictionary-based classifiers, we present a method that naturally provides the visualization of salient parts on light curves. These classifiers code the relevant parts intrinsically assigning weights to each word in the dictionary according to their contribution in the signal approximation. Our approach delivers an intuitive visualization according to the relevance of each part in the time series. Results suggest the effectiveness of this method to highlight salient patterns. We also propose a pipeline that uses our method for observational time scheduling.

Automatic classification of poorly observed variable stars

This project has been developed by Nicolas Castro (Master Student), in collaboration with Professor Pavlos Protopapas. Automatic classification methods applied to sky surveys have revolutionized the astronomical target selection process. Most surveys generate a vast amount of time series or lightcurves. Unfortunately, astronomical data take several years to be completed, producing partial time series that generally remain without analysis until the observations are completed. This happens because state of the art methods rely on a variety of statistical descriptors or features that present an increasing degree of dispersion when the number of observations decreases, which reduces their precision. In this work we propose a novel method that increases the performance of automatic classifiers of variable stars by incorporating the deviations that scarcity of observations produces. Our method uses Gaussian Process Regression to form a probabilistic model of the values observed for each lightcurve.

Meta Classification for variable stars

This project has been developed by Karim Pichara, in collaboration with Professor Pavlos Protopapas and Daniel León. The need for the development of automatic tools to explore astronomical databases has been recognized since the inception of CCDs and modern computers. Astronomers already have developed solutions to tackle several science problems, such as automatic classification of stellar objects, outlier detection, and globular clusters identification, among others. New science problems emerge and it is critical to be able to re-use the models learned before, without rebuilding everything from the beginning when the science problem changes. In this paper, we propose a new meta-model that automatically integrates existing classification models of variable stars. The proposed meta-model incorporates existing models that are trained in a different context, answering different questions and using different representations of data. Conventional mixture of experts algorithms in machine learning literature can not be used since each expert (model) uses different inputs. We also consider computational complexity of the model by using the most expensive models only when it is necessary. We test our model with EROS-2 and MACHO datasets, and we show that we solve most of the classification challenges only by training a meta-model to learn how to integrate the previous experts.

Meta Classification for variable stars

Clustering based feature learning on variable stars

This project was developed by Cristóbal Mackenzie, in collaboration with Professor Pavlos Protopapas. The success of automatic classification of variable stars strongly depends on the lightcurve representation. Usually, lightcurves are represented as a vector of many descriptors designed by astronomers called features. These descriptors are expensive in terms of computing, require substantial research effort to develop and do not guarantee a good classification. Today, lightcurve representation is not entirely automatic; algorithms must be designed and manually tuned up for every survey. The amounts of data that will be generated in the future mean astronomers must develop scalable and automated analysis pipelines. In this work we present a feature learning algorithm designed for variable objects. Our method works by extracting a large number of lightcurve subsequences from a given set, which are then clustered to find common local patterns in the time series. Representatives of these common patterns are then used to transform lightcurves of a labeled set into a new representation that can be used to train a classifier. The proposed algorithm learns the features from both labeled and unlabeled lightcurves, overcoming the bias using only labeled data.

Clustering based feature learning on variable stars

Automatic classification of variable stars in catalogs with missing data

This project was developed by Karim Pichara, in collaboration with Professor Pavlos Protopapas. We present an automatic classification method for astronomical catalogs with missing data. We use Bayesian networks and a probabilistic graphical model that allows us to perform inference to predict missing values given observed data and dependency relationships between variables. To learn a Bayesian network from incomplete data, we use an iterative algorithm that utilizes sampling methods and expectation maximization to estimate the distributions and probabilistic dependencies of variables from data with missing values. To test our model, we use three catalogs with missing data (SAGE, Two Micron All Sky Survey, and UBVI) and one complete catalog (MACHO). We examine how classification accuracy changes when information from missing data catalogs is included, how our method compares to traditional missing data approaches, and at what computational cost. Integrating these catalogs with missing data, we find that classification of variable objects improves by a few percent and by 15% for quasar detection while keeping the computational cost the same.

Automatic classification of variable stars in catalogs with missing data

Supervised detection of anomalous light-curves in massive astronomical catalogs

This project was developed by Isadora Nun, in collaboration with Professor Pavlos Protopapas. The development of synoptic sky surveys has led to a massive amount of data for which resources needed for analysis are beyond human capabilities. In order to process this information and to extract all possible knowledge, machine learning techniques become necessary. Here we present a new methodology to automatically discover unknown variable objects in large astronomical catalogs. With the aim of taking full advantage of all information we have about known objects, our method is based on a supervised algorithm. In particular, we train a random forest classifier using known variability classes of objects and obtain votes for each of the objects in the training set. We then model this voting distribution with a Bayesian network and obtain the joint voting distribution among the training objects. Consequently, an unknown object is considered as an outlier insofar it has a low joint probability. By leaving out one of the classes on the training set we perform a validity test and show that when the random forest classifier attempts to classify unknown lightcurves (the class left out), it votes with an unusual distribution among the classes. This rare voting is detected by the Bayesian network and expressed as a low joint probability. Our method is suitable for exploring massive datasets given that the training process is performed offline.

Supervised detection of anomalous light-curves in massive astronomical catalogs

Fondecyt Projects

Automatic Classification of Variable Stars integrating multiple Catalogs

In the last few years, there has been an increased interest towards computer applications for astronomical research. This interest has been mainly triggered by the ongoing and future observational projects expected to deliver huge amounts of high-quality data, which needs to be promptly analyzed. Some illustrative examples are: the upcoming Large Synoptic SurveyTelescope (LSST), the ongoing Vista Variables in the Via Láctea (VVV) ESO Public Survey, and the Atacama Large Millime-ter/submillimeter Array (ALMA), among others...

View

Main Collaborators

Pavlos Protopapas

IACS, Harvard University

Link

Felipe Barrientos

Centro de Astro Ingeniería UC

Link

Márcio Catelán

Astronomía UC

Link

Felipe Tobar

CMM Universidad de Chile

Link

Paulina Troncoso

Centro de Astro Ingeniería UC

Link

Francisco Forster

CMM Universidad de Chile

Link

Guillermo Cabrera

CMM Universidad de Chile

Link

Pablo Huijse

Insituto Milenio de Astrofísica (MAS)

Link

Postdocs

Christian Pieringer

Link

PhD Students

Francisco Pérez Galarce

Link

Astrid San Martin

Link

Ignacio Becker

Link

Orlando Vásquez

Link

Felipe Rojas Aracena

Link

Master Students

Matias Vergara

Link

Lukas Zorich

Link

Javiera Astudillo

Link

Rodrigo Hurtado

Link

Gonzalo Cáceres

Link

Nebil Kawas

Link

Alumni

Nicolás De La Maza

Link

Patricio Benavente

Link

Belén Saldías

Link

Nicolás Castro

Link

Cristóbal Berger

Link

Isadora Nun

Link

Carlos Aguirre

Link

Javier Machin Matos

Link

Lucas Valenzuela

Link

Tomás Yany

Link

Andrés Riveros

Link

Cristóbal Mackenzie

Link

Augusto Sandoval

Link

Daniela Carrasco

Link

Visiting Students

Yesenia Helem

Link

Advanced Computer Programming in Python

This book covers most of the advanced topics in Computer Programming, such as Object Oriented Design, Data Structures, Functional Programming, MetaClasses, Abstract Classes, Exceptions, Testing, Threading, Simulation, Graphical Interfaces, Input/Output, Networking and Web Services. All these topics are based on the Python 3 language. On each chapter, besides the theory, there is always code showing examples of applications. Also, at the end of each chapter, there are two hands-on activities, where a harder problem is defined, encouraging readers to solve them by themselves. We include all the solutions at the end of the book.

Buy it Online version