How to use cheap latent variable information to augment data


  The area of the small circle (there is one) is proportional to the time it needs to generate an event via Monte Carlo up to a latent state z, which has less than 100 features. The large area represents the time needed to simulate the latter steps up to the observables x (100 million features), this part of the simulation is independent of the theory parameters that we want to infer. Typically, we use features of x and the ground truth to train our machine learning. The structure described invites to divert from this and profit from the compact informative latent state z. In some cases, the probability distribution of the parameter of interest conditional on z is even tractable (and its score) or else can be learned from big "cheap" samples. In [1] arXiv:1808.00973 we show how to augment the loss function using this information from the latent state to increase the data efficiency for training on x. This method has the potential to reduce the amount of CPU needed for simulation, which is a major concern at the LHC for the future.


[1] Likelihood-free inference with an improved cross-entropy estimator, M. Stoye et al, arXiv:1808.00973

[2] Poster MLPS workshop at Neurips, proceding paper

Which particle is that?

Each line or block represent a multi-dimension complex measurement

To answer such questions my team developed a custom deep neural networks architecture for the multi-class classification tasks in this domain. The resulting performance improvement with respect to previous methods leads big positive impact on the CMS experiment. Detector upgrades that would reach similar performance gains would cost 10th's of million $. The classifiers are now implemented in the CMS experiment's software and currently under validation in real data.

A summary can be found in my NIPS DLPS workshop paper

How to deal with simulation and real data differences

How to deal with simulation and real data differences

picture: arXiv:1709.07857

Similar to practice in many other domains (e.g. robotics, left illustration), also in particle physics, we train on simulation and apply the neural network in real data. My team customises off-the-shell unsupervised domain adaptation algorithms to our special needs. We simultaneously estimate the label fraction in real data and do domain adaptation, as in particle physics also the label fraction in real data is typically not know precisely.

Premilinary results presented at the 2nd LHC machine learning workshop at CERN

Using simulation to help classifier calibration in real data

Calibration of classifier in real data is key for particle physics. Sometimes we have several real data datasets with different label proportions. For the binary case, that is sufficient to deterime the density function of the score for a given label up to a two dimensional degerneracy in the solution space, namely rotations in label space. I broke the degeneracy by selecting the solution that would require the smallest relative correction to simulation. After this procedure the performance of corrected simultion and "data" was very similar. Thus, the calibration estimation from simulation for real data improved.

Premilinary results presented at the 2nd LHC machine learning workshop at CERN