Creating a classifier

The aim of this study is to design and compare various algorithms which classify the top quarks according to their origin. The top quarks originating from the resonance decays should be labelled as such (e.g. with a label "resonance: 1") and the top quarks originating from associated production should get a different label (e.g. "resonance: 0").

You can find the project here:

Setup

To start the project, please open a new terminal and create a new directory for your summer student project. In case you are not familiar with the command line, have a look at this tutorial: https://www.codecademy.com/learn/learn-the-command-line

First, download the git project (not required if you have already done so for exploring the dataset):

git clone git@github.com:philippgadow/bsm4tops-gnn.git

Then move into the directory for the classification study:

cd bsm4tops-gnn/project_simple

We will use python throughout the project to benefit from a number of useful packages. Setting these up works best in a clean environment. Therefore, we use the python virtual environment (similar to conda if you have heard of that one).

A setup script is provided which takes care of that for you:

source setup.sh

Now the virtual environment is activated. You notice that the command line now starts with "(venv)". Whenever you return to the project and open a new terminal, make sure to execute the setup script every time before you start.

If you already set up a different virtual environment, you might encounter issues. Make sure to use a clean shell.

Running the code

To run the script, first make sure that the virtual environment is active by sourcing the setup.sh script and execute:

python scripts/bsm4tops.py ../data/simple/unweighted_events.root

This will just run a plotting routine based on seaborn which illustrates the distribution of features of the dataset for the two classes "resonance: 1" and "resonance: 0" and their correlation.

The code has several command line arguments you can use to run specific classifiers and test their performance. To learn about them, check out the --help command line argument:

python scripts/bsm4tops.py --help

A very simple classifier

For the beginning, we will set aside any idea of machine-learning and instead focus on human-learning. It is now your task to implement a classifier.

Have a look at the runSimpleClassifier(X_train, X_test, y_train, y_test) function in the script. A simple requirement based on the transverse momentum of top quarks is used to decide whether they originate from the resonance or from associated production.

Let's quantify the performance. A useful measure is the so-called Receiver Operator Characteristic (ROC) curve.

Read the Wikipedia article: https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Then have a look at the ROC curve and the area-under-curve measure in plots/simple_ROC_test.png . It is pretty bad, only slightly better than randomly choosing.

Try to come up with a better criterion! You can look at the distributions of the features (this is the "learning" step of this algorithm: you, the human, has to learn a good criterion and then implement it).

Letting the machine do the job

You will notice that there are more options in the script. Already two machine-learning-based classifiers are implemented:

  1. k-nearest-neighbour classification [details]

  2. Boosted Decision Tree classification [details][explanation of how BDTs work]

You can play around with them and check their performance!

Last updated