Model Training and Evaluation for Higgs Dataset
This repository demonstrates training and evaluating a Keras model using the Higgs dataset available from the UCI ML Repository.
The dataset has been studied in this publication:
The ML pipeline includes downloading the dataset, data preparation, model training, evaluation, feature importance analysis, and visualization of results. Dask is utilised for handling this large datasets for parallel processing.
python -m venv env
source env/bin/activate # On Windows use `env\Scripts\activate`
pip install -r requirements.txt
The Higgs dataset can be downloaded directly from the provided scripts in separate steps
download_data.py
~ 2.6 GBdata_extraction.py
~ 7 GBdata_preparation.py
~ test dataset: 240 MB, trained dataset: 5 GBAlternatively, you can run directly the main script from the data/src/main.py
:
python data/src/main.py
Download a dataset file from the specified URL with a progress bar.
python data/download_data.py
zipDataUrl = 'https://archive.ics.uci.edu/static/public/280/higgs.zip' # Higgs dataset URL
zipPath = '../higgs/higgs.zip'
downloadDataset(zipDataUrl, zipPath)
cleanUp(zipPath) # Clean up downloaded zip file (~ 2.6 GB)
Extract the contents of a zip dataset and decompress the .gz dataset file to a specified output path.
python data/data_extraction.py
zipDataUrl = 'https://archive.ics.uci.edu/static/public/280/higgs.zip' # Higgs dataset URL
extractTo = '../higgs'
zipPath = os.path.join(extractTo, 'higgs.zip')
gzCsvPath = os.path.join(extractTo, 'higgs.csv.gz')
finalCsvPath = os.path.join(extractTo, 'higgs.csv')
extractZippedData(zipPath, extractTo)
decompressGzFile(gzCsvPath, finalCsvPath)
cleanUp(gzCsvPath) # Clean up gzipped file (~ 2.6 GB)
Set column names and separates the test set from the training data based on the dataset description (500,000 test sets).
Dataset Description: The first column is the class label (1 for signal, 0 for background), followed by the 28 features (21 low-level features then 7 high-level features). The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features.
python data/data_preparation.py
prepareFrom = '../higgs'
csvPath = os.path.join(prepareFrom, 'higgs.csv')
preparedCsvPath = os.path.join(prepareFrom, 'prepared-higgs.csv')
prepareData(csvPath, preparedCsvPath)
cleanUp(csvPath) # Clean up gzipped file (~ 7.5 GB)
Use the dataLoader/data_loader.py
script to load the prepared dataset into a Pandas DataFrame.
python data/src/data_loader.py
filepath = '../data/higgs/prepared-higgs_train.csv' # prepared-higgs_test.csv
dataLoader = DataLoader(filepath)
dataFrame = dataLoader.loadData()
dataLoader.previewData(dataFrame)
Use the dataLoader/data_loader_dask.py
script to load the prepared dataset into a Dask DataFrame, which is beneficial for this large dataset.
python data/src/data_loader_dask.py
filepath = '../data/higgs/prepared-higgs_train.csv' # prepared-higgs_test.csv
dataLoader = DataLoaderDask(filepath)
dataFrame = dataLoader.loadData()
dataLoader.previewData(dataFrame)
Provides various functions for performing EDA, including visualising correlations, checking missing values, and plotting feature distributions.
The data analysis plots are saved under eda/plots
.
python exploration/eda.py
filepath = '../data/higgs/prepared-higgs_train.csv' # prepared-higgs_test.csv
# using Dask data frame
dataLoaderDask = DataLoaderDask(filepath)
dataFrame = dataLoaderDask.loadData()
eda = EDA(dataFrame)
eda.describeData()
eda.checkMissingValues()
eda.visualiseFeatureCorrelation()
eda.visualizeTargetDistribution()
eda.visualizeFeatureDistribution('feature_1')
eda.visualizeAllFeatureDistributions()
eda.visualizeFeatureScatter('feature_1', 'feature_2')
eda.visualizeTargetDistribution()
eda.visualizeFeatureBoxplot('feature_2')
The model is defined using Keras with the following default architecture for binary classification:
You can customise the model architecture by providing a different modelBuilder
callable in the ModelTrainer class.
The trained models and training loss plots are saved under kerasModel/trainer/trainedModels
.
python kerasModel/trainer/model_trainer.py
filePath = '../../data/higgs/prepared-higgs_train.csv'
def customModel(inputShape: int) -> Model:
"""Example of a custom model builder function for classification"""
model = keras.Sequential([
layers.Input(shape=(inputShape,)),
layers.Dense(512, activation='relu'),
layers.Dropout(0.3),
layers.Dense(256, activation='relu'),
layers.Dense(128, activation='relu'),
layers.Dense(64, activation='relu'),
layers.Dense(1, activation='sigmoid') # Sigmoid for binary classification
])
return model
dataLoaderDask = DataLoaderDask(filePath)
dataFrame = dataLoaderDask.loadData()
## Optional: Define model training/compiling/defining parameters as a dictionary and pass it to the class constructor
params = {
"epochs": 10,
"batchSize": 32,
"minSampleSize": 100000,
"learningRate": 0.001,
"modelBuilder": customModel, # callable
"loss": 'binary_crossentropy',
"metrics": ['accuracy']
}
trainer = ModelTrainer(dataFrame, params)
trainer.trainKerasModel() # optional: Train the Keras model with sampling, Set: trainKerasModel(sample = true, frac = 0.1).
trainer.plotTrainingHistory()
The evaluation script computes metrics like:
The evaluation includes visualizations such as
The evaluation results are logged and saved to a file under kerasModel/evaluator/evaluationPlots
.
python kerasModel/evaluator/model_evaluator.py
modelPath = '../trainer/trainedModels/keras_model_trained_dataset.keras'
filePath = '../../data/higgs/prepared-higgs_train.csv'
dataLoaderDask = DataLoaderDask(filePath)
dataFrame = dataLoaderDask.loadData()
evaluator = ModelEvaluator(modelPath, dataFrame)
evaluator.evaluate()
The feature importance is computed using permutation importance and visualised using a bar chart. It is implemented once using the Pandas approach (with SciKit) and another using Dask for parallel processing.
The chart and the result CSV file are saved under kerasModel/featureImportance/featureImportancePlots
.
python kerasModel/featureImportance/feature_importance.py
modelPath = '../trainer/trainedModels/keras_model_test_dataset.keras'
filePath = '../../data/higgs/prepared-higgs_test.csv'
dataLoaderDask = DataLoaderDask(filePath)
dataFrame = dataLoaderDask.loadData()
evaluator = FeatureImportanceEvaluator(modelPath, dataFrame)
evaluator.evaluate()
# Alternatively
evaluator = FeatureImportanceEvaluator(modelPath, dataFrame, sampleFraction = 0.1, nRepeats=32) # with sampling
evaluator.evaluate(withDask = False) # with pandas