Domino Data Lab provides a collection of open-source solutions called Domino Reference Projects. These projects are freely available and were built with the following goals:
-
To educate you about a specific data science topic.
-
To accomplish a specific analytical method or task in Domino, including relevant best practices.
-
To provide an easy way to share pre-built assets such as Launchers, Scheduled Jobs, Apps, and Endpoints.
-
To provide end-to-end implementations that new team members can use to get experience with the platform while onboarding.
All the projects follow a common pattern, where a use case was developed with Python or R. The data sets that the projects use are based on freely available collections of data that are encapsulated with the reference project or are available externally to be downloaded.
Typically, the projects contain a Jupyter notebook, which provides background and context for the use case. Most of the projects also include the relevant scripts for operationalization (such as model retraining job scripts, Model API scripts, and web applications). The projects and all accompanying assets are available on GitHub.
The following table lists the reference projects that are currently available.
Project Name | Brief description | GitHub Link |
---|---|---|
Credit Card Fraud Detection | Uses XGBoost to detect credit card transaction fraud | |
Named Entity Recognition | Locates and classifies named entities with a BiLSTM-CRF model |
The GitHub repositories include instructions about how to use the project assets and how to create a dedicated compute environment, if needed.
Import the GitHub repository to bring project assets into your Domino installation or leverage Git-based projects.
Credit card fraud represents a significant problem for financial institutions, and reliable fraud detection is generally challenging. You can use this project as a template to facilitate training a machine learning model on a real-world credit card fraud dataset. It employs techniques such as oversampling and threshold movement to address class imbalance.
The dataset used in this project was collected as part of a research collaboration between Worldline and the Machine Learning Group of Université Libre de Bruxelles. You can download the raw data from Kaggle.
The following assets are included in the project:
-
FraudDetection.ipynb
- A notebook that performs exploratory data analysis, data wrangling, hyperparameter optimization, model training, and evaluation. The notebook introduces the use cases and describes the key techniques needed to implement a classification model (such as oversampling and threshold moving). -
model_train.py
- A training script that can be operationalized and retrain the model on-demand or on schedule. You can use the script as a template. The key elements that must be customized for other datasets are:-
load_data - data ingestion function
-
feature_eng - data wrangling
-
xgboost_search - more specifically, the values in params that define the grid search scope
-
-
model_api.py
- A scoring function that exposes the persisted model as Model API. The score function accepts all independent parameters of the dataset as arguments and uses the model to compute the fraud probability for the individual transaction.
This project uses Python packages that are not included in the Domino standard environments: imblearn and xgboost.
You can customize a copy of the Domino Standard Environment or create a new environment with the Dockerfile instructions in the README.md
file of the project.
Named Entity Recognition (NER) is a Neuro-Linguisitic Programming (NLP) problem that involves locating and classifying named entities (people, places, organizations, and so on) mentioned in unstructured text. This problem is used in many NLP applications that deal with use-cases like machine translation, information retrieval, and chatbots. In this project, we fit a BiLSTM-CRF model using a freely available annotated corpus and Keras.
This project uses the Annotated Corpus for Named Entity Recognition dataset. This dataset is based on the GMB (Groningen Meaning Bank) corpus and has been tagged, annotated and built specifically to train a classifier to predict named entities such as name and location.
The assets included in the project are:
-
ner.ipynb
- A notebook that performs exploratory data analysis, data wrangling, hyperparameter optimization, model training, and evaluation. The notebook introduces the use cases and describes the key techniques needed to implement an NER classification model. -
model_train.py
- A training script that can be operationalized and retrain the model on-demand or on-schedule. You can use the script as a template. The key elements that must be customized for other datasets are:-
load_data - data ingestion function
-
pre_process - data wrangling
Most of the important parameters are controlled through command-line arguments to the script.
-
-
model_api.py
- A scoring function that exposes the persisted model as Model API. The score function accepts a string of plain text and outputs the tokenized version of the text with the corresponding IOB tags.
This project uses Python packages that are not included in the Domino standard environments: plot-keras-history and keras-contrib.
You can customize a copy of the Domino Standard Environment or create a new environment with the Dockerfile instructions in the README.md
file of the project.