Skip to content

Commit

Permalink
Merge pull request #52 from UBC-MDS/update_docker_instructions_in_readme
Browse files Browse the repository at this point in the history
Update docker instructions in readme
  • Loading branch information
RussDim authored Dec 10, 2022
2 parents 4019eae + 1cacd60 commit b506919
Show file tree
Hide file tree
Showing 2 changed files with 113 additions and 0 deletions.
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,24 @@ There are two method to replicate this analysis:
```
### Usage with Docker
The image for rendering the results of this project resides at:
[Docker Hub](https://hub.docker.com/repository/docker/creditapprovalprediction/credit_approval_prediction)
In order to run the image you need to install Docker Desktop.
Then from the folder where you have cloned project repository run in terminal:
```
docker run --rm -v /$(pwd):/home/credit_approval_prediction creditapprovalprediction/credit_approval_prediction make -C /home/credit_approval_prediction all
```
In order to clean the output from the above command run:
```
docker run --rm -v /$(pwd):/home/credit_approval_prediction creditapprovalprediction/credit_approval_prediction make -C /home/credit_approval_prediction clean
```
## Makefile Dependencies Graph
![makefile](https://github.com/UBC-MDS/Credit_Approval_Prediction/blob/014b7405e9f88d85d87cf94eb2b099ec94611d55/Makefile.png)
Expand Down
95 changes: 95 additions & 0 deletions doc/Proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Credit Approval Prediction

## Contributors:

- Spencer Gerlach
- Ruslan Dimitrov
- Daniel Merigo
- Mengjun Chen

This was a data analysis term project completed for DSCI 522 (Data Science Workflows), a course in the Master of Data Science program at the University of British Columbia.

## Introduction

The overall goal of this project was to use a publicly available dataset to answer a question about the data, and to automate the data science workflow associated with the analysis.

This data analysis project includes an analysis of the [Credit Approval dataset](https://archive-beta.ics.uci.edu/dataset/27/credit+approval), made publicly available via the UC Irvine Machine Learning Repository.

The project included the following major deliverables:

- Write 4-5 R/python scripts,
- Creation of a reproducible report in Jupyter Lab or R Markdown,
- Automation of the analysis workflow using `GNU Make`

## Exploratory Data Analysis

The dataset in question, [Credit Approval dataset](https://archive-beta.ics.uci.edu/dataset/27/credit+approval), included a good selection of features upon which to build a simple automated machine learning and statistical exercise. The dataset contains data on Japanese credit card screening of credit card applications. All attribute names and values have been anonymized in order to protect the confidentiality of the applicants. A high level characterization of the features is found at the dataset page linked above. The raw dataset contains a mixture of categorical and numeric features named A1-A16, where the target feature A16 contains values of `+` or `-` indicating whether the candidate is approved or not.

An EDA analysis, [linked here](https://github.com/UBC-MDS/Credit_Approval_Prediction/blob/main/src/Exploratory_Data_Analysis.ipynb), was conducted to investigate the contents of the dataset, relabel and remove missing values, visualize the distribution of various feature values, and to detect any existing correlation between numeric features.

The Credit Approval dataset is anonymized, so information gleaned from the EDA can only tell us which features (A1-A16) may or may not be important when predicting the target, and which features may be correlated or distributed according to certain known distributions. We are not able to apply any real-world contextual background or domain knowledge to the dataset without labelled feature names.

The EDA generated the following conclusions about the dataset:
- There are 690 rows in the original dataset, 522 of which will be used to train the ML models after a 80%/20% train-test data split. Some of this data is missing values that are replaced/filtered by the EDA.
- The dataset has 16 columns, 6 of which are numeric, and 10 are categorical.
- Numeric columns will require scaling during the preprosessing stage of model creation.
- There is no significant correlation found between any two features in the dataset.

## Analysis Question

This analysis will focus on predicting whether a credit card applicant will be approved or not approved based on a set of features describing that applicant. The dataset in question will be trained on the train portion of the initial dataset (defined during EDA phase), and evaluated against a smaller testing portion of the initial dataset.

Specifically, our analysis prediction question is:

> "Given features about a credit card applicant, will the applicant be approved for a credit card?"
In our predictive study, we will evaluate the prediction accuracy of a number of simple machine learning models. After splitting the data in EDA into train and test splits, and conducting data preprocessing, we will train and evaluate the following models:

- Support Vector Machine Classifier (RBF Kernel), which we will refer to as `SVC`
- k-Nearest Neighbours model, which we will refer to as `kNN`
- Logistic Regression model, which we will refer to as `Logistic Regression`

These models will undergo hyperparameter optimization, and the optimized models will be scored against the test data.

## Report

The final report will be linked here once completed.

## Usage

In order to replicate this analysis:

1. Clone this repo, following the [cloning a repository](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository) documentation if requried.

2. Navigate to this repository and ensure it is your current working directory.

2. Run the following code in your terminal:

python src/download_data.py --url="https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data" --out_path=<supply an output location> [--filename=<supply a suitable filename>]

The filename argument is optional and if not supplied it will default to 'crx.csv'.

This is done so we ensure that by default the data file is downloaded and converted to a csv file the experiment can read.

## Dependencies

- ipykernel
- ipython\>=7.15
- vega_datasets
- altair_saver
- selenium\<4.3.0
- pandas\<1.5
- pip
- docopt=0.6.2
- requests=2.22.0
- feather-format=0.4.0

## Licenses

The Credit Approval materials here are licensed under the MIT License and the Creative Commons Attribution 2.5 Canada License (CC BY 2.5 CA). If re-using/re-mixing please provide attribution and link to this webpage.

The license information can be viewed in the `LICENSE` file found in the root directory of this repository.

## Attribution

The automated scripting file in `src/download_data.py` is based on the script `download_data.py` created by Tiffany Timbers 2019-12-18. It can be found [here](https://github.com/ttimbers/breast_cancer_predictor/blob/master/src/download_data.py)

0 comments on commit b506919

Please sign in to comment.