forked from harestakesyan/Team2_final-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTeam2_TopicProposal.rmd
52 lines (35 loc) · 3.21 KB
/
Team2_TopicProposal.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Team 2 - Topic Proposal
**Hovhannes Arestakesyan, Zhe Yu, Chenxu Wang, Karina Martinez, Timberley Thompson**
Date: 18-Jul-2022
**1. Research Topic:**
The topic of our research is cardiovascular disease. According to the American Heart Association, cardiovascular disease can refer to a number of conditions, including heart disease, heart attack, stroke, heart failure, arrhythmia, and heart valve problems. Heart disease and stroke are the leading causes of death worldwide, ranked first and second respectively. The World Heath Organization cites unhealthy diet, physical inactivity, tobacco and alcohol use as the most important behavioral risk factors of heart disease and stroke risk. As a result, patients with cardiovascular disease often present with raised blood pressure, raised blood glucose, raised blood lipids, and BMI in the overweight or obese range.
**2. SMART question(s):**
* What are the top three factors associated with cardiovascular disease? Does this agree with the risk factors observed by the World Health Organization?
* What recommendations can be made to reduce cardiovascular risk and prevent heart attacks and strokes?
* Can the model predict the probability that a patient has cardiovascular disease?
* Is there any association between an individual's sex and whether they has the cardiovascular disease ? Are men more prone to cardiovascular diseases as compared to women?
**3. Source of your data set and how many (roughly) observations:**
The Cardiovascular Disease dataset was obtained from Kaggle (https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset). There are 70000 total values (observations) in 11 features, such as age, gender, systolic blood pressure, diastolic blood pressure, and etc. The target class “cardio” equals to 1, when patient has cardiovascular disease, and it’s 0, if patient is healthy.
**4. Link to GitHub repo:**
https://github.com/harestakesyan/Team2_final-project
# **Professor comments**
1. Top 3 factors. Do you mean according to a particular model? And you are looking at feature importance? Just something to think about.
2. Again, be a little more specific. Do you mean recommendations like "screen early", "eat healthy", "do exercise". Are these the kind of recommendations you have in mind? Are they supported in your dataset?
3. Instead of "Can the model predict", it's probably more like "How accurately can the model predict"
# **Plan:**
## EDA
1. Dataframe structure, variable types, summary...
2. Check the missing values
3. Outlier: box graph
4. Whether the data is normally distributed?
5. Normalization?
6. General information we can get from each variable
## Modeling
**Question 1:** Top 3 factors: feature importance?
feature selection, the Leaps package or the Bestglm package?
**Question 3:**. How accurately can the model predict:
which model should we choose? Or choose multiple models and compare their accuracy?
**Linear Regression**:can do correlation but y must be continuous variable, can answer the question like whether body weight is correlated with cholesterol?
**Logistic Regression**: perdition?
**KNN**:perdition?
How to compare the accuracy: Confusion matrix and ROC curve of the models?