Depression is a growing concern, particularly among students and young professionals who often face significant academic, professional, and personal pressures. It is critical to identify individuals at risk early on to provide timely support and interventions. In this project we use data analysis and machine learning models to explore various factors potentially associated with depression, including academic pressure, sleep patterns, dietary habits, work-study hours, and demographic characteristics. By examining these factors, the aim is to understand the underlying relationships and identify patterns that can contribute to better mental health awareness and support systems.
The primary goal of this project is to identify key factors associated with depression among students and young professionals and to build predictive models that can accurately identify individuals at risk. By analyzing variables like academic pressure, sleep patterns, dietary habits, work hours, and demographic characteristics, we seek to determine which factors are most strongly related to depression risk. While the focus is on prediction rather than inference, this knowledge can be used to create early warning systems and design interventions to address mental health issues effectively.
In this project, recall is prioritized over precision as the primary evaluation metric. Recall measures the proportion of actual cases of depression that are correctly identified by the model. This is crucial because missing someone who is truly at risk of depression (a false negative) could lead to severe consequences, such as not receiving the help they urgently need. While false positives (classifying someone as depressed when they are not) are not desirable either, they are far less harmful compared to false negatives. By focusing on recall, in this project we aim to ensure that as many at-risk individuals as possible are correctly identified.
We are starting with loading the Student Depression Dataset, downloaded from the kaggle.com.
Source: Kaggle - Student Depression Dataset
data <- read.csv("student_depression_dataset.csv")
str(data)
## 'data.frame': 27901 obs. of 18 variables:
## $ id : int 2 8 26 30 32 33 52 56 59 62 ...
## $ Gender : chr "Male" "Female" "Male" "Female" ...
## $ Age : num 33 24 31 28 25 29 30 30 28 31 ...
## $ City : chr "Visakhapatnam" "Bangalore" "Srinagar" "Varanasi" ...
## $ Profession : chr "Student" "Student" "Student" "Student" ...
## $ Academic.Pressure : num 5 2 3 3 4 2 3 2 3 2 ...
## $ Work.Pressure : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CGPA : num 8.97 5.9 7.03 5.59 8.13 5.7 9.54 8.04 9.79 8.38 ...
## $ Study.Satisfaction : num 2 5 5 2 3 3 4 4 1 3 ...
## $ Job.Satisfaction : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Sleep.Duration : chr "'5-6 hours'" "'5-6 hours'" "'Less than 5 hours'" "'7-8 hours'" ...
## $ Dietary.Habits : chr "Healthy" "Moderate" "Healthy" "Moderate" ...
## $ Degree : chr "B.Pharm" "BSc" "BA" "BCA" ...
## $ Have.you.ever.had.suicidal.thoughts..: chr "Yes" "No" "No" "Yes" ...
## $ Work.Study.Hours : num 3 3 9 4 1 4 1 0 12 2 ...
## $ Financial.Stress : chr "1.0" "2.0" "1.0" "5.0" ...
## $ Family.History.of.Mental.Illness : chr "No" "Yes" "Yes" "Yes" ...
## $ Depression : int 1 0 0 1 0 0 0 0 1 1 ...
This dataset comprises 27,901 observations and 18 variables, focusing on a study of mental health and depression, particularly among students. The target variable is Depression, which is binary (1 for depressed, 0 for not depressed), with the remaining variables serving as predictors. The numeric variables include Age, id, Academic.Pressure, Work.Pressure, CGPA, Study.Satisfaction, Job.Satisfaction, Work.Study.Hours, and Depression. Meanwhile, the categorical variables are Gender, City, Profession, Sleep.Duration, Dietary.Habits, Degree, Have.you.ever.had.suicidal.thoughts.., Financial.Stress, and Family.History.of.Mental.Illness.
Now, we are going to look for any missing values present in our dataset.
sum(is.na(data))
## [1] 0
sum(data == "?")
## [1] 3
sum(data$Financial.Stress == "?")
## [1] 3
The dataset contains a small number of missing values in the form of “?”. Specifically, there are three instances of “?” in the column Financial.Stress. Since this is a very small proportion of the dataset, these observations will be removed to ensure data quality.
The following code was used to remove these observations:
data <- data[data$Financial.Stress != "?", ]
sum(data$Financial.Stress == "?")
## [1] 0
After this operation, the dataset no longer contains any missing values.
Now, we would like to continue with exploration of the non-zero values in the columns Work.Pressure and Job.Satisfaction, as they were found to contain small amount of the non-zero values during the manual dataset inspection.
sum(data$Work.Pressure != 0)
## [1] 3
sum(data$Job.Satisfaction != 0)
## [1] 8
sum(data$Profession != "Student")
## [1] 31
The results show that the column Work.Pressure has only 3 non-zero values and Job.Satisfaction has only 8 non-zero values out of 27,901 observations. This indicates that the majority of values in these columns are zero. Similarly, the column Profession contains only 31 entries that are not categorized as ‘Student’. Since we are focusing solely on students, we will retain only the entries where ‘Profession’ is ‘Student’. As a result, the column Profession no longer holds any explanatory power and will be removed from the dataset.
Moreover, since the number of non-zero values in these columns is extremely low compared to the total dataset size, these variables are unlikely to provide meaningful insights. Therefore, we have decided to exclude Work.Pressure Job.Satisfaction, Profession from further analysis to focus on variables that contribute more significantly to our objectives.
data$Work.Pressure <- NULL
data$Job.Satisfaction <- NULL
# We further remove rows where Profession is not "Student", as we are interested only in students
data <- data[data$Profession == "Student", ]
data$Profession <- NULL
dim(data)
## [1] 27867 15
The updated dataset, data, now consists of 27867 observations and 15 variables.
To begin with, we would like to use summary statistics of the dataset. This provides an overview of both numerical and categorical variables, helping us to identify any patterns at a glance.
summary(data)
## id Gender Age City
## Min. : 2 Length:27867 Min. :18.00 Length:27867
## 1st Qu.: 35057 Class :character 1st Qu.:21.00 Class :character
## Median : 70659 Mode :character Median :25.00 Mode :character
## Mean : 70440 Mean :25.82
## 3rd Qu.:105816 3rd Qu.:30.00
## Max. :140699 Max. :59.00
## Academic.Pressure CGPA Study.Satisfaction Sleep.Duration
## Min. :0.000 Min. : 0.000 Min. :0.000 Length:27867
## 1st Qu.:2.000 1st Qu.: 6.290 1st Qu.:2.000 Class :character
## Median :3.000 Median : 7.770 Median :3.000 Mode :character
## Mean :3.141 Mean : 7.656 Mean :2.944
## 3rd Qu.:4.000 3rd Qu.: 8.920 3rd Qu.:4.000
## Max. :5.000 Max. :10.000 Max. :5.000
## Dietary.Habits Degree Have.you.ever.had.suicidal.thoughts..
## Length:27867 Length:27867 Length:27867
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Work.Study.Hours Financial.Stress Family.History.of.Mental.Illness
## Min. : 0.000 Length:27867 Length:27867
## 1st Qu.: 4.000 Class :character Class :character
## Median : 8.000 Mode :character Mode :character
## Mean : 7.158
## 3rd Qu.:10.000
## Max. :12.000
## Depression
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.5852
## 3rd Qu.:1.0000
## Max. :1.0000
From the summary statistics we can describe the variables:
Now, we would like to have a look at the correlation matrix, but to do that we first encode columns with “Yes”/“No” values as 1/0, so that we can include them too.
# We create the new dataset for purpose of correlation matrix, as we do not
# want to alter the main one
data_encoded <- data
binary_columns <- c("Have.you.ever.had.suicidal.thoughts..",
"Family.History.of.Mental.Illness")
data_encoded[binary_columns] <- lapply(data_encoded[binary_columns],
function(x) ifelse(x == "Yes", 1, 0))
numeric_columns <- c("id", "Financial.Stress")
data_encoded[numeric_columns] <- lapply(data_encoded[numeric_columns], as.numeric)
# Now we exclude all the categorical variables
data_encoded <- data_encoded[sapply(data_encoded, is.numeric)]
correlation_matrix <- cor(data_encoded, use = "pairwise.complete.obs")
correlation_matrix <- (correlation_matrix + t(correlation_matrix))/2
corrplot(correlation_matrix,
method = "color",
addCoef.col = 'black',
type = "lower",
diag = FALSE,
number.cex = 0.7,
tl.cex = 0.8, # Text label size
cl.cex = 0.8,
tl.col = "black",
cl.pos = "r",
mar = c(0, 0, 1, 0),
title = "Correlation between variables")
From the correlation plot we can observe the following:
Variable Depression is moderately positively correlated with Financial Stress (0.36), Have you ever had suicidal thoughts.. (0.55) and Academic Pressure (0.47), indicating these factors may contribute to higher depression levels. There is also a moderate-weak correlation with Work Study Hours (0.21). Additionally, variable Depression has a negative correlation with Study Satisfaction (-0.17) and Age (-0.23).
Now, we would like to have a deeper look into main predictors of the depression and some additional variables.
p1 <- ggplot(data, aes(
x = as.factor(Academic.Pressure),
fill = factor(Depression, levels = c(0, 1), labels = c("No", "Yes"))
)) +
geom_bar(position = "fill") +
labs(
title = "Proportion of Depression by Academic Pressure",
x = "Academic Pressure (0 = Low, 5 = High)",
y = "Proportion",
fill = "Depression"
) +
scale_y_continuous(labels = scales::percent) +
scale_fill_brewer(palette = "Pastel1")+
theme_minimal(base_size = 5)
p2 <- ggplot(data, aes(
x = as.factor(Study.Satisfaction),
fill = factor(Depression, levels = c(0, 1), labels = c("No", "Yes"))
)) +
geom_bar(position = "fill") +
labs(
title = "Proportion of Depression by Study Satisfaction",
x = "Study Satisfaction (0 = Low, 5 = High)",
y = "Proportion",
fill = "Depression"
) +
scale_y_continuous(labels = scales::percent) +
scale_fill_brewer(palette = "Pastel1") +
theme_minimal(base_size = 5)
# Boxplot for Academic Pressure vs. Depression
brewer_colors <- brewer.pal(3, "Pastel1")
p3 <- ggplot(data, aes(x = as.factor(Depression), y = Academic.Pressure, fill = as.factor(Depression))) +
geom_boxplot() +
labs(
title = "Boxplot of Academic Pressure vs. Depression",
x = "Depression (0 = No, 1 = Yes)",
y = "Academic Pressure",
fill = "Depression"
) +
scale_fill_manual(
values = brewer_colors[1:2], # Select 2 colors from Set 2
labels = c("0", "1") # Customize the legend labels here
)+
theme_minimal(base_size = 5)
# Boxplot for Financial Stress vs. Depression
p4 <- ggplot(data, aes(x = as.factor(Depression), y = as.numeric(Financial.Stress), fill = as.factor(Depression))) +
geom_boxplot() +
labs(
title = "Boxplot of Financial Stress vs. Depression",
x = "Depression (0 = No, 1 = Yes)",
y = "Financial Stress",
fill = "Depression"
) +
scale_fill_manual(
values = brewer_colors[1:2], # Select 2 colors from Set 2
labels = c("0", "1") # Customize the legend labels here
) +
theme_minimal(base_size = 5)
grid.arrange(p1, p2, p3, p4, ncol = 2)
Proportional Bar Chart: We can observe a clear upward trend in depression prevalence (blue bars) as academic pressure increases. Students experiencing higher academic pressure (levels 4 and 5) report higher rates of depression.
Boxplot: From a bit different type of visualisation, we can observe that students with depression report significantly higher academic pressure compared to those without.
From the proportional bar chart we can observe that depression rates decline as study satisfaction increases. Students most satisfied with their studies (level 5) show the lowest prevalence of depression.
From the boxplot we can see that students experiencing depression tend to report higher financial stress levels, suggesting a strong link between financial pressure and mental health challenges.
p5 <- ggplot(data, aes(x = as.factor(Depression), y = Age, fill = Gender)) +
geom_boxplot() +
labs(title = "Age Distribution by Depression Status and Gender",
x = "Depression",
y = "Age") + scale_fill_brewer(palette = "Pastel1") +
theme_minimal(base_size = 5)
sleep <- factor(
data$Sleep.Duration,
levels = c("'Less than 5 hours'", "'5-6 hours'", "'7-8 hours'", "'More than 8 hours'", "Others"))
p6 <- ggplot(data, aes(
x = sleep,
fill = factor(Depression, levels = c(0, 1), labels = c("No", "Yes"))
)) +
geom_bar(position = "fill") +
labs(
title = "Proportion of Depression by Sleep Duration Group",
x = "Sleep Duration (hours)",
y = "Proportion",
fill = "Depression"
) +
scale_y_continuous(labels = scales::percent) +
scale_fill_brewer(palette = "Pastel1") +
theme_minimal(base_size = 5)
p7 <- ggplot(data, aes( x = factor(Dietary.Habits), fill = factor(Depression, levels = c(0, 1), labels = c("No", "Yes")) )) + geom_bar(position = "fill") + labs( title = "Depression Proportions by Dietary Habits", x = "Dietary Habits", y = "Proportion", fill = "Depression" ) + scale_y_continuous(labels = scales::percent) + scale_fill_brewer(palette = "Pastel1") + theme_minimal(base_size = 5)
p8 <- ggplot(data, aes( x = factor(Have.you.ever.had.suicidal.thoughts..), fill = factor(Depression, levels = c(0, 1), labels = c("No", "Yes")) )) + geom_bar(position = "fill") + labs( title = "Depression Proportions by Suicidal Thoughts", x = "Suicidal Thoughts (0 = No, 1 = Yes)", y = "Proportion", fill = "Depression" ) + scale_y_continuous(labels = scales::percent) + scale_fill_brewer(palette = "Pastel1") + theme_minimal(base_size = 5)
categorical_grid_top <- grid.arrange(p5, p6, p7,p8, ncol = 2)
In simular manner, from the plotted boxplot we can observe that age distributions for depressed and non-depressed students are similar, with slightly more outliers among non-depressed females. No significant age trend is evident, though male students appear slightly older on average.
From the proportional bar chart, we can see that students who sleep more than 8 hours have the lowest depression rates. Short sleep durations (less than 5 hours) are associated with higher depression prevalence
This proportional bar chart shows a clear trend, that students with depression report eating unhealthier food. Participants with healthy eating habits report having depression 50% of the time, while nearly three out of fours with unhealthy eating habits report depression.
As one might expect, students who had suicidal thoughts, are more likely to report depression. Surprisingly, students without depression still report having suicidal thoughts 25% of the time, as shown in the proportional bar chart.
We will now take a closer look at the outliers in our dataset, focusing specifically on the numeric columns. To achieve this, we will use the Inter-Quartile Range (IQR) method to calculate and display the upper and lower boundaries for each numeric feature, along with the corresponding outlier values.
# Outliers detection using IQR for numeric features
data_subset <- data[, -c(1, 15)]
# Identify numeric columns
numeric_cols <- sapply(data_subset, is.numeric)
numeric_col_names <- names(data_subset)[numeric_cols]
# Detect outliers using IQR
outliers <- lapply(numeric_col_names, function(col_name) {
column <- data_subset[[col_name]]
Q1 <- quantile(column, 0.25, na.rm = TRUE)
Q3 <- quantile(column, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Boundaries for normal values and outliers
cat("Feature:", col_name, "\n")
cat("Lower Bound:", lower_bound, "\n")
cat("Upper Bound:", upper_bound, "\n")
cat("Outlier Values:\n")
outlier_values <- column[column < lower_bound | column > upper_bound]
print(outlier_values)
cat("\n")
list(
lower_bound = lower_bound,
upper_bound = upper_bound,
outlier_values = outlier_values
)
})
## Feature: Age
## Lower Bound: 7.5
## Upper Bound: 43.5
## Outlier Values:
## [1] 58 49 51 44 46 59 54 48 56 46 48 48
##
## Feature: Academic.Pressure
## Lower Bound: -1
## Upper Bound: 7
## Outlier Values:
## numeric(0)
##
## Feature: CGPA
## Lower Bound: 2.345
## Upper Bound: 12.865
## Outlier Values:
## [1] 0 0 0 0 0 0 0 0 0
##
## Feature: Study.Satisfaction
## Lower Bound: -1
## Upper Bound: 7
## Outlier Values:
## numeric(0)
##
## Feature: Work.Study.Hours
## Lower Bound: -5
## Upper Bound: 19
## Outlier Values:
## numeric(0)
# Assign names to the output list
names(outliers) <- numeric_col_names
Feature: Age The normal range for the feature Age is between 7.5 and 43.5. Values outside this range are considered outliers. The outliers detected for this feature are: 58, 49, 51, 44, 46, 59, 54, 48, 56, 46, 48, 48. These values represent unusually high ages compared to the rest of the dataset.
Feature: Academic.Pressure The normal range for the feature Academic.Pressure is between -1 and 7. No outliers were found for this feature, meaning all data points fall within the expected range.
Feature: CGPA The normal range for the feature CGPA is between 2.345 and 12.865. The outliers detected for this feature are: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0. These values of 0 are significantly below the lower bound and may indicate missing or erroneous data.
Feature: Study.Satisfaction The normal range for the feature Study.Satisfaction is between -1 and 7. No outliers were found for this feature, indicating that all values are within the expected range.
Feature: Work.Study.Hours The normal range for the feature Work.Study.Hours is between -5 and 19. No outliers were detected for this feature, meaning all values are within the acceptable range.
Now, we would like to further check for the duplicates present in our dataset.
if (any(duplicated(data))) {
cat("The dataset contains duplicates. Here are the duplicated rows:\n")
duplicates <- data[duplicated(data) | duplicated(data, fromLast = TRUE), ]
print(duplicates)
} else {
cat("The dataset does not contain any duplicates.\n")
}
## The dataset does not contain any duplicates.
With the following code, we further confirmed that there are no duplicates in our dataset.
We will perform logistic regression to predict Depression, as it is a
binary variable. The Financial.Stress variable, which
represents an ordered scale from 1 to 5, is transformed into a numeric
format to simplify the model and maintain interpretability.
Additionally, the level order in Sleep.Duration and Dietary.Habits
variables are changed to better highlight how deviation from standard
healthy behavior can have an impact on having depression.
data$Depression <- factor(data$Depression, levels = c(0, 1), labels = c("No", "Yes"))
data$Financial.Stress <- as.numeric(data$Financial.Stress)
data$Sleep.Duration <- gsub("'", "", data$Sleep.Duration)
data$Sleep.Duration <- factor(data$Sleep.Duration,
levels = c('More than 8 hours', '7-8 hours', '5-6 hours',
'Less than 5 hours', 'Others'))
data$Dietary.Habits <- factor(data$Dietary.Habits,
levels = c('Healthy', 'Moderate', 'Unhealthy',
'Others'))
m <- glm(Depression ~ ., data = data, family = binomial())
summary(m)
##
## Call:
## glm(formula = Depression ~ ., family = binomial(), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.330e+01 5.354e+02 -0.025 0.98018
## id 1.568e-07 4.494e-07 0.349 0.72713
## GenderMale -1.044e-02 3.724e-02 -0.280 0.77918
## Age -1.202e-01 4.974e-03 -24.168 < 2e-16
## City'Less than 5 Kalyan' 1.170e+00 7.572e+02 0.002 0.99877
## City3.0 1.926e+01 7.572e+02 0.025 0.97971
## CityAgra 9.033e+00 5.354e+02 0.017 0.98654
## CityAhmedabad 9.607e+00 5.354e+02 0.018 0.98568
## CityBangalore 9.397e+00 5.354e+02 0.018 0.98600
## CityBhavna 2.205e+01 6.300e+02 0.035 0.97208
## CityBhopal 9.449e+00 5.354e+02 0.018 0.98592
## CityChennai 9.215e+00 5.354e+02 0.017 0.98627
## CityCity 1.107e+01 5.354e+02 0.021 0.98351
## CityDelhi 9.368e+00 5.354e+02 0.017 0.98604
## CityFaridabad 9.451e+00 5.354e+02 0.018 0.98592
## CityGaurav 1.922e+01 7.572e+02 0.025 0.97975
## CityGhaziabad 9.330e+00 5.354e+02 0.017 0.98610
## CityHarsh 1.881e+01 7.572e+02 0.025 0.98018
## CityHarsha 9.507e+00 5.354e+02 0.018 0.98583
## CityHyderabad 9.586e+00 5.354e+02 0.018 0.98572
## CityIndore 9.310e+00 5.354e+02 0.017 0.98613
## CityJaipur 9.056e+00 5.354e+02 0.017 0.98650
## CityKalyan 9.378e+00 5.354e+02 0.018 0.98603
## CityKanpur 9.319e+00 5.354e+02 0.017 0.98611
## CityKhaziabad 2.079e+01 7.572e+02 0.027 0.97810
## CityKibara -5.395e+00 7.572e+02 -0.007 0.99431
## CityKolkata 9.489e+00 5.354e+02 0.018 0.98586
## CityLucknow 9.222e+00 5.354e+02 0.017 0.98626
## CityLudhiana 9.370e+00 5.354e+02 0.018 0.98604
## CityM.Com 1.910e+01 7.572e+02 0.025 0.97988
## CityM.Tech -1.170e+00 7.572e+02 -0.002 0.99877
## CityME -3.162e+00 7.572e+02 -0.004 0.99667
## CityMeerut 9.466e+00 5.354e+02 0.018 0.98589
## CityMihir 2.131e+01 7.572e+02 0.028 0.97755
## CityMira 2.131e+01 7.572e+02 0.028 0.97754
## CityMumbai 9.008e+00 5.354e+02 0.017 0.98658
## CityNagpur 8.974e+00 5.354e+02 0.017 0.98663
## CityNalini 1.867e+01 7.572e+02 0.025 0.98033
## CityNalyan -5.555e+00 7.572e+02 -0.007 0.99415
## CityNandini -2.786e+00 7.572e+02 -0.004 0.99706
## CityNashik 9.357e+00 5.354e+02 0.017 0.98606
## CityPatna 9.501e+00 5.354e+02 0.018 0.98584
## CityPune 9.214e+00 5.354e+02 0.017 0.98627
## CityRajkot 9.373e+00 5.354e+02 0.018 0.98603
## CityRashi -6.662e-01 7.572e+02 -0.001 0.99930
## CityReyansh -2.568e+00 7.572e+02 -0.003 0.99729
## CitySaanvi 2.243e+01 6.545e+02 0.034 0.97266
## CitySrinagar 9.250e+00 5.354e+02 0.017 0.98622
## CitySurat 9.312e+00 5.354e+02 0.017 0.98612
## CityThane 9.407e+00 5.354e+02 0.018 0.98598
## CityVaanya 2.302e+01 7.572e+02 0.030 0.97575
## CityVadodara 9.264e+00 5.354e+02 0.017 0.98620
## CityVaranasi 9.110e+00 5.354e+02 0.017 0.98643
## CityVasai-Virar 9.269e+00 5.354e+02 0.017 0.98619
## CityVisakhapatnam 9.343e+00 5.354e+02 0.017 0.98608
## Academic.Pressure 8.416e-01 1.501e-02 56.072 < 2e-16
## CGPA 5.534e-02 1.252e-02 4.419 9.92e-06
## Study.Satisfaction -2.418e-01 1.367e-02 -17.688 < 2e-16
## Sleep.Duration7-8 hours 2.778e-01 5.254e-02 5.287 1.25e-07
## Sleep.Duration5-6 hours 2.617e-01 5.458e-02 4.794 1.63e-06
## Sleep.DurationLess than 5 hours 6.231e-01 5.213e-02 11.954 < 2e-16
## Sleep.DurationOthers 5.973e-02 6.375e-01 0.094 0.92535
## Dietary.HabitsModerate 4.710e-01 4.549e-02 10.354 < 2e-16
## Dietary.HabitsUnhealthy 1.097e+00 4.665e-02 23.512 < 2e-16
## Dietary.HabitsOthers 8.831e-01 7.822e-01 1.129 0.25885
## DegreeB.Arch 1.875e-01 9.612e-02 1.951 0.05108
## DegreeB.Com 6.020e-02 9.519e-02 0.632 0.52712
## DegreeB.Ed 1.500e-01 9.166e-02 1.637 0.10173
## DegreeB.Pharm 8.238e-02 1.178e-01 0.700 0.48421
## DegreeB.Tech 2.715e-01 1.051e-01 2.584 0.00976
## DegreeBA -5.987e-03 1.283e-01 -0.047 0.96279
## DegreeBBA 2.368e-01 1.241e-01 1.909 0.05632
## DegreeBCA 1.331e-01 9.528e-02 1.397 0.16247
## DegreeBE 2.133e-01 1.292e-01 1.651 0.09884
## DegreeBHM 6.411e-02 1.130e-01 0.567 0.57060
## DegreeBSc 3.273e-01 1.150e-01 2.847 0.00442
## DegreeLLB 3.284e-01 1.288e-01 2.549 0.01079
## DegreeLLM 4.491e-01 1.489e-01 3.016 0.00256
## DegreeM.Com 1.477e-01 1.240e-01 1.192 0.23340
## DegreeM.Ed 2.485e-01 1.225e-01 2.030 0.04240
## DegreeM.Pharm 1.848e-01 1.375e-01 1.345 0.17877
## DegreeM.Tech 9.237e-02 1.114e-01 0.829 0.40693
## DegreeMA 7.737e-02 1.364e-01 0.567 0.57066
## DegreeMBA 1.583e-02 1.353e-01 0.117 0.90688
## DegreeMBBS 3.064e-01 1.262e-01 2.429 0.01514
## DegreeMCA 2.513e-01 1.076e-01 2.334 0.01957
## DegreeMD 2.353e-01 1.394e-01 1.687 0.09154
## DegreeME 7.903e-02 2.248e-01 0.352 0.72513
## DegreeMHM 1.939e-01 2.133e-01 0.909 0.36338
## DegreeMSc 4.970e-02 1.022e-01 0.486 0.62680
## DegreeOthers 6.948e-01 4.809e-01 1.445 0.14849
## DegreePhD 3.101e-01 1.435e-01 2.160 0.03074
## Have.you.ever.had.suicidal.thoughts..Yes 2.518e+00 3.938e-02 63.936 < 2e-16
## Work.Study.Hours 1.182e-01 4.999e-03 23.640 < 2e-16
## Financial.Stress 5.541e-01 1.341e-02 41.324 < 2e-16
## Family.History.of.Mental.IllnessYes 2.483e-01 3.663e-02 6.779 1.21e-11
##
## (Intercept)
## id
## GenderMale
## Age ***
## City'Less than 5 Kalyan'
## City3.0
## CityAgra
## CityAhmedabad
## CityBangalore
## CityBhavna
## CityBhopal
## CityChennai
## CityCity
## CityDelhi
## CityFaridabad
## CityGaurav
## CityGhaziabad
## CityHarsh
## CityHarsha
## CityHyderabad
## CityIndore
## CityJaipur
## CityKalyan
## CityKanpur
## CityKhaziabad
## CityKibara
## CityKolkata
## CityLucknow
## CityLudhiana
## CityM.Com
## CityM.Tech
## CityME
## CityMeerut
## CityMihir
## CityMira
## CityMumbai
## CityNagpur
## CityNalini
## CityNalyan
## CityNandini
## CityNashik
## CityPatna
## CityPune
## CityRajkot
## CityRashi
## CityReyansh
## CitySaanvi
## CitySrinagar
## CitySurat
## CityThane
## CityVaanya
## CityVadodara
## CityVaranasi
## CityVasai-Virar
## CityVisakhapatnam
## Academic.Pressure ***
## CGPA ***
## Study.Satisfaction ***
## Sleep.Duration7-8 hours ***
## Sleep.Duration5-6 hours ***
## Sleep.DurationLess than 5 hours ***
## Sleep.DurationOthers
## Dietary.HabitsModerate ***
## Dietary.HabitsUnhealthy ***
## Dietary.HabitsOthers
## DegreeB.Arch .
## DegreeB.Com
## DegreeB.Ed
## DegreeB.Pharm
## DegreeB.Tech **
## DegreeBA
## DegreeBBA .
## DegreeBCA
## DegreeBE .
## DegreeBHM
## DegreeBSc **
## DegreeLLB *
## DegreeLLM **
## DegreeM.Com
## DegreeM.Ed *
## DegreeM.Pharm
## DegreeM.Tech
## DegreeMA
## DegreeMBA
## DegreeMBBS *
## DegreeMCA *
## DegreeMD .
## DegreeME
## DegreeMHM
## DegreeMSc
## DegreeOthers
## DegreePhD *
## Have.you.ever.had.suicidal.thoughts..Yes ***
## Work.Study.Hours ***
## Financial.Stress ***
## Family.History.of.Mental.IllnessYes ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37819 on 27866 degrees of freedom
## Residual deviance: 19333 on 27771 degrees of freedom
## AIC: 19525
##
## Number of Fisher Scoring iterations: 12
Based on the summary statistics, the variables id, City, and Degree generally do not provide significant explanatory value for predicting the outcome. While specific categories within the Degree variable, such as DegreeBSc and DegreeLLM, show statistical significance (two stars), and DegreeM.Ed has one star significance, the overall contribution of the Degree variable is limited. Including Degree risks overfitting due to its 28 sparse categories, which do not add meaningful predictive value and lack a strong theoretical connection to mental health outcomes. Therefore, we exclude id, City, and Degree from the model to improve interpretability and avoid unnecessary complexity. And now we are going to have a look into the new logistic model and fully describe it. Additionally, we will prepare the training and test data split (80% - training data and 20% - test data).
data_n <- data
data_n$id <- NULL
data_n$City <- NULL
data_n$Degree <- NULL
set.seed(123)
n <- nrow(data_n)
n1 <- 0.8 * n
id_train <- sample(1:n, n1)
trainingdata <- data_n[id_train, ]
testdata <- data_n[-id_train, ]
m1 <- glm(Depression ~ ., data = trainingdata, family = binomial())
summary(m1)
##
## Call:
## glm(formula = Depression ~ ., family = binomial(), data = trainingdata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.084251 0.190749 -21.412 < 2e-16
## GenderMale 0.007223 0.041014 0.176 0.860
## Age -0.109394 0.004271 -25.611 < 2e-16
## Academic.Pressure 0.838378 0.016555 50.643 < 2e-16
## CGPA 0.054395 0.013821 3.936 8.30e-05
## Study.Satisfaction -0.250605 0.015149 -16.542 < 2e-16
## Sleep.Duration7-8 hours 0.278485 0.058565 4.755 1.98e-06
## Sleep.Duration5-6 hours 0.258794 0.060548 4.274 1.92e-05
## Sleep.DurationLess than 5 hours 0.590262 0.057891 10.196 < 2e-16
## Sleep.DurationOthers -0.013578 0.675266 -0.020 0.984
## Dietary.HabitsModerate 0.469863 0.050444 9.315 < 2e-16
## Dietary.HabitsUnhealthy 1.092463 0.051685 21.137 < 2e-16
## Dietary.HabitsOthers 0.459458 0.853115 0.539 0.590
## Have.you.ever.had.suicidal.thoughts..Yes 2.504515 0.043746 57.251 < 2e-16
## Work.Study.Hours 0.119190 0.005558 21.445 < 2e-16
## Financial.Stress 0.557224 0.014880 37.447 < 2e-16
## Family.History.of.Mental.IllnessYes 0.248295 0.040709 6.099 1.07e-09
##
## (Intercept) ***
## GenderMale
## Age ***
## Academic.Pressure ***
## CGPA ***
## Study.Satisfaction ***
## Sleep.Duration7-8 hours ***
## Sleep.Duration5-6 hours ***
## Sleep.DurationLess than 5 hours ***
## Sleep.DurationOthers
## Dietary.HabitsModerate ***
## Dietary.HabitsUnhealthy ***
## Dietary.HabitsOthers
## Have.you.ever.had.suicidal.thoughts..Yes ***
## Work.Study.Hours ***
## Financial.Stress ***
## Family.History.of.Mental.IllnessYes ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 30271 on 22292 degrees of freedom
## Residual deviance: 15564 on 22276 degrees of freedom
## AIC: 15598
##
## Number of Fisher Scoring iterations: 6
The model shows that the following variables have a significant relationship with depression at a p < 0.05 significance level:
Now we perform mixed step-wise model selection on the training sample. The model with the lowest AIC is chosen as a result.
glm_step <- step(m1)
## Start: AIC=15597.89
## Depression ~ Gender + Age + Academic.Pressure + CGPA + Study.Satisfaction +
## Sleep.Duration + Dietary.Habits + Have.you.ever.had.suicidal.thoughts.. +
## Work.Study.Hours + Financial.Stress + Family.History.of.Mental.Illness
##
## Df Deviance AIC
## - Gender 1 15564 15596
## <none> 15564 15598
## - CGPA 1 15579 15611
## - Family.History.of.Mental.Illness 1 15601 15633
## - Sleep.Duration 4 15672 15698
## - Study.Satisfaction 1 15843 15875
## - Dietary.Habits 3 16033 16061
## - Work.Study.Hours 1 16039 16071
## - Age 1 16260 16292
## - Financial.Stress 1 17121 17153
## - Academic.Pressure 1 18768 18800
## - Have.you.ever.had.suicidal.thoughts.. 1 19544 19576
##
## Step: AIC=15595.92
## Depression ~ Age + Academic.Pressure + CGPA + Study.Satisfaction +
## Sleep.Duration + Dietary.Habits + Have.you.ever.had.suicidal.thoughts.. +
## Work.Study.Hours + Financial.Stress + Family.History.of.Mental.Illness
##
## Df Deviance AIC
## <none> 15564 15596
## - CGPA 1 15580 15610
## - Family.History.of.Mental.Illness 1 15601 15631
## - Sleep.Duration 4 15672 15696
## - Study.Satisfaction 1 15843 15873
## - Dietary.Habits 3 16036 16062
## - Work.Study.Hours 1 16040 16070
## - Age 1 16260 16290
## - Financial.Stress 1 17121 17151
## - Academic.Pressure 1 18769 18799
## - Have.you.ever.had.suicidal.thoughts.. 1 19544 19574
summary(glm_step)
##
## Call:
## glm(formula = Depression ~ Age + Academic.Pressure + CGPA + Study.Satisfaction +
## Sleep.Duration + Dietary.Habits + Have.you.ever.had.suicidal.thoughts.. +
## Work.Study.Hours + Financial.Stress + Family.History.of.Mental.Illness,
## family = binomial(), data = trainingdata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.080900 0.189792 -21.502 < 2e-16
## Age -0.109393 0.004271 -25.610 < 2e-16
## Academic.Pressure 0.838317 0.016551 50.651 < 2e-16
## CGPA 0.054472 0.013814 3.943 8.04e-05
## Study.Satisfaction -0.250645 0.015148 -16.547 < 2e-16
## Sleep.Duration7-8 hours 0.278425 0.058564 4.754 1.99e-06
## Sleep.Duration5-6 hours 0.258818 0.060548 4.275 1.91e-05
## Sleep.DurationLess than 5 hours 0.590290 0.057891 10.197 < 2e-16
## Sleep.DurationOthers -0.013658 0.675354 -0.020 0.984
## Dietary.HabitsModerate 0.470012 0.050437 9.319 < 2e-16
## Dietary.HabitsUnhealthy 1.093104 0.051557 21.202 < 2e-16
## Dietary.HabitsOthers 0.460695 0.853049 0.540 0.589
## Have.you.ever.had.suicidal.thoughts..Yes 2.504527 0.043746 57.251 < 2e-16
## Work.Study.Hours 0.119213 0.005556 21.455 < 2e-16
## Financial.Stress 0.557204 0.014880 37.447 < 2e-16
## Family.History.of.Mental.IllnessYes 0.248219 0.040707 6.098 1.08e-09
##
## (Intercept) ***
## Age ***
## Academic.Pressure ***
## CGPA ***
## Study.Satisfaction ***
## Sleep.Duration7-8 hours ***
## Sleep.Duration5-6 hours ***
## Sleep.DurationLess than 5 hours ***
## Sleep.DurationOthers
## Dietary.HabitsModerate ***
## Dietary.HabitsUnhealthy ***
## Dietary.HabitsOthers
## Have.you.ever.had.suicidal.thoughts..Yes ***
## Work.Study.Hours ***
## Financial.Stress ***
## Family.History.of.Mental.IllnessYes ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 30271 on 22292 degrees of freedom
## Residual deviance: 15564 on 22277 degrees of freedom
## AIC: 15596
##
## Number of Fisher Scoring iterations: 6
The new stepwise model shows that the following variables have a significant relationship with depression at a p < 0.05 significance level:
The model is overall very similar to the regular logistic regression model. Some p-values and coefficients have changed and one not significant predictor Gender was excluded.
Models can be compared based on AIC:
AIC(m1, glm_step)
The stepwise logistic regression model performs better than logistic regression (lower AIC is better), therefore we will use only the stepwise model from here onwards.
To evaluate the performance of the step model, we analyze the confusion matrix.
predicted_p_glm_step <- predict(glm_step, newdata = testdata, type = "response")
predicted_glm_step <- ifelse(predicted_p_glm_step > 0.5, "Yes", "No")
predicted_glm_step <- factor(predicted_glm_step, levels = c("No", "Yes"))
actual <- testdata$Depression
# Confusion matrix
conf_matrix_glm <- confusionMatrix(predicted_glm_step, actual)
# Extract metrics confusionMatrix
glm_accuracy <- conf_matrix_glm$overall["Accuracy"]
glm_precision <- conf_matrix_glm$byClass["Pos Pred Value"]
glm_recall <- conf_matrix_glm$byClass["Sensitivity"]
glm_f1 <- conf_matrix_glm$byClass["F1"]
glm_metrics2 <- data.frame(
Model = "GLM (Stepwise)",
Accuracy = glm_accuracy,
Precision = glm_precision,
Recall = glm_recall,
F1_Score = glm_f1
)
print(glm_metrics2)
## Model Accuracy Precision Recall F1_Score
## Accuracy GLM (Stepwise) 0.8471475 0.8258503 0.7955439 0.8104139
From the Confusion Matrix for the step-wise logistic regression performed before, we can make the following statements:
Overall, the model demonstrates strong out-of-sample performance with a solid balance between correctly identifying positive cases and avoiding false positives. Given that recall is the primary metric of interest, the model shows promising results in identifying the majority of actual positive instances.
We estimate the k-NN model on the training set, using the caret package to do a 5-fold cross-validation exercise to pick the appropriate k. Additionally, we also scale the data.
set.seed(123)
# Set up fitControl:
fitControl <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = prSummary, savePredictions = "final")
# Scale the data:
train_scaled <- trainingdata
test_scaled <- testdata
num_cols <- sapply(train_scaled, is.numeric)
train_scaled[, num_cols] <- scale(train_scaled[, num_cols])
# Scale test set using training set mean and sd:
means <- sapply(trainingdata[, num_cols], mean)
sds <- sapply(trainingdata[, num_cols], sd)
test_scaled[, num_cols] <- scale(test_scaled[, num_cols], center = means, scale = sds)
# Apply KNN on training data:
knnFit <- train(Depression ~ ., data = train_scaled,
method = "knn",
trControl = fitControl,
tuneGrid = data.frame(k = c(3, 5, 7, 9, 11, 13, 15)),
metric = "Recall")
knnFit
## k-Nearest Neighbors
##
## 22293 samples
## 11 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 17835, 17835, 17834, 17834, 17834
## Resampling results across tuning parameters:
##
## k AUC Precision Recall F
## 3 0.3880346 0.7871597 0.7338997 0.7595644
## 5 0.5124437 0.8050061 0.7423136 0.7723297
## 7 0.5880311 0.8162197 0.7470596 0.7800673
## 9 0.6450068 0.8247330 0.7480304 0.7844436
## 11 0.6853104 0.8323770 0.7509423 0.7894642
## 13 0.7208451 0.8356487 0.7499717 0.7904140
## 15 0.7442683 0.8390105 0.7509423 0.7924524
##
## Recall was used to select the optimal model using the largest value.
## The final value used for the model was k = 15.
plot(knnFit, main = "Optimal k Selection for k-NN")
As a result, k values of 3, 5, 7, 9, 11, 13, and 15 were tested, and the model with k = 15 was chosen as the optimal model, as it yielded the highest recall (0.7509423). This means the model correctly identifies approximately 75% of all actual depression cases. While the differences in recall across the models were relatively small, the recall of the 15-nearest neighbors model was 1.3% higher than that of the 3-nearest neighbors model. Additionally, the chosen model also achieved the highest precision (0.8390105) and F1-score (0.7924524).
To evaluate the performance of the k-NN model, we analyze the confusion matrix.
# Prediction
predicted_probs_knn <- predict(knnFit, newdata = test_scaled, type = "prob")[, "Yes"]
predicted_knn <- ifelse(predicted_probs_knn > 0.5, "Yes", "No")
predicted_knn <- factor(predicted_knn, levels = c("No", "Yes"))
actual <- factor(test_scaled$Depression, levels = c("No", "Yes"))
# Confusion matrix
conf_matrix_knn <- confusionMatrix(predicted_knn, actual)
# Extract metrics from confusionMatrix
knn_accuracy <- conf_matrix_knn$overall["Accuracy"]
knn_precision <- conf_matrix_knn$byClass["Pos Pred Value"]
knn_recall <- conf_matrix_knn$byClass["Sensitivity"]
knn_f1 <- conf_matrix_knn$byClass["F1"]
knn_metrics2 <- data.frame(
Model = "k-NN",
Accuracy = knn_accuracy,
Precision = knn_precision,
Recall = knn_recall,
F1_Score = knn_f1
)
print(knn_metrics2)
## Model Accuracy Precision Recall F1_Score
## Accuracy k-NN 0.8322569 0.822381 0.7544779 0.7869674
From the Confusion Matrics for the k-nn performed before, we can make the following statements:
This performance means the k-NN model is slightly worse at correctly detecting people with depression than the logistic regression.
nbFit <- naiveBayes(Depression ~ ., data = trainingdata)
nbFit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## No Yes
## 0.4158705 0.5841295
##
## Conditional probabilities:
## Gender
## Y Female Male
## No 0.4456909 0.5543091
## Yes 0.4444786 0.5555214
##
## Age
## Y [,1] [,2]
## No 27.16395 4.949745
## Yes 24.87014 4.673351
##
## Academic.Pressure
## Y [,1] [,2]
## No 2.355301 1.256081
## Yes 3.697205 1.188064
##
## CGPA
## Y [,1] [,2]
## No 7.616958 1.492570
## Yes 7.681386 1.456439
##
## Study.Satisfaction
## Y [,1] [,2]
## No 3.203646 1.326500
## Yes 2.739671 1.351101
##
## Sleep.Duration
## Y More than 8 hours 7-8 hours 5-6 hours Less than 5 hours
## No 0.2544493582 0.2553122640 0.2326609859 0.2567144860
## Yes 0.1893718323 0.2680847796 0.2167101828 0.3252956535
## Sleep.Duration
## Y Others
## No 0.0008629058
## Yes 0.0005375518
##
## Dietary.Habits
## Y Healthy Moderate Unhealthy Others
## No 0.3633912199 0.3755797649 0.2605975623 0.0004314529
## Yes 0.2159422516 0.3391184150 0.4444785747 0.0004607587
##
## Have.you.ever.had.suicidal.thoughts..
## Y No Yes
## No 0.6787833 0.3212167
## Yes 0.1460605 0.8539395
##
## Work.Study.Hours
## Y [,1] [,2]
## No 6.234171 3.865994
## Yes 7.780986 3.457171
##
## Financial.Stress
## Y [,1] [,2]
## No 2.512566 1.345747
## Yes 3.584165 1.334681
##
## Family.History.of.Mental.Illness
## Y No Yes
## No 0.5464351 0.4535649
## Yes 0.4950852 0.5049148
The A-priori Probability, showing the default likelihood of the two possible outcomes occurring in the training data, are:
#0 (person doesn’t have depression): 41.59 %
#1 (person has depression): 58.41 %
There are more people with depression in our dataset, than without. Any model must perform better than 58.41% accuracy to be useful (since always predicting “1” would give this level of accuracy).
To evaluate the performance of the Naive Bayes model, we analyze the confusion matrix.
predicted_probs_nb <- predict(nbFit, newdata = testdata)
# Create confusion matrix
conf_matrix_nb <- confusionMatrix(predicted_probs_nb, testdata$Depression)
# Extract metrics from confusionMatrix
nb_accuracy <- conf_matrix_nb$overall["Accuracy"]
nb_precision <- conf_matrix_nb$byClass["Pos Pred Value"]
nb_recall <- conf_matrix_nb$byClass["Sensitivity"]
nb_f1 <- conf_matrix_nb$byClass["F1"]
nb_metrics2 <- data.frame(
Model = "Naive Bayes",
Accuracy = nb_accuracy,
Precision = nb_precision,
Recall = nb_recall,
F1_Score = nb_f1
)
print(nb_metrics2)
## Model Accuracy Precision Recall F1_Score
## Accuracy Naive Bayes 0.8466093 0.8253176 0.7946702 0.809704
From the Confusion Matrix for the k-NN performed before, we can make the following statements:
This model’s metrics suggest better performance compared the k-NN model, but a slightly worse one than of the logistic regression.
trainingdata$Depression <- factor(trainingdata$Depression)
tree <- rpart(Depression ~ ., data = trainingdata)
plot(as.party(tree))
The classification tree analyzes factors related to suicidal thoughts, with the root node asking if someone has experienced suicidal thoughts before branching into different paths. For those who answered “No,” the model evaluates academic pressure using a threshold of 3.5, with those below this level classified into Node 3 (n=5949), while those above this threshold are further assessed based on financial stress. Those who answered “Yes” to suicidal thoughts are evaluated on academic pressure with a lower threshold of 2.5; if their academic pressure is below 2.5, they’re further classified by financial stress, but if it’s 2.5 or higher, they go directly to Node 11, which contains the largest population (n=10784) and displays the highest proportion of risk in its bar graph. The tree suggests that both academic pressure and financial stress are important predictors, with different threshold levels mattering depending on whether someone has already experienced suicidal thoughts, making Node 11 (individuals with suicidal thoughts and academic pressure ≥2.5) the highest risk group.
To evaluate the performance of the classification tree model, we analyze the confusion matrix.
predicted_probs_tree <- predict(tree, newdata = testdata, type = "class")
conf_matrix <- confusionMatrix(predicted_probs_tree, testdata$Depression)
# Extract confusion Matrix components
accuracy <- conf_matrix$overall["Accuracy"]
precision <- conf_matrix$byClass["Pos Pred Value"]
recall <- conf_matrix$byClass["Sensitivity"]
f1 <- conf_matrix$byClass["F1"]
# Create metrics dataframe
metrics_tree2 <- data.frame(
Model = "Classification Tree",
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1_Score = f1
)
print(metrics_tree2)
## Model Accuracy Precision Recall F1_Score
## Accuracy Classification Tree 0.8177252 0.8209783 0.7112276 0.7621723
From the Confusion Matrics for the classification tree performed before, we can make the following statements:
Overall, the classification tree model is inferior to all the other models regarding all the metrics.
rf <- ranger(Depression ~ ., data = trainingdata, probability = TRUE, importance = "permutation")
print(rf)
## Ranger result
##
## Call:
## ranger(Depression ~ ., data = trainingdata, probability = TRUE, importance = "permutation")
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 22293
## Number of independent variables: 11
## Mtry: 3
## Target node size: 10
## Variable importance mode: permutation
## Splitrule: gini
## OOB prediction error (Brier s.): 0.1130589
The model uses 500 trees and 11 predictors on a dataset of 22,293 observations, with 3 variables evaluated per split (mtry=3) and nodes grown to ≤10 observations for depth control. It outputs depression probabilities and achieves strong calibration, indicated by the Brier score of around 0.113 (with 0.25 representing random guessing). While variable importance isn’t displayed, the permutation method ranks predictors by how much shuffling them increases error—likely highlighting suicidal thoughts, financial stress, and sleep as key drivers.
# Get the importance values
importance_values <- importance(rf)
# Create a data_new frame for ggplot
importance_df <- data.frame(Feature = names(importance_values), Importance = importance_values)
# Sort the data_new frame by importance
importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]
# Create a new column for color based on importance
importance_df$Color <- ifelse(importance_df$Importance == max(importance_df$Importance),
"grey", "#003c75")
# Create the plot using ggplot2
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance, fill = Color)) +
geom_bar(stat = "identity") +
coord_flip() + # Flip the coordinates for better readability
labs(x = "Features", y = "Importance", title = "Most important features to predict depression") +
scale_fill_identity() + # Use the fill values directly
theme_minimal() +
theme(legend.position = "none") # Remove the legend“
The most critical predictor is “Have you ever had suicidal thoughts…”, showing the highest importance score, which aligns with common knowledge that suicidal thoughts are strongly linked to depression. Following this, Academic Pressure and Financial Stress emerge as significant contributors, reflecting how environmental stress impacts mental health. Age and Work Study Hours also play notable roles, suggesting younger individuals and those with heavier workloads may be more vulnerable. Lifestyle factors like Dietary Habits, Sleep Duration, and Study Satisfaction appear moderately influential, while CGPA, Family History of Mental Illness, and Gender show relatively lower importance, indicating weaker associations with depression in the model.
Now that we have identified the most important features, we can create visualizations for the top three predictors (suicidal thoughts, academic pressure, and financial stress). This will help us further explore and understand the relationships between these predictors and Depression, providing deeper insights into their impact.
trainingdata$Have.you.ever.had.suicidal.thoughts.. <- as.factor(trainingdata$Have.you.ever.had.suicidal.thoughts..)
grd <- levels(trainingdata$Have.you.ever.had.suicidal.thoughts..)
pdp_data <- data.frame()
for (i in 1:length(grd)) {
temp_data <- trainingdata
temp_data$Have.you.ever.had.suicidal.thoughts.. <- factor(grd[i], levels = grd)
p <- predict(rf, data = temp_data, type = "response")$predictions[, "Yes"]
pdp_data <- rbind(pdp_data, data.frame(Level = grd[i], Predicted_Probability = p))
}
p1<- ggplot(pdp_data, aes(x = Level, y = Predicted_Probability)) +
geom_boxplot(fill = "lightblue", color = "darkblue") +
labs(
title = "Partial Dependence: Suicidal Thoughts",
x = "Have you ever had suicidal thoughts?",
y = "Predicted Probability of Depression ('Yes')"
) +
theme_minimal(base_size = 5)
# Academic Pressure
trainingdata$Academic.Pressure <- as.factor(trainingdata$Academic.Pressure)
grd_academic <- levels(trainingdata$Academic.Pressure)
pdp_data_academic <- data.frame()
for (i in 1:length(grd_academic)) {
temp_data <- trainingdata
temp_data$Academic.Pressure <- factor(grd_academic[i], levels = grd_academic)
p <- predict(rf, data = temp_data, type = "response")$predictions[, "Yes"]
pdp_data_academic <- rbind(pdp_data_academic, data.frame(Level = grd_academic[i], Predicted_Probability = p))
}
p2<- ggplot(pdp_data_academic, aes(x = Level, y = Predicted_Probability)) +
geom_boxplot(fill = "lightgreen", color = "darkgreen") +
labs(
title = "Partial Dependence: Academic Pressure",
x = "Academic Pressure Levels",
y = "Predicted Probability of Depression ('Yes')"
) +
theme_minimal(base_size = 5)
# Financial Stress
trainingdata$Financial.Stress <- as.factor(trainingdata$Financial.Stress)
grd_financial <- levels(trainingdata$Financial.Stress)
pdp_data_financial <- data.frame()
for (i in 1:length(grd_financial)) {
temp_data <- trainingdata
temp_data$Financial.Stress <- factor(grd_financial[i], levels = grd_financial)
p <- predict(rf, data = temp_data, type = "response")$predictions[, "Yes"]
pdp_data_financial <- rbind(pdp_data_financial, data.frame(Level = grd_financial[i], Predicted_Probability = p))
}
p3 <- ggplot(pdp_data_financial, aes(x = Level, y = Predicted_Probability)) +
geom_boxplot(fill = "lightcoral", color = "darkred") +
labs(
title = "Partial Dependence: Financial Stress",
x = "Financial Stress Levels",
y = "Predicted Probability of Depression ('Yes')"
) +
theme_minimal(base_size = 5)
grid.arrange(p1, p2, p3, ncol = 3)
This plot shows how the predicted probability of depression changes based on whether an individual has ever had suicidal thoughts. Individuals who answered “Yes” have a significantly higher predicted probability of depression compared to those who answered “No.”
This plot illustrates the relationship between academic pressure levels and the predicted probability of depression. As academic pressure increases, the predicted probability of depression generally rises, indicating a positive correlation.
This plot demonstrates how financial stress levels affect the predicted probability of depression. Higher financial stress levels correspond to an increase in the predicted probability of depression, highlighting its strong impact.
To evaluate the performance of the random forest model, we analyze the confusion matrix.
# Get predictions
predicted_probs_rf <- predict(rf, data = testdata, type = "response")$predictions
predicted_rf <- ifelse(predicted_probs_rf[,2] > 0.5, "Yes", "No")
# Ensure factors have matching levels
actual <- factor(testdata$Depression, levels = c("No", "Yes"))
predicted_rf <- factor(predicted_rf, levels = c("No", "Yes"))
# Using caret for confusion matrix (more robust)
conf_matrix <- confusionMatrix(predicted_rf, actual)
# Extract metrics directly from confusionMatrix
accuracy <- conf_matrix$overall["Accuracy"]
precision <- conf_matrix$byClass["Pos Pred Value"]
recall <- conf_matrix$byClass["Sensitivity"]
f1 <- conf_matrix$byClass["F1"]
metrics_rf2 <- data.frame(
Model = "Random Forest",
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1_Score = f1
)
print(metrics_rf2)
## Model Accuracy Precision Recall F1_Score
## Accuracy Random Forest 0.8448152 0.8215899 0.7946702 0.8079058
From the Confusion Matrix for the random forest performed before, we can make the following statements:
Overall, the random forest model has a strong performance, having 3rd best result in recall, after logistic regression and Naive Bayes, only falling short of 0.3% from the best performing model.
all_metrics <- rbind(glm_metrics2, knn_metrics2, nb_metrics2, metrics_tree2, metrics_rf2)
print(all_metrics)
## Model Accuracy Precision Recall F1_Score
## Accuracy GLM (Stepwise) 0.8471475 0.8258503 0.7955439 0.8104139
## Accuracy1 k-NN 0.8322569 0.8223810 0.7544779 0.7869674
## Accuracy2 Naive Bayes 0.8466093 0.8253176 0.7946702 0.8097040
## Accuracy3 Classification Tree 0.8177252 0.8209783 0.7112276 0.7621723
## Accuracy4 Random Forest 0.8448152 0.8215899 0.7946702 0.8079058
From the combined confusion matrix we can see the following results:
Overall, the GLM (Stepwise) model achieved the best combination of metrics, particularly in recall (0.7955), making it the most suitable choice for this project. Naive Bayes and Random Forest also performed well and can be considered as backup models. Models like k-NN and Classification Tree are less favorable due to their lower recall.
# Data Preprocessing
data_n$Financial.Stress <- as.numeric(data_n$Financial.Stress)
# 10 folds for cross-validation
set.seed(123)
folds <- createFolds(data_n$Depression, k = 10, list = TRUE, returnTrain = TRUE)
# Creating the placeholders for results
recall_scores <- data.frame(step_glm = numeric(), knn = numeric(), naive_bayes = numeric(), tree = numeric(), random_forest = numeric())
# Scaling numeric variables once for all folds
numeric_cols <- sapply(data_n, is.numeric)
data_new_scaled <- data_n
data_new_scaled[, numeric_cols] <- lapply(data_n[, numeric_cols], scale)
for (fold_index in 1:10) {
# Here we are splitting data into training and test sets
train_indices <- folds[[fold_index]]
training_data <- data_new_scaled[train_indices, ]
test_data <- data_new_scaled[-train_indices, ]
# Stepwise Logistic Regression (as it has lower AIC compared to Logistic)
base_model <- glm(Depression ~ ., data = training_data, family = "binomial")
step_model <- step(base_model, direction = "both", trace = 0) # Use stepwise selection
step_predictions <- predict(step_model, newdata = test_data, type = "response")
step_pred_class <- ifelse(step_predictions > 0.5, "Yes", "No")
step_conf_matrix <- confusionMatrix(factor(step_pred_class, levels = c("No", "Yes")),
test_data$Depression)
# k-NN Model
knn_model <- train(Depression ~ ., data = training_data,
method = "knn",
trControl = trainControl(method = "none"),
tuneGrid = data.frame(k = 15),
metric = "Recall")
knn_predictions <- predict(knn_model, newdata = test_data)
knn_conf_matrix <- confusionMatrix(knn_predictions, test_data$Depression)
# Naive Bayes
nb_model <- naiveBayes(Depression ~ ., data = training_data)
nb_predictions <- predict(nb_model, newdata = test_data)
nb_conf_matrix <- confusionMatrix(nb_predictions, test_data$Depression)
# Decision Trees
tree_model <- rpart(Depression ~ ., data = training_data)
tree_predictions <- predict(tree_model, newdata = test_data, type = "class")
tree_conf_matrix <- confusionMatrix(tree_predictions, test_data$Depression)
# Random Forest
rf_model <- ranger(Depression ~ ., data = training_data, probability = TRUE, importance = "permutation")
rf_predictions <- predict(rf_model, data = test_data)$predictions
rf_pred_class <- ifelse(rf_predictions[, 2] > 0.5, "Yes", "No")
rf_conf_matrix <- confusionMatrix(factor(rf_pred_class, levels = c("No", "Yes")),
test_data$Depression)
recall_scores <- rbind(recall_scores, data.frame(
step_glm = step_conf_matrix$byClass["Recall"],
knn = knn_conf_matrix$byClass["Recall"],
naive_bayes = nb_conf_matrix$byClass["Recall"],
tree = tree_conf_matrix$byClass["Recall"],
random_forest = rf_conf_matrix$byClass["Recall"]
))
}
boxplot(recall_scores,
ylab = "Recall (Sensitivity)",
col = c('plum', 'palevioletred', 'pink', 'lightblue', 'lightgreen'),
main = "10-Fold CV Recall Comparison")
From the 10-Fold CV Recall Comparison, we can observe that Naive Bayes and step-wise GLM have the highest median recall scores, both around 0.79–0.80, indicating they are most effective at capturing true positives. Random forest also performs well, with a slightly lower median recall near 0.78. KNN has a moderate recall (around 0.75), while the decision tree model has the lowest recall, with a median near 0.72. Overall, naive Bayes and step-wise GLM outperform the other models in terms of recall, while the decision tree lags behind.
This project has provided important insights into the factors associated with depression among students, along with predictive models to identify individuals at risk. However, there are several areas where further research and development can enhance the outcomes. One key area for future work is the incorporation of longitudinal data, which would allow for the study of how depression risk factors change over time, offering a deeper understanding of causal relationships. Expanding the dataset to include individuals from a wider range of geographic, cultural, and socioeconomic backgrounds would also improve the generalizability of the findings.
Additionally, including new variables, such as social media usage, physical activity levels, and access to mental health resources, could make the models more comprehensive. Another promising area is the development of real-time prediction systems that integrate with wearable devices or mobile applications to monitor lifestyle factors like sleep patterns and activity levels, enabling early interventions.
Collaboration with mental health professionals is another important step, as their expertise can help interpret the findings and design effective, tailored interventions for individuals at risk. Ethical considerations, such as addressing privacy concerns and algorithmic bias, must also be prioritized to ensure the responsible use of predictive models. Finally, testing the models in real-world settings, such as schools, workplaces, and healthcare facilities, would provide valuable insights into their practical utility and effectiveness. By focusing on these areas, future work can build upon this project to create more robust and impactful tools for addressing mental health challenges.
In conclusion, this project has helped us understand the important factors that contribute to depression among students and young professionals. We found that things like sleep habits, diet, work hours, and personal background can all affect someone’s risk of depression, but the main influential factors are still suicidal thoughts, academic pressure, and financial stress. Both through 10-fold cross-validation and out-of-sample performance, we observed that the Naive Bayes and stepwise GLM models had the highest recall, meaning they were most effective at correctly identifying individuals at risk of depression. By focusing on prediction rather than inference, the project offers practical tools that stakeholders can use to address mental health challenges effectively.
For educational institutions, the insights gained from this study can guide the development of policies aimed at reducing academic stress and promoting mental well-being among students. Workplace stakeholders, including employers and human resource managers, can leverage these findings to create healthier work environments, such as balancing workloads and encouraging better work-life integration. Healthcare professionals and policymakers can also use the predictive models to develop early warning systems, enabling timely interventions for individuals at risk.
The implications of this work highlight the importance of integrating data-driven approaches into mental health strategies. By identifying at-risk individuals early, stakeholders can implement targeted interventions that not only improve individual well-being but also contribute to healthier, more productive communities. The findings underline the urgent need for collaborative efforts among educators, employers, healthcare providers, and policymakers to address the growing mental health challenges faced by young populations. So, we hope that our project provides a foundation for these efforts, paving the way for future advancements in mental health research and practice.