Class distribution:
df['Class'].value_counts()
Class
0 86
1 84
Identify: depend, dtype: int64
dataset appears pretty balanced with 86 divorced and 84 nonetheless married.
Practice Check cut up:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)
The dataset is cut up 80:20 to coach and check respectively.
A number of fashions had been skilled and evaluated to find out one of the best performing one. Mannequin outcomes will present accuracy, confusion matrix and classification studies, for higher understanding.
estimators = {
'Resolution Tree' : DecisionTreeClassifier(random_state=3),
'Random Forest' : RandomForestClassifier(random_state=3),
'Additional Tree' : ExtraTreesClassifier(random_state=3),
'Gradient Increase' : GradientBoostingClassifier(random_state=3),
'AdaBoost' : AdaBoostClassifier(random_state=3),
'Logistic Regression' : LogisticRegression(random_state=3),
'SGDC' : SGDClassifier(random_state=3),
'Ridge' : RidgeClassifier(random_state=3)
}
for identify, mannequin in estimators.objects():
mannequin.match(X_train, y_train)
y_pred= mannequin.predict(X_test)
# Print the mannequin identify
print(f'{identify}')
# Print the accuracy rating
print(f' Accuracy: {accuracy_score(y_test, y_pred):.3f}')
# Print the confusion matrix
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred)}')
# Print the classification report
print(f' Report: n{classification_report(y_test, y_pred)}')
print("*" * 100)
- Okay-Nearest Neighbors (KNN):
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.match(X_train, y_train)
predicted = knn.predict(X_test)#USE A LIST COMPRESSION TO FIND ANY WRONG PREDICTIONS
wrong_pred = [(p, e) for (p, e) in zip(predicted, expected) if p != e]
wrong_pred #knn wrongly predicted 1 out of our 34 values
[(0, 1)]
print(f'{knn.rating(X_test, y_test):.2%}')
97.06%
KNN, has a predictive accuracy of 97.06%, recording one misprediction out of 34 values.
#Okay FOLD CROSS EXAMINATION
from sklearn.model_selection import KFoldkfold = KFold(n_splits=5, random_state = 8, shuffle=True)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator = knn, X=X_test,
y=y_test, cv=kfold)
scores: array([1. , 1. , 1. , 1. , 0.83333333])
scores.imply() = 96.67%
Cross validation rating 96.67%
2. Resolution Tree:
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier(random_state=3)
decision_tree.match(X_train, y_train)Resolution Tree
Accuracy: 1.000
Confusion Matrix:
[[16 0]
[ 0 18]]
Report:
precision recall f1-score help
0 1.00 1.00 1.00 16
1 1.00 1.00 1.00 18
accuracy 1.00 34
macro avg 1.00 1.00 1.00 34
weighted avg 1.00 1.00 1.00 34
Accuracy: 100%
The Resolution Tree mannequin completely labeled all situations within the check set, indicating potential overfitting.
3. Random Forest:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(random_state=3)
random_forest.match(X_train, y_train)Random Forest
Accuracy: 0.971
Confusion Matrix:
[[16 0]
[ 1 17]]
Report:
precision recall f1-score help
0 0.94 1.00 0.97 16
1 1.00 0.94 0.97 18
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted avg 0.97 0.97 0.97 34
Accuracy: 97.1%
Random Forest achieved excessive accuracy, much like KNN, however with barely extra misclassifications in comparison with the Resolution Tree.
4. Gradient Increase:
from sklearn.ensemble import GradientBoostingClassifier
gradient_boost = GradientBoostingClassifier(random_state=3)
gradient_boost.match(X_train, y_train)Gradient Increase
Accuracy: 1.000
Confusion Matrix:
[[16 0]
[ 0 18]]
Report:
precision recall f1-score help
0 1.00 1.00 1.00 16
1 1.00 1.00 1.00 18
accuracy 1.00 34
macro avg 1.00 1.00 1.00 34
weighted avg 1.00 1.00 1.00 34
Accuracy: 100%
Just like the Resolution Tree, Gradient Boosting additionally achieved excellent accuracy on the check set.
5. Logistic Regression:
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression(random_state=3)
logistic_regression.match(X_train, y_train)Logistic Regression
Accuracy: 0.971
Confusion Matrix:
[[16 0]
[ 1 17]]
Report:
precision recall f1-score help
0 0.94 1.00 0.97 16
1 1.00 0.94 0.97 18
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted avg 0.97 0.97 0.97 34
Accuracy: 97.1%
Logistic Regression additionally carried out nicely, demonstrating its effectiveness for this classification downside.
6. SGD Classifier and Ridge Classifier:
Each fashions achieved excellent accuracy, much like the Resolution Tree and Gradient Boosting fashions.
Overfitting is perhaps a difficulty as most of the fashions present 100 accuracy when predicting.
utilizing SMOTE (Artificial Minority Over-sampling Approach) to create artificial samples for higher class steadiness. As highlighted earlier, our information appears nicely balanced however lets see if the mannequin might be improved.
Regularization will assist the mannequin generalize higher.
- SMOTE:
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE# Load the information
df = pd.read_csv(r'C:UsersADMindata analyticsanalyst HQdivorce.csv', delimiter=';')
# Break up information into options and goal
X = df.drop('Class', axis=1)
y = df['Class']
# Break up the information into coaching and testing units
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3, stratify=y)
# Apply SMOTE to generate artificial samples
smote = SMOTE(random_state=3)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
# Examine the category distribution after making use of SMOTE
print(f'Class distribution after SMOTE: n{pd.Sequence(y_train_smote).value_counts()}')
Class distribution after SMOTE:
Class
1 69
0 69
Identify: depend, dtype: int64
Completely ba….
2. LASSO AND RIDGE LINEAR REGRESSION:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix# L1 Regularization (Lasso)
logreg_l1 = LogisticRegression(penalty='l1', solver='liblinear', random_state=3)
logreg_l1.match(X_train_smote, y_train_smote)
y_pred_l1 = logreg_l1.predict(X_test)
print("L1 Regularization (Lasso) Outcomes:")
print(f' Accuracy: {accuracy_score(y_test, y_pred_l1):.3f}')
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred_l1)}')
print(f' Classification Report: n{classification_report(y_test, y_pred_l1)}')
# L2 Regularization (Ridge)
logreg_l2 = LogisticRegression(penalty='l2', solver='liblinear', random_state=3)
logreg_l2.match(X_train_smote, y_train_smote)
y_pred_l2 = logreg_l2.predict(X_test)
print("L2 Regularization (Ridge) Outcomes:")
print(f' Accuracy: {accuracy_score(y_test, y_pred_l2):.3f}')
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred_l2)}')
print(f' Classification Report: n{classification_report(y_test, y_pred_l2)}')
L1 Regularization (Lasso) Outcomes:
Accuracy: 0.971
Confusion Matrix:
[[17 0]
[ 1 16]]
Classification Report:
precision recall f1-score help
0 0.94 1.00 0.97 17
1 1.00 0.94 0.97 17
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted avg 0.97 0.97 0.97 34
L2 Regularization (Ridge) Outcomes:
Accuracy: 0.971
Confusion Matrix:
[[17 0]
[ 1 16]]
Classification Report:
precision recall f1-score help
0 0.94 1.00 0.97 17
1 1.00 0.94 0.97 17
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted av
3. LASSO AND RIDGE SDGClassifier:
from sklearn.linear_model import SGDClassifier
SGD_l1 = SGDClassifier(penalty='l1', random_state=3)
SGD_l1.match(X_train_smote, y_train_smote)
y_pred_l1 = SGD_l1.predict(X_test)
print("L1 Regularization (Lasso) Outcomes:")
print(f' Accuracy: {accuracy_score(y_test, y_pred_l1):.3f}')
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred_l1)}')
print(f' Classification Report: n{classification_report(y_test, y_pred_l1)}')# L2 Regularization (Ridge)
SGD_l2 = SGDClassifier(penalty='l2', random_state=3)
SGD_l2.match(X_train_smote, y_train_smote)
y_pred_l2 = SGD_l2.predict(X_test)
print("L2 Regularization (Ridge) Outcomes:")
print(f' Accuracy: {accuracy_score(y_test, y_pred_l2):.3f}')
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred_l2)}')
print(f' Classification Report: n{classification_report(y_test, y_pred_l2)}')
L1 Regularization (Lasso) Outcomes:
Accuracy: 0.971
Confusion Matrix:
[[16 1]
[ 0 17]]
Classification Report:
precision recall f1-score help
0 1.00 0.94 0.97 17
1 0.94 1.00 0.97 17
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted avg 0.97 0.97 0.97 34
L2 Regularization (Ridge) Outcomes:
Accuracy: 1.000
Confusion Matrix:
[[17 0]
[ 0 17]]
Classification Report:
precision recall f1-score help
0 1.00 1.00 1.00 17
1 1.00 1.00 1.00 17
accuracy 1.00 34
macro avg 1.00 1.00 1.00 34
weighted avg 1.00 1.00 1.00 34
The lasso and ridge regularization strategies seem to have minimal influence. Notably, solely the lasso regularization on the SGDClassifier reveals a slight impact, lowering overfitting and decreasing the predictive price from 100% to 97.1%. Nevertheless, each regularization strategies for linear regression yield the identical predictive worth of 97.1% because the non-regularized mannequin, indicating no important enchancment.
- Excessive Accuracy Fashions: Resolution Tree, Gradient Boosting, and SGD Classifier achieved excellent accuracy on the check set, inflicting us to do a regularization that didn’t have an effect on a lot. Nevertheless, such excessive accuracy could point out overfitting, necessitating additional validation with a bigger dataset.
- Sturdy Efficiency: KNN, Logistic Regression and Random Forest offered constantly excessive accuracy with good generalization capabilities.
- Characteristic Significance: Investigating function significance for fashions like Random Forest can present insights into key components influencing divorce, probably guiding relationship counseling and interventions.
- Future Work: To make sure robustness, additional cross-validation, hyperparameter tuning, and testing on an expanded dataset are really helpful.