Class distribution:
df['Class'].value_counts()
Class
0 86
1 84
Establish: rely, dtype: int64
dataset seems fairly balanced with 86 divorced and 84 nonetheless married.
Apply Examine reduce up:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)
The dataset is reduce up 80:20 to educate and examine respectively.
Various fashions had been expert and evaluated to seek out out the most effective performing one. Model outcomes will current accuracy, confusion matrix and classification research, for larger understanding.
estimators = {
'Decision Tree' : DecisionTreeClassifier(random_state=3),
'Random Forest' : RandomForestClassifier(random_state=3),
'Further Tree' : ExtraTreesClassifier(random_state=3),
'Gradient Enhance' : GradientBoostingClassifier(random_state=3),
'AdaBoost' : AdaBoostClassifier(random_state=3),
'Logistic Regression' : LogisticRegression(random_state=3),
'SGDC' : SGDClassifier(random_state=3),
'Ridge' : RidgeClassifier(random_state=3)
}
for determine, model in estimators.objects():
model.match(X_train, y_train)
y_pred= model.predict(X_test)
# Print the model determine
print(f'{determine}')
# Print the accuracy score
print(f' Accuracy: {accuracy_score(y_test, y_pred):.3f}')
# Print the confusion matrix
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred)}')
# Print the classification report
print(f' Report: n{classification_report(y_test, y_pred)}')
print("*" * 100)
- Okay-Nearest Neighbors (KNN):
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.match(X_train, y_train)
predicted = knn.predict(X_test)#USE A LIST COMPRESSION TO FIND ANY WRONG PREDICTIONS
wrong_pred = [(p, e) for (p, e) in zip(predicted, expected) if p != e]
wrong_pred #knn wrongly predicted 1 out of our 34 values
[(0, 1)]
print(f'{knn.score(X_test, y_test):.2%}')
97.06%
KNN, has a predictive accuracy of 97.06%, recording one misprediction out of 34 values.
#Okay FOLD CROSS EXAMINATION
from sklearn.model_selection import KFoldkfold = KFold(n_splits=5, random_state = 8, shuffle=True)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator = knn, X=X_test,
y=y_test, cv=kfold)
scores: array([1. , 1. , 1. , 1. , 0.83333333])
scores.suggest() = 96.67%
Cross validation score 96.67%
2. Decision Tree:
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier(random_state=3)
decision_tree.match(X_train, y_train)Decision Tree
Accuracy: 1.000
Confusion Matrix:
[[16 0]
[ 0 18]]
Report:
precision recall f1-score assist
0 1.00 1.00 1.00 16
1 1.00 1.00 1.00 18
accuracy 1.00 34
macro avg 1.00 1.00 1.00 34
weighted avg 1.00 1.00 1.00 34
Accuracy: 100%
The Decision Tree model utterly labeled all conditions inside the examine set, indicating potential overfitting.
3. Random Forest:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(random_state=3)
random_forest.match(X_train, y_train)Random Forest
Accuracy: 0.971
Confusion Matrix:
[[16 0]
[ 1 17]]
Report:
precision recall f1-score assist
0 0.94 1.00 0.97 16
1 1.00 0.94 0.97 18
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted avg 0.97 0.97 0.97 34
Accuracy: 97.1%
Random Forest achieved extreme accuracy, very like KNN, nevertheless with barely additional misclassifications as compared with the Decision Tree.
4. Gradient Enhance:
from sklearn.ensemble import GradientBoostingClassifier
gradient_boost = GradientBoostingClassifier(random_state=3)
gradient_boost.match(X_train, y_train)Gradient Enhance
Accuracy: 1.000
Confusion Matrix:
[[16 0]
[ 0 18]]
Report:
precision recall f1-score assist
0 1.00 1.00 1.00 16
1 1.00 1.00 1.00 18
accuracy 1.00 34
macro avg 1.00 1.00 1.00 34
weighted avg 1.00 1.00 1.00 34
Accuracy: 100%
Identical to the Decision Tree, Gradient Boosting moreover achieved glorious accuracy on the examine set.
5. Logistic Regression:
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression(random_state=3)
logistic_regression.match(X_train, y_train)Logistic Regression
Accuracy: 0.971
Confusion Matrix:
[[16 0]
[ 1 17]]
Report:
precision recall f1-score assist
0 0.94 1.00 0.97 16
1 1.00 0.94 0.97 18
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted avg 0.97 0.97 0.97 34
Accuracy: 97.1%
Logistic Regression moreover carried out properly, demonstrating its effectiveness for this classification draw back.
6. SGD Classifier and Ridge Classifier:
Every fashions achieved glorious accuracy, very like the Decision Tree and Gradient Boosting fashions.
Overfitting is maybe a problem as many of the fashions current 100 accuracy when predicting.
using SMOTE (Synthetic Minority Over-sampling Method) to create synthetic samples for larger class steadiness. As highlighted earlier, our data seems properly balanced nevertheless lets see if the model is likely to be improved.
Regularization will help the model generalize larger.
- SMOTE:
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE# Load the data
df = pd.read_csv(r'C:UsersADMindata analyticsanalyst HQdivorce.csv', delimiter=';')
# Break up data into choices and purpose
X = df.drop('Class', axis=1)
y = df['Class']
# Break up the data into teaching and testing items
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3, stratify=y)
# Apply SMOTE to generate synthetic samples
smote = SMOTE(random_state=3)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
# Look at the class distribution after making use of SMOTE
print(f'Class distribution after SMOTE: n{pd.Sequence(y_train_smote).value_counts()}')
Class distribution after SMOTE:
Class
1 69
0 69
Establish: rely, dtype: int64
Utterly ba….
2. LASSO AND RIDGE LINEAR REGRESSION:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix# L1 Regularization (Lasso)
logreg_l1 = LogisticRegression(penalty='l1', solver='liblinear', random_state=3)
logreg_l1.match(X_train_smote, y_train_smote)
y_pred_l1 = logreg_l1.predict(X_test)
print("L1 Regularization (Lasso) Outcomes:")
print(f' Accuracy: {accuracy_score(y_test, y_pred_l1):.3f}')
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred_l1)}')
print(f' Classification Report: n{classification_report(y_test, y_pred_l1)}')
# L2 Regularization (Ridge)
logreg_l2 = LogisticRegression(penalty='l2', solver='liblinear', random_state=3)
logreg_l2.match(X_train_smote, y_train_smote)
y_pred_l2 = logreg_l2.predict(X_test)
print("L2 Regularization (Ridge) Outcomes:")
print(f' Accuracy: {accuracy_score(y_test, y_pred_l2):.3f}')
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred_l2)}')
print(f' Classification Report: n{classification_report(y_test, y_pred_l2)}')
L1 Regularization (Lasso) Outcomes:
Accuracy: 0.971
Confusion Matrix:
[[17 0]
[ 1 16]]
Classification Report:
precision recall f1-score assist
0 0.94 1.00 0.97 17
1 1.00 0.94 0.97 17
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted avg 0.97 0.97 0.97 34
L2 Regularization (Ridge) Outcomes:
Accuracy: 0.971
Confusion Matrix:
[[17 0]
[ 1 16]]
Classification Report:
precision recall f1-score assist
0 0.94 1.00 0.97 17
1 1.00 0.94 0.97 17
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted av
3. LASSO AND RIDGE SDGClassifier:
from sklearn.linear_model import SGDClassifier
SGD_l1 = SGDClassifier(penalty='l1', random_state=3)
SGD_l1.match(X_train_smote, y_train_smote)
y_pred_l1 = SGD_l1.predict(X_test)
print("L1 Regularization (Lasso) Outcomes:")
print(f' Accuracy: {accuracy_score(y_test, y_pred_l1):.3f}')
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred_l1)}')
print(f' Classification Report: n{classification_report(y_test, y_pred_l1)}')# L2 Regularization (Ridge)
SGD_l2 = SGDClassifier(penalty='l2', random_state=3)
SGD_l2.match(X_train_smote, y_train_smote)
y_pred_l2 = SGD_l2.predict(X_test)
print("L2 Regularization (Ridge) Outcomes:")
print(f' Accuracy: {accuracy_score(y_test, y_pred_l2):.3f}')
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred_l2)}')
print(f' Classification Report: n{classification_report(y_test, y_pred_l2)}')
L1 Regularization (Lasso) Outcomes:
Accuracy: 0.971
Confusion Matrix:
[[16 1]
[ 0 17]]
Classification Report:
precision recall f1-score assist
0 1.00 0.94 0.97 17
1 0.94 1.00 0.97 17
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted avg 0.97 0.97 0.97 34
L2 Regularization (Ridge) Outcomes:
Accuracy: 1.000
Confusion Matrix:
[[17 0]
[ 0 17]]
Classification Report:
precision recall f1-score assist
0 1.00 1.00 1.00 17
1 1.00 1.00 1.00 17
accuracy 1.00 34
macro avg 1.00 1.00 1.00 34
weighted avg 1.00 1.00 1.00 34
The lasso and ridge regularization methods appear to have minimal affect. Notably, solely the lasso regularization on the SGDClassifier reveals a slight affect, decreasing overfitting and lowering the predictive worth from 100% to 97.1%. However, every regularization methods for linear regression yield the similar predictive price of 97.1% as a result of the non-regularized model, indicating no vital enchancment.
- Extreme Accuracy Fashions: Decision Tree, Gradient Boosting, and SGD Classifier achieved glorious accuracy on the examine set, inflicting us to do a regularization that didn’t impact loads. However, such extreme accuracy might level out overfitting, necessitating further validation with an even bigger dataset.
- Sturdy Effectivity: KNN, Logistic Regression and Random Forest provided continually extreme accuracy with good generalization capabilities.
- Attribute Significance: Investigating operate significance for fashions like Random Forest can current insights into key elements influencing divorce, most likely guiding relationship counseling and interventions.
- Future Work: To ensure robustness, further cross-validation, hyperparameter tuning, and testing on an expanded dataset are actually useful.