Open-Course-Data-Analysis-Project

Open-Course-Elective-Choice-Trends-of-Students [2019-2020]

Data Source : Department of Statistics, Mar Athanasius College of Arts and Science (Autonomous), Kothamangalam
Language : Python
Workflow : JupyterLab, Microsoft Excel

*Certain columns with Personally Identifiable Information [PII] have been removed for privacy reasons.

TABLE OF CONTENTS : 📌

  1. Project Background
  2. About the Data
  3. Project Summary
  4. Exploratory Data Analysis [EDA]
  5. K-Means Clustering
  6. Multinomial Logistic Regression

1. PROJECT BACKGROUND

In the fifth semester, students across 12 departments gets to enroll in a course, out of 13 options. The students are mandated to fill in a form. The form requires students to fill in their informations such as gross points from the first two semesters, choice fillings from first to sixth along with their unique id and parent department information. The students will be elected based on the priority of choices and gross percentage scores.


2. ABOUT THE DATA 🧩

The data was collected by the Department of Statistics. The mode of collection of data was facilitated by using Google Forms which the students had to fill in. Data from the year 2019 & 2020 were used conveniently for comparison purposes. The data collected was then made to undergo a few transformations and some basic cleaning in Microsoft Excel.

2.1. Data Files :

  1. Allotment-Data-2019.xlsx
  2. Allotment-Data-2020.xlsx

2.2. Data Info :

  Rows Columns
2019 368 10
2020 413 12

2.3. Columns :

\[gp = \frac{Sem \space I \space Score + Sem \space II \space Score}{Total \space Marks}\]

2.4. Departments :

  1. B.Com. Model I
  2. B.A. Economics
  3. B.A. Sociology
  4. B.A. English
  5. B.A. History
  6. Physical Education
  7. B.Sc. Chemistry
  8. B.Sc. Mathematics
  9. B.Sc. Statistics
  10. B.Sc. Zoology
  11. B.Sc. Botany
  12. B.Sc. Physics
  13. B.A. Hindi

2.5. Notations of Course Choices :


3. PROJECT SUMMARY 🏆

  1. Nearly 75% to 80% of students got their first choice as their allotted department.
  2. There was a general increase in the average gp of the entire college from 2019 to 2020.
  3. B.Sc. Botany and B.Sc. Chemistry were the only two programs that had a decrease in average GP scores.
  4. Science subjects had a higher GP scores than arts subjects.
  5. B.Sc. Mathematics has the highest sverage gp scores in both years.
  6. There were six students in the year 2020 who were admitted to a course outside of their choice.
  7. Science students have better chances of being admitted to the course of their choice compared to arts students.
  8. Even though science departments had the highest marks in both years, the highest improvement in GP scores were observed in the art departments.

4. EXPLORATORY DATA ANALYSIS [EDA] 💡

EDA is a very important step in a any data analysis project.

4.1. Gross Percentage Descriptive Stats

Department Avg. GP 2019 Avg. GP 2020 Change +/-
B.A Economics 47.69 56.23
B.A English 58.43 64.23
B.A Hindi 45.50 55.03
B.A History 46.51 50.29
B.A Sociology 43.22 53.01
B.Com Model I 65.70 75.01
B.Sc Botany 68.83 64.07 🔻
B.Sc Chemistry 76.37 75.69 🔻
B.Sc Mathematics 76.67 78.30
B.Sc Physics 67.98 76.24
B.Sc Statistics 71.97 76.46
B.Sc Zoology 61.14 63.06

Table 1.0. shows the department wise average Gross Percentage of students and the change from years 2019 and 2020. From the above Table we can observe the following:

  1. The departments B.Sc. Botany and B.Sc. Chemistry were the only two departments to have a decrease in the average gp score.
  2. Science departments have higher average scores than most arts courses.

4.1.1. GP Descriptive Stats 2019
Department count mean std min 25% 50% 75% max
B.A Economics 35.0 47.70 22.44 0.00 32.82 44.10 70.30 76.90
B.A English 35.0 58.44 20.54 10.60 37.90 63.80 75.60 87.40
B.A Hindi 22.0 45.50 23.93 7.00 26.10 47.00 62.67 84.17
B.A History 36.0 46.51 17.07 13.40 32.98 49.20 55.98 87.20
B.A Sociology 33.0 43.22 21.45 5.00 29.00 44.60 59.00 87.70
B.Com Model I 50.0 65.70 16.78 23.00 60.04 71.71 76.98 88.58
B.Sc Botany 28.0 68.84 13.90 31.17 64.19 71.21 79.62 85.42
B.Sc Chemistry 24.0 76.37 15.43 34.83 67.33 82.46 87.77 92.25
B.Sc Mathematics 25.0 76.67 12.67 36.24 70.48 76.50 87.17 91.92
B.Sc Physics 23.0 67.99 18.20 29.17 56.88 75.08 79.79 89.75
B.Sc Statistics 29.0 71.97 10.47 47.67 65.58 74.08 78.08 88.67
B.Sc Zoology 28.0 61.14 16.95 24.17 52.63 65.09 74.17 88.50

Table 1.1. shows the department wise descriptive statistics of Gross Percentage in the year 2019.


4.1.2. GP Descriptive Stats 2020
Department count mean std min 25% 50% 75% max
B.A Economics 49.0 56.23 22.00 8.70 38.20 58.80 74.90 87.20
B.A English 40.0 64.24 16.89 30.50 48.00 68.20 79.80 88.70
B.A Hindi 29.0 55.04 22.11 16.33 35.50 48.33 76.17 90.83
B.A History 42.0 50.29 22.92 0.00 36.65 54.25 66.55 89.50
B.A Sociology 43.0 53.02 15.75 15.60 43.95 55.90 62.45 81.70
B.Com Model I 51.0 75.01 16.00 27.67 67.96 79.58 87.21 93.33
B.Sc Botany 23.0 64.07 22.31 15.50 54.92 73.92 78.46 87.08
B.Sc Chemistry 26.0 75.69 16.32 37.00 71.19 82.87 86.73 92.83
B.Sc Mathematics 25.0 78.30 12.02 50.58 75.67 81.33 86.50 90.67
B.Sc Physics 29.0 76.24 12.36 38.58 69.33 80.25 85.25 93.17
B.Sc Statistics 27.0 76.47 16.57 29.25 71.58 81.58 88.88 92.25
B.Sc Zoology 29.0 63.07 16.63 28.92 50.33 65.92 78.42 89.08

Table 1.2. shows the department wise descriptive statistics of Gross Percentage in the year 2020.


5.0. A Rather Miss-Happenning in 2020
ID Program gp
1020 B.A Economics 67.700000
1038 B.A Economics 65.100000
1066 B.A English 30.500000
1099 B.A Hindi 43.666667
1109 B.A Hindi 18.500000
1133 B.A History 47.900000

Table 1.3. showing the students who were not admitted to any of their 6 choice preferences.


Department Top Allotted [2019] Top Choice [2019] Top Allotted [2020] Top Choice [2020]
B.A Economics CO ST CO CO
B.A English SO SO SO PE
B.A Hindi EN SO BO HI
B.A History EC EC EN SO
B.A Sociology HI HI EC HI
B.Com Model I ST ST ST ST
B.Sc Botany CH CH ZO ZO
B.Sc Chemistry MA MA MA MA
B.Sc Mathematics ST ST CO CO
B.Sc Physics CO CO MA MA
B.Sc Statistics MA MA CO CO
B.Sc Zoology CH CH CH CH

Table 2.0. Table depicting the Top Choice and the Top Allotted Course of each parent department on both years.


Distribution of Students in Parent Departments [2019-2020]

Image
fig 1.0. showing the proportion and count of students in various departments.


Average Department-Wise GP [2019-2020]

image fig 2.0. shows the department-wise average GP scores of students.


Change in GP Score from (2019 - 2020)

image fig 2.1. showing the change in the average GP score of departments as observed from 2019 to 2020.


Boxplots - GP Scores (2019-2020)

Image fig. 2.2. shows the boxplots of GP scores in both years


Lineplot of GP Scores (2019-2020)

image fig. 2.3. shows the line plot of departments


Probability Distribution Plot - GP Scores (2019-2020)

Image fig. 2.4. shows the layered probability density plot of GP scores in both years


Popular Choices of Students
fig 3.0. showing the Total Count of Top 3 Choices received to each Open Course Subjects.


Top Choice Allotted Proportion (2019-2020)

Chart fig 4.0. shows the proportion of students being allotted to the first choice


Heatmap - Allotment Proportion of Parent Departments (2019-2020)

Heatmap fig 5.0. Heatmap showing the proportion of students from parent department to the allotted open course subjects


Heatmap - Change in Allotment Proportion of Parent Departments (2019-2020)

Heatmap fig 5.1. Heatmap showing the proportion of students from parent department to the allotted open course subjects


Top Parent Department Allotments (2019-2020)

Network Diagram fig 5.2. Network Diagram showing the top course allotted by each parent department


Heatmap - Allotment Proportion in Open Course Subjects (2019-2020)

Heatmap


Heatmap - Change in Allotment Proportion in Open Course Subjects (2019-2020)

Heatmap fig 5.3. shows the change in allotment proportion trends in the open course subjects


Top Open Course Allotments (2019-2020)

Network Diagram fig. 5.4. shows the top parent program received in the open course subjects


Heatmap - First Choice (2019-2020)

Heatmap fig. 5.5. shows the first choice proportion of students


Heatmap - Change in First Choice (2019-2020)

Heatmp fig. 5.6 shows the change in first choice proportions of students


Top Choices (2019-2020)

Network diagram fig. 5.7. showing the network diagram representing the top elective course prefered by departmeants


5. K-Means Clustering 💭

K-Means Clustering is a simple but effective unsupervised learning algorithm used to group data points into clusters. It is a centroid-based algorithm, meaning it identifies groups (clusters) by finding the centroids (centers) of the data points. It uses the Euclidean Distance to cluster the points with the lowest distance.

Steps in K-Means Clustering :

  1. Define the number of clusters (k) : This is the most crucial step, as it determines the granularity of the clusters. The optimal number of clusters can be determined using various techniques, such as the elbow method or silhouette analysis. For this project I have used elbow method to find the optimal no. of clusters.

  2. Initialize the centroids : The algorithm starts by randomly placing k centroids within the data space. These centroids represent the center of each cluster.

  3. Assign data points to clusters : Each data point is assigned to the closest centroid based on a distance metric, such as Euclidean distance.

  4. Recalculate the centroids : After all data points are assigned, the algorithm recalculates the centroids by taking the average of all data points within each cluster.

  5. Repeat steps 3 and 4 until convergence : The algorithm iterates between steps 3 and 4 until the centroids no longer move significantly. This indicates that the clusters have converged and are stable.


Allotment Clusters of Parent Department (2019)

Cluster Plot fig 6.0 a. : uses the elbow method to find the optimal no. of clusters (here 3).

fig 6.0 b. : Shows the cluster plot of parent departments visualized using PCA using a custom plot

fig 6.0 c. : shows the average mean of each clusters against their behaviours in the respective open course subjects

Insights :


Allotment Clusters of Parent Department (2020)

Cluster Plot fig 6.1a, b & c showing the elbow method, cluster plot and the multiple bar plot of parent departmet allotment behaviours

Insights :


Allotment Clusters of Open Course Subjects (2019)

cluster plots fig 6.2 a, b & c showing the allotment behaviours experienced by the open course subjects in the year 2019

Insights :


Allotment Clusters of Open Course Subjects (2020)

Cluster Plot fig 6.3 a, b & c showing the allotment behaviours experienced by the open course subject in the year 2020

Insights :


Parent Department Clusters of First Choice (2019)

Cluster Plot fig 6.4 a, b & c showing the scree plot, cluster plot and the multiple barplot of cluster formation.

Insights :


Parent Department Clusters of First Choice (2020)

Cluster Plot fig 6.5a, b & c showing the scree plot, cluster plot and the multiple barplot of cluster formation.

Insights :

6. Multinomial Logistic Regression 🎯

Multinomial Logistic Regression is an extension of the binary logistic regression to handle multiple classes. It is a classification algorithm that is particularly useful when the target variable has more than two categories. Multinomial Logistic Regression extends logistic regression to predict outcomes with more than two categories. It models the probability of each category relative to one reference category.

Introduction

In this project, we employed a multinomial logistic regression model to predict the allotted courses for students based on their numerical and categorical attributes. We trained the model on the 2019 dataset and subsequently tested it on both the 2019 and 2020 datasets to evaluate its performance and generalizability.

Methodology

The dataset includes various features, with one numerical column and several categorical columns. The target variable is the ‘Allotted Course’. The categorical features were one-hot encoded to convert them into a numerical format suitable for the logistic regression model.

Steps involved were:

Results and Interpretation

2019 Results
Confusion Matrix:

The confusion matrix for the 2019 test data shows how well the model predicted each class. Each row represents the actual class, and each column represents the predicted class. Diagonal values indicate correct predictions, while off-diagonal values indicate misclassifications.

Classification Report:

The classification report provides precision, recall, and F1-score for each class:

CODE :

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics
from joblib import dump,load

# Split into X and y
numerical_column = df1.iloc[:, 2]  # Select the numerical column
categorical_columns = df1.iloc[:, 3:9]  # Select the categorical columns

# Combine numerical and categorical columns
X = pd.concat([numerical_column, categorical_columns], axis=1)
y = df1['Allotted Course']

# One-hot encode the categorical variables
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
X_encoded = pd.DataFrame(encoder.fit_transform(X))
X_encoded.columns = encoder.get_feature_names_out(X.columns)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

# Fit the model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model.fit(X_train, y_train)

# Save the trained model
dump(model, 'Model/MNLogReg.joblib')

Classification Report:

          precision    recall  f1-score   support

      BO       1.00      0.25      0.40         4
      CH       0.75      1.00      0.86         6
      CO       0.78      1.00      0.88         7
      EC       0.80      1.00      0.89         4
      EN       0.88      0.88      0.88         8
      HI       0.80      1.00      0.89         4
      HN       1.00      0.00      0.00         2
      MA       1.00      0.75      0.86         8
      ...
weighted avg   0.86      0.82      0.80        74

Overall Metrics (2019):

2020 Results:

Confusion Matrix:

2020 Results: Confusion Matrix:

Classification Report:

            precision    recall  f1-score   support

      BO       0.80      0.40      0.53        10
      CH       0.67      0.67      0.67         3
      CO       1.00      0.91      0.95        11
      EC       1.00      1.00      1.00         6
      EN       0.75      0.60      0.67         5
      HI       0.83      0.91      0.87        11
      HN       1.00      0.50      0.67         4
      MA       0.40      1.00      0.57         4
      ...
weighted avg   0.81      0.75      0.74        83

Overall Matrix:

Interpretation

Conclusion:

The multinomial logistic regression model demonstrated reasonable accuracy and performance in predicting course allotments for both 2019 and 2020 datasets. However, there are discrepancies in performance across different classes, with some classes consistently showing high predictive power and others indicating potential areas for model improvement. The results suggest that while the model generalizes well, there is room for refinement, particularly in addressing the variability and potential changes in the underlying data between years.