Dr R Anurekha: March 2024

Naïve Bayes Models

Expt No: 3 Naïve Bayes Models

Date:

Aim: To write a program to demonstrate Naïve Bayes models using Iris dataset

Program

import pandas as pd

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB, CategoricalNB

from sklearn.metrics import accuracy_score

# Load the Iris dataset from CSV file

iris_df = pd.read_csv("Iris.csv")

iris_df.drop(columns=["Id"], inplace=True)

# Display the dataset characteristics

print("Iris Dataset Characteristics:")

print("Number of samples:", iris_df.shape[0])

print("Number of features:", iris_df.shape[1] - 1)

print("Classes:", iris_df["Species"].unique())

# Summary statistics for each feature

summary_stats = iris_df.describe()

# Display summary statistics and class distribution

print("Summary Statistics for each feature:")

print(summary_stats)

# Box plots for each feature grouped by the target variable "Species"

plt.figure(figsize=(12, 8))

for i, column in enumerate(iris_df.columns[:-1]):

plt.subplot(2, 2, i+1)

sns.boxplot(x="Species", y=column, data=iris_df)

plt.title(f"Box plot - {column}")

plt.xlabel("Species")

plt.ylabel(column)

plt.suptitle("Box Plots of Features by Species")

plt.tight_layout()

plt.show()

# Class distribution

class_distribution = iris_df["Species"].value_counts()

print("\nClass Distribution:")

print(class_distribution)

# Naive Bayes models

# Load the Iris dataset from CSV file

X = iris_df.drop(columns=["Species"]).values

y = iris_df["Species"].values

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Gaussian Naive Bayes

gnb = GaussianNB()

gnb.fit(X_train, y_train)

gnb_pred = gnb.predict(X_test)

gnb_accuracy = accuracy_score(y_test, gnb_pred)

# Multinomial Naive Bayes

mnb = MultinomialNB()

mnb.fit(X_train, y_train)

mnb_pred = mnb.predict(X_test)

mnb_accuracy = accuracy_score(y_test, mnb_pred)

# Convvert the continuous features to binary

binarization_threshold = 0.5

X_train_binary = (X_train > binarization_threshold).astype(int)

X_test_binary = (X_test > binarization_threshold).astype(int)

# Bernoulli Naive Bayes with adjusted binary features

bnb = BernoulliNB()

bnb.fit(X_train_binary, y_train)

bnb_pred = bnb.predict(X_test_binary)

bnb_accuracy = accuracy_score(y_test, bnb_pred)

# Complement Naive Bayes

cnb = ComplementNB()

cnb.fit(X_train, y_train)

cnb_pred = cnb.predict(X_test)

cnb_accuracy = accuracy_score(y_test, cnb_pred)

# Categorical Naive Bayes

catnb = CategoricalNB()

catnb.fit(X_train, y_train)

catnb_pred = catnb.predict(X_test)

catnb_accuracy = accuracy_score(y_test, catnb_pred)

# Accuracy of various Naïve Bayes models

print("Accuracy of various Naive Bayes models for Iris datatset. ")

print("Gaussian Naive Bayes:", format(gnb_accuracy, '.4f'))

print("Multinomial Naive Bayes:", format(mnb_accuracy, '.4f'))

print("Bernoulli Naive Bayes:", format(bnb_accuracy, '.4f'))

print("Complement Naive Bayes:", format(cnb_accuracy, '.4f'))

print("Categorical Naive Bayes:", format(catnb_accuracy, '.4f'))

Result: Thus the program to demonstrate Naïve Bayes models was written and executed

Sample Output:

Iris Dataset Characteristics:

Number of samples: 150

Number of features: 4

Classes: ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']

Summary Statistics for each feature:

SepalLength SepalWidth PetalLength PetalWidth

count 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.054000 3.758667 1.198667

std 0.828066 0.433594 1.764420 0.763161

min 4.300000 2.000000 1.000000 0.100000

25% 5.100000 2.800000 1.600000 0.300000

50% 5.800000 3.000000 4.350000 1.300000

75% 6.400000 3.300000 5.100000 1.800000

max 7.900000 4.400000 6.900000 2.500000

Class Distribution:

Iris-versicolor 50

Iris-virginica 50

Iris-setosa 50

Name: Species, dtype: int64

Accuracy of various Naive Bayes models for Iris datatset.

Gaussian Naive Bayes: 1.0000

Multinomial Naive Bayes: 0.9000

Bernoulli Naive Bayes: 0.6333

Complement Naive Bayes: 0.7000

Categorical Naive Bayes: 0.9667

Regression models

Expt No: 5 Regression models

Date:

Aim: To write a program to demonstrate various Regression models

Program

# Linear Regression, Bayesian Linear Regression and Polynomial Regression

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression, BayesianRidge

from sklearn.preprocessing import PolynomialFeatures

from sklearn.metrics import mean_squared_error

from sklearn.impute import SimpleImputer

import numpy as np

import matplotlib.pyplot as plt

# Load the dataset

df = pd.read_csv('HousingData.csv')

# Assume 'MEDV' as the dependent variable and the rest as independent variables

X = df.drop('MEDV', axis=1)

y = df['MEDV']

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Handle missing values using simple imputation with mean

imputer = SimpleImputer(strategy='mean')

X_train_imputed = imputer.fit_transform(X_train)

X_test_imputed = imputer.transform(X_test)

# Linear Regression

lin_reg = LinearRegression()

lin_reg.fit(X_train_imputed, y_train)

lin_reg_train_pred = lin_reg.predict(X_train_imputed)

lin_reg_test_pred = lin_reg.predict(X_test_imputed)

# Bayesian Linear Regression

bayesian_reg = BayesianRidge()

bayesian_reg.fit(X_train_imputed, y_train)

bayesian_reg_train_pred = bayesian_reg.predict(X_train_imputed)

bayesian_reg_test_pred = bayesian_reg.predict(X_test_imputed)

# Polynomial Regression (degree=2)

poly_reg = PolynomialFeatures(degree=2)

X_train_poly = poly_reg.fit_transform(X_train_imputed)

X_test_poly = poly_reg.transform(X_test_imputed)

poly_lin_reg = LinearRegression()

poly_lin_reg.fit(X_train_poly, y_train)

poly_lin_reg_train_pred = poly_lin_reg.predict(X_train_poly)

poly_lin_reg_test_pred = poly_lin_reg.predict(X_test_poly)

# Calculate mean squared error

lin_reg_train_mse = mean_squared_error(y_train, lin_reg_train_pred)

lin_reg_test_mse = mean_squared_error(y_test, lin_reg_test_pred)

bayesian_reg_train_mse = mean_squared_error(y_train, bayesian_reg_train_pred)

bayesian_reg_test_mse = mean_squared_error(y_test, bayesian_reg_test_pred)

poly_lin_reg_train_mse = mean_squared_error(y_train, poly_lin_reg_train_pred)

poly_lin_reg_test_mse = mean_squared_error(y_test, poly_lin_reg_test_pred)

print("Linear Regression:")

print(f" Train MSE: {lin_reg_train_mse:.2f}")

print(f" Test MSE: {lin_reg_test_mse:.2f}")

print("Bayesian Linear Regression:")

print(f" Train MSE: {bayesian_reg_train_mse:.2f}")

print(f" Test MSE: {bayesian_reg_test_mse:.2f}")

print("Polynomial Regression (degree=2):")

print(f" Train MSE: {poly_lin_reg_train_mse:.2f}")

print(f" Test MSE: {poly_lin_reg_test_mse:.2f}")

# Plot actual vs predicted prices

plt.figure(figsize=(12, 6))

plt.scatter(y_test, lin_reg_test_pred, color='blue', label='Linear Regression')

plt.scatter(y_test, bayesian_reg_test_pred, color='green', label='Bayesian Linear Regression')

plt.scatter(y_test, poly_lin_reg_test_pred, color='red', label='Polynomial Regression (degree=2)')

plt.xlabel('Actual Price')

plt.ylabel('Predicted Price')

plt.title('Actual vs Predicted Prices (Regression)')

plt.legend()

plt.show()

# Plot actual vs predicted prices with the fitted line for linear regression

plt.figure(figsize=(12, 6))

plt.scatter(y_test, lin_reg_test_pred, color='blue', label='Linear Regression')

plt.xlabel('Actual Price')

plt.ylabel('Predicted Price')

plt.title('Actual vs Predicted Prices for Linear Regression')

plt.legend()

plt.show()

# Plot actual vs predicted prices with the fitted line for polynomial regression

plt.figure(figsize=(12, 6))

plt.scatter(y_test, poly_lin_reg_test_pred, color='red', label='Polynomial Regression (degree=2)')

plt.xlabel('Actual Price')

plt.ylabel('Predicted Price')

plt.title('Actual vs Predicted Prices for Polynomial Regression')

plt.legend()

plt.show()

# Plot actual vs predicted prices for Bayesian Linear Regression

plt.figure(figsize=(12, 6))

plt.scatter(y_test, bayesian_reg_test_pred, color='green', label='Bayesian Linear Regression')

plt.xlabel('Actual Price')

plt.ylabel('Predicted Price')

plt.title('Actual vs Predicted Prices for Bayesian Linear Regression')

plt.legend()

plt.show()

# Logistic Regression

# Single variable

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# Generate synthetic dataset with multiple features

np.random.seed(42)

n_samples = 1000

# Generate features: transaction amount, transaction time, and transaction type

transaction_amount = np.random.normal(loc=50, scale=20, size=n_samples)

transaction_time = np.random.uniform(low=0, high=24, size=n_samples) # Transaction time in hours

transaction_type = np.random.choice(['Online', 'In-person'], size=n_samples)

# Generate target variable: is_fraudulent

# Assume transactions made between 1:00 AM and 6:00 AM, online transactions,

# and high transaction amounts have a higher probability of being fraudulent

is_fraudulent = (((transaction_time >= 1) & (transaction_time <= 6)) |

(transaction_type == 'Online') |

(transaction_amount > 70)).astype(int)

# Create DataFrame

df = pd.DataFrame({

'TransactionAmount': transaction_amount,

'TransactionTime': transaction_time,

'TransactionType': transaction_type,

'IsFraudulent': is_fraudulent

})

# One-hot encode the 'TransactionType' feature

df = pd.get_dummies(df, columns=['TransactionType'])

# Separate datasets based on each feature for logistic regression

datasets = [('Transaction Amount', df[['TransactionAmount']]),

('Transaction Time', df[['TransactionTime']]),

('Transaction Type', df.drop(['TransactionAmount', 'TransactionTime', 'IsFraudulent'], axis=1))]

# Perform logistic regression for each feature

for feature_name, X_feature in datasets:

X_train, X_test, y_train, y_test = train_test_split(X_feature, df['IsFraudulent'], test_size=0.2, random_state=42)

model = LogisticRegression(solver='liblinear')

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy based on {feature_name} only : {accuracy:.2f}")

print()

# Logistic Regression

# Multiple variables

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# Generate synthetic dataset with multiple features

np.random.seed(42)

n_samples = 1000

# Generate features: transaction amount, transaction time, and transaction type

transaction_amount = np.random.normal(loc=50, scale=20, size=n_samples)

transaction_time = np.random.uniform(low=0, high=24, size=n_samples) # Transaction time in hours

transaction_type = np.random.choice(['Online', 'In-person'], size=n_samples)

# Generate target variable: is_fraudulent

# Assume transactions made between 1:00 AM and 6:00 AM, online transactions,

# and high transaction amounts have a higher probability of being fraudulent

is_fraudulent = (((transaction_time >= 1) & (transaction_time <= 6)) |

(transaction_type == 'Online') |

(transaction_amount > 70)).astype(int)

# Create DataFrame

df = pd.DataFrame({

'TransactionAmount': transaction_amount,

'TransactionTime': transaction_time,

'TransactionType': transaction_type,

'IsFraudulent': is_fraudulent

})

# One-hot encode the 'TransactionType' feature

df = pd.get_dummies(df, columns=['TransactionType'])

# Split the data into training and testing sets

X = df.drop('IsFraudulent', axis=1)

y = df['IsFraudulent']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit logistic regression model

model = LogisticRegression(solver='liblinear')

model.fit(X_train, y_train)

# Predict the classes for the test set

y_pred = model.predict(X_test)

# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy involving all three variables : {accuracy:.2f}")

Result: Thus the program to demonstrate Regression models were written and executed.

Sample Output

Linear Regression, Bayesian Linear Regression and Polynomial Regression:

Linear Regression:

Train MSE: 22.40

Test MSE: 25.00

Bayesian Linear Regression:

Train MSE: 23.09

Test MSE: 25.30

Polynomial Regression (degree=2):

Train MSE: 6.53

Test MSE: 16.46

# Logistic Regression

# Single variables

Accuracy based on Transaction Amount only : 0.67

Accuracy based on Transaction Time only : 0.69

Accuracy based on Transaction Type only : 0.86

# Logistic Regression

# Multiple variables

Accuracy involving all three variables : 0.91