Naïve Bayes Models

 

Expt No: 3                                           Naïve Bayes Models

Date:

 

Aim: To write a program to demonstrate Naïve Bayes models using Iris dataset

 

Program

import pandas as pd

 

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB, CategoricalNB

from sklearn.metrics import accuracy_score

 

# Load the Iris dataset from CSV file

iris_df = pd.read_csv("Iris.csv")

iris_df.drop(columns=["Id"], inplace=True)

 

# Display the dataset characteristics

print("Iris Dataset Characteristics:")

print("Number of samples:", iris_df.shape[0])

print("Number of features:", iris_df.shape[1] - 1) 

print("Classes:", iris_df["Species"].unique())

 

# Summary statistics for each feature

summary_stats = iris_df.describe()

 

# Display summary statistics and class distribution

print("Summary Statistics for each feature:")

print(summary_stats)

 

# Box plots for each feature grouped by the target variable "Species"

plt.figure(figsize=(12, 8))

for i, column in enumerate(iris_df.columns[:-1]):

    plt.subplot(2, 2, i+1)

    sns.boxplot(x="Species", y=column, data=iris_df)

    plt.title(f"Box plot - {column}")

    plt.xlabel("Species")

    plt.ylabel(column)

plt.suptitle("Box Plots of Features by Species")

plt.tight_layout()

plt.show()

 

# Class distribution

class_distribution = iris_df["Species"].value_counts()

 

print("\nClass Distribution:")

print(class_distribution)

 

# Naive Bayes models

# Load the Iris dataset from CSV file

X = iris_df.drop(columns=["Species"]).values

y = iris_df["Species"].values

 

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

 

# Gaussian Naive Bayes

gnb = GaussianNB()

gnb.fit(X_train, y_train)

gnb_pred = gnb.predict(X_test)

gnb_accuracy = accuracy_score(y_test, gnb_pred)

 

# Multinomial Naive Bayes

mnb = MultinomialNB()

mnb.fit(X_train, y_train)

mnb_pred = mnb.predict(X_test)

mnb_accuracy = accuracy_score(y_test, mnb_pred)

 

# Convvert the continuous features to binary

binarization_threshold = 0.5

X_train_binary = (X_train > binarization_threshold).astype(int)

X_test_binary = (X_test > binarization_threshold).astype(int)

 

# Bernoulli Naive Bayes with adjusted binary features

bnb = BernoulliNB()

bnb.fit(X_train_binary, y_train)

bnb_pred = bnb.predict(X_test_binary)

bnb_accuracy = accuracy_score(y_test, bnb_pred)

 

# Complement Naive Bayes

cnb = ComplementNB()

cnb.fit(X_train, y_train)

cnb_pred = cnb.predict(X_test)

cnb_accuracy = accuracy_score(y_test, cnb_pred)

 

 

# Categorical Naive Bayes

catnb = CategoricalNB()

catnb.fit(X_train, y_train)

catnb_pred = catnb.predict(X_test)

catnb_accuracy = accuracy_score(y_test, catnb_pred)

 

# Accuracy of various Naïve Bayes models

print("Accuracy of various Naive Bayes models for Iris datatset. ")

print("Gaussian Naive Bayes:", format(gnb_accuracy, '.4f'))

print("Multinomial Naive Bayes:", format(mnb_accuracy, '.4f'))

print("Bernoulli Naive Bayes:", format(bnb_accuracy, '.4f'))

print("Complement Naive Bayes:", format(cnb_accuracy, '.4f'))

print("Categorical Naive Bayes:", format(catnb_accuracy, '.4f'))

 

 




Result: Thus the program to demonstrate Naïve Bayes models was written and executed


Sample Output:

 

Iris Dataset Characteristics:

Number of samples: 150

Number of features: 4

Classes: ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']

 

Summary Statistics for each feature:

                  SepalLength          SepalWidth            PetalLength           PetalWidth

count         150.000000           150.000000           150.000000           150.000000

mean         5.843333               3.054000               3.758667               1.198667

std             0.828066               0.433594               1.764420               0.763161

min            4.300000               2.000000               1.000000               0.100000

25%           5.100000               2.800000               1.600000               0.300000

50%           5.800000               3.000000               4.350000               1.300000

75%           6.400000               3.300000               5.100000               1.800000

max           7.900000               4.400000               6.900000               2.500000

 


 

 

Class Distribution:

Iris-versicolor    50

Iris-virginica     50

Iris-setosa        50

Name: Species, dtype: int64

 

Accuracy of various Naive Bayes models for Iris datatset.

Gaussian Naive Bayes: 1.0000

Multinomial Naive Bayes: 0.9000

Bernoulli Naive Bayes: 0.6333

Complement Naive Bayes: 0.7000

Categorical Naive Bayes: 0.9667



 


Regression models

 

Expt No: 5                                           Regression models

Date:

 

Aim: To write a program to demonstrate various Regression models

 

Program

 

# Linear Regression, Bayesian Linear Regression and Polynomial Regression

 

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression, BayesianRidge

from sklearn.preprocessing import PolynomialFeatures

from sklearn.metrics import mean_squared_error

from sklearn.impute import SimpleImputer

import numpy as np

import matplotlib.pyplot as plt

 

# Load the dataset

df = pd.read_csv('HousingData.csv')

 

# Assume 'MEDV' as the dependent variable and the rest as independent variables

X = df.drop('MEDV', axis=1)

y = df['MEDV']

 

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

 

# Handle missing values using simple imputation with mean

imputer = SimpleImputer(strategy='mean')

X_train_imputed = imputer.fit_transform(X_train)

X_test_imputed = imputer.transform(X_test)

 

# Linear Regression

lin_reg = LinearRegression()

lin_reg.fit(X_train_imputed, y_train)

lin_reg_train_pred = lin_reg.predict(X_train_imputed)

lin_reg_test_pred = lin_reg.predict(X_test_imputed)

 


 

# Bayesian Linear Regression

bayesian_reg = BayesianRidge()

bayesian_reg.fit(X_train_imputed, y_train)

bayesian_reg_train_pred = bayesian_reg.predict(X_train_imputed)

bayesian_reg_test_pred = bayesian_reg.predict(X_test_imputed)

 

# Polynomial Regression (degree=2)

poly_reg = PolynomialFeatures(degree=2)

X_train_poly = poly_reg.fit_transform(X_train_imputed)

X_test_poly = poly_reg.transform(X_test_imputed)

poly_lin_reg = LinearRegression()

poly_lin_reg.fit(X_train_poly, y_train)

poly_lin_reg_train_pred = poly_lin_reg.predict(X_train_poly)

poly_lin_reg_test_pred = poly_lin_reg.predict(X_test_poly)

 

# Calculate mean squared error

lin_reg_train_mse = mean_squared_error(y_train, lin_reg_train_pred)

lin_reg_test_mse = mean_squared_error(y_test, lin_reg_test_pred)

bayesian_reg_train_mse = mean_squared_error(y_train, bayesian_reg_train_pred)

bayesian_reg_test_mse = mean_squared_error(y_test, bayesian_reg_test_pred)

poly_lin_reg_train_mse = mean_squared_error(y_train, poly_lin_reg_train_pred)

poly_lin_reg_test_mse = mean_squared_error(y_test, poly_lin_reg_test_pred)

 

print("Linear Regression:")

print(f"  Train MSE: {lin_reg_train_mse:.2f}")

print(f"  Test MSE: {lin_reg_test_mse:.2f}")

 

print("Bayesian Linear Regression:")

print(f"  Train MSE: {bayesian_reg_train_mse:.2f}")

print(f"  Test MSE: {bayesian_reg_test_mse:.2f}")

 

print("Polynomial Regression (degree=2):")

print(f"  Train MSE: {poly_lin_reg_train_mse:.2f}")

print(f"  Test MSE: {poly_lin_reg_test_mse:.2f}")

 


 

# Plot actual vs predicted prices

plt.figure(figsize=(12, 6))

plt.scatter(y_test, lin_reg_test_pred, color='blue', label='Linear Regression')

plt.scatter(y_test, bayesian_reg_test_pred, color='green', label='Bayesian Linear Regression')

plt.scatter(y_test, poly_lin_reg_test_pred, color='red', label='Polynomial Regression (degree=2)')

plt.xlabel('Actual Price')

plt.ylabel('Predicted Price')

plt.title('Actual vs Predicted Prices (Regression)')

plt.legend()

plt.show()

 

# Plot actual vs predicted prices with the fitted line for linear regression

plt.figure(figsize=(12, 6))

plt.scatter(y_test, lin_reg_test_pred, color='blue', label='Linear Regression')

plt.xlabel('Actual Price')

plt.ylabel('Predicted Price')

plt.title('Actual vs Predicted Prices for Linear Regression')

plt.legend()

plt.show()

 

# Plot actual vs predicted prices with the fitted line for polynomial regression

plt.figure(figsize=(12, 6))

plt.scatter(y_test, poly_lin_reg_test_pred, color='red', label='Polynomial Regression (degree=2)')

plt.xlabel('Actual Price')

plt.ylabel('Predicted Price')

plt.title('Actual vs Predicted Prices for Polynomial Regression')

plt.legend()

plt.show()

 

# Plot actual vs predicted prices for Bayesian Linear Regression

plt.figure(figsize=(12, 6))

plt.scatter(y_test, bayesian_reg_test_pred, color='green', label='Bayesian Linear Regression')

plt.xlabel('Actual Price')

plt.ylabel('Predicted Price')

plt.title('Actual vs Predicted Prices for Bayesian Linear Regression')

plt.legend()

plt.show()



# Logistic Regression

# Single variable

 

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

 

# Generate synthetic dataset with multiple features

np.random.seed(42)

n_samples = 1000

 

# Generate features: transaction amount, transaction time, and transaction type

transaction_amount = np.random.normal(loc=50, scale=20, size=n_samples)

transaction_time = np.random.uniform(low=0, high=24, size=n_samples)  # Transaction time in hours

transaction_type = np.random.choice(['Online', 'In-person'], size=n_samples)

 

# Generate target variable: is_fraudulent

# Assume transactions made between 1:00 AM and 6:00 AM, online transactions,

# and high transaction amounts have a higher probability of being fraudulent

is_fraudulent = (((transaction_time >= 1) & (transaction_time <= 6)) |

                 (transaction_type == 'Online') |

                 (transaction_amount > 70)).astype(int)

 

# Create DataFrame

df = pd.DataFrame({

    'TransactionAmount': transaction_amount,

    'TransactionTime': transaction_time,

    'TransactionType': transaction_type,

    'IsFraudulent': is_fraudulent

})

 

# One-hot encode the 'TransactionType' feature

df = pd.get_dummies(df, columns=['TransactionType'])

 


 

# Separate datasets based on each feature for logistic regression

datasets = [('Transaction Amount', df[['TransactionAmount']]),

            ('Transaction Time', df[['TransactionTime']]),

            ('Transaction Type', df.drop(['TransactionAmount', 'TransactionTime', 'IsFraudulent'], axis=1))]

 

# Perform logistic regression for each feature

for feature_name, X_feature in datasets:

    X_train, X_test, y_train, y_test = train_test_split(X_feature, df['IsFraudulent'], test_size=0.2, random_state=42)

    model = LogisticRegression(solver='liblinear')

    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)

    print(f"Accuracy based on {feature_name} only : {accuracy:.2f}")

    print()

 


 

# Logistic Regression

# Multiple variables

 

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

 

# Generate synthetic dataset with multiple features

np.random.seed(42)

n_samples = 1000

 

# Generate features: transaction amount, transaction time, and transaction type

transaction_amount = np.random.normal(loc=50, scale=20, size=n_samples)

transaction_time = np.random.uniform(low=0, high=24, size=n_samples)  # Transaction time in hours

transaction_type = np.random.choice(['Online', 'In-person'], size=n_samples)

 

# Generate target variable: is_fraudulent

# Assume transactions made between 1:00 AM and 6:00 AM, online transactions,

# and high transaction amounts have a higher probability of being fraudulent

is_fraudulent = (((transaction_time >= 1) & (transaction_time <= 6)) |

                 (transaction_type == 'Online') |

                 (transaction_amount > 70)).astype(int)

 

# Create DataFrame

df = pd.DataFrame({

    'TransactionAmount': transaction_amount,

    'TransactionTime': transaction_time,

    'TransactionType': transaction_type,

    'IsFraudulent': is_fraudulent

})

 

# One-hot encode the 'TransactionType' feature

df = pd.get_dummies(df, columns=['TransactionType'])

 

# Split the data into training and testing sets

X = df.drop('IsFraudulent', axis=1)

y = df['IsFraudulent']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

 

# Fit logistic regression model

model = LogisticRegression(solver='liblinear')

model.fit(X_train, y_train)

 

# Predict the classes for the test set

y_pred = model.predict(X_test)

 

# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy involving all three variables : {accuracy:.2f}")

 

 

 

 

ResultThus the program to demonstrate Regression models were written and executed.



Sample Output

Linear Regression, Bayesian Linear Regression and Polynomial Regression:

 

Linear Regression:

  Train MSE: 22.40

  Test MSE: 25.00

 

Bayesian Linear Regression:

  Train MSE: 23.09

  Test MSE: 25.30

 

Polynomial Regression (degree=2):

  Train MSE: 6.53

  Test MSE: 16.46










 

# Logistic Regression

# Single variables

Accuracy based on Transaction Amount only : 0.67

Accuracy based on Transaction Time only : 0.69

Accuracy based on Transaction Type only : 0.86

 

# Logistic Regression

# Multiple variables

Accuracy involving all three variables : 0.91