KBC Academy

Topics in Advanced Machine Learning

About the course

- Experience sharing
- discussions are welcomed
- conceptual knowledge and practical skills
- not too academic style
- designing and applying techniques for modeling complex systems
- significant body of knowledge - most is for self-prepare
- focus on specific topics of interest, proposed by the KBC institution
- adaptive to the needs of the audience

- platform-agnostic - understanding the concepts and not on particular software implementations
- Flipped classroom, the learners are expected to be:
* preparing for lectures preliminarily (out-of-class activity);
* participating actively in the discussions during the lectures (in-class activity)
* Phase 3 is optional time for exercising the new techniques (4 weeks) - self learning.

Feature engineering for structured data

Angel Marchev, Jr.

Special guest: Alexander Efremov

Agenda

  1. Workflow review
  2. Preliminary remarks
  3. Feature engineering in Panel data
  4. Feature engineering in Time series data
  5. Variable reduction
  6. Regularization
  7. Factor analysis

Workflow 1/3

Workflow 2/3

Workflow 3/3

Preliminary remarks

  • Data uncertainty

    • A larger sample size can help (central limit theorem)
    • but cannot eliminate it completely
    • Feature engineering (along with other data prep) can also introduce additional sources of error and bias.
    • "Smarter" than data mining

Preliminary remarks

  • With panel data if possible pivot datetime features as a single time index
    • 1NF normalization
    • have power of TS analysis available
SELECT id, date, value FROM my_table UNPIVOT (value FOR date IN (' + @cols + ')) AS unpivoted' my_table: id date1 date2 date3 1 2022-01-01 2022-02-01 2022-03-01 2 2022-04-01 2022-05-01 2022-06-01 | V id date value 1 date1 2022-01-01 1 date2 2022-02-01 1 date3 2022-03-01 2 date1 2022-04-01 2 date2 2022-05-01 2 date3 2022-06-01

Feature engineering in Panel data

Process of selecting and transforming raw data into features that can be used to train machine learning models.

Most often operations

  • outliers detection & manipulation
  • reduce low varying features
  • derive domain/task specific features
  • re-scaling
  • non-linear transformations
  • encoding
  • binning

Outliers detection & manipulation

  • Z-score
    • assumes normal distribution
    • may not work well if it is significantly different (i.e., if the outliers have a different mean and standard deviation)
In [164]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# generate some random data
np.random.seed(42)
data = np.random.normal(0, 1, 100)

# introduce 15 outliers with value > 3
num_outliers = 15
outlier_indices = np.random.choice(100, num_outliers, replace=False)
outliers_gen = np.random.normal(0, 1, num_outliers) + np.random.normal(3, 1, num_outliers)
data[outlier_indices] = outliers_gen

# convert data to a Pandas dataframe
df = pd.DataFrame(data, columns=['Value'])
In [165]:
# detect outliers using z-score
z_scores = (df - df.mean()) / df.std()
threshold = 3
outliers = np.abs(z_scores) > threshold
detected_out=df[outliers]

# remove outliers from the data
clean_data = df[~outliers]

# replace outliers with mean value
mean_value = df.mean()
df[outliers] = np.tile(mean_value.values, (len(df[outliers].index), 1))
In [166]:
fig, ax = plt.subplots()
# plot histogram of clean data in blue
ax.hist(clean_data, bins=50, color='blue', alpha=0.5, label='Clean data')
# plot histogram of detected outliers in red
ax.hist(detected_out, bins=50, color='red', alpha=0.5, label='Detected outliers')
ax.set_xlabel('Value')
ax.set_ylabel('Frequency')
ax.set_title('Histogram of Data with Outliers Detected by Z-score')
ax.legend()
plt.show()

Outliers detection & manipulation

  • Interquartile range (IQR)
    • more robust to non-normal distributions
    • defined as the difference between the 75th and 25th percentiles of the data
    • IQR is less sensitive to extreme values than the standard deviation
In [167]:
%reset -f
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# generate some random data
np.random.seed(42)
data = np.random.normal(0, 0.5, 100)

# introduce 15 outliers with value > |2|
num_outliers = 15
outlier_indices = np.random.choice(100, num_outliers, replace=False)
outliers_gen = np.random.normal(1.5, 0.5, num_outliers)
for i in range(num_outliers//2):
    outliers_gen[i] *= -1
data[outlier_indices] = outliers_gen

# convert data to a Pandas dataframe
df = pd.DataFrame(data, columns=['Value'])
In [168]:
# detect outliers using IQR
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
outliers = (df['Value']<Q1 - 1.5 * IQR) | (df['Value']>Q3 + 1.5 * IQR)
detected_out=df[outliers]

# remove outliers from the data
clean_data = df[~outliers]

# replace outliers with mean value + noise
mean_value = df.mean()
noise = np.random.normal(0, 0.1, len(df[outliers]))
mean_value_with_noise = noise + mean_value.item()

# tile mean value with noise to replace outliers
df[outliers] = pd.DataFrame(mean_value_with_noise, columns=['Value'])
In [169]:
# plot histogram with 50 bins
plt.hist(df['Value'], bins=50, alpha=0.5)
# plot detected outliers in red
plt.hist(detected_out['Value'], bins=50, color='red', alpha=0.5)
# set plot labels and legend
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend(['Clean Data', 'Detected Outliers'])
# show plot
plt.show()

Reduce low varying features

  • features with low variance
  • do not contribute much to the prediction (can even degrade the performance)
  • identify by calculating the variance of each feature, and removing the features with variance less than a threshold.
In [170]:
import pandas as pd
import numpy as np
# create sample dataframe
data = {'A': [1, 2, 3, 4, 5],
        'B': [1, 1, 1, 1, 1],
        'C': [1, 1, 2, 1, 1],
        'D': [1, 2, 3, 2, 1],
        'E': [1, 1, 1, 1, 1]}
df = pd.DataFrame(data)

# calculate coefficient of variation for each column
cv = df.std() / df.mean()
# define threshold for coefficient of variation
threshold = 0.1

# identify low varying variables
low_var = cv < threshold
# remove low varying variables from the dataframe
clean_data = df.drop(columns=df.columns[low_var])
In [171]:
# print the results
print('Original data:\n', df)
print('Cleaned data:\n', clean_data)
Original data:
    A  B  C  D  E
0  1  1  1  1  1
1  2  1  1  2  1
2  3  1  2  3  1
3  4  1  1  2  1
4  5  1  1  1  1
Cleaned data:
    A  C  D
0  1  1  1
1  2  1  2
2  3  2  3
3  4  1  2
4  5  1  1

Derive domain/task specific features

  • creating features that are specific to the problem domain
  • need for domain knowledge
In [242]:
import pandas as pd

# create sample dataframe
data = {'Company': ['A', 'B', 'C', 'D'],
        'Revenue': [100000, 50000, 75000, 125000],
        'COGS': [70000, 25000, 50000, 80000],
        'EBITDA': [30000, 25000, 25000, 45000]}
df = pd.DataFrame(data)

# calculate financial ratios
df['Gross Profit'] = df['Revenue'] - df['COGS']
df['Gross Margin'] = df['Gross Profit'] / df['Revenue']
df['EBITDA Margin']= df['EBITDA'] / df['Revenue']
df['Net Income'] = df['EBITDA'] - (df['COGS'] - (df['COGS'] / df['Revenue']) * df['EBITDA'])
df['Net Margin'] = df['Net Income'] / df['Revenue']
df['ROE'] = df['Net Income'] / (df['COGS'] + df['EBITDA'])
df['ROA'] = df['Net Income'] / df['Revenue']

df[['Gross Margin','EBITDA Margin', 'Net Income','Net Margin','ROE', 'ROA']] = df[['Gross Margin','EBITDA Margin', 'Net Income','Net Margin','ROE', 'ROA']].round(2)
In [243]:
# print the results
df
Out[243]:
Company Revenue COGS EBITDA Gross Profit Gross Margin EBITDA Margin Net Income Net Margin ROE ROA
0 A 100000 70000 30000 30000 0.30 0.30 -19000.00 -0.19 -0.19 -0.19
1 B 50000 25000 25000 25000 0.50 0.50 12500.00 0.25 0.25 0.25
2 C 75000 50000 25000 25000 0.33 0.33 -8333.33 -0.11 -0.11 -0.11
3 D 125000 80000 45000 45000 0.36 0.36 -6200.00 -0.05 -0.05 -0.05

Re-scaling

  • Scaling
    • bring all variables to a similar scale, which can improve the performance of certain algorithms.
    • helps to reduce the effect of outliers
    • can improve the interpretability of the data
    • Could mean several different operations
  • Normalization

    • scaling the data so that all values are between 0 and 1.
    • useful for datasets with varying ranges
    • when it is important to maintain the relative differences between the values.
  • Standardization

    • scaling the data so that the mean is 0 and the standard deviation is 1
    • useful for datasets with normally distributed variables
    • when it is important to maintain the shape of the distribution.
  • Centering

    • shifting the data so that the mean is 0
    • useful for datasets where the absolute value of the data is not important, but the relative differences between the values are
In [2]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# create sample dataframe
data = {'A': [1, 2, 3, 4, 5],
        'B': [100, 200, 300, 400, 500],
        'C': [1.0, 2.0, 2.5, 3.0, 4.0]}
df = pd.DataFrame(data)
In [14]:
# apply different scaling techniques
scalers = {'Min-Max': MinMaxScaler(),
           'Standard': StandardScaler()}
In [15]:
fig, axs = plt.subplots(1, len(scalers)*2, figsize=(12, 3))

for i, (scaler_name, scaler) in enumerate(scalers.items()):
    # fit and transform the data using the scaler
    scaled_data = scaler.fit_transform(df)
    scaled_df = pd.DataFrame(scaled_data, columns=df.columns)

    # plot the original data
    axs[i*2].scatter(df['A'], df['B'])
    axs[i*2].set_title('Original data')

    # plot the scaled data
    axs[i*2+1].scatter(scaled_df['A'], scaled_df['B'])
    axs[i*2+1].set_title(scaler_name + ' scaling')
In [179]:
# apply normalization within range
range_scaler = lambda x: (x - x.min()) / (x.max() - x.min())
scaled_data = df.apply(range_scaler)
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)

# plot the results
fig, axs = plt.subplots(1, 2, figsize=(6, 3))
axs[0].scatter(df['A'], df['C'])
axs[0].set_title('Original data')
axs[1].scatter(scaled_df['A'], scaled_df['C'])
axs[1].set_title('Normalization within range')
plt.show()

Non-linear transformations

- **transforms based on linearity by parameters**
linear (lin)
logarithmic (lgn)
exponential (xpy)
power (pow)
$$Y=a_0+a_1.X$$ $$Y=a_0+a_1.\ln{X}$$ $$Z=\ln{X}$$ $$Y=a_0+a_1.Z$$ $$Y=e^{a_0+a_1.X}$$ $$Q=\ln{Y}$$ $$Q=a_0+a_1.X$$ $$Y=a_0.X^{a_1}$$ $$Z=\ln{X}$$ $$Q=\ln{Y}$$ $$Q=a_0+a_1.Z$$
reciprocal (rex)
reverse reciprocal (rey)
quadratic sum (sqr)
sine (snx)
$$Y=a_0+\frac{a_1}{X}$$ $$Z=\frac{1}{𝑋}, {X}\neq{0}$$ $$Y=a_0+a_1.Z$$ $$Y=\frac{1}{a_0+a_1.X}$$ $$Q=\frac{1}{Y}, {Y}\neq{0}$$ $$Q=a_0+a_1.X$$ $$Y={(a_0+a_1.X)^2}$$ $$Q=\sqrt{Y}$$ $$Q=|a_0+a_1.X|$$ $${Y=a_0+a_1.\sin{X}}$$ $$Z=\sin{X}$$ $$Y=a_0+a_1.Z$$
In [180]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Generate a random dataset of 20 floats
np.random.seed(42)
data = pd.DataFrame(np.random.rand(20)*10, columns=['Feature'])
In [181]:
# Calculate the natural logarithm of the feature
data['Log_Feature'] = np.log(data['Feature'])

# Calculate the reciprocal of the feature
data['Reciprocal_Feature'] = 1 / data['Feature']

# Calculate the square root of the feature
data['Sqrt_Feature'] = np.sqrt(data['Feature'])

# Calculate the sine of the feature
data['Sin_Feature'] = np.sin(data['Feature'])
In [182]:
# Create a scatter plot of all features
fig, ax = plt.subplots()
ax.plot(data.index, data['Feature'], label='Feature')
ax.plot(data.index, data['Log_Feature'], label='Log_Feature')
ax.plot(data.index, data['Reciprocal_Feature'], label='Reciprocal_Feature')
ax.plot(data.index, data['Sqrt_Feature'], label='Sqrt_Feature')
ax.plot(data.index, data['Sin_Feature'], label='Sin_Feature')
ax.legend()
plt.show()

Encoding

Convert categorical features into numerical features

  • One-hot encoding
    • create a new column for each category in the categorical variable, and setting the value to 1 if the category is present, and 0 otherwise.
    • could be modified for economical representation
In [261]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# create sample dataframe
data = {'color': ['red', 'blue', 'green', 'red', 'green', 'blue'],
        'size': ['small', 'medium', 'small', 'large', 'medium', 'large']}
df = pd.DataFrame(data)
In [262]:
# one-hot encode the categorical variables
onehot_encoder = OneHotEncoder()
onehot_encoded = onehot_encoder.fit_transform(df[['color', 'size']])
onehot_encoded = onehot_encoded.astype(int)  # convert to integers
onehot_df = pd.DataFrame(onehot_encoded.toarray(), columns=onehot_encoder.get_feature_names_out(['color', 'size']))

# concatenate the one-hot encoded data to the original dataframe
df = pd.concat([df, onehot_df], axis=1)
In [263]:
# drop the original categorical variables
df.drop(['color', 'size'], axis=1, inplace=True)
# drop redundant ctegories
# df.drop(['color_red', 'size_small'], axis=1, inplace=True)
df
Out[263]:
color_blue color_green color_red size_large size_medium size_small
0 0 0 1 0 0 1
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 0 1 1 0 0
4 0 1 0 0 1 0
5 1 0 0 1 0 0

Encoding

  • Encoding with dependent mean
    • can be useful in situations where there is a clear relationship between the categorical variable and the target variable
    • it can introduce bias in the data if the relationship between the categorical variable and the target variable is not well understood.
In [264]:
import pandas as pd

# create sample dataframe
data = {'color': ['red', 'blue', 'green', 'red', 'green', 'blue'],
        'size': ['small', 'medium', 'small', 'large', 'medium', 'large'],
        'price': [10, 20, 15, 25, 18, 22]}
df = pd.DataFrame(data)
In [265]:
# calculate mean price for each color
color_mean = df.groupby('color')['price'].mean()

# create new column with encoded values
df['color_mean'] = df['color'].map(color_mean)

# drop the original categorical variable
df.drop(['color'], axis=1, inplace=True)

print(df)
     size  price  color_mean
0   small     10        17.5
1  medium     20        21.0
2   small     15        16.5
3   large     25        17.5
4  medium     18        16.5
5   large     22        21.0

Encoding

  • Weight of Evidence
    • can be useful in situations where there is a non-linear relationship between the categorical variable and the target variable
    • it can introduce bias in the data if the relationship between the categorical variable and the target variable is not well understood.
In [187]:
import pandas as pd
import numpy as np

# create sample dataframe
data = {'color': ['red', 'blue', 'green', 'red', 'green', 'blue'],
        'size': ['small', 'medium', 'small', 'large', 'medium', 'large'],
        'target': [10, 20, 31, 22, 11, 5]}
df = pd.DataFrame(data)
In [ ]:
# calculate total counts and target counts for each category
total_counts = df.groupby('color')['target'].count()
target_counts = df.groupby('color')['target'].sum()
# calculate percentages and WoE values for each category
total_perc = total_counts / total_counts.sum()
target_perc = target_counts / target_counts.sum()
nontarget_perc = (total_counts - target_counts) / (total_counts.sum() - target_counts.sum())
woe_values = np.log(target_perc / nontarget_perc)
In [ ]:
# create new column with WoE values
df['color_woe'] = df['color'].map(woe_values)

# drop the original categorical variable
df.drop(['color'], axis=1, inplace=True)
In [188]:
print(df)
     size  target  color_woe
0   small      10   0.002018
1  medium      20   0.020861
2   small      31  -0.013730
3   large      22   0.002018
4  medium      11  -0.013730
5   large       5   0.020861

Binning

Convert numerical/categorical features into categorical features

  • Unsupervised

    • useful in situations where the data does not have a clear relationship with the target variable
    • can lead to biased results if the bin sizes are not chosen carefully
  • Supervised

    • useful when there is a clear relationship between the numerical variable and the target variable
    • requires labeled data to train the classifier
In [238]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

# create sample dataframe with 100 observations
np.random.seed(40)
age = np.random.normal(loc=50, scale=15, size=100)
income = np.random.normal(loc=50000, scale=10000, size=100)
education = np.random.randint(1,4, size=100)
gender = np.random.randint(1,3, size=100)
df = pd.DataFrame({'age': age.astype(int),
                   'income': income.astype(int),
                   'education': education,
                   'gender': gender
                  })
In [237]:
# unsupervised binning using equal width intervals
num_bins = 3
bin_labels = ['low', 'medium', 'high']
df['age_bins_unsupervised'] = pd.cut(df['age'], num_bins, labels=bin_labels)

# supervised binning using decision tree
tree = DecisionTreeClassifier(max_depth=1)
y = df['income']
X = df[df.columns.difference(['income','age_bins_unsupervised'])]
tree.fit(X, y)
df['age_bins_supervised'] = tree.predict(X)

ct = pd.crosstab(df['age_bins_supervised'], df['age_bins_unsupervised'])
ct
Out[237]:
age_bins_unsupervised low medium high
age_bins_supervised
28800 14 50 0
53039 0 8 28

Feature engineering in Time series

Most often operations

  • lags
  • rolling window statistics
  • datetime
  • outliers low frequency filter
  • Harmonic decomposition

Deriving lagged variables

Variables with a time delay compared to the others. Variable shifted in time.

  • used in time series analysis to model the relationships between variables over time
  • used to analyze the relationship between a variable and its past values

Methods

  • shift function in pandas

  • Henkel matrix - Strongly recommended universal method

In [ ]:
import pandas as pd

# create a sample dataframe
data = {'value': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)
In [200]:
# create a lagged variable with a time shift of 1 day
df['lagged'] = df['value'].shift(1)

print(df)
   value  lagged
0      1     NaN
1      2     1.0
2      3     2.0
3      4     3.0
4      5     4.0
In [268]:
import numpy as np

# Generate random time series data with 20 observations
data = np.random.rand(20)

# Define the maximum lag we want to include in our lagged features
max_lag = 5

# Create a Henkel matrix with lagged features
henkel_matrix = np.zeros((len(data), max_lag+1))

for i in range(max_lag+1):
    henkel_matrix[i:len(data), i] = data[0:len(data)-i]
henkel_matrix=henkel_matrix.round(3)
    
In [269]:
# Print the Henkel matrix
print(henkel_matrix)
[[0.74  0.    0.    0.    0.    0.   ]
 [0.497 0.74  0.    0.    0.    0.   ]
 [0.586 0.497 0.74  0.    0.    0.   ]
 [0.061 0.586 0.497 0.74  0.    0.   ]
 [0.617 0.061 0.586 0.497 0.74  0.   ]
 [0.657 0.617 0.061 0.586 0.497 0.74 ]
 [0.859 0.657 0.617 0.061 0.586 0.497]
 [0.569 0.859 0.657 0.617 0.061 0.586]
 [0.905 0.569 0.859 0.657 0.617 0.061]
 [0.834 0.905 0.569 0.859 0.657 0.617]
 [0.568 0.834 0.905 0.569 0.859 0.657]
 [0.847 0.568 0.834 0.905 0.569 0.859]
 [0.026 0.847 0.568 0.834 0.905 0.569]
 [0.818 0.026 0.847 0.568 0.834 0.905]
 [0.961 0.818 0.026 0.847 0.568 0.834]
 [0.207 0.961 0.818 0.026 0.847 0.568]
 [0.57  0.207 0.961 0.818 0.026 0.847]
 [0.954 0.57  0.207 0.961 0.818 0.026]
 [0.237 0.954 0.57  0.207 0.961 0.818]
 [0.474 0.237 0.954 0.57  0.207 0.961]]

Rolling window statistics

Sample windows

  • used in time series analysis to reduce the dimensionality of the data
  • capture relevant patterns over a specific time interval

Method

  • defining a fixed-length sample window
  • extract a set of features from each window
  • size of the sample window is an important hyperparameter
  • it should be chosen based on the characteristics of the time series data and the specific prediction problem at hand.
In [ ]:
import numpy as np
import pandas as pd

# Generate random time series data with 20 observations
data = np.random.rand(20)

# Convert the data to a Pandas Series
series = pd.Series(data)
In [271]:
# Define the window size for the rolling statistics
window_size = 3
# Calculate rolling mean, standard deviation, and maximum
rolling_mean = series.rolling(window_size).mean()
rolling_std = series.rolling(window_size).std()
rolling_max = series.rolling(window_size).max()
In [ ]:
# Print the rolling statistics
df=pd.DataFrame({"Original data": data,
                 "Rolling mean": rolling_mean.tolist(),
                 "Rolling standard deviation": rolling_std.tolist(),
                 "Rolling maximum": rolling_max.tolist()
                })
In [272]:
df
Out[272]:
Original data Rolling mean Rolling standard deviation Rolling maximum
0 0.076313 NaN NaN NaN
1 0.264040 NaN NaN NaN
2 0.675782 0.338712 0.306631 0.675782
3 0.068876 0.336233 0.309826 0.675782
4 0.806467 0.517042 0.393585 0.806467
5 0.705469 0.526937 0.399894 0.806467
6 0.756620 0.756185 0.050500 0.806467
7 0.018057 0.493382 0.412437 0.756620
8 0.089027 0.287901 0.407471 0.756620
9 0.579511 0.228865 0.305734 0.579511
10 0.527292 0.398610 0.269375 0.579511
11 0.970188 0.692330 0.242044 0.970188
12 0.485930 0.661137 0.268444 0.970188
13 0.957106 0.804408 0.275888 0.970188
14 0.128065 0.523700 0.415809 0.957106
15 0.372937 0.486036 0.425935 0.957106
16 0.881238 0.460746 0.384188 0.881238
17 0.867801 0.707325 0.289667 0.881238
18 0.971188 0.906742 0.056215 0.971188
19 0.874550 0.904513 0.057841 0.971188

Datetime index operations

Re-scaling

  • manipulating the index of DataFrame to a new scale of dates
In [25]:
import pandas as pd

# create a DataFrame with a datetime index
date_rng = pd.date_range(start='1/1/2020', end='1/20/2020', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,100,size=(len(date_rng)))

# change the frequency to weekly and take the mean of each group
df = df.set_index('date')
weekly_df = df.resample('W').mean()
weekly_df
Out[25]:
data
date
2020-01-05 59.600000
2020-01-12 70.857143
2020-01-19 42.857143
2020-01-26 95.000000

Datetime index operations

Re-framing

  • fill in the missing dates with some specified fill value.
In [ ]:
import pandas as pd
import numpy as np

# create a DataFrame with a datetime index and missing dates
date_rng = pd.date_range(start='1/1/2020', end='1/7/2020', freq='D')
date_rng = date_rng.drop(date_rng[3])
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,100,size=(len(date_rng)))
In [31]:
# fill in the missing dates with NaN values
df = df.set_index('date')
df_new = df.asfreq('D')
df_new
Out[31]:
data
date
2020-01-01 54.0
2020-01-02 67.0
2020-01-03 42.0
2020-01-04 NaN
2020-01-05 60.0
2020-01-06 22.0
2020-01-07 99.0

Datetime index operations

Extracting datetime features

  • using the full datetime string to brake down into features
In [ ]:
import numpy as np
import pandas as pd

# Generate random time series data with 20 observations
data = np.random.rand(20)

# Generate a DatetimeIndex with hourly frequency starting from 2022-01-01
date_range = pd.date_range(start='2022-01-01', periods=len(data), freq='H')
In [207]:
# Convert the data to a Pandas Series with DatetimeIndex
series = pd.Series(data, index=date_range)

# Extract calendar and time base features from the index
year = series.index.year
month = series.index.month
day = series.index.day
hour = series.index.hour
minute = series.index.minute
In [ ]:
# Print the calendar and time base features
df=pd.DataFrame({"Date":date_range,
                 "Data": data,
                 "Year": year,
                 "Month": month,
                 "Day": day,
                 "Hour": hour,
                 "Minute": minute
                })
In [208]:
df
Out[208]:
Date Data Year Month Day Hour Minute
0 2022-01-01 00:00:00 0.114295 2022 1 1 0 0
1 2022-01-01 01:00:00 0.499400 2022 1 1 1 0
2 2022-01-01 02:00:00 0.316746 2022 1 1 2 0
3 2022-01-01 03:00:00 0.901192 2022 1 1 3 0
4 2022-01-01 04:00:00 0.531030 2022 1 1 4 0
5 2022-01-01 05:00:00 0.792617 2022 1 1 5 0
6 2022-01-01 06:00:00 0.100412 2022 1 1 6 0
7 2022-01-01 07:00:00 0.187317 2022 1 1 7 0
8 2022-01-01 08:00:00 0.786790 2022 1 1 8 0
9 2022-01-01 09:00:00 0.497147 2022 1 1 9 0
10 2022-01-01 10:00:00 0.138009 2022 1 1 10 0
11 2022-01-01 11:00:00 0.681217 2022 1 1 11 0
12 2022-01-01 12:00:00 0.251439 2022 1 1 12 0
13 2022-01-01 13:00:00 0.518483 2022 1 1 13 0
14 2022-01-01 14:00:00 0.559895 2022 1 1 14 0
15 2022-01-01 15:00:00 0.542000 2022 1 1 15 0
16 2022-01-01 16:00:00 0.265538 2022 1 1 16 0
17 2022-01-01 17:00:00 0.068318 2022 1 1 17 0
18 2022-01-01 18:00:00 0.548399 2022 1 1 18 0
19 2022-01-01 19:00:00 0.909004 2022 1 1 19 0

Outliers low frequency filter

  • Similar to panel data case
  • but it could be implemented to be a streaming process
  • IQR
In [209]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Generate random time series data with 20 observations and 3 outliers
data = np.random.rand(20)
data[3] = -2
data[10] = 3.5
data[17] = 2.6
In [210]:
# Convert the data to a Pandas Series
series = pd.Series(data)

# Calculate the first and third quartiles
q1 = series.quantile(0.25)
q3 = series.quantile(0.75)

# Define the filter based on the interquartile range (IQR)
iqr = q3 - q1
filter = (series >= q1 - 1.5*iqr) & (series <= q3 + 1.5*iqr)

# Filter the data
filtered_data = series[filter]
In [211]:
# Plot the original data and the filtered data
fig, ax = plt.subplots()
ax.plot(series.index, series.values, label='Original data')
ax.plot(filtered_data.index, filtered_data.values, label='Filtered data')
ax.set_xlabel('Index')
ax.set_ylabel('Value')
ax.legend()
plt.show()

Harmonics decomposition

Extract seasonality from a time series, decomposing them into its trend, seasonal, and residual components.

Fourier

  • Decompose a signal into its frequency components
  • based on the Fourier series
  • any periodic function can be represented as a sum of sine and cosine waves of different frequencies, phases, and amplitudes
  • the time series data is first transformed into the frequency domain using a Fourier transform
  • The amplitudes and phases of these waves are then estimated using a least-squares regression
In [1]:
from IPython.display import IFrame

# define the URL of the webpage to embed
url = "https://www.jezzamon.com/fourier/"

# create an IFrame object with the specified URL and dimensions
iframe = IFrame(url, width=900, height=500)
In [2]:
display(iframe)
In [96]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Generate random time series data with 20 observations
data = np.random.rand(20)

# Convert the data to a Pandas Series
series = pd.Series(data)
In [97]:
# Calculate the Fourier coefficients for each harmonic separately
num_harmonics = 3
all_coeffs = np.fft.fft(series)
coeffs = []
for i in range(1, num_harmonics+1):
    coeffs.append(np.zeros(len(all_coeffs), dtype=complex))
    coeffs[-1][i] = all_coeffs[i]
    coeffs[-1][-i] = all_coeffs[-i]

# Reconstruct the signal using the first 3 harmonics
reconstructed_coeffs = np.zeros(len(all_coeffs), dtype=complex)
for i in range(num_harmonics):
    reconstructed_coeffs += coeffs[i]
reconstructed_signal = np.fft.ifft(reconstructed_coeffs).real
reconstructed_signal += series.mean()
In [98]:
# Plot the original signal and the reconstructed signals for each harmonic
fig, ax = plt.subplots()
ax.plot(series.index, series.values, label='Original signal')
for i in range(num_harmonics):
    smoothed_signal = pd.Series(np.fft.ifft(coeffs[i]).real, index=series.index).rolling(window=5, center=True).mean()
    ax.plot(series.index, smoothed_signal.values, label='Harmonic ' + str(i+1))
ax.plot(series.index, reconstructed_signal, label='Reconstructed signal', linewidth=2, linestyle='--')
ax.set_xlabel('Index')
ax.set_ylabel('Value')
ax.legend()
Out[98]:
<matplotlib.legend.Legend at 0x7f6036d384f0>

Harmonics decomposition

Seasonality analysis

  • uses the classical time series decomposition method based on moving averages
In [17]:
%reset -f

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Generate random time series data with 20 observations
data = np.random.rand(20)

# Convert the data to a Pandas Series
series = pd.Series(data)
In [35]:
# Perform the decomposition
decomposition = sm.tsa.seasonal_decompose(series, model='additive', period=4)

fig=decomposition.plot();
fig.set_size_inches((8, 3.5));
fig.tight_layout();

Variables selection/reduction

Selecting the most relevant features for the model, which can improve performance and reduce the risk of overfitting

X,X Correlation

  • Look for overlapping features
  • Reduction method
In [218]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import itertools


# Set random seed for reproducibility
np.random.seed(123)

# Generate random data with 10 variables
data = np.random.rand(100, 10)

# Add three highly correlated variables
data[:, 2] = 0.8 * data[:, 0] + 0.2 * data[:, 1] + 0.1 * np.random.randn(100)
data[:, 4] = 0.6 * data[:, 0] + 0.4 * data[:, 1] + 0.1 * np.random.randn(100)
data[:, 6] = 0.4 * data[:, 0] + 0.6 * data[:, 1] + 0.1 * np.random.randn(100)

# Convert the data to a Pandas DataFrame
df = pd.DataFrame(data, columns=['Var'+str(i+1) for i in range(10)])
In [219]:
def hyperoptimize(corr_matrix, exclude_features):
    n_features = corr_matrix.shape[0]
    min_high_corr = float('inf')
    best_n_reduce = 0
    best_exclude = exclude_features.copy()

    for n_reduce in range(1, n_features):
        for k in range(n_features):
            for exclude in itertools.combinations(exclude_features, k):
                exclude = list(exclude)
                reduce_features = [i for i in range(n_features) if i not in exclude and i != n_reduce]
                reduced_corr_matrix = corr_matrix.iloc[reduce_features, reduce_features]
                n_high_corr = (reduced_corr_matrix.abs() > 0.5).sum().sum()
                if n_high_corr < min_high_corr:
                    min_high_corr = n_high_corr
                    best_n_reduce = n_reduce
                    best_exclude = exclude

    return best_n_reduce, best_exclude
In [220]:
# Compute the correlation matrix
corr_matrix = df.corr()

# Store the initial correlation matrix
corr_matrix_old = df.corr()

# Set the threshold for the number of high-correlation values
threshold = 7

# Set the initial list of excluded features
exclude_features = []

# Iterate until the number of high-correlation values is below the threshold
while True:
    # Hyperoptimize the number of features to reduce and the excluded features
    n_reduce, exclude_features = hyperoptimize(corr_matrix, exclude_features)

    # Reduce the specified features
    reduce_features = [i for i in range(corr_matrix.shape[0]) if i not in exclude_features and i != n_reduce]
    corr_matrix = corr_matrix.iloc[reduce_features, reduce_features]

    # Count the number of high-correlation values
    n_high_corr = (corr_matrix.abs() > 0.5).sum().sum()

    # If the number of high-correlation values is below the threshold, break the loop
    if n_high_corr <= threshold:
        break
In [221]:
# Create a figure with two axes side by side
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 4))

# Plot the first heatmap on the left axis
sns.heatmap(corr_matrix_old, annot=True, cmap='coolwarm', center=0, square=True, ax=ax1)
ax1.set_title('Old correlation matrix')

# Plot the second heatmap on the right axis
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, square=True, ax=ax2)
ax2.set_title('Updated correlation matrix')

# Show the plot
plt.show()

Variables selection/reduction

X,Y Correlation

  • Look for features with higher importance
  • Selection method
In [ ]:
import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(123)

# Generate random data with 10 independent variables and 1 dependent variable
data = np.random.rand(100, 11)

# Set the 3rd, 5th, and 7th independent variables to be strongly correlated with the dependent variable
data[:, 2] = 0.8 * data[:, -1] + 0.1 * np.random.randn(100)
data[:, 4] = 0.6 * data[:, -1] + 0.1 * np.random.randn(100)
data[:, 6] = 0.4 * data[:, -1] + 0.1 * np.random.randn(100)

# Convert the data to a Pandas DataFrame
df = pd.DataFrame(data, columns=['Indep'+str(i+1) for i in range(10)] + ['Dep'])
In [86]:
# Compute the correlation coefficients between the dependent variable and the independent variables
corr_matrix = df.corr().iloc[-1, :-1]
# Filter out the independent variables with low correlation coefficients
selected_vars = corr_matrix[corr_matrix.abs() >= 0.5].index
In [ ]:
# Convert the correlation matrix to a DataFrame with two columns: "var" and "corr"
corr_df = corr_matrix.reset_index().rename(columns={'index': 'var', 'Dep': 'corr'})

# Create a new column "selected" and initialize all values to "no"
corr_df['selected'] = 'no'

# Set the "selected" value to "yes" for the selected variables
corr_df.loc[corr_df['var'].isin(selected_vars), 'selected'] = 'yes'

# Display the resulting DataFrame
In [95]:
print(corr_df)
       var      corr selected
0   Indep1  0.256760       no
1   Indep2  0.047934       no
2   Indep3  0.922601      yes
3   Indep4  0.147780       no
4   Indep5  0.891848      yes
5   Indep6  0.006403       no
6   Indep7  0.781090      yes
7   Indep8  0.066287       no
8   Indep9  0.013465       no
9  Indep10 -0.076730       no

Multicollinearity

Two or more predictor variables in a regression model are highly correlated, meaning they are linearly dependent on each other

  • can reduce the statistical significance of individual predictor variables
  • lead to unstable and unreliable estimates of the regression coefficients
  • makes it difficult to identify which predictor variables are actually contributing
  • To address multicollinearity, variables can be removed or combined with other variables
In [ ]:
from IPython.display import IFrame

# define the URL of the webpage to embed
url = "https://mattbirch.shinyapps.io/Multicollinearity_Visualization_App/"

# create an IFrame object with the specified URL and dimensions
iframe = IFrame(url, width=900, height=400)
In [41]:
# display the IFrame object in the notebook output
display(iframe)

Multicollinearity

Variance Inflation Factor (VIF)

  • a measure of multicollinearity in a multiple regression analysis.
  • the ratio of the variance of the estimated coefficients of a variable in a regression model to the variance of the same coefficients in a model that does not include the variable.
  • always greater than or equal to 1, and a value of 1 indicates that there is no correlation between the i-th variable and the other predictors in the model
  • the greater than 1 the higher degree of correlation with one or more of the other predictors
  • value of 5 or more is often used as a threshold for identifying variables with high multicollinearity
In [224]:
import numpy as np
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Set random seed for reproducibility
np.random.seed(123)

# Generate random data with 10 independent variables
data = np.random.rand(100, 10)

# Add multicollinearity among some of the variables
data[:, 2] = 0.8 * data[:, 0] + 0.2 * data[:, 1] + 0.1 * np.random.randn(100)
data[:, 4] = 0.6 * data[:, 0] + 0.4 * data[:, 1] + 0.1 * np.random.randn(100)
data[:, 6] = 0.4 * data[:, 0] + 0.6 * data[:, 1] + 0.1 * np.random.randn(100)

# Convert the data to a Pandas DataFrame
df = pd.DataFrame(data, columns=['Var'+str(i+1) for i in range(10)])
In [ ]:
# Compute the variance inflation factor for each variable
vif = pd.Series([variance_inflation_factor(df.values, i) for i in range(df.shape[1])],
                index=df.columns)

# Select the variables with VIF less than 5
selected_vars = vif[vif < 5].index
In [225]:
# Print the VIF and the selected variables
print('Variance Inflation Factor:')
print(vif)
print('Selected variables:')
print(selected_vars)
Variance Inflation Factor:
Var1     39.607395
Var2     27.014129
Var3     31.176229
Var4      4.314400
Var5     32.245597
Var6      4.181100
Var7     40.196595
Var8      3.633429
Var9      4.434751
Var10     3.954685
dtype: float64
Selected variables:
Index(['Var4', 'Var6', 'Var8', 'Var9', 'Var10'], dtype='object')

Regularization

Regularization is a technique to prevent overfitting and improve the performance of predictive models by adding a penalty term to the model.

  • shrinking the magnitude of the model parameters
  • making them less sensitive to small changes in the input data
  • encourages the model to produce simpler and more generalizable solutions
  • *advisible* is using in combination with some cross-validation for easier hyperparameter optimization on the strength of the penalty term

L1 Regularization

Least Absolute Shrinkage and Selection Operator adds a penalty term to the sum of squared errors (SSE) in the linear regression equation, using a tuning parameter to control the amount of regularization.

  • shrinks the magnitude of the coefficients towards zero
  • if effect - also variable selection (a more interpretable and efficient model)
  • could solve the "large p, small n" problem
  • could deal with multi-collinearity
  • an optimization algorithm is used to minimize the objective function
$$ SSE_{{L1} norm} = min{(SSE + \lambda . \sum_{}|\beta_{i}|)}$$

where:

  • SSE is the sum of squared errors
  • λ is the tuning parameter
  • βi are the coefficients of the predictor variables
In [275]:
%reset -f
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# Load the Boston housing dataset
dataset = fetch_california_housing()
In [276]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
In [277]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2, random_state=42)

# Create a Lasso regression object with alpha=0.5
lasso_reg = Lasso(alpha=0.5)

# Fit the Lasso regression model to the training data
lasso_reg.fit(X_train, y_train)

# Use the trained model to make predictions on the testing data
y_pred = lasso_reg.predict(X_test)

# Calculate the mean squared error of the predictions
mse = mean_squared_error(y_test, y_pred)
In [278]:
# Extract the coefficients for each feature
coefficients = lasso_reg.coef_

# Extract the intercept term
intercept = lasso_reg.intercept_

# Print the coefficients, intercept, alpha and mse
df = pd.DataFrame(coefficients.reshape(1,-1))
df['icpt']=intercept
In [279]:
print("Mean Squared Error: ", mse)
df.round(2)
Mean Squared Error:  0.7263312822033787
Out[279]:
0 1 2 3 4 5 6 7 icpt
0 0.29 0.01 0.0 -0.0 0.0 -0.0 -0.0 -0.0 0.59
In [100]:
import matplotlib.pyplot as plt

# Create a range of alpha values to test
alphas = np.logspace(-3, 0, 100)

# Initialize the Lasso regression model
lasso_reg = Lasso()

# Initialize lists to store the coefficients and mean squared errors
coefficients = []
mse_values = []

# Iterate over the range of alpha values and fit the Lasso regression model
for alpha in alphas:
    lasso_reg.set_params(alpha=alpha)
    lasso_reg.fit(X_train, y_train)
    coefficients.append(lasso_reg.coef_)
    mse_values.append(np.mean((lasso_reg.predict(X_test) - y_test)**2))
    
# Find the alpha value with the minimum mean squared error
min_idx = np.argmin(mse_values)
min_alpha = alphas[min_idx]
min_mse = mse_values[min_idx]

# Plot the coefficients against the alpha values
def plot_lasso_coefs():
    plt.figure(figsize=(10, 6))
    plt.plot(alphas, coefficients)
    plt.xscale('log')
    plt.xlabel('alpha')
    plt.ylabel('Coefficients')
    plt.title('Lasso Regression Coefficients')
    plt.legend(dataset.feature_names)
    
# Plot the mean squared error against the alpha values
def plot_lasso_mse():
    plt.figure(figsize=(10, 6))
    plt.plot(alphas, mse_values)
    plt.plot(min_alpha, min_mse, 'ro')
    plt.xscale('log')
    plt.xlabel('alpha')
    plt.ylabel('Mean Squared Error')
    plt.title('Lasso Regression Mean Squared Error')
In [101]:
plot_lasso_coefs()
In [104]:
plot_lasso_mse()

L2 Regularization

Ridge regression adds a penalty term - proportional to the squared values of the model parameters.

  • encourages to produce solutions with smaller parameter values
  • often used to reduce the effects of collinearity by shrinking the coefficients of correlated variables towards each other
  • can help prevent overfitting
  • also shrinks the magnitude of the coefficients towards zero, but unlike lasso regression, it does not force any of the coefficients to be exactly zero
  • can improve the stability and reduce the variance of the model without sacrificing too much in terms of bias.
$$ SSE_{{L2} norm} = min{(SSE + \lambda . \sum_{}\beta_{i}^2)}$$

where

  • SSE is the sum of squared errors
  • λ is the tuning parameter
  • βi are the coefficients of the predictor variables
In [73]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
In [74]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing

# Load the California housing dataset
dataset = fetch_california_housing()
In [75]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2, random_state=42)

# Create a Ridge regression object with alpha=0.5
ridge_reg = Ridge(alpha=0.5)

# Fit the Ridge regression model to the training data
ridge_reg.fit(X_train, y_train)

# Use the trained model to make predictions on the testing data
y_pred = ridge_reg.predict(X_test)

# Calculate the mean squared error of the predictions
mse = mean_squared_error(y_test, y_pred)
In [88]:
# Extract the coefficients for each feature
coefficients = ridge_reg.coef_

# Extract the intercept term
intercept = ridge_reg.intercept_

# Print the coefficients, intercept, alpha and mse
df = pd.DataFrame(coefficients.reshape(1,-1))
df['icpt']=intercept
In [89]:
print("Mean Squared Error: ", mse)
df.round(2)
Mean Squared Error:  0.5635171842142302
Out[89]:
0 1 2 3 4 5 6 7 icpt
0 0.13 0.02 0.03 -0.01 -0.0 -0.01 -0.03 -0.01 2.07
In [139]:
# Scale the features to have zero mean and unit variance
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

# Create a range of alpha values to test
alphas = np.logspace(0, 4.5, 100)

# Initialize the Lasso regression model
ridge_reg = Ridge()

# Initialize lists to store the coefficients and mean squared errors
coefs = []
mse_values = []

# Iterate over the range of alpha values and fit the Lasso regression model
for alpha in alphas:
    ridge_reg.set_params(alpha=alpha)
    ridge_reg.fit(X_train, y_train)
    coefs.append(ridge_reg.coef_)
    mse_values.append(np.mean((ridge_reg.predict(X_test) - y_test)**2))
    

# Find the alpha value with the minimum mean squared error
min_idx = np.argmin(mse_values)
min_alpha = alphas[min_idx]
min_mse = mse_values[min_idx]

# Plot the coefficients against the alpha values
def plot_ridge_coefs():
    plt.figure(figsize=(10, 6))
    ax = plt.gca()
    ax.plot(alphas, coefs)
    ax.set_xscale('log')
    plt.xlabel('alpha')
    plt.ylabel('Coefficients')
    plt.title('Ridge Regression Coefficients')
    plt.legend(dataset.feature_names)
    plt.axis('tight')
In [140]:
plot_ridge_coefs()
In [141]:
# Plot the mean squared error against the alpha values
def plot_ridge_mse():
    plt.figure(figsize=(10, 6))
    plt.plot(alphas, mse_values)
    plt.plot(min_alpha, min_mse, 'ro')
    plt.xscale('log')
    plt.xlabel('alpha')
    plt.ylabel('Mean Squared Error')
    plt.title('Ridge Regression Mean Squared Error')
In [142]:
plot_ridge_mse()
In [90]:
# Load the California housing dataset
from sklearn.datasets import fetch_california_housing
dataset = fetch_california_housing()

# Extract the features and target variable
X = dataset.data
y = dataset.target

# Add a column of ones for the intercept
X = np.concatenate([np.ones((len(y), 1)), X], axis=1)

# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set the tuning parameter
alpha = 0.5

# Fit the Ridge regression model to the training data
beta = ridge_regression(X_train, y_train, alpha)

# Use the trained model to make predictions on the testing data
y_pred = X_test @ beta

# Calculate the mean squared error of the predictions
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)

# Print the coefficients, intercept, alpha and mse
df = pd.DataFrame(beta.reshape(1,-1))

L1 & L2 Regularization

Elastic net is a regularization combining the L1 and L2 regularization techniques used in Lasso and Ridge regression.

  • designed to overcome some of the limitations of Lasso and Ridge regression by leveraging their respective strengths
  • L1 penalty tends to produce sparse solutions by setting some coefficients to zero
  • L2 penalty tends to shrink the magnitude of the coefficients towards zero
  • combines the strengths of Lasso and Ridge regression by adding a combination of the L1 and L2 penalties, shrinking the magnitude of the coefficients towards zero while also setting some of them to exactly zero.
$$SSE_{{L1}{L2} norm} = min{(SSE + \lambda_1 . \sum_{}|\beta_{i}| + \lambda_2 . \sum_{}\beta_{i}^2)}$$

where:

  • SSE is the sum of squared errors
  • λ1 and λ2 are the tuning parameters
  • βi are the coefficients of the predictor variables
In [172]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
In [173]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# Load the California housing dataset
dataset = fetch_california_housing()
In [174]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2, random_state=42)

# Create an Elastic Net object with alpha=0.5, l1_ratio=0.5
enet = ElasticNet(alpha=0.5, l1_ratio=0.5)

# Fit the Elastic Net model to the training data
enet.fit(X_train, y_train)

# Use the trained model to make predictions on the testing data
y_pred = enet.predict(X_test)

# Calculate the mean squared error of the predictions
mse = mean_squared_error(y_test, y_pred)
In [175]:
# Extract the coefficients for each feature
coefficients = enet.coef_

# Extract the intercept term
intercept = enet.intercept_

# Print the coefficients, intercept, alpha and mse
df = pd.DataFrame(coefficients.reshape(1,-1))
df['icpt']=intercept
In [176]:
print("Mean Squared Error: ", mse)
df.round(2)
Mean Squared Error:  0.6868730783041608
Out[176]:
0 1 2 3 4 5 6 7 icpt
0 0.34 0.01 -0.0 0.0 0.0 -0.0 -0.0 -0.0 0.31
In [206]:
import matplotlib.pyplot as plt

# Create a range of alpha values to test
alphas = np.logspace(-3, 0, 100)
l1_ratios = np.linspace(0, 1, 100)

# Initialize the Lasso regression model
enet_reg = ElasticNet()

# Initialize lists to store the coefficients and mean squared errors
coefficientsa = []
coefficientsl1 = []
mse_valuesa = []
mse_valuesl1 = []

# Iterate over the range of alpha values and fit the Lasso regression model
for alpha in alphas:
    enet_reg.set_params(alpha=alpha)
    enet_reg.fit(X_train, y_train)
    coefficientsa.append(enet_reg.coef_)
    mse_valuesa.append(np.mean((enet_reg.predict(X_test) - y_test)**2))

# Iterate over the range of alpha values and fit the Lasso regression model
for l1_ratio in l1_ratios:
    enet_reg.set_params(l1_ratio=l1_ratio)
    enet_reg.fit(X_train, y_train)
    coefficientsl1.append(enet_reg.coef_)
    mse_valuesl1.append(np.mean((enet_reg.predict(X_test) - y_test)**2))
    
# Find the alpha value with the minimum mean squared error
min_idx = np.argmin(mse_valuesa)
min_alpha = alphas[min_idx]
min_msea = mse_valuesa[min_idx]

# Find the alpha value with the minimum mean squared error
min_idx = np.argmin(mse_valuesl1)
min_l1ratio = l1_ratios[min_idx]
min_msel1 = mse_valuesl1[min_idx]

# Plot the coefficients against the alpha values
def plot_enet_coef_alpha():
    plt.figure(figsize=(10, 6))
    plt.plot(alphas, coefficientsa)
    plt.xscale('log')
    plt.xlabel('alpha')
    plt.ylabel('Coefficients')
    plt.title('ElasticNet Regression Coefficients')
    plt.legend(dataset.feature_names)
    
# Plot the mean squared error against the alpha values
def plot_enet_mse_alpha():
    plt.figure(figsize=(10, 6))
    plt.plot(alphas, mse_valuesa)
    plt.plot(min_alpha, min_msea, 'ro')
    plt.xscale('log')
    plt.xlabel('alpha')
    plt.ylabel('Mean Squared Error')
    plt.title('ElasticNet Regression Mean Squared Error')   
    
# Plot the coefficients against the alpha values
def plot_enet_coef_l1ratio():
    plt.figure(figsize=(10, 6))
    plt.plot(l1_ratios, coefficientsl1)
    plt.xscale('linear')
    plt.xlabel('L1 ratio')
    plt.ylabel('Coefficients')
    plt.title('ElasticNet Regression Coefficients')
    plt.legend(dataset.feature_names)

# Plot the mean squared error against the alpha values
def plot_enet_mse_l1ratio():
    plt.figure(figsize=(10, 6))
    plt.plot(l1_ratios, mse_valuesl1)
    plt.plot(min_l1ratio, min_msel1, 'ro')
    plt.xscale('linear')
    plt.xlabel('L1 ratio')
    plt.ylabel('Mean Squared Error')
    plt.title('ElasticNet Regression Mean Squared Error')
/home/junior/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 6.284e+03, tolerance: 2.207e+00 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
  model = cd_fast.enet_coordinate_descent(
In [207]:
plot_enet_coef_alpha()
In [208]:
plot_enet_mse_alpha()
In [209]:
 plot_enet_coef_l1ratio()
In [210]:
plot_enet_mse_l1ratio()

Factor analysis

Feature extraction: This involves transforming raw data into a new set of features that better capture the underlying patterns in the data. This can include techniques such as principal component analysis (PCA), singular value decomposition (SVD), Linear Discriminant Analysis (LDA), etc.