Correlation among features is one of the most discussed topics in the Machine learning space. While it is very common to find a blog explaining the correlation among Numerical features, it's quite a bit rare to have blogs explaining Correlation among other types of Features.

In this post, we will discuss Correlation among different types of Features and their Python code.

We will learn -

- What are other types of Correlation
*(other than Pearson's coefficient)* - Intuition to understand the underlying correlation strength
- Python code for individual Correlation method
- Other methods
*(not discussed in this post)*

Let's quickly review the dataset that we will use and its respective Features. It's a Life expectancy dataset for different countries.

Due to its nature, it has all different types of Features *i.e. Numerical, Categorical, etc.* We have created new Features *i.e. population_rank* so that we have two ordinal Features too.

```
# Download the data from kaggle
import os, pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt
os.environ['KAGGLE_USERNAME'] = "10xAI"
os.environ['KAGGLE_KEY'] = "<<Use your Kaggle API Key>>"
import kaggle
!kaggle datasets download kumarajarshi/life-expectancy-who
dataset = pd.read_csv("/content/life-expectancy-who.zip")
# Select 6 columns and 1000 rows
sample = dataset.iloc[:,0:6].sample(1000).reset_index(drop=True)
# This is just to creae another Oridnal Features, so that we have all the combinations of feature types
population = dataset.loc[dataset.Year==2014,['Country','Population']].fillna(0).sort_values(by=['Population']).reset_index(drop=True)
sample['population_rank'] = sample['Country'].apply(lambda x: population[population.Country==x].index.values)
sample['population_rank'] = sample['population_rank'].apply(lambda x: 0 if not x else x[0])
# Fill any NaN
sample.fillna(value=sample.mean(), inplace=True)
sample.iloc[:5,:]
```

^{Sample 5 records from the sample DataFrame}

Below is mapping for our columns and it's respective type

Year and population_rankas "Ordinal"Status and Countryas "Categorical/Nominal"Remaining three Features as"Numerical/Continuous"

Let's follow a "Top-Down" approach. So we are listing the approaches that we will follow to calculate the correlation between any two types of Features.

In the subsequent section, we will learn the details of all of these and also code the same.

We have only kept the upper Triangle of the table to avoid redundancy because the lower triangle will be a replica of the upper triangle *e.g. Numerical-Categorical* is the same as *Categorical-Numerical*

Let's understand each of the used approaches

This method is based on the Chi-square value of two nominal features.

** Chi-square** value is calculated using the contingency table and the deviation of every value from the expected value. If the expected value is similar to the observed values then we can safely assume a little or no correlation.

Let's understand this by two examples and manual calculation.

Obesity | No Obesity | Total | |
---|---|---|---|

No Gym | 50 (50) | 50 (50) | 100 |

Gym | 50 (50) | 50 (50) | 100 |

Total | 100 | 100 | 200 |

We have sample data for 200 cases of individuals and its contingency table for Gym goers and those having Obesity.

A **contingency table** is simply the cross-tabulated count of different combinations of different Features values. e.g. in the above table, there are 50 data points where *Feature#01 has "Gym=Yes" and Features#02 has "Obesity=Yes".* This is our observed data

We are trying to answer a simple question *" What is the effect of Gym in preventing Obesity"*

Let's calculate the expected values using individual totals and grand totals. The expected value is the value for each cell in the contingency table if we assume no relation between the two Features. This can be calculated using a simple formula

Expected value-For "No Gym" - Obesity vs "No Obesity",^{There are 100 non-Gym goers. If we assume, no relation between Gym and Obesity, it implies 50% of 100 non-Gym goers will have Obesity and 50% will not have it.}

With the above logic, we will have 50 each for Obesity and non-Obesity. Coincidently this value will be the same for all the cells i.e. **=50**. *i.e. values in the parenthesis.*

In ** chi-square** logic, we simply calculate the deviation of the observed value from the expected value. i.e.

Hence, **chi_square = 0,** for the above particular scenario. It implies no correlation. It was obvious since the data was created for this purpose.

*There is no change in expected value and observed value which implies no correlation between going to Gym and Not being obese.*

In other words, had been the aforementioned correlation, the expected value of *"Gym, Obese"* must be very low, and *"Gym, Not Obese"* must be high.

Let's see another sample data for the same.

Obesity | No Obesity | Total | |
---|---|---|---|

No Gym | 75 (50) | 25 (50) | 100 |

Gym | 25 (50) | 75 (50) | 100 |

Total | 100 | 100 | 200 |

Let's do the calculating again,

Expected values will remain 50 for each case. We will directly calculate the **chi_square** value.

$$chi-square = (75-50)^2/50 + (25-50)^2/50 + (25-50)^2/50 + (75-50)^2/50$$

*= 12.5 + 12.5 + 12.5 + 12.5 = 50* ,

But you might have noticed,

- Since all the terms are positive, having too many unique values of each can increase the value as every new value will add to the total.
- The second issue is that it is difficult to qualify the value
*i.e. whether the correlation is little or large*

With **Crammer's V**, we fix this issue by standardizing the value w.r.t *the sample size and R x C values.*

$$Crammer's V = \sqrt({chi_square}/(n * min(r-1, c-1) )$$

**n** = sample size, **r**=row count, **c**= columns count

Crammer's V for the above scenario, = *SQRT( 50/(200 x 1) ) = SQRT( 0.25)*

*= 0.5, which shows a decent correlation*

Let's write the Python code for it.

```
def crammer(s1, s2): #1
import pandas as pd
from scipy.stats import chi2_contingency
n = len(s1)
r,c = s1.nunique(), s2.nunique()
matrix = pd.crosstab(s1,s2).values
chi_sq = chi2_contingency(matrix)
cramm_V = np.sqrt(chi_sq[0]/(n*min(r-1,c-1)))
return cramm_V
```

Code-explanation^{#1 - Code is self-explanatory, take the two Features as Pandas Series, calculate the cross-tab matrix and then scipy.stats module calculate the chi_square. Finally, calculate the crammer's V using NumPy functions.}

We are done with the Nominal to Nominal case. Let's move to the case of Nominal to Continuous and Nominal to the ordinal case. In both these cases, we will use a similar approach.

This approach is quite simple and can be used in most cases. We define one of the variables as an Independent feature and the other as a dependent feature.

Using the two features, we fit a Linear/Logistic Regression model and then calculate the **r-square score.** Underneath philosophy is that, if the two Feature has little or no correlation, then the Model's score will reflect the same

**r-square score** of the model is used to get the Strength. Better the **r-square score** better is the strength. You can read [here] about **r-square score**.

It's the percentage of variability explained by the model w.r.t to a line passing through the mean. In our case, it will become how one feature can explain the other.

There is a limitation with the **r-square score** i.e its value increases with every additional feature. So with too many garbage features, it's value will increase and reflect the incorrect relationship.

Although in our case it's just one Feature, still we will use the **adjusted r-square score.**
It takes into consideration the number of features to balance the addition caused by the increase in **r-square score**.

Let's proceed to the work -

**Categorical - Numerical** - Treating Numerical as dependent and converting Categorical to OHE, fit it to a Linear Regression and get the **r-square score**

**Categorical - Ordinal** - In this case, there will be an additional step *i.e. to convert the Ordinal feature into a Numerical feature using the Rank*. So, basically, we are assuming that if there is a correlation, the prediction will be in a specific direction *i.e. either from a smaller rank towards the larger rank or vice-versa.*

To create a mental picture for the above explanation, observe the relation in the image below, first two tables show a Correlation while the 3rd table depicts a random relation.

Now, since we are clear with the concept, let's do the coding.

```
def reg_r2score(s1, s2):
import pandas as pd
x, y = (s2,s1) if s2.dtype == object else (s1,s2) #1
y=y.rank()
x = pd.get_dummies(x)
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(x, y)
r2score = reg.score(x, y)
n=len(y)
k=len(x.columns)
adj_r_sq = 1 - (1 - r2score)*(n-1)/(n-1-k)
return adj_r_sq
```

Code-explanation^{#1 - In this line of code, we are simply deciding that the Nominal feature will be x i.e. predictor. Rest all the lines are self-explanatory}

We are done with the case of *Nominal to Continuous* and *Nominal to the Ordinal*.

Let's move to the case of *Ordinal to Ordinal* and *Ordinal to Continuous*. In both these cases, we will use a similar approach i.e. **Spearmans rank-order coefficient** Or **Spearmen rho.**

Let's understand **Spearmans rank-order coefficient** Or **Spearmen rho.**

It has a simple approach where we rank both the feature among respective values. Then try to measure the difference in respective ranking.

What it means if ranking is similar for both features, we can assume a high correlation. **e.g. Academic vs Sports**, if the same set of people are top ranker in Academic and the same set of people are either top or bottom ranked, we will assume a very high correlation between the two.

Below is the formula,

$$spearman rho = 1-6\times\Sigma{d_i}^2/n(n^2-1)$$

**d** is the difference between the respective ranks of the two Features.

The Sum of squares will always increase with additional data points. So the number of data points have been factored in to balanced it.

Let's consider a simple example of *Drug Intake(Numerical)* and "*Rank in Championship"(Ordinal).*

drugs_intake(In grams) | rank in championship | rank of drug intake(since its numerical) | difference (d) |
---|---|---|---|

5 | 2 | 4 | 2 |

10 | 3 | 3 | 0 |

15 | 4 | 2 | 2 |

20 | 1 | 1 | 0 |

$$spearmanrho = 1 - 6 \times (4 + 0 + 4 + 0)/(4\times(16-1)) = 1 - 0.8 = 0.2$$

Lets' do the coding part. In this case, it is very simple *i.e. using a module from scipy. Pandas corr()* function too can calculate this if passed

```
def spearman(s1, s2):
from scipy.stats import spearmanr
corr, _ = spearmanr(s1,s2) #1
return corr
```

Code-explanation^{#1 - We are assuming that both the Features are ordinal. Had one been Continuous we would have used the rank function to get the rank as we did in the previous code.}

^{Note - You might have observed that we converted Ordinal into Rank and used regression approach for Ordinal-Nominal and did same for Ordinal-Continuous but calculated spearman's rho. So the whole idea is to measure the movement of one feature w.r.t to the other. If you understand the underlying logic, you can easily have more flexibility.}

*This is the defacto correlation approach*. Since most of the time, you deal with Continuous data points and it is used for Continuous features.

We will not explain much of this. You can read about it [Here] .

It give us the **linear relationship** between the two Features *i.e. how one move linearly when the other changes.*

For a pair of variables, **Pearsons correlation coefficient** is simply the square of the R-square score.

Below is the function to calculate it using ** scipy**.

```
def pearson(s1, s2):
from scipy.stats import pearsonr
corr, _ = pearsonr(s1,s2)
return corr
```

Just be mindful of the fact that all these techniques are based on different approaches, so you can't compare the output of one with another *i.e. 0.5 from Crammer's V might not be the same as 0.5 from the Pearson correlation coefficient.*

You may create a function that accepts all the Features and their type and create a consolidated Table or a heatmap. As shown below.

**Heatmap**

Generating the Heatmap from the Correlation matrix is simple stuff. You can directly call the Heatmap method of Seaborn.

```
fig, ax = plt.subplots(1, 1, figsize=(10,5))
sns.heatmap(corr_matrix, ax=ax, annot = True)
```

]]>In this post, we will dive into the DateTime related module of Python and Pandas. Handling DateTime is always a boring part of any programming language. Many times we can achieve most of our requirements without delving much into this module. But if we understand it structurally, it is not that boring. It will make your life pretty easy when handling a Timeseries dataset.

We will try to develop a mindmap along with this post. We will cover,

- Datetime objects in Python
- Operations and Arithmetic on Python Datetime object
- Read DateTime from String and format back to String
- Datetime objects in Pandas
- Learning to operate TimeSeries data based on Datetime Index
- Understanding and applying Delta, Offsets, Timezone

The above-mentioned modules we will cover in this post. We will dive deep into the datetime module of Python and all the shown modules of Pandas. These are enough for all our DateTime need.

Time is the first package that we will discuss. You may not need it more often because the datetime module will cover everything that is available in this module.

There are 3 ways we can input the information for a time

**epoch**- Seconds since a reference instant, known as the epoch. Midnight, UTC, of January 1, 1970, is a popular epoch used on both Unix and Windows platforms.**As a tuple**- An alternative to seconds since the epoch, a time instant can be represented by a tuple of nine integers, called a timetuple. As show below`tm_year=2005, tm_mon=8, tm_mday=7, tm_hour=23, tm_min=21, tm_sec=29, tm_wday=6, tm_yday=219, tm_isdst=0`

This is an intuitive approach since we have the option to input all the relevant values with a keyword argument. This approach is common across different modules but with different names of the underlying Class.`struct_time`

is the name for the Class in`time`

module**From String**- We can also read from strings like '2020-11-18 23:59:59'

Let's see the functions that are required to achieve the above methods.

```
import time
tm = time.gmtime(1123456889.5) # epoch --> time.struct_time object
time.mktime(tm) # time.struct_time object --> epoch
time.struct_time((2005, 8, 7, 23, 21, 29, 6, 219, 0)) # Create struct_time explicitly
time.time() # Current time in epoch
# Get the individual attributes
print(tm.tm_year, tm.tm_mon, tm.tm_mday,tm.tm_hour, tm.tm_min, tm.tm_sec)
```

Code-explanation^{We have simply used the 3 methods of time class [ in the time module ]}^{All other parts of the code is quite trivial and self-explanatory.}

With the above code snippet, we are equipped to read and save time data. let's read from Sring and format back to a string

```
read_time = time.strptime("2018-04-02 23:59:50", '%Y-%m-%d %H:%M:%S')
str_time = time.strftime('%d-%b-%Y %H:%M:%S', read_time)
```

Code-explanation^{We have two method to our service - strptime and strftime.}^{The meaning of each alphabetic code can be checked Here}

The `datetime`

module has all the functionality of the time module and has many APIs on top of it. So, you might ignore the time module.

The `datetime`

module has Classes for - *Date, Time, and Datetime*. The first two are for Date and Time respectively and the last one is the superset for the two. Hence the last one i.e `datetime`

Class is sufficient for all of our tasks.

_{Why we need datetime when we have the time module}_{The high-level reason is that the time module is to handle time as a Float. It is not designed keeping humans in mind. datetime has all the required API needed to handle date and time by a Human. Check his Reddit Answer Reddit}

Let's check the `datetime`

module with the required code. Be mindful that the Object of the Datetime which stores the values will be `datetime`

. Also, take a note that the name of the top-level package is also `datetime`

```
from datetime import datetime # Both are named datetime
dtm = datetime(2000, 5, 23, hour=0, minute=0,second=0, microsecond=0,tzinfo=None) # Time tuple
dtm = datetime.fromtimestamp(1123456889.5) # epoch --> datetime. Similar to mktime
datetime.now() # Current time
# Read from String
datetime.strptime("2018-04-02 23:59:50", '%Y-%m-%d %H:%M:%S') # string--> datetime
# Back to String
d = datetime.strptime("2018-04-02 23:59:50", '%Y-%m-%d %H:%M:%S')
datetime.strftime(d, '%d-%b-%Y %H:%M:%S') # datetime --> string
# Individual attributes of datetime
d2.year,d2.month, d2.day, d2.minute, d2.second
# Weekdays names are not directly avaialble as attribute
d2.strftime("%A"), d2.strftime("%a")
```

Code-explanation^{Code is quite intuitive to understand. In addition, now we have an option for timezone(tzinfo parameter). We will use it later}

We now know the approach to input, format, and print formatted datetime. So, let's learn how to do Arithmetic with datetime.
`timedelta`

is the module to create and manage the difference between to datetime. We can also calculate the future date if the delta is known.
*Instances of the timedelta class represent time intervals with three read-only integer attributes days, seconds, and microseconds.*

Let's check the `timedelta`

module with the required code.

```
from datetime import timedelta, datetime
d1 = datetime.strptime("2018-04-02 23:59:50", '%Y-%m-%d %H:%M:%S')
d2 = datetime.strptime("2019-05-03 23:57:12", '%Y-%m-%d %H:%M:%S')
d2 - d1 # >>> datetime.timedelta(days=395, seconds=86242) # this is timedelta object
delta = timedelta(days=395, seconds=86242) # this is timedelta object
delta.days,delta.seconds # Check attributes
# Add the delta to d1
datetime.strftime(d1+delta, '%d-%b-%Y %H:%M:%S') # Same as d2
str(d1+delta) # str function implementation of timedelta
```

Code-explanation^{As mentioned above, timedeta can be expressed in only 3 attributes days, seconds and microseconds.}^{Difference of two datetime object is a timedelta object}

`pytz`

is a third-party module to handle timezone-related manipulations. Timezone handling can be prone to bugs and issues. Here are the words of wisdom from "*Python in a Nutshell*"

_{The best way to program around the traps and pitfalls of time zones is to always use the UTC time zone internally, converting from other time zones on input, and to other time zones only for display purposes.}

Let's check a quick code snippet to handle timezone with datetime.

```
!pip install pytz
import pytz
# Get the list of all available timezones
pytz.common_timezones #1
# Timezone for a particular country # Use the ISO format of country code
pytz.country_timezones('IN') # >>> ['Asia/Kolkata'] #2
inp_ny = datetime(2021,11,11, tzinfo=pytz.timezone('America/New_York')) # Return datetime with New york time #3
# use the astimezone method of datetime object
out_ind = inp_ny.astimezone( pytz.timezone('Asia/Kolkata')) #4
```

Code-explanation^{#1 - Fetech the list of all avaialble timezones}^{#2 - Fetch the list of all timezones for a country [India here]}^{#3 -Use the tzinfo of datetime constructor}^{#4 -Covert to the desired timezone}

When we create a datetime without a `tzinfo`

it's a naive `datetime`

*i.e. just a datetime without any timezone attached*. When we pass the timezone to the tzinfo parameter, the datetime became the datetime for that timezone.

Let's do a small exercise and create two datetime with the same values but pinned to different timezones. Then calculate the timedelta of the two.

```
time_1 = datetime(2021,11,11, tzinfo=pytz.timezone('America/New_York'))
time_2 = datetime(2021,11,11, tzinfo=pytz.timezone('Pacific/Auckland'))
time_1 - time_2 # >>>datetime.timedelta(seconds=59700) | Equivalent to ~16.5 Hours
```

This was all for this post. If you keep these few snippets in mind, datetime will never haunt you. We will continue this post and add on to the Pandas library. That post will not just focus on core pf pandas datetime objects but also on the Timeseries data.

You may try,

**The dateutil module**- a third-party package that offers modules to manipulate dates.*[Link]***The calendar module**- calendar module supplies calendar-related functions**The arrow library**- It offers a sensible and human-friendly approach to creating, manipulating, formatting and converting dates, times and timestamps.*[Link]*

Hello everyone. In this post, we will dive into the string module of Pandas. We will learn that although we can achieve all the required task even without the module but using the module gives us the much-needed programming robustness and elegance.

We will learn,

- Pandas Category vs String
- Different operation with Pandas str module
- Performance comparison with a simple approach

Let's jump to the code

By default, the string data will be of the object type. We may explicitly define the dtype to string.
With an explicit definition in place, we can select all the string columns with `select_dtypes`

```
data_obj = pd.Series(["test"])
data_str = pd.Series(["test"], dtype='string') # Created with string dtype
data_obj.dtype, data_str.dtype
data_obj = pd.DataFrame(data = [{"A":"A1", "B":"B1"},{"A":"A2", "B":"B2"}])
data_obj.select_dtypes('object') # Output both the columns
data_obj.B = data_obj.B.astype('string')
data_obj.select_dtypes('string') # Output only column B
```

Code-explanation^{All other parts of the code is quite trivial and self-explanatory.}

If the string column is not a free text column or when the unique values are limited* i.e. a nominal data type e.g. age*, then using the category `dtype`

has performance benefit.

^{ Excerpt from the official document If you have a Series where lots of elements are repeated (i.e. the number of unique elements in the Series is a lot smaller than the length of the Series), it can be faster to convert the original Series to one of type category and then use .str.<method> or .dt.<property> on that. The performance difference comes from the fact that, for Series of type category, the string operations are done on the .categories and not on each element of the Series.}

Let's check a code,

```
dataset = pd.read_csv("/content/california-housing-prices.zip") #1
%timeit dataset.ocean_proximity.map(lambda x :x.replace('A', 'B')) #2
dataset.ocean_proximity = dataset.ocean_proximity.astype('category') #3
%timeit dataset.ocean_proximity.map(lambda x :x.replace('A', 'B')) #4
```

Code-explanation^{#1 - We have loaded the famous California dataset which has a text column i.e. ocean_proximity}^{#2 - Got the average execution time on the text col for a simple operation}^{#3 -Changed the dtype of ocean_proximity to category}^{#4 -Got the new execution time for the same operation}

Result^{Output#1 - 100 loops, best of 5: 5.75 ms per loop}^{Output#2 - 1000 loops, best of 5: 374 s per loop}

We can see that the performance gain is almost** x15 on 20K samples**.

In this section, we will list the code for multiple very frequently used string operation. So you may keep these code snippet or function on your finger-tip

To call the string functions on a Pandas column, we use the str attributes.

Let's call the upper, lower,len functions

```
data = pd.DataFrame(data = [{"A":"A1", "B":"B1"},{"A":"A2", "B":"B2"}])
# Lower
data.A.str.lower()
# Chaining of functions
data.A.str.lower().str.len()
# Strip
data.A.str.strip()
```

It has all the methods that are available for Python string. Here is the link for the List

We can use the split function in many ways. The most simple being to split on delimiter to create a DataFrame on the fly from a single column/series.

```
# Create a dummy data
data = pd.Series(["a_b_c", "c_d_e", "f_g_h"], dtype="string")
# Call the split function
data.str.split("_")
# Select the individual split with get or []
data.str.split("_").str.get(1)
data.str.split("_").str[1]
# You may create a DataFrame using the expand parameter
df = data.str.split("_", expand=True)
df.columns=['fname','mname','lname']
df
```

Code-explanation^{All the parts of the code is quite trivial and self-explanatory but you must take a note of the last snippet. It can shorten multiple lines of code into just one line.}

`replace`

is another most useful function. With `str.replace`

, you can not only replace a simple string but also use regex to replace which makes it very powerful

```
# Replace
dataset = pd.read_csv("/content/california-housing-prices.zip")
# Simple replace
dataset.ocean_proximity.str.replace(" ", "_")
# Replace with regex - Mask the vowels
dataset.ocean_proximity.str.replace("[aeiou]", "X", regex=True, case=False) #1
```

Code-explanation^{#1 - Set regex=True to use regex}^{#1 - Use the case flag to ignore case}^{All other parts of the code is quite trivial and self-explanatory.}

^{ Important excerpt from the official document Warning: Some caution must be taken when dealing with regular expressions! The current behaviour is to treat single-character patterns as literal strings, even when regex is set to True. This behaviour is deprecated and will be removed in a future version so that the regex keyword is always respected.}

You can simply slice the column just the way we do on a simple string. We can do that either using the slice method or the [ ]

```
dataset = pd.read_csv("/content/california-housing-prices.zip")
# Remove the first and last char
dataset.ocean_proximity.str.slice(1,-1)
dataset.ocean_proximity.str[1:-1]
# Get a char
dataset.ocean_proximity.str[0]
```

The `str`

attribute also contains all the necessary function which are required to get a boolean flag out of String *e.g. if it is all Alphanumeric or contains other chars*

```
# Startswith, Endswith
dataset.ocean_proximity.str.startswith('B')
dataset.ocean_proximity.str.startswith('Y')
# isalnum(), isalpha(), isspace(), islower(), isupper(), etc.
dataset.ocean_proximity.str.isalnum() # Check if all chars are Alphabet or Num
dataset.ocean_proximity.str.isalpha() # Check if all chars are Alphabet only
dataset.ocean_proximity.str.isspace() # Check if its space only
dataset.ocean_proximity.str.islower() # Check if all chars in Lower case
dataset.ocean_proximity.str.isupper() # Check if all chars in Upper case
```

All these str functionalities are available on DataFrame index too. Columns are also an index.

It is very common to get a dataset with the columns name with spaces, mixed case. We can fix these issue in a simple one-liner

```
# Columns
!kaggle datasets download kumarajarshi/life-expectancy-who
dataset = pd.read_csv("/content/life-expectancy-who.zip")
# Remove left/right spaces, convert to lower case, Underscore instead of mid space
dataset.columns = dataset.columns.str.strip().str.lower().str.replace(" ", '_')
```

Code-explanation^{All other parts of the code is quite trivial and self-explanatory.}

Let's quikly build a DataFrame out of the raw IMDB movie dataset

```
# Data load i.e. Raw data
data = pd.read_csv("/content/review.csv")
data.review = data.review.astype('string')
data.sentiment = data.sentiment.astype('category')
data.review[0]
```

Result

We will convert this text into a DataFrame with a columns for each word(tokens)

```
# Replace the "<br></br>> with blank
data.review = data.review.str.replace('<.*?>',' ', regex=True)
# Remove non-alphabets
data.review = data.review.str.replace('[^a-zA-Z]',' ', regex=True)
# Convert all to lower case
data.review = data.review.str.lower()
# Replace muliple blanks with single blanks
data.review = data.review.str.replace('[ ]+',' ', regex=True)
import itertools
flatten = itertools.chain.from_iterable
corpus = list(set(list(flatten(data.review.str.split(" ").tolist())))) #1
len(corpus)
data_feat = pd.DataFrame(np.zeros(shape=(data.shape[0],len(corpus)))) #2
data_feat.columns = corpus
data_feat = data_feat.apply(lambda x: data.review.str.contains(x.name), axis=0) #3
data_feat = data_feat.replace({True:1,False:0}).iloc[:,1:] #4
```

Code-explanation^{#1 - Get the unique list of all words(Tokens)}^{#2 - Create a blank DataFrame of size (rows,corpus length)}^{#3 -Map each col to True/False if it is available in Corpus}^{Replace True/False with 1/0}

Result

We have the OneHotEncoded dataset ready to train a model. We need few more steps *e.g. stop_words removal* to make it optimum for any model but those steps are not in the scope of this post.

This was all for this post. With these tools in your hand, you can make your text related code robust and simple. Try to implement all the functions available in Pandas doc.

The alternate approach to achieve the same results without the `.str attribute`

is using `lambda and map`

i.e. `dataframe.text_col.map(lambda x : x.lower())`

Strat + LM Defn
Hello AI Enthusiasts!!!
This is the 5th post in our series on Natural Language processing. In this post, we will understand and learn Language modelling. In a very simple explanation, Language modelling is the task of finding the next best word/character given a sequence of words or characters. Let's check the definition from the book "*Deep Learning with Python by Francois Chollet"*

_{Any network that can model the probability of the next token given the previous ones is called a language model. A language model captures the latent space of language: its statistical structure.}

It means, based on a Corpus e.g. Wikipedia dump and given a sequence e.g. "*Natural language*" what is the probability of different words to be the next word* i.e. "Processing":0.5, "Engine":0.3, "Speaker":0.2* etc.

We can achieve this via. two ways

- Classical Statistics and probability-based approach
*i.e. N-gram model* - Deep Learning-based sequence model

Let's say we have to predict the next word for *"Transfer learning has been recently shown to drastically increase the <...>"*

What we want is,

Which is equal to,

It's quite obvious from the above equation that even with a very large corpus we can't get too many repetitions of such a long sequence and eventually hardly get more than one word to predict.

To solve the above challenge, we will make an assumption that each word is only dependent on past "N" words. "N" will be the parameter that we can change and accordingly the model will have farsightedness in past or lack of it.

In the above example, let's assume N = 3, then the next word is only dependent on "

**N-gram**

Let's see the formal definition from Wikipedia.

^{In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The n-grams typically are collected from a text or speech corpus. Using Latin numerical prefixes, an n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". English cardinal numbers are sometimes used, e.g., "four-gram", "five-gram", and so on. }

e.g. for our small corpus *i.e. "transfer learning has been recently shown to drastically increase the performance"*

_{1-gram = ["transfer", "learning", "has", "been", "recently", "shown", "to", " "]}_{2-gram = ["transfer learning", "learning has", "has been", "been recently" ]}_{3-gram = ["transfer learning has", "learning has been", "has been recently"]}

So, now the steps are very simple

- Decide an N for n-gram. Remember smaller the "n", lesser will be the contextual understanding in the prediction. It will only depend on the last few words.
- Count each unique n-gram and the unique words following it
- Predict the next word as the word having the highest count(
*being the next word*) for the n-gram

**What if**, the n-gram combination is new in the test data

There is a good discussion in this beautiful book on NLP i.e. *Speech and Language Processing*. We will follow a very simple approach* i.e. putting a random prediction for such cases.*

**Dataset**

We will use a news corpus dataset. It has small chunks of news description from different publishing houses.

```
# Load the dataset
dataframe = pd.read_csv("/content/ag_news_csv/train.csv",names=['publisher','description'])
# Remove all non aphabetical char, All multiple space to single space, All to lower
import re
regex = re.compile('[^a-zA-Z]')
data = data.map(lambda x:regex.sub(' ', x.lower()))
data = data.map(lambda x:re.sub(' +',' ', x).split())`
```

Code-explanation^{All other parts of the code is quite trivial and self-explanatory.}

```
def make_gram(list_token, n=4):
n_gram_data = {} # Blank dict - format will be {k1:[w1,w2,w1,w3], k2:[...],...}
for i in range(len(list_token)-n): #1
key = '_'.join(list_token[i:i+n]) #2
val = list_token[i+n]
if key not in n_gram_data.keys(): #3
n_gram_data[key] = [val]
else :
n_gram_data[key].append(val)
corpus = []
for id,rows in data.iteritems(): #4
make_gram(rows, 3)
corpus.extend(rows)
```

Code-explanation^{#1 - Loop on the list of all the tokens}^{#2 - Create key for each n token i.e. for each n-gram}^{#3 - Append if key exists or insert new value in the dict}^{#4 - Create the n-gram dict for the news dataset with n=3. Corpus is a copy of all tokens}

```
string = 'it will not' #Sees n-gram
input = '_'.join(string.split()) # Convert it to Keys
length = 50 # Number of word to predict
best_of = 2
print(string, end=' ') # Print each without new line
while length<0:
vals = n_gram_data.get(input,'NA') # This is a list
# Smoothening #1
if vals=='NA':
vals = [corpus[np.random.randint(0,corpus_len,1)[0]]]
# Smoothening Ends
prob_dict = dict([(elem,vals.count(elem)/len(vals)) for elem in vals]) #2
pred = sorted(prob_dict, key=prob_dict.get)[:best_of] #3
best_of = len(pred) if len(pred)<best_of else best_of #4
next_word = pred[np.random.randint(0,best_of,1)[0]]
print(next_word, end=' ')
input = input.split('_') + [next_word]
input = '_'.join(input[1:])
length-=1
```

Code-explanation^{#1 - If a key is not available, pick the next word at random}^{#2 - Calculated the probability of each values and made it a key. This was for one key}^{#3 - Sorted the prob_dict on values [Not keys]}^{#4 - We don't always want to pick the word having highest probability. So we pick anyone out of random top K[See "best_of" parm]. This adds newness in the generated text.}^{All other parts of the code is quite trivial and self-explanatory.}

Result^{it will not impose fuel surcharges on domestic and international air fares will rise by and respectively because of the trademarked keyterms that companies bid for in the adwords keyword advertising program lt p gt ottawa reuters nortel networks corp investors predicted tuesday the telecom equipment giant the subject of how miguel angel }

Although it lacks coherence in terms of the meaning of the sentence but we have got a decent output considering the simplicity of the method used.

Two key limitations of the previous approach are,

- Need to fix the n-gram in the beginning
- Need of smoothening

With a Deep Learning model,

- We can input a sequence of variable length
- We will get an output in every scenario

We will build our Deep Learning model on char as a token unlike using the words as done in the previous section. With char, we will have only *27 token i.e. 26 letters and one for space*. We want to avoid embedding here to keep the explanation single focussed.

It's a simple Recurrent Classification model *i.e. predicting the next char.*

```
corpus = []
for id,row in data.iteritems(): #1
for elem in row:
corpus.append(' ')
corpus.extend(list(elem)) # Remember extend vs append
corpus[0:100]
corpus = np.array(corpus)
corpus = corpus.reshape(-1,1)
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
corpus = ohe.fit_transform(corpus).toarray() #2
corpus = corpus.astype('float16')
```

Code-explanation^{#1 - Dreated the corpus of each token(char here)}^{#2 - OneHotEncoded all the chars}^{All other parts of the code is quite trivial and self-explanatory.}

```
x = []; y = [] ; seq_len = 10 #1
total_seq_len = 500000
for i, char in enumerate(corpus[:total_seq_len]): #2
x.append(corpus[i:i+seq_len]) #3
y.append(corpus[i+seq_len])
x = np.array(x)
y = np.array(y)
```

Code-explanation^{#1 - Defining the sequence length and vocab size(total_seq_len )}^{#2 - Using the corpus only till vocab_size length}^{#3 -Defining x as seq_len chars sequence. Y as the ext char after seq_len chars}^{All other parts of the code is quite trivial and self-explanatory.}

```
def data_gen(batch_size): #1
while True:
for i in range(x.shape[0]-batch_size): #2
yield x[i*batch_size:i*batch_size+batch_size], y[i*batch_size:i*batch_size+batch_size]
```

Code-explanation^{#1 - Definined a generator of batch_size}^{#2 - Looped x and yielded x, y of length batch_size}

```
import tensorflow as tf
from tensorflow import keras
dropout = 0.1; neurons = 128
model = keras.models.Sequential([
keras.layers.Bidirectional(keras.layers.LSTM(neurons, return_sequences=True,input_shape=[x.shape[2]])),
keras.layers.Bidirectional(keras.layers.LSTM(neurons, return_sequences=True, )),
keras.layers.Bidirectional(keras.layers.LSTM(neurons)),
keras.layers.Dense(250 , activation='relu'),
keras.layers.Dropout(rate=dropout),
keras.layers.Dense(27, activation='softmax')
]) #1
optimizer = keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
batch_size=256
history = model.fit(data_gen(batch_size), epochs=75, steps_per_epoch=total_seq_len//batch_size) #2
```

Code-explanation^{#1 - Defined a simple Recurrent Neural Network}^{#2 - Since we have define a custo generator, so steps_per_epoch is needed otherwise 1st epoch will not end}

```
string = 'the company'[:seq_len]
length = 125; i= 0
print(string, end=' ')
while length>0:
input = ohe.transform(np.array(list(string)).reshape(-1,1)).toarray().reshape(1,len(string),-1)
vals = model.predict(input)
argmax = np.argmax(vals)
vals[:] = 0
vals[0,argmax] = 1
char = ohe.inverse_transform(vals)
print(char[0][0], end='')
string = string[1:]+char[0][0]
length-=1
```

Code-explanation^{A trivial code to predict next char in a running wondow manner}

Result^{the compan y s credit picture industry group said on wednesday after the web search engine has slashed the price range to beevised the b}

Our sentences are not very coherent since we didn't use one homogeneous corpus e.g. a book etc. It was more of a collection of news pieces. Still the words speling were very accurate. You can try,

- The same exercise on a different Corpus
- Try Deep Learning based model for word level token
This was all for this post. In the next post of NLP series, we will learn and code a Language Translator
*i.e. German to English etc.*

Hello AI Enthusiast!!!

In this post we will learn to code to extract the embedding of an image and use that embedding to search the matching images for a particular test image.

Embeddings are an equivalent representation for specific data in new feature dimensions. The new dimensions may not be comprehensible to human but it works great in separating distinct image and grouping similar images together.

You may check our previous blog on Embeddings i.e. *Here* and *Here*. Though these blogs were written for word embedding but the concept remains the same for Image/visual embedding too.

In very simple terms, we are grouping similar images together.

Similar idea was discussed in the Deep Learning course by Stanford *Here*. Below is an image from the course. Images were plotted using their embeddings.

We will not plot the image as done in the above post but we will use a similar approach to find the matching best "N" images.

How can we achieve this task,

**Simply getting the euclidean distance between pixels**

This approach will not work because we need something which is invariance to translation and if possible to rotation too. Pixel to pixel comparison will expect similar feature representation by the same pixel for each image. Another reason is that a raw image will have a very high dimensionality which will impact the euclidean distance formula.**Our approach**

This is a standard approach for such a task. The steps go as,- Get the feature vector of all the training images using a pre-trained CNN model
- These are the visual embedding for the images
- Get the visual embedding of our search image
- Find the nearest "N" images from the training dataset
- These are the search results

**Download the dataset**

We will use Caltech_101 dataset. Below is the description for the dataset from the official website Link.

^{Pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. Collected in September 2003 by Fei-Fei Li, Marco Andreetto, and Marc 'Aurelio Ranzato. The size of each image is roughly 300 x 200 pixels}

```
%%bash
# One time code
tar_path = '/content/<<full_path>>/101_ObjectCategories.tar.gz'
tar -xvzf "$tar_path" -C '/content'
ls -l '/content/data' | head -5
```

Code-explanation^{%%bash - We are doing all our code in Google Colab. It also supports magic cell concept. With %%bash, we can use Linux command directly in the Colab cell.}^{All other parts of the code is quite trivial and self-explanatory.}

```
import shutil, os, sys
path="/content/101_ObjectCategories"
dest = "/content/all_images"
for root, dirs, files in os.walk(path): #1
for dir in dirs:
for filename in os.listdir(root+"/"+dir):
#spl = root.split("/"); newname = spl[-1]; sup = ("/").join(spl[:-1])
shutil.move(root+"/"+dir+"/"+filename, dest+"/"+dir+"_"+filename)
all_images = ["/content/all_images/"+elem for elem in os.listdir(dest)] #2
```

Code-explanation^{#1 - The images are placed in 101 separate sub-dirs. We are looping on each and placing all the images in a common dir i.e. dest}^{#2 - This code is saving the full path of all the images. We will need this later}^{All other parts of the code is quite trivial and self-explanatory.}

```
# Load as numpy array
from tensorflow.keras.preprocessing.image import load_img, img_to_array
dataset = np.empty(shape=(len(all_images),128,128,3)) #1
for i,path in enumerate(all_images): #2
img = load_img(path, target_size=(128,128))
img_arr = img_to_array(img)
dataset[i] = img_arr
np.random.shuffle(dataset) #3
dataset.shape
```

Code-explanation^{#1 - Creating a blank numpy array of size equal to all the images}^{#2 - Loading all the images and saving it as numpy array}^{#3 - Since, we read the file sequentially from each Category folder. So it is not shuffled by Default. So we did it explicitly.}

```
# View some images
rand_id = np.random.randint(0,len(all_images),5)
_,ax = plt.subplots(1,5,figsize=(15,4))
for i,id in enumerate(rand_id):
#img = plt.imread(all_images[id],)
img = dataset[id]/255.
ax[i].imshow(img)
# Keep 5% images for search
x_train = dataset[:8000]
x_test = dataset[8000:]
```

Code-explanation^{We have randomly picked 5 images and displayed it}

Result

```
# Build the Feature extractor model
from tensorflow import keras
from keras.applications.resnet50 import ResNet50, preprocess_input
input = keras.layers.Input(shape=(128,128,3))
model = keras.layers.Lambda(lambda x: preprocess_input(x))(input) #1
model = ResNet50(weights='imagenet', include_top=False)(model) #2
model = keras.Model(inputs=input, outputs=model)
model.summary()
```

Code-explanation^{#1 - A lambda layer to pass each image through ResNet50 pre-processing function}^{#2 - Loading the ResNet50 without the top Dense layers. We only need the features. So, top is excluded}

```
# Extract the the embeddings
train_embedding = model.predict(x_train) #1
train_embedding = np.average(train_embedding, axis=(1,2)) #2
# Pick a random test image #3
rand_id = np.random.randint(0,x_test.shape[0],1)
data = x_test[rand_id]
embedding = np.average(model.predict(data), axis=(1,2))
plt.imshow(data[0]/255.)
```

Code-explanation^{#1 - We got the Features for all images with the predict function of the model}^{#2 - Each feature has a 4x4 shape. We average these 16 values for each features}^{#3 - Pick a random test image(Our search image). Calculate its embedding. Display the image}

Result

```
# Calculate the Eucledean distance for the test embedding from all the train embeddings
eucledean_distance = np.array([np.linalg.norm(embedding[0]-elem) for elem in train_embedding]) #1
# Get the nearest 10, these are our search results
top_10 = eucledean_distance.argsort()[:10] #2
search = x_train[top_10] #3
# View search results #4
_,ax = plt.subplots(2,5,figsize=(15,8))
for i,arr in enumerate(search):
#img = plt.imread(all_images[id],)
img = arr/255.
ax[i//5][i%5].imshow(img)
```

Code-explanation^{#1 - Calculate the Euclidean distance of the search image with all the train image}^{#2 - Get the index of 10 lowest values i.e. nearest images. We will use this index to get the image from train dataset}^{#3 - Get the matching images from the train set using the above index}^{#4 - Display the matching images}

Result

**Search Image**
**Search results**

We observed that a very simple code was able to identify very close matches *esp. the second example where faces of the same person were identified.*

We must give the due credit to the power of the pre-trained model. *The ResNet50 pre-trained model was the backbone of our code.*
*Another aspect is the Euclidean distance*. Be mindful that we need to calculate the distance of the test image with all the train image. We may define the centre of different clusters among the train images and quickly check the nearest centre for the test image and pick the result from that cluster only.

Taking it forward,

- You can try this on another dataset e.g. ImageNet
- Try reducing the dimension of extracted features using PCA and see the results.
*The current dimension is 2048 which is the Feature maps count of ResNet50.* - Try adding a fancy UI to send an image and display the responses. This can be a simple demonstration project.

Hello AI enthusiasts, we are here with another post. This post is focused on a few of the best books available on AI/ML. In recent time, books have taken a little backstage and video-based education has got the centre stage but books have their own relevance.

Books needs your consistent attention while reading that might not be required in video-based learning and sometimes it creates an illusion of completion.

Ranking the best books is always a subjective matter, so we will avoid it except highlighting the best book *i.e. Rank#1*. All the others books will be just discussed without attributing any specific rank.

We will discuss the following points for each,

- Does the book covers Machine Learning/Deep Learning Or Both
- Does the book just covers the theory or theory/code both
- What are the missing topics
- Salient features
*i.e. Beginner/Advance etc*.

So, let's begin.

** Author **: Aurlien Gron,

This is the best book on ML/DL at this particular moment. I will humbly clarify with due respect to all the other books that it is best when we look at all the aspects of a technical book

It covers both that topic

All the important ML algorithms have been explained.

Let's see few excerpts,

- Very balanced coverage of classic ML and the Deep Learning
- Lot of focus on Tensorflow. If you are looking to work with Tensorflow, then it can be your goto book
- Lot of programming examples covering different scenarios
- No chapters on
*Features engineering, Recommendation system* - Need a basic of Python and a little idea of ML. It can't be your very first book if you are coming from any other area
*e.g. web programming*

** Author **: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani,

This book should be your very first book if you have just started your ML journey.

This book was published after "

^{One of the first books in this areaThe Elements of Statistical Learning (ESL) (Hastie, Tibshirani, and Friedman)was published in 2001, with a second edition in 2009. ESL has become a popular text not only in statistics but also in related fields. One of the reasons for ESLs popularity is its relatively accessible style. But ESL is intended for individuals with advanced training in the mathematical sciences. An Introduction to Statistical Learning (ISL) arose from the perceived need for a broader and less technical treatment of these topics. In this new book, we cover many of the same topics as ESL, but we concentrate more on the applications of the methods and less on the mathematical details.}

**Coverage**

The book covers only the ML portion and not the Deep Learning. It should be obvious from the name itself. Programming examples are there but in R* i.e. not in Python *which is the de-facto programming language of the ML community. Still, R has a decent following.

Book has 10 chapters and is written in around 400 pages. The book starts with a very basic explanation of Machine Learning and moves to explain all the ML areas *i.e. Regression, Classification, Ensembling, Unsupervised.*

In comparison to the previous book in our list, this book has a statistical touch while Hands-On-ML is focused on Machine Learning and quick implementation with Scikit-Learn/Tensorflow *e.g. while performance limitations are discussed inlined with a model but aspects like the assumptions of LinearRegresion are not discussed.*

**Salient features -**

- Starts from very basic and used very intuitive examples
- You can get a free copy of pdf from the official site
- Not much to help if you are looking for programming examples
- Quick and smaller coverage of Ensembling models
- Many intuitive examples across the book
*e.g. Linear Vs Tree,*

** Author **: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani,

Two of the authors are from the list of the last book and this book is from the same group

You may even skip this book if you are not intended to dive deep into the working of every Model

The book has 18 chapters spread across 750 pages. The book covers only the ML portion and not the Deep Learning portion. It should be obvious from the name itself. It is a purely theoretical book, so no help n programming. On the brighter side, the book thoroughly covers almost every portion of classic ML.

**Salient features -**

- In-depth coverage of very ML topics
- You can get a Free copy of pdf from the official site
- Ensembling techniques have special attention expanded in
*~4 chapters* - Few topics that are missing in most of the books are covered
*e.g. Kernel Smoothing Methods* - Mathematics focussed and can help you answer many "Why" of ML e.g.

** Author **: Ian Goodfellow, Yoshua Bengio and Aaron Courville,

This is the best book on the subject of Deep learning if we ignore the need for programming examples. It has a few initial chapters on the idea of classic ML but primary the book is dedicated to Deep learning.

You may even skip this book if you are not intended to dive deep into the working of Neural network and Deep Learning. If you want to gain that knowledge this is the book for you. This book is one document equivalent to multiple Deep Learning papers. You may call this book the Deep Learning alternative of "

The book has 20 chapters spread across 700 pages. The book covers only the Deep Learning portion and not the classic Machine Learning. It should be obvious from the name itself. It is a purely theoretical book, so no help in programming. On the brighter side, the book covers almost every aspect of Deep Learning meticulously.

Since Deep Learning is an evolving field so few of the latest concepts are not covered

**Salient features -**

- Simple and not very Mathematics heavy
- You can read it free(online) on the official page
- Very thorough discussion on "
*Optimization techniques" i.e. the pillar of Neural Network* - A full chapter on "
*Practical Methodology" and "Troubleshooting*". This is one of the most sought portions of Deep Learning" - Missing key updates of last 4-5 years

** Author **: Max Kuhn and Kjell Johnson,

Previously listed books were directly on the subject of ML or DL models. But Feature engineering is a very important subject in Classific ML and it is ignored by most of the book. In fact, it is practically impossible for one book to cover it along with the Models. So it deserved a dedicated book and this is that book.

You must read this book even if you have a decent understanding of Machine Learning. It will not just teach you Feature Engineering but also help you understand many concepts

The book has 12 chapters spread across 300 pages. The book has all the relevant topics of Feature engineering

A full chapter is dedicated to Interaction Effect which is an important portion of Interpretable ML and hence not covered in detail in any other book.*

**Salient features -**

- Overall an elegant writing and visualization
- Exploratory Data Analysis is explained in a beginners-friendly manner
- You can read it free(online) on the official page
- Every WHys has been properly explained
*e.g. Why we don't see full OHE with LinearRegression* - An excerpts explaining Interaction Effect,
- Only Tabular data is covered
*i.e. no discussion on Text/Image/Time-Series etc.* - The datasets and R code are available in the GitHub repository

** Author **: Christoph Molnar,

All the above-listed books have tried to explain the concept and/or program of Machine Learning and Deep Learning. As we know Machine Learning itself became a black box when we started working with a large dataset in recent time.

The situation worsens with Deep learning. Lots of research is going on in this field and currently, this book is one of the best to teach you the concepts of Interpretability.

You may skip this book if you are not interested to learn about the topic. Before that, you should look at the other positive takeaway of this book

The book has 8 chapters starting with the importance of Interpretability in ML to ML Models and then into Deep Learning. Though Deep Learning aspect is not very comprehensive. You have to rely on research papers for that.

**Salient features -**

- Start with the very basic i.e. beginners friendly
- A full chapter on "Interpretable Models"
*i.e. LogisticsRegression/DecisionTree etc.* - You can read it free(online) on the official page
- Covered Feature Importance, SHAP, PDP etc.
- The Author has used R for the plots/analysis

This was all for this post. I know many of the other good books are missing here. This list was based on a goal to provide complete coverage on ML/DL in terms of balancing all the different aspects *i.e. Theory/Code, ML/DL, Basic/Expert *

For a comprehensive list of other books/post please check our Blog ML in 3 Months

With that you will have an idea of all the good books in this area. Still few books/blogs remain which are in the area of specialized topics in Deep Learning *i.e. Computer Vision and Natural Language Processing.*

Hello friends, in this post we will continue the last topic i.e. Word Embeddings. Here we will program rewrite our Sentiment Analysis code with Pre-trained word embedding. You can check the last code Here. In that post, we learnt to program our own embedding using Backpropagation.

In this post we will train with,

- Word2Vec embedding
- GloVe Embedding
- FastText Embedding

Let's start with GloVe first.

We will use the code of Post#02 of this series(Check Here) and take that forward. Below is a summarized depiction of our previous approach and the current approach,

We will not rewrite the code of the previous post. So please check that out first.

Let's code the Pre-trained embedding part,

```
!wget http://nlp.stanford.edu/data/glove.6B.zip #1
file_name = "/content/glove.6B.zip" #2
from zipfile import ZipFile #3
with ZipFile(file_name, 'r') as zip:
zip.extractall()
```

Code-explanation^{#1 - This is the URL for the GloVe embedding file}^{#2 - Local path of the downloaded zip file}^{#3 -Unzipped the file}

`! cat "/content/glove.6B.50d.txt" | head -1000 | tail -1`

Result^{attention -0.084238 0.53053 0.12816 -0.28886 -0.18125 -0.205 -0.096976 0.070474.....}

Code-explanation^{We have viewed one line of the file to see its structure. It's a simple space-separated value of all the dimension. The "Word" is the first token of each line}

Let's read the file line by line and create a dictionary mapping for each word to its embeddings.

```
word_to_embeddings_map = {}
file = "/content/glove.6B.100d.txt"
with open(file) as file: #1
for line in file:
values = line.split()
word = values[0] #2
embed = np.array(values[1:], dtype='float32') #3
word_to_embeddings_map[word] = embed
```

Code-explanation^{#1 - Open the file and read it in for-loop}^{#2 - Split it on "space". First token is the "word" rest all is the embeddings}^{#3 -Convert the embedding into a numpy array and assign it to the dictionary}

Now we will get all the words of IMDB movie review dataset and then map its word to GloVe embedding using the above dictionary.

```
embedding_dim = 100
word_index = keras.datasets.imdb.get_word_index() #1
embedding_matrix = np.zeros((vocab+1, embedding_dim)) #2
for word, i in word_index.items():
embedding_vector = word_to_embeddings_map.get(word) #3
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
```

Code-explanation^{#1 - Get the word to Idnex map of IMDB review dataset}^{#2 - Create a matrix of the size of embedding and out vocab}^{#3 -Loop on each word of the IMDB index and get its GloVe mapping and place it in the above matrix}

Now the last step becomes very simple. Use the Keras word embedding layer and replace its weight with the above embedding and make it non-trainable.

```
dropout = 0.5
neurons = 128
model = keras.models.Sequential([
keras.layers.Embedding(embedding_matrix.shape[0], embedding_dim, input_shape=[None],mask_zero=True), #1
keras.layers.Bidirectional(keras.layers.GRU(neurons, return_sequences=True, dropout=dropout)), #2
keras.layers.Bidirectional(keras.layers.GRU(neurons, return_sequences=True, dropout=dropout)),
keras.layers.Bidirectional(keras.layers.GRU(neurons, dropout=dropout)),
keras.layers.Dense(250 , activation='relu'),
keras.layers.Dropout(rate=dropout),
keras.layers.Dense(1, activation="sigmoid")
])
model.layers[0].set_weights([embedding_matrix]) #1
model.layers[0].trainable = False #2
optimizer = keras.optimizers.Adam(learning_rate=0.002)
model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])
history = model.fit(x_train_ohe,y_train, epochs=15, validation_data=(x_test_ohe,y_test), batch_size=256)
```

Code-explanation^{#1 - Set the weights with the word embeddings}^{#2 - Freeze the Layer to make it non-trainable}

Now we can simply repeat the above steps for any other embedding. So, we will not place the full code for word2vec. We will only add the code to download and read the file.

```
!wget "https://tfhub.dev/google/Wiki-words-250/2"
embed = hub.load("https://tfhub.dev/google/Wiki-words-250/2")
embedding_dim = 250
word_index = keras.datasets.imdb.get_word_index()
embedding_matrix = np.zeros((vocab+1, embedding_dim))
for word, i in word_index.items():
embedding_vector = embed([word]).numpy()[0]
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
```

Code-explanation^{All other parts of the code is quite trivial and self-explanatory.}

Now everything should be quite simple. You can repeat the same for fasttext too. Below is the downloadable URL. Fasttext

You might have observed that the test accuracy is around 80%. Since pre-trained embeddings are trained on a large and generic corpus, so not necessarily it will fit exactly to our dataset.

But we can fine-tune the pre-trained embedding with a very small *Learning rate* just like we do in a CNN pre-trained model.

This will improve the test accuracy to ~85% very easily.

```
model.layers[0].trainable = True
optimizer = keras.optimizers.Adam(learning_rate=0.0001)
model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])
history = model.fit(x_train_ohe,y_train, epochs=5, validation_data=(x_test_ohe,y_test), batch_size=256)
```

This was all we have for applying any pre-trained word embedding to your tokens.

You can extend it to -

- Any other dataset
*e.g. Fakenews dataset.* - Try bigger dimensions of the pre-trained models
Dimension of the embedding will be like any other Hyper-parameter i.e. start with the smallest and try a bigger one till you are satisfied with the computation/score trade-off.

In the next part of the series, we will understand and code "Language Modelling".

Hello AI Enthusiasts, in this post we will understand the basics of GPU. Then we will discuss the Libraries which we can use for classical ML algorithm e.g. LinearRegression. Why classical only, because Deep Learning models are inherently GPU supported e.g Tensorflow. A traditional ML library i.e. scikit-learn doesn't have GPU support neither it is in pipeline. Remember this thread from Twitter,

Before checking the Library, we will try to understand the need and working of the GPU.

Before understanding the GPU, let's quickly understand the CPU in few lines.

- The CPU is a general-purpose processor i.e. it can run a Game, a Browser, Video etc.
- It contains one ALU[Arithmetic Logic Unit] per core. It does all the calculations
- Control unit instructs the ALU/Memory sequentially(per core)
- It is not designed for any specific type of task
Since it is too general-purpose, it has no idea what the next instruction can be. So it read the instruction every time from the program. This is the point that is Traded-Off in GPU and even further with TPU.

The GPU, - Contains 2000-5000 ALU to do the calculations in parallel
- But its communication with the CPU is expensive
*i.e. add to latency*

So, the bigger the calculation we sent to GPU better the Throughput/Latency ratio.

It gives a huge improvement over CPU for Neural Network by computing the millions of Tensor multiplications/ addition in parallel.

So, in summary, a CPU can do relatively smaller calculations in parallel (throughput) but it can handle complex control logic *i.e. branching/switching. *

On the other hand, a GPU can do a massive amount of calculation in parallel but it is not good at handling complex logic. Just like the CPU, GPU too has its own memory. This helps GPU to improve latency(*i.e. fewer calls to CPU for data*).

But the point that is worth a mention is that, someone needs to think of the logic to parallelize the pieces of stuff to utilize GPU cores. In the case of Deep Learning, it is a bit simple because it is mainly a task of Tensor Multiplication and addition.

Unlike Deep Learning, in the case of classic Machine Learning, all the algorithms have a different technique to fit the data and to find the solution. So, in this case, we need case-specific code/logic to enable parallel operations and utilize the GPU.

For example, In Gradient Boosting one of the tasks that is parallelized is *to sort each feature while finding the best splits.*

That's why we will use the available Modules which supports GPU for ML

NVIDIA doesn't need an introduction. It has become the leader in GPU space by timely tapping the market to reap the benefits of Deep Learning rise.

CUDA was invented by NVIDIA's as a general-purpose programming language for its GPUs.

In simple terms, it facilitates us to interact with the GPU core and utilize it to the max.

RAPIDS AI, an excerpt from the official docs

In simple terms, RAPID builds the simple programming wrapper to utilize the CUDA. Check this depiction from the official page Since we are interested in Python. So let's discuss that. Again I will put an excerpt from the official page,The RAPIDS suite of open-source software libraries and APIs give you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs. Licensed under Apache 2.0, RAPIDS is incubated by NVIDIA based on extensive hardware and data science experience. RAPIDS utilizes NVIDIA CUDA primitives for low-level compute optimization and exposes GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

>

So, cuDF and cuML are the mappings of Pandas and Scikit-Learn. The best part is that the API is designed exactly like the two. So, using cuDF, cuML has no learning curve if you have used the two *i.e. Pandas, scikit-Learn.*

Let's jump to the code.

We will use the MNIST fashion dataset. It is a toy dataset for Deep Learning but will add a decent complexity for simple ML algorithms. It has 60K instances of low-resolution images of Fashion apparels.

Each instance is 28x28 i.e. 784 dimension which is a lot for a simple ML algorithm on CPU.

```
import tensorflow as tf
from tensorflow import keras
fashion_mnist = keras.datasets.fashion_mnist
((x_train, y_train), (x_test, y_test)) = fashion_mnist.load_data()
# scale data to the range of [0, 1]
x_train = (x_train.reshape(-1,784) / 255.0).astype('float32')
x_test = (x_test.reshape(-1,784) / 255.0).astype('float32')
_, ax = plt.subplots(1,5,figsize=(15,3))
img = np.random.randint(0, x_train.shape[0], size=5)
for id, img in enumerate(img):
ax[id].imshow(x_train[img].reshape(28,28))
```

Code-explanation^{All other parts of the code is quite trivial and self-explanatory.}

Result

Let's train a LogisticsRegression with RAPIDS cuML and Scikit-Learn.

```
# cuML
from cuml.linear_model import LogisticRegression
model = LogisticRegression(penalty='l1', C=1)
%timeit -n 1 model.fit(x_train,y_train)
%timeit -n 1 y_pred = model.predict(x_test)
acc = np.sum(y_test == y_pred)/len(y_test)
# Scikit-Learn
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l1', C=1, solver='liblinear')
%timeit -n 1 model.fit(x_train,y_train)
%timeit -n 1 y_pred = model.predict(x_test)
acc = np.sum(y_test == y_pred)/len(y_test)
```

Code-explanation^{All other parts of the code is quite trivial and self-explanatory.}

Result

Model | Fit time | Predict time | Accuracy |

Scikit-Learn | 914s | 51.9ms | 84.2% |

cuML | 3.61s | 41.5ms | 84.52% |

Look at the difference in the fit time i.e. almost ~250 times. In another way, you don't have to wait for almost 15 mins for the training to get complete. You can try multiple hyper-parms in that length of time with cuML.

We have only shown the code for one algorithm. You may follow the same and code for all other algorithms, i.e.SVM, KNN etc.

Check the official document and follow. [*Link*]

Here is the summary of other algorithms. Instead of the total time, we have mentioned the ratio of time taken by cuML and Scikit-Learn.

Model | Fit Ratio | Predict ratio |

SVC | 16 | 42 |

RF^{#1} | 14 | 13 |

KNN | 235 | 2500^{#2} |

TNSE | TOO LARGE^{#3} | NA |

LR | 253 | 1 |

Result-explanation^{#1 - Depth was kept at max_depth=10}^{#2 - In case of KNN, predict time is relevant since fit doesn't do anything}^{#3 - It was more than 2 hours for Scikit-Learn and just 2.36s for cuML}

RandomForest was trained only with `max_depth=10`

because the model may goes OutOfmemory during training if very deep trees are built. This happens because each Tree resides in the GPU memory till training is completed and Deep trees need more memory. So, it needs a large amount of GPU memory.

Here is an excerpt from the official docs,

Very deep / very wide models may exhaust available GPU memory. Future versions of cuML will provide an alternative algorithm to reduce memory consumption.

GradientBoost is another very famous algorithm mainly due to its proven records in Kaggle and that is because of XGBoost (*one of the State of the Art implementation*).

Another SoTA implementation is LGMB(by Microsoft). It has recently got a lot of traction.

Point that I wanted to add here is that both of these implementations support GPU. That may be the reason cuML doesn't have a GradientBoosting implementation(*I am not sure of the reason*).

Scikit-Learn's HistGradientBoosting failed after consuming all the RAM(25GB). Below is the result of XGBoost.

Model | Fit time | Predict time | Accuracy |

XGBoost | 32.2s | 255ms | 77.2% |

This was all for this blog. Now you are equipped with another handy weapon to quickly try things on a larger Dataset on complex models e.g. TSNE. The best part is that we have decent GPU availability on Google Colab.

You can take this forward and try different models on even larger Datasets.

Check this list for different dataset. *Benchmarking nearest neighbours*

One of the downsides is that it is not natively supported in Colab and it takes almost 20 minutes to do all the setup.

Follow this Notebook to set it up in Google Colab. Link

Hello AI Enthusiasts, In this post, we will learn how to apply 3-D CNN with Keras. For the dataset, we will use the MNIST 3-D available at Kaggle. Link.

Before that let's try to understand the working of 3-D CNN,

^{Image Credit - Arxiv paper - 3D-CNN for heterogeneous material homogenization}

In 3D-CNN, we simply switch from 2-D to 3-D. So, now our kernel is 3D, pooling is 3-D and the resulting FeatureMaps are also 3-D.

Another change is the convolution steps, the filter convolutes the input in 3-D space instead of just a 2-D plane. The first step of the 3-D convolution is depicted in the above image. You can extrapolate it in a 3-D space.

In 2-D also, the FeatureMaps along with the channels used to be a 3-D but in 3-D CNN individual FeatureMaps are 3-D if we consider the channels (*i.e. result of each Kernel*) together, it will be a 4-D.

The immediate question that may arise is what's the purpose of 3-D CNN and how it will be better than a 2-D CNN.

A simple answer is, just like the way a 2-D facilitates spatial invariance in a 2-D plane, a 3-D CNN will facilitate spatial invariance in a 3-D space.

Let's check these MNIST digits,
If we observe the highlighted digits(2nd), it resembles a "2" even with a 2-D slice from the front but this 2-D slice will no work if the image is rotated in a 3-D space(*see other images*). What we need here is to capture the 3-D features.

This was just one use-case when we add another dimension in 2-D image and made it a 3-D, but there is another use-case of 3-D CNN *i.e. adding a temporal dimension in a 2-D frame(a video)*.

Check this image,
^{Image Credit - Deep Learning for Computer Vision, University of Michigan}

In the above video, we need to convolute the image across the time frame to extract the Feature maps which can distinguish "*Running*" from "*Jumping*" by appropriately registering the movement of legs and hands across frames.

We will now jump to the code to implement a 3-D CNN using Keras on MNIST digit dataset.

```
path = "/content/......./full_dataset_vectors.h5.zip"
with ZipFile(path, 'r') as zip:
zip.extractall("/content/")
with h5py.File("/content/full_dataset_vectors.h5", "r") as hf:
x_train = hf["X_train"][:]
y_train = hf["y_train"][:]
x_test = hf["X_test"][:]
y_test = hf["y_test"][:]
```

Code-explanation^{We have downloaded the dataset from the Kaggle link. Then simply unextracted the zipfile and open the h5 files into train/test. This code is available on the website}

```
num=np.random.randint(0,1000,1)[0] #1
vox = x_train.reshape(-1,16,16,16)[num] #2
vox_1 = np.ceil(vox).swapaxes(0,2) #3
vox_2 = np.ceil(vox).swapaxes(0,1)
vox_3 = np.ceil(vox).swapaxes(1,2)
fig = plt.figure(figsize=(16,4))
#---- First subplot
ax = fig.add_subplot(1, 4, 1, projection='3d') #4
#---- 2nd subplot
ax1 = fig.add_subplot(1, 4, 2, projection='3d')
#---- 3rd subplot
ax2 = fig.add_subplot(1, 4, 3, projection='3d')
![Digit_full.PNG](https://cdn.hashnode.com/res/hashnode/image/upload/v1618064739350/G7jvc1YWU.png)
#---- 4th subplot
ax3 = fig.add_subplot(1, 4, 4, projection='3d')
ax.voxels(vox, edgecolor='k')
ax1.voxels(vox_1, edgecolor='k')
ax2.voxels(vox_2, edgecolor='k')
ax3.voxels(vox_3, edgecolor='k')
print(y_train[num]) #5
```

Code-explanation^{#1 - Generating a ranomf number to get one random instance from x_train}^{#2 - x_train is flattened i.e. shape=(10000, 4096). So we have reshaped it into (10000,16,16,16)}^{#3 -Swaped the each axis once to view the digit in different orientation}^{#4 -Plotted the different orientation on different axes of the figure}^{#5 -Printed the digit's label}

Result

```
x_train = x_train.reshape(-1,16,16,16,1) #1
y_train = pd.get_dummies(y_train) #2
x_test = x_test.reshape(-1,16,16,16,1)
y_test = pd.get_dummies(y_test)
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv3D, BatchNormalization, MaxPooling3D
```

Code-explanation^{#1 - Reshaped the images to keras format i.e. channe_last(channels=1)}^{#2 - One Hot Encoded the labels}

```
filters = 64
dropout = 0.5
model= Sequential([
Conv3D(filters, 3, padding='same', activation='relu', input_shape = (16, 16, 16, 1)),
MaxPooling3D(pool_size=2, padding="same"),
BatchNormalization(),
Dropout(dropout),
Conv3D(filters, 3, activation='relu', padding='same'),
#MaxPooling3D(pool_size=2, padding="same"),
BatchNormalization(),
Dropout(dropout),
Conv3D(filters, 3, activation='relu', padding='same'),
#MaxPooling3D(pool_size=2, padding="same"),
BatchNormalization(),
Dropout(dropout),
Conv3D(filters, 3, activation='relu', padding='same'),
BatchNormalization(),
Dropout(dropout),
Conv3D(filters, 3, activation='relu', padding='same'),
BatchNormalization(),
Dropout(dropout),
Flatten(),
Dense(250, activation='relu'),
Dropout(dropout),
Dense(100, activation='relu'),
Dropout(dropout),
Dense(10, activation='softmax')
])
#model.summary()
adam = Adam(learning_rate=0.001)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=250, epochs=500, validation_data=(x_test, y_test))
```

Code-explanation^{All the code above is self-explanatory, the only thing that is changed is the use of 3-D CNN and 3-D pooling.}^{Kernels size, Dropout etc. are hyper-parameters and have been found by tuning approaches. Just like we do in a regular 2-D CNN.}

Result

This was all from us for this post. Please try the code in your own setup.

You may try,

- Making a binary classifier by picking only 2 digits
- Try this on any other 3-D dataset
- Go through the slides of University of Michigan

Hello AI enthusiast, this is a short and spot-on post on a very unique top. If you are a Data Scientist Or an aspiring one then you must have used Matplotlib to analyse your data using different plots.

In this post, we will go a step further and utilize the inbuilt capability of Matplotlib to animate the plots.
You may download the animation as a video file if required or you can simply use it to demonstrate any iterative learning.

This post is not about the basics of Matplotlib, so you should have a basic idea of Matplotlib.

FuncAnimation is the class which facilitate all the required capability to convert a simple plot into an animation. So let's understand its initialization parameters.

`class matplotlib.animation.FuncAnimation(fig, func, frames=None, init_func=None, fargs=None, save_count=None, *, cache_frame_data=True, **kwargs)[source]`

Parameters

^{fig - The figure object used to get needed events, such as draw or resize.}^{func - This is a function which the CLass will call in every iteration to draw a plot and collate all the plots to build the animation. The first argument will be the next value in frames. Any additional positional arguments can be supplied via the fargs parameter.}^{frames - This is basically an Iterator whose individual values will b passed to the function i.e. the previous parameter. We have to use it to draw the plot in each iteration}^{init_func - A function used to draw a clear frame.}^{fargs - Additional arguments to pass to each call to func.}^{interval - Delay between frames in milliseconds.}^{repeat_delay - The delay in milliseconds between consecutive animation runs, if repeat is True}^{repeat - Whether the animation repeats when the sequence of frames is completed.}

Let's program a simple example,

```
import numpy as np, matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
from matplotlib import animation
from IPython.display import HTML, Image
rc('animation', html='html5')
fig = plt.figure(figsize=(8,6))
ax = plt.axes()
def func_plot(w): #1
ax.clear()
x=np.random.normal(loc=0.0,scale=50.0,size=50)
y=np.random.normal(loc=0.0,scale=50.0,size=50)
plot = ax.scatter(x, y, c=w)
return plot
anim = animation.FuncAnimation(fig, func_plot, frames=['r','b','g','k'], repeat=True) #2
anim
```

Code-explanation^{#1 - This is our function that will be passed to the Animation Class. It plots a Scatterplot and returns that}^{#2 - In the frames argument, we have passed a list of 4 string which will be passed to the parameter of func_plot which uses it as the colour for the Scatterplot. It also means that the function will be called 4 times }

Result

So, we are done with a basic example. Let's now move to a better example where we will plot a simple linear regression's convergence via. GradientDescent.

We will use a dummy data X, Y and fit a Linear Regression on it. We will save the values of Y for each iteration and then plot it in an animation.

```
# Linear Regression
from sklearn.datasets import make_regression
from sklearn.linear_model import SGDRegressor
from IPython.display import HTML, Image
rc('animation', html='html5')
x, y = make_regression(n_samples=1000, n_features=1, n_informative=1, noise=150,)
model = SGDRegressor(warm_start=True, eta0=2.5, random_state=0)
parms = []
for i in range(25):
model.partial_fit(x,y)
m,c = model.coef_[0], model.intercept_[0]
parms.append((m,c))
fig = plt.figure(figsize=(8,6))
ax = plt.axes()
def func_plot(w):
ax.clear()
x1, y1, x2, y2 = x.min(), w[0]*x.min()+w[1], x.max(), w[0]*x.max()+w[1]
plot = ax.scatter(x,y)
plot = ax.plot([x1, x2], [y1, y2], color='b')
return plot
anim = animation.FuncAnimation(fig, func_plot, frames=parms, repeat=True, interval=500)
anim
```

Code-explanation^{Code is quite trivial and self-explanatory. We just fit a LinearRegression sequentially and use the learnt parameters to plot the line.}

Result

In a similar manner, you can extend this code for a 3-D plot. The only thing that will change is the addition of matplotlib 3-D plots. Code related to animation will remain as it is.

Let's plot a plane by changing the equation and some random scatterplot.

```
# Main plane
from mpl_toolkits import mplot3d
# All other imports from the previous code
fig = plt.figure(figsize=(8,6))
ax = plt.axes(projection='3d')
N=1
x=500*np.random.random(N)
y=500*np.random.random(N)
z=500*np.random.random(N)
def func_plot(w):
ax.clear()
global x,y,z
x=x+np.random.normal(loc=0.0,scale=50.0,size=10)
y=y+np.random.normal(loc=0.0,scale=50.0,size=10)
z=z+np.random.normal(loc=0.0,scale=50.0,size=10)
X = np.linspace(-w*10,w*10,10000)
Y = np.linspace(-w*10,w*10,10000)
X,Y = np.meshgrid(X,Y)
Z = 2*X + 5*Y - 7.5
ax.set_xlim3d(-500.0,500.0)
ax.set_ylim3d(-500.0,500.0)
ax.set_zlim3d(-500.0,500.0)
plot = ax.scatter3D(x, y, z, c='r')
plot = ax.plot_surface(X, Y, Z, color='r')
return plot
anim = animation.FuncAnimation(fig, func_plot, frames=range(1,10,1))
anim
```

Code-explanation^{Code is quite trivial and self-explanatory. }

**Result**

This was all for this post. You may extend the idea to,

- KMeans Clustering
- Neural Network Learning of Decision Boundary

Hello friends, in this post we will learn about one of the most versatile ideas in the Artificial Intelligence field *i.e. Word-Embedding.* The other high-leverage ideas are *Gradient Descent, Pre-trained models and recently "Transformer architecture*.

The idea started with the thought of representing the texts as a Vector. Representing texts using *n-dimensional *vector is something we have done from the very first lesson of ML *i.e. One-Hot-Encoding*. In a very formal definition, OHE too is a Vector representing the underlying words.

Using OHE, we have modelled our data and achieved a decent result.

We can use some other techniques to achieve this task *e.g. term-frequency, tf-idf etc*., the way we did in the first post of this series.Check that Here

While the simple encoding technique is good for basic tasks but it is a sparse approach of encoding and these techniques result in a lot of Dimension for a large word corpus.

This will cause two major problems for any ML algorithm,

- It will create high dimensional sparse (a lot of zeroes) data
- It loses the semantic information of the underlying text
*i.e. distance between Related, Similar words is the same as between unrelated words.*

Check this simple depiction for the two scenarios,

Now we understand the idea and the need for such an encoding approach i.e. Dense vector. Let's understand the way we can achieve it.

Let's check the depiction shown below,

We follow the following steps to learn the embeddings of the words,

- Decide the dimension(N) of the embedding. This is a Hyper-parameter
- Assume the first layer as the embedding
*i.e. should have MxN Neuron*. M is the number of unique words. - Pass the input as OHE of the words. In short, only those weights will be updated in the backpropagation.
- Post the training completion, the weights matrix will represent the Embedding Matrix.

But we don't need to do all these setups explicitly. We can simply use the Keras embedding layer and it will take care of all these inner workings.

Please check the previous post of this series *Here* where we have used the Keras embedding layer and explained the parameters in details.

With that, we have our words encoded with the appropriate context provided by the dataset. It will smoothen the training process because learning is a dense vector.

The only thing that has to be noted here is that we can't do much with the embedding that we got in the end. Since we were actually interested in the Models accuracy which we got at the completion of the training and the embedding too took the whole training cycle.

What I meant is that the reusability was missing.

Another aspect is that it was trained in a pure supervised manner, you can't always expect to have a text corpus with some type of labels available.

Good news is that if we have a similar context then you can reuse the final embedding that we got from the previous training. For example, we trained our model on sentiment analysis, which means we may use it for another type of similar system *i.e. Product review.*

A much better news is that we have got an even better solution, There are multiple available word-embedding that are trained on a huge volume of the dataset which can be used in text analysis tasks.

While you can train your own model to get the embedding every time but this will have two obvious challenges -

- You can't have a sufficiently large corpus to train the model to learn the embedding
- If you have a large enough corpus, then you will not prefer to invest so much computational cost every time.

Pretrained word embedding map is the solution to both these problems, obviously with a trade-off.

Pretrained word embedding models are models that have been trained on a very large Corpus *e.g. Wikipedia *to generate the embeddings of each word.

While such an approach will give us a sufficient generic embedding of each word but it lacks a specific context knowledge *e.g. Medical text. This is the Trade-off.*

Another useful aspect of pre-trained embedding is that these are trained with an end goal in mind *i.e. create embedding*, so these model are optimised for computation.

Although the high-level approach is similar to the depiction we saw in the previous section but these models use some alternative approaches instead of a Deep Neural Network to avoid the huge computational cost. A Deep Neural Network is not very computation effective solution.

Let's quickly review a few of the famous Pre-trained models,

**word2vec**- Word2vec model became one of the most famous pre-trained embedding models. It learns the relationship of all words by looking at them in small sliding windows and then train the model by applying a supervised approach. This approach (*i.e. using the vocab as data and labels both)*will be common for all the pre-trained models.

Let's check this depiction, It has two different algorithms available as options to the user*i.e. Continuous BoW(Bag of Words) and Continuous Skip-Gram.*

In CBoW, the model is fed with past "K" words and future "K" words and trained on the prediction of the centre word. K is a hyperparameter. In the image above, K=3.

Bigger K will have better context but will add to the computational overhead.

In the case of continuous skip-gram, it is done the other way round. The model tries to predict the contextual past and future words using the centre word.

If you want to dive deep into the calculation, check out this paper which was written just to explain word2vec since the original paper is not very easy to comprehend.**GloVe**- GloVe stand for Global vector. It was developed as an open-source project at Stanford. The way it differs from word2vec is that it tries to learn a global pattern of the corpus, unlike the word2vec which focussed on local windows.

It used the co-occurrence count of each word with respect to other words. It calculates the probability of each word to be as a co-occurrence word to any other word. Check this table from the official website, It then tries to learn a vector such that the word is near to another word with a high probability of co-occurrence but distant to a word with a low probability of co-occurrence. In the above depiction,*"Solid" is close to "Ice" as compared to "Steam".*

If you want to dive deep into the calculation, check out the originalpaper

Embeddings align the word in an N-dimensional space according to the context learnt from the training corpus. We can easily visualize the same in 2-D space by either dimension reduction or using the appropriate visualization technique. We will do this exercise in the next post.

Let's see this examples which is not just showing the neigbhourghood of words but also a temporal sense* i.e. the neighbourhood changes*. This can happen if the contemporary corpus* i.e. Newspaper, Social Media* changes the way it interprets particular words. This is a type of concept drift.
^{Image Credit - Sebastian Ruder's Blog}

Also, check this embedding created on a corpus of Game of thrones. It has the top five words near to the word King and Queen.
^{Image Credit - Lilian Weng's Blog}

On the negative side, the model can learn a lot of social biases. A few of the example are listed below -

- If asked,
*man is to computer programmer as a woman is to X*and the system answer x=homemaker [*Reference*] - The same paper has other similar examples
*e.g. "father is to a doctor as a mother is to a nurse"* - People in the United States have been shown to associate African-American names with
unpleasant words (more than European-American names), male names more with
mathematics and female names with the arts. [
*Reference*]

This was all for this post. We will continue word-embedding in the next post too. But there we will do the hands-on.

Although we discussed all the important points regarding Word-embeddings, still there are a lot of things that you should look into to gain a deeper insight. For that, go through the references provided in the post.

You can also check few additional blogs on the topic,

Hello friends, in this post we will learn to serve the input request with a simple CNN model.
Let's check the depiction below,
^{Image credit - AI and Machine Learning for Coders, by Laurence Moroney}

It is a simple flow diagram of an ML model's lifecycle till the serving stage.

Just like any other software product/app, ML models too have an operations stage that begins after the model start serving to real data *i.e. Production data*. This is stage known as MLOps which we will not touch in this post.

Before the serving stage, we have all the trivial initial stages of an ML training cycle. This part is shown in the Translucent frame in the image above. We will skip that part too since we assume we all are fairly aware of all these steps.

Let's check our Request-Response flow. If you have never worked on web API and don't know what is a Request, Response, then let me explain that to you. The query a user send to the server using his device/browser is a Request and the answer by the server is the Response.

Check this simple flow diagram,
We will not build any fancy webApp, so we will mimic the request-sending process using the Postman client on the Desktop. Postman is a mature app that is used to test web API. You can get that from Here

Our server will be hosted on Google Colb which is a cloud-based Notebook as a service.

Our model will be a CNN model trained on the MNIST digits dataset. Request data will be an image made on a digital whiteboard with a Hand-written digit using a stylus.

Our server code will have 4-key components,

**The model**- This is the trained model ready to serve*i.e. response to the predicted call***Pre-processing function**- You must be aware of the pre-processing step while training the model. Now we have to apply the same set of steps to the new data. The most important part of this step is to be aware of the statistics used in training because you can't calculate these statistics*e.g. mean/std etc.*every time for the*training+new_test data.*

We will also place any generic utility here*e.g. MNIST will be trained on b/w data but we may receive a coloured image in the request.***Web API**- This is our Flask based API. The responsibility of the API to accept the request. Perform the basic data validation, call the pre-processing/predict function and finally send the Response back to the Client.**ngrok overhead**- This component is not the part of our required components but we need this as a workaround to get a Public API out of Google Colab.*So you can safely skip this part and use it simply as-it-is i.e. copy-paste.*

```
import numpy as np
from keras.models import load_model
from tensorflow import keras
# Load the model
model = load_model('/content/drive/MyDrive/Colab Notebooks/Blogs/10xAI_Blog_0022_ML-Serving/CNN_MNIST.h5')
x_train_max = 255.
x_train_min = 0.
path = "/content"
def pre_process(img_path):
# load, B/W, Resize image
img = keras.preprocessing.image.load_img(img_path, color_mode='grayscale', target_size=(28, 28))
img_arr = keras.preprocessing.image.img_to_array(img)
# scale
img_arr = (img_arr - x_train_min)/(x_train_max - x_train_min)
img_arr = img_arr[np.newaxis,:,:]
# predict and return
return np.argmax(model.predict(img_arr), axis=-1)[0]
```

Code-explanation^{Image is loaded from the path in Grayscale format with a size of 28x28 All other parts of the code is quite trivial and self-explanatory.}

With the above code snippet, we have created the training parameters, loaded the model and created the pre-processing function.

Let's build the Flask API

```
from flask import Flask, request
import time,os
app = Flask(__name__)
# Create a method for /
@app.route("/")
def home():
return "<h1>Running Flask on Google Colab!</h1>"
# Post method for Predict
@app.route('/predict',methods=['POST'])
def predict_():
# Get request param
uploaded_file = request.files['file'] #1
# Check it is a valid image file
# Do it yourself
# Save to DB/Disk
img_path = os.path.join(path,uploaded_file.filename+time.strftime("%Y%m%d-%H%M%S"))
img = uploaded_file.save(img_path) #2
#Pre-processing and prediction
digit_class = pre_process(img_path) #3
# Prepare output
res = {"Digit": int(digit_class)} #4
return flask.Response(response=json.dumps(res), status=200, mimetype='application/json')
```

Code-explanation^{#1 - Till this line, we have use the standard code of Flask. Check the official docs Here}^{#2 - We should save our input for further analysis. For demo purpose we have just save to the disk.Concatenated the timestamp in the filename}^{#3 -Called the pre_processing function}^{#4 -Build the simple response JSON}

Now we are left with creating a tunnel path using `ngrok`

to make this API available publicly. We have use `pyngrok`

for this. Post that we will start the Flask server.

```
# Need not know the detail of this code
# This bind the localhost:80 to an internet address and return the address
# Use that address to call the API from anywhere
!pip install pyngrok --quiet
from pyngrok import ngrok
public_url = ngrok.connect(port="80", proto="http", options={"bind_tls": True},)
print("Public URL:", public_url)
# Start the server
app.run(host='0.0.0.0', port=80, debug=False)
```

Result

Great!! We are ready with our server up and running.

Let's go to the Postman client and call the API. Here is a screenshot from the Postman client. Red texts in the Green box are the explanation text.

^{Select the method as "Post"}^{Put the URL that we got from the ngrok code. See the previuos image"}^{Click the Body tab and create a key with the name "file" and upload the image using the browse button}^{Click send and you will receive the Response.}

We conclude this end to end code for ML serving. The best part of it is that you can do this in Google Colab. You can use this code and extend it to a very different use case i.e. Tabular data or you can try this for CAT/DOG dataset with your own model.

Just be mindful of the fact that real-life scenario will not work with just simplistic validation *e.g. image must be saved in a file-hosting server and its metadata in a Database*.Secondly, nothing special has been done to get a top-tier performance

Hello Friends, the title of the post may sound counter-intuitive to many of you since you might be habitual of listening to a lot on the internet about how to master ML in 21-days etc.

It might look simple if you just skim it quickly but if you sit with an intention to solve a real-problem or participate on Kaggle or even try to answer questions on Stackexchange, you will realize the gap even though you consider yourself complete with the required courses.

In case you have not tried any course/tutorial, then the very first challenge in front of you is to assimilate such a large volume of courses and content.

On top of all this, if you see a message similar to this tweet, confusion raises to the next level.

Let's see another stat that adds further to the confusions. This is an excerpt from State Of AI annual publishing.

_{Source - }_{State of AI Report 2020, Slide#80}

Supply is not being fulfilled is connected to the quality of the supply. We will come back to these images in the conclusion part of the post. Let's move to the next section to figure out why it so.

The very first challenge anyone can face with Machine Learning is the task of grasping so many disciplines.

**Mathematics**is mixed throughout the ML space*i.e. the working of Models.*You will need a basic understanding of*Linear Algebra, Probability and Calculus(Deep Learning)*.**Statistics**is the subject on which machine learning is based on. Many of the basic models are based on pure statistics and almost all of the models have their root in Statistics. If you know the underlying Statistics, then you feel very confident about the learning and enjoy the whole process. Though you may move into Modelling with a little of Maths and Statistics.**Programming**- There are many views out there that you might not need programming to be a Data Scientist but that's will be true in a limited scope. There is no one-fit-all model or approach. What it means is that you need a lot of engineering with ML API and Data analysis API i.e. Numpy. and this is without considering the post Modelling work i.e. Deployment, MLOps etc. You will need a programming hand for all these activitis.**ML Algorithms**- As we said earlier, there is no one-fit-all model or approach, so you need to understand the pros and cons of different models and accordingly you can try it. This particular part will require the above-mentioned knowledge of Mathematics and Statistics. Other than understanding the models you will need the art of Feature engineering, which itself is a very abstract subject and can be learnt mostly by doing.

Imagine studying all these from bottom to top. It's almost equivalent to 2 years of engineering. The good news is that you can do things in a shorter time if you curate the learning properly especially Mathematics, Programing and Statistics. ** So, the summary of the section is to find and follow a fine-grained course that has sufficient focus on programming**.

The domain is the Industry domain to which the data belongs to. Data will be generated only through a process *e.g. Movie review, Twitter, Medical images, Sales data etc.* Each of these data belongs to the specific domain *e.g. Finance, Pharma, Manufacturing etc.*If we have the expertise of the domain then it might ease the modelling part e.g. *you many easily figure out why a particular group of data behaving differently with the model*.

But we can not do much on it as it is completely dependent on your work experience and also your inner interest.

Data has other aspects too i.e. Type and Size. Data can be *Tabular, Image, Text, Sequence, Image etc.* All of these will bring a new type of learning requirement and most of the time a new modelling strategy *e.g. a simple Neural Network will become a Convolutional Neural Network for Image data and Recurrent Neural Network for text data* and on top of that, you will be introduced to the concept of Transfer-Learning, Word-embedding etc.

Data size brings another complexity to the problem i.e. merging of two different entity i.e. Large Dataset(*I am avoiding the term Big Data*) and ML modelling. You can't keep these as two separate functions i.e. one team will do all the data part and then the ML team will do the modelling. Why it will not work because the ML guy will have to inspect *i.e. trying different models, Feature engineering, Tuning etc.,* on this large dataset and the step which should take 1 minute will start consuming 2 Hours. So, you have to figure out a new set of *Library implementation, Software engineering etc.*

**Bonus point - **ML research is moving very fast *esp. for Text and Images.* So, this becomes an additional task to keep up sync with the Research.

We have a detailed post on the list of books and tutorials to follow. Check it *here*

In this post, we will just summarize the key steps -

- Start early, I believe from the second semester of your bachelor. You will have an ample amount of time.
- If you are already in a job, you don't have a similar choice. So focus on your will-power.
- Commit at least 10 hours each week. Don't expect any magic before one year.
- Learn Python, Numpy to a decent level. Check the suggested post for books references.
- Balance concept and code. Check the suggested post for books references.
- Participate in Kaggle and read the good solution to get the noble idea
- Follow Industry leaders on Twitter

^{Skipping Statistics/Mathematics You might be tempted to skip these two portions and may move ahead to a decent level even without them. But this will hamper your growth in terms of scalability of learning e.g. you will start skipping those great sources of knowledge that uses a lot of these pieces of stuff. Programming You may receive a bit different suggestion on this aspect too. But you should only avoid it when you are sure what you are doing i.e. there are fields that might not be using a very large amount of data or it may be working purely in the domain of Statistical testing.}

With the last section, we are done with this post. The idea was never to intimidate you but it was to make you aware. Learning only the very basics pieces is good to generate interest but it will not land you to a place you might be aspiring for.

There are many positive pieces too,

- AI is going to stay here with new forms and Architecture every few years.
- Whatever be the technology, it will have to knock the door of AI after a certain level of Maturity e.g.
"
*Using AI for X*" - Investment is intact in this space, check this excerpt from AI Index 2021 survey.
Now, you can connect the above stats with the stats shown at the beginning
*i.e. about the job not getting fulfilled*. Companies are expecting people to come beyond the "Hello world" ML stuffs*i.e. very simple Datasets and scenario*.

One the first image, Andriy is correct in his claim but his expectation with Scikit-Learn is not appropriate. Please read the full-tweet thread*Here*.

Hello AI Enthusiasts, this is the second part of our series on Natural Language Processing(NLP). You can check the first post *[Here]*.
In the previous post, we used the One-Hot-Encoded data making it similar to Tabular data and used the SVM ML model.

In this post, we will keep the sequence of words intact and use a Recurrent based Neural Network.
We will not deep dive into the working of Recurrent based neural Newtowk but put a quick and intuitive summary in the next section. It will help you to move forward throughout this series.

By the end of this post, you will understand the data setup and working of an RNN(*Recurrent based Neural Network*)

Let's check this depiction similar to a simple Neural Network except for an additional Recurrent weight.

Our data is in the form of a sequence of Words encoded into Integers. Batch has the same meaning as in the case of a simple Neural Network.

- Each sequence element is passed one by one
- The output of each Neuron is looped back via a weight to the input. This facilitates a simple Historical memory. Now the Network has a basic system in place to learn the sequence as a whole rather than each Feature (Word) independently.
- After each Word, the Neuron will output a value. We may pass this output after each word or we may pass it only after the whole sequence(
*one data point*). This decision depends on the use-case. More on it in the next section. - Since each sequence is passed separately, the length of each data point can be different.
- We have different variants of Recurrent Neural Network. Three of the most common are
*Simple RNN, LSTM and GRU*. Keras has got the implementation of all of these.

This much information is sufficient for us to move forward. In case you want to dive deep into the working of the different implementation, check the book **D2L** _{CH#08,09}

Check the below depiction on the different combination of input type and output type,

^{Vector is the traditional data points i.e. NxM features based data Or an OHE output Sequence is the data where datapoints sequence has a meaning e.g. Text, Timeseries data, Speech etc. Image credit - Deep Learning for Computer Vision Fall 2020}

Let's quickly understand these different combinations,

- Vector to Vector -
*This is a typical tabular data i.e. Iris data(Vector) and Iris Class(Vector).* - Vector to Sequence -
*An image captioning system. Image data is a vector.* - Sequence to Vector -
*Classification(Vector) on IMDB reviews(Sequence ).* - Sequence to Sequence -
*Language translation i.e. English(Sequence ) to Hindi(Sequence).*

^{Let's come back to the blue dot in the first image, we will be interested only in the final output of a full sequence when we are doing the Classification work(case-III) but when we are implementing a translation system(Case-IV), we will need the output after every word. Keep this intuition in mind, we will need it.}

We are good with the basic theory. Let's code it. Most of the portion will remain same as in the previous post of the Series.

```
from tensorflow import keras
import numpy as np, seaborn as sns, pandas as pd
num_words=None; maxlen=1000 ; skip_top=20
(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=num_words, maxlen=maxlen, skip_top=skip_top)
```

Code-explanation^{All other parts of the code is quite trivial and self-explanatory. We don't restrict the word-length and the document counts because Neural Network is an incremental optimization algorithm, unlike the SVM which we used in the previous post. So, it will pretty good with such a volume of data.}

```
# Dataset to Fixed-lenth by padding #1
clip_length = 250
x_train_ohe = np.zeros((len(x_train), maxlen), dtype='float32')
for i,review in enumerate(x_train):
j=0
for word in review:
if word not in [0,1,2]:
x_train_ohe[i,j]=word
j+=1
x_test_ohe = np.zeros((len(x_test), maxlen), dtype='float32')
for i,review in enumerate(x_test):
j=0
for word in review:
if word not in [0,1,2]: #2
x_test_ohe[i,j]=word
j+=1
x_train_ohe = x_train_ohe[:,:clip_length]
x_test_ohe = x_test_ohe[:,:clip_length]
x_train_ohe = x_train_ohe[:,:,np.newaxis] #3
x_test_ohe = x_test_ohe[:,:,np.newaxis]
vocab = int(max(set(x_train_ohe.ravel()))) #4
```

Code-explanation^{#1 - Check this image to understand what is done using the two loops. We have initialized a zero-array to standardized the length. Then fill with all the sequences and at the end clipped it to a desired length i.e. clip_length parameter.}^{#2 - [0, 1, 2] are not sentiment word but punctuation words, so skipped it}^{#3 - Added an axis for the Features. We have only one feature per word.}^{#4 - Calculated the unique word count. To be used later}

```
embed_size = 32
model = keras.models.Sequential([
keras.layers.Embedding(vocab + 1, embed_size, input_shape=[None],mask_zero=True), #1
keras.layers.GRU(embed_size, return_sequences=True, dropout=0.5), #2
keras.layers.GRU(embed_size, dropout=0.5),
keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(x_train_ohe,y_train, epochs=2, validation_data=(x_test_ohe,y_test), batch_size=128)#
```

Code-explanation^{#1 - We will explain it in next section.}^{#2 - we have used a GRU recurrent layer. Parameters are self-explanatory}^{return_sequences -This is the parameter to control the flow of sequence after each word as explained with the blue-dot in the beginning. Setting it True means we have passed the sequence word-by-word. We do this for all recurrent layers before that last one. It means, both the GRU layers will learn in recurrence on each word of a sequence and at the end, it will pass the output to the Dense layer}^{All other parts of the code is quite trivial and self-explanatory.}

Result^{Epoch 1/2}^{195/195 [==============================] - 22s 79ms/step - loss: 0.6104 - accuracy: 0.6331 - val_loss: 0.3225 - val_accuracy: 0.8656}^{Epoch 2/2}^{195/195 [==============================] - 13s 66ms/step - loss: 0.2166 - accuracy: 0.9200 - val_loss: 0.3108 - val_accuracy: 0.8700}

The first layer we have used in the code snippet was an Embedding layer. This technique has become the backbone of contemporary advancement in the NLP space. We will have a dedicated post on this in the next part of this series.

In simple terms, you may assume this is a Dimensionality reduction technique that tried to represent all the words in N dimensions(32 in our case).

Another benefit that it facilitates is that it converts the sparse OHE encoded data to a continuous space. Neural Network converged easily on a continuous dataset.

Check this post of 10xAI Link to understand the basics of the Embedding Layer.

With the last snippet, we completed this post covering the Recurrent Neural Network. You may try -

- SimpleRNN and LSTM layers of Keras
- Different embedding size
- Different Neurons count

The improvement over the previous model *i.e. OHE with SVM is not quite a lot*. The primary reason for this is that only a few of the important words are sufficient for Positive/Negative sentiment classification. If you execute the code without a GPU environment, it will take a bit longer. Please use Google Colab for a Free GPU facility.

In the next part of this series, we will dive deep into Word-Embedding and also pre-trained Embeddings.

In this post, we will learn and understand why should we scale a KNN model.
Prima-facie it looks like euclidean distance should not get impacted even with unscaled data *i.e.*

$$distance = \sqrt(X_2^2 - X_1^2) + (Y_2^2 - Y_1^2)$$

We can observe, the same feature is used in squared difference. *So if Y1 is at a bigger scale so is Y2. Hence, this distance metric will remain consistent*

The above plot is depicting two classes based on two features that are properly scaled. One of the points is highlighted in a red dotted circle. If we observe, this point very comfortably looks like a "Blue" class as almost all the neighbours are Blue.

But now try to imagine unscaled data where one of the features is large. What it means is that space will be very elongated for that axis and squizzed for the other axis.

It will mean that the

distance in the smaller axis has no relevance as compared to the larger axis. The distance along the larger axis will define the total distance.

What it means for this plot is that the highlighted point has many "Orange" neighbours now since we can ignore vertical distance. These new neighbours are all the "orange points" below the highlighted point in the unscaled data. In the previous view(unscaled view) these were looking further, hence were not a Neighbour.

Let's see,

We have intentionally flattened the plot along one axis to develop better intuition. Now we can observe that the highlighted point is fairly near to the Orange" class.

Let's squeeze again,

Now, we can easily see that the highlighted point belongs to the "Orange class".

- Let's fit a KNN model on a scaled dataset and predict the class for the same data.
- Then predict the same model on the unscaled data.
- Note the point which changes their stance between the two predictions.

In the below plot, we highlighted the data which changed its Class with a bigger size. Hue is as per the prediction of the scaled model.

Let's see the same plot in squeezed axis and hue on prediction on unscaled data.

It is pretty evident from the above plot which data point changes its Class and why it did so. These are the points which have many data points of other class farther away on the vertical axis but very near on the horizontal axis.

With scaled data, Vertical distance lost its relevance and the Class swapped.

Below is the code snippet for all the trials we did in this post.

```
# Create the data
import numpy as np, seaborn as sns, matplotlib.pyplot as plt
sns.set_style("darkgrid")
sns.set_context(font_scale=1.0, rc={"lines.linewidth": 2.0})
from sklearn.datasets import make_classification
x, y = make_classification(n_samples=500, n_features=2, n_classes=2,
n_informative=2, n_redundant=0)
# Simple plotting the points
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x = x[:,0], y = x[:,1], hue=y)
# Flattened the axis and Unscaled the data
fig = plt.figure(figsize=(20,1))
sns.scatterplot(x = x[:,0]*100, y = x[:,1], hue = y)
# Further Flattened the axis and Unscaled the data
fig = plt.figure(figsize=(20,0.5))
sns.scatterplot(x = x[:,0]*100, y = x[:,1], hue = y)
# Created the Model and fit the data
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
y_pred = model.fit(x, y).predict(x)
x[:,0] = x[:,0]*200
y_pred_unscaled = model.fit(x, y).predict(x)
diff = y_pred!=y_pred_unscaled
# Plot for scaled prediction and highlighting the diff data points
fig = plt.figure(figsize=(10,6))
sns.scatterplot(x = x[:,0]/200, y = x[:,1], hue=y_pred, size=(diff+1)*100)
# Plot for unscaled/Elongated axis, prediction and highlighting the diff data points
fig = plt.figure(figsize=(20,3))
fig.gca().set_ylim(-10,10)
sns.scatterplot(x = x[:,0], y = x[:,1], hue=y_pred_unscaled, size=(diff+1)*100)
```

Code-explanation^{All other parts of the code is quite trivial and self-explanatory.}

So, what is meant is that if we use unscaled data, the distance along the Feature having a smaller value will lose its relevance. The issue will impact the boundary points and it will become more severe if the two Features differ on a large scale.

So, it's better to scale even though it is not needed because it will not create any problem if done unnecessarily.

Similar logic can be applied to Clustering using **KMeans** as that algorithm is also based on the same approach.

Hello AI Enthusiasts, with this post we will start a series of posts on Natural Language Processing(NLP). We assume you have a basic understanding of ML and DL around tabular data. You can treat this series as a smooth transition from your current knowledge into the NLP space.

We can't covert every part and bit of such a huge space, so we will summarize that(*Not covered topics*) too at the end.
We will also learn the related code. In the journey, we will try to answer many of the confusing questions and concepts.

Below is a depiction of our intention.

As you may observe, we will cover a broad range of topics. The focus is on understanding the Algorithm and Feature engineering techniques. We have escaped the NLU (Natural Language Understanding) zone which makes the foundation of Chatbots, conversational AI. So. let's start the journey.

The first thing that we will analyse is the structural difference in the dataset as compared to a Tabular dataset.

As shown in the above depiction, features were defined and very well organized in the case of a Tabular dataset. But in the case of a Text dataset,

- Features are not define
- Hence, not segregated
- Datapoint has different lengths
- Dataset is 100% Nominal(Obviously)

Preprocessing is all about handling the above scenario. Text data preprocessing requires more effort as compared to the Tabular counterpart. One aspect that we didn't list is the cleansing of text. We will use a fairly clean dataset, so we will skip this part. We will use the IMDB movie review dataset.

Let's check the different terms and steps -

**Tokenization**-*Tokenization is the step where each text sequence is split into a list of tokens. A token is a basic unit in the text. In our current case, it is a Word e.g. "exceptional ". With this step, we got our Features. But it is still a bit far from use.***Vocabulary(Bag-Of-Words)**-*This is the key step in finalizing our Features from the sentences. We build a vocabulary list for all the unique words(Tokens) available in our dataset. This is also known as Bag-Of-Words. This is the simplest form of words i.e. 1-gram(more on it later). It must be noted that the Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.**What if a word is new and came only in the test dataset. This is called***Out-Of-Vocabulary**. It is handled by keeping placeholder word. All new words will be mapped to that word. Definitely, it will not help much in the prediction. There are some other flavours for this. We will also learn that later.**One-Hot-Encoding**-*With this last step we are ready to create our Features. We will have one dimension for each unique word and simply mark the respective feature value "one" if it is present in the document. In case the word is absent, we will mark it as "zero". This was a simple multi-label One-Hot encoding***Term-frequency(tf)**-*With a plane Multi-label OHE we are losing the information of reoccurrence of a word. It may not impact the Positive-Negative class separation but it will bring all the document of the same class nearby-by i.e. not adding much value in term of class-probability. With term-frequency, we put the count of each Token as a feature value instead of just 0/1.***Normalized Term-frequency**-*Different document can be of different length. One of the many implications of this is that a particular document may get a bigger frequency of a Token only because its length is more than the other e.g. a 10 Token document has "Happy" two times while a 5 Token document as it only once. So, we simply divide the term-frequency by the length of the document.***Term-frequency Inverse Document Frequency(tf-idf)**-*Imagine a word that is present in almost every document i.e. "The". Although this will not impact the Classification boundary but it will add an unnecessary Feature. So, we simply divide each Token by its document-frequency i.e. the number of document it is present in. This new entity is tf-idf. It has more significance in the Search algorithms i.e. Google etc. not a lot here since we can remove such redundant token with other techniques too*.**Word-embedding**- This is a broad topic in itself. We will have a dedicated post on this.

Check this depiction for an overall summary.

^{Few things we didn't talk about but we will use them implicitly e.g case normalization(all token are in small case). Similarly, Lemmatization is the task of determining that two words have the same root, despite their surface differences e.g. play, playing. There are many dedicated algorithms for this. We will no do this in this post and rely upon our dataset. Another one is Stop-words which those words that are very common in all the documents and have very little useful information that can be used to distinguish between different classes of documents. Examples of stop-words are is, and, has, the, like May go through this beautiful book Speech and Language Processing (3rd ed. draft) Dan Jurafsky and James H. Martin to dive deeper.}

We are done pretty well with the terminologies. Now let's list the steps and start coding them.

- Load the documents

We will not do any cleaning activity as we will use a standard dataset from Keras. - Create features based on OHE, Term-frequency
- Reduce the dimension

This is a point we didn't discuss in the last section. If we will simply keep all the words in our vocabulary we will end with way too many Dimensions. We can skip words by removing stop-words Or keeping only top K features based on tf-idf. - Train and test with a Model

```
from tensorflow import keras
import numpy as np
num_words=1000; maxlen=300 ; skip_top=100
(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=num_words, maxlen=maxlen, skip_top=skip_top) #1
type(x_train)
max_doc_len = max(np.vectorize(lambda x : len(x))(x_train)) # This is the word count for the longest document
```

Code-explanation^{#1 - A available function in in keras.datasets. It return the numpy array of the dataset num_words - Words are ranked by how often they occur (in the training set) and only the num_words most frequent words are kept. Any less frequent word will appear as oov_char value in the sequence data. If None, all words are kept. Defaults to None, so all words are kept. skip_top - skip the top N most frequently occurring words (which may not be informative). These words will appear as oov_char value in the dataset. Defaults to 0, so no words are skipped. maxlen - Maximum sequence length. Any longer sequence will be truncated. Defaults to None, which means no truncation.}

This was the data load part. You right have noticed that we have controlled *irrelevant words i.e. Stop words, Token count i.e. Feature Dimension* using the 3 parameters that are listed above. The value of `skip_top`

must be based on exploring the dataset but we have picked a guessed number.

Let's create a Multi-labelled OHE for the train and test dataset and fit an SVM model.

```
x_train_ohe = np.zeros((len(x_train), max_doc_len))
for i,documents in enumerate(x_train):
for word in documents:
x_train_ohe[i,word]=1
x_test_ohe = np.zeros((len(x_test), max_doc_len))
for i,documents in enumerate(x_test):
for word in documents:
x_test_ohe[i,word]=1
```

Code-explanation^{We simply looped over each token and assign a 1 to the respective Index. All the other indexes will remain 0}

```
from sklearn.svm import SVC
model = SVC()
sample = np.random.randint(0, len(x_train), size=(5000))
model.fit(x_train_ohe[sample],y_train[sample])
model.score(x_test_ohe, y_test)
```

Result^{Accuracy - 0.8502827763496144}

We used only the 5000 documents out of the 25K and we got almost 85% accuracy. The reason for this is that the Classification boundary is dependent on only a few keywords for the positive and negative sentiment.

Below is the code snippet on how to get the different statistical format using scikit-learn that we discussed in the previous section.

```
x_train_str = np.vectorize(lambda x : ' '.join([str(elem) for elem in x]))(x_train) #1
# Term Frequency
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
x_train_count = vectorizer.fit_transform(x_train_str)
# Term frequency - Inverse Document Frequency
from sklearn.feature_extraction.text import TfidfVectorizer
tfid = TfidfVectorizer()
x_train_tfid = tfid.fit_transform(x_train_str)
```

Code-explanation^{#1 - In this line we have convert the list object into String. This is the requirement of scikit-learn. Rest of the code is self-explanatory. You may also achieve the same by a custom code i.e. using NumPy only as the steps are quite simple and trivial}

With the last snippet, we completed this post explaining and end-to-end concept to code journey to build a Sentiment analyzer based on IMDB Movie review dataset.

You may try -

- Smaller document length and Training size
- Apply Naive Bayes algorithm instead of SVM

We treated all the individual token as independent i.e. not accounted for their proximity Or their consecutiveness. For example, "very kool" would have lost its extra sense of emphasis on positiveness.

In the next part of this series, we will treat the reviews as a sequence *i.e. will try to account for the significance of its Token's order.*

In this 10xAI's knowledge article, we will understand the Feature Interaction effect in Machine Learning. We will also understand its impact, techniques to find it. In the learning process, we will also understand a very important Feature vs Target plot *i.e. Partial Dependent Plot.*

Features will be said to have an Interaction effect when the target increased or decreased more than when the two features are present together compared to their individual effect on the target.

$$y = \alpha f1 + \beta f2 + \gamma f1f2 $$

*In the above equation, f1 and f2 are two features and f1f2 term is the interaction effect*

Interaction Effect can be of multiple types *i.e. don't expect it always to be additive*. We will not cover too much theoretical detail. You can read more on this in the beautiful book by *Max Kuhn and Kjell Johnson* [Here]

Two key effects of Feature Interaction are

- Not all models are able to figure out the effect inherently, so such models will miss this important information
- It affect our interpretation of Feature Importance
*e.g. in the above case f1,f2 will have higher importance when present together*

The **LinearRegression** model will not consider the interaction term by default. We can add such terms by using Polynomial Features. We can observe in the plot below, how the target will move when there is an Interaction effect, and when itis absent.

It's obvious that a LinearRegression will not be able to model the pattern when the effect is present.

^{A tree-based model is able to capture the Feature Interaction effect inherently. This looks surprising. Try to guess the reason and post your answer in the comment.}

Let's check a Regression model on California Housing Data, We will follow the following steps.

- Fit a Linear Regression Model on the data set
- Check the
**coef**of each feature to see the Importance - Create polynomial features and repeat the step but with L1 regularization so that we get rid of not so useful features
- Observe the new Features which are added to feature importance ranking. These will a mix of
**Interacted-Features and Degree 2 features**

```
# .....Load and pre-process te data #1
# Define Model and fit data
from sklearn.linear_model import SGDRegressor
model_reg = SGDRegressor(eta0=0.001) # eta0=0.001
model_reg.fit(x_train, y_train)
# Check Coeff
coef = pd.DataFrame(model_reg.coef_.reshape(1, -1), columns = x_train.columns) #2
df = coef.T.sort_values(by=[0], ascending=False, key=lambda col: col.abs()) #3
df
```

Code-explanation^{#1 - California housing data is available in Google Colab's sample data folder. You can simply load and pre-process it.}^{#2,#3 - Get the coeff and sort it. Remember that importance is decided by Magitude not the sign, that's why we passed a function to compare only the absolute value.}

Result^{latitude >longitude >median_income >population> total_bedrooms >households >total_rooms >housing_median_age}

As we can observe, latitude and longitude are the most important features. Followed by median_income. We will not discuss much on feature important as that in itself worth a full post. Now let's repeat the exercise with polynomial features

```
# Polynomial Features
def poly_feat(x,deg): #1
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree = deg)
x = pd.DataFrame(poly.fit_transform(x), columns=poly.get_feature_names(x.columns))
return x
d = 2
x_train = poly_feat(x_train, d) #2
# Very strong L1 Regularization
from sklearn.linear_model import Lasso
model_reg = Lasso(alpha=253, max_iter=5000) #3
model_reg.fit(x_train, y_train)
# Check Coeff
coef = pd.DataFrame(model_reg.coef_.reshape(1, -1), columns = x_train.columns)
df = coef.T.sort_values(by=[0], ascending=False, key=lambda col: col.abs())
df
```

Code-explanation^{#1 - Function to create polynomial features. It is available in Scikit-Learn.}^{#2,#3 - Calling the above function and fitting to a LASSO model.}

Result^{latitude >longitude> total_bedrooms>median_income>population>latitude median_income>total_rooms>longitude median_income>.........}

We can observe that the Interacted features of **Latitude/Longitude** with **median_Income** has come up in the importance ranking. This hints towards a probable interaction effect between the Features.

This was an indirect technique but intuitive to explain the effect. Now in the next section, let's understand an approach to figure it out

A partial dependence plot tells us how a Feature impacts the Target assuming all other features as constant. Using this information, we can figure out the Interaction between two features.

Let's understand the general math first.

*Let's assume two features F1, F2 contributes to the output as 50, 50 (a fictitious unit). The contribution due to their interaction is 25.*

*When both the features are used,*
*total contribution = 125*

*When only one feature is used*
*F1, F2 contribution = 50 (because interaction effect will not come into play)*

**Difference** = 125 - [50*(for f1)* + 50*(for f2)* + 0*(for Interaction)*] = **25**

This is a general approach to figure out the interaction effect. This was a basic approach to develop intuition.
Let's understand the formal way to do it. There may be some other approaches but we will discuss the approach known as **partial dependence plot.**We need to do some data Feature twisting/turning before we draw the plot. Here are the steps -

- Replace all the values for a particular feature with a constant value, keep other feature as it is
- Calculate the target and calculate its average for all the samples. This is our first point for one value of

the feature[*Check the Image below*] - Repeat the above steps for all the different values for the feature
- This will give us the Feature-Target mapping with other feature being constant

We can do the same for two features together. Hence,
**partial_dependence(F1)+partial_dependence(F2)** will be compared with **partial_dependence(F1,F2 joint)** to figure out the interaction effect.

We don't need to do it all as **Scikit-learn** has got a module to do so.

```
from sklearn.inspection import partial_dependence
features = [1] #latitude
par_1 = partial_dependence(model_reg, x_train, features, method='auto',)
par_1_pred = par_1[0][0]
features = [7] #median_income
par_7 = partial_dependence(model_reg, x_train, features, method='auto',)
par_7_pred = par_7[0][0]
features = [(1, 7)] #joint partial_dependence
par_1_7 = partial_dependence(model_reg, x_train, features, method='auto')
joint_pred = par_1_7[0][0][:,0]
```

Code-explanation^{We have simply used the partial_dependence Class to calculate the same for latitude, median_income and their joint value.}

Plot the curve for the sum and joint. In the case of Interaction, the joint plot will move very steeply as compared to the sum of the two features.

```
plt.plot(par_1_pred+par_7_pred)
plt.plot(joint_pred)
```

This shows an interaction effect between Latitude and median_income. Let's do the same exercise for latitude and total_rooms.

In this case, both the line have a similar slope. This show an absence of interaction between the two.

The good news is that you don't need to do all this complex exercise. Scikit-Learn provided you with a simple Class to plot the two-way **partial_dependence plot.**

```
from sklearn.inspection import plot_partial_dependence
features = [(1, 7)] # latitude and median_income
plot_partial_dependence(model_reg, x_train, features, n_jobs=-1)
```

If you read the above plot, with a fixed latitude, the target is moving exponentially(*the coloured zone*) with **median_income**. This indicates the presence of Interaction.

As you would expect, the partial_dependent plot for **latitude** and **total_rooms** will not indicate a similar pattern.

You might not need this knowledge always but when the sample and feature volume s not so big and interpretation is a key need. You will need this important concept.

Be mindful that this exercise requires a lot of computation. So this must be considered.

The focus of this post was around the interaction effect, so we exclude the single-feature Vs Target **partial_depence plot.** You may plot that too using **Scikit-learn** and understand how a single feature affects the target independent of other features

**Read these to dive deep**

Hello, AI Enthusiast! In this post, we will learn about another handy but not readily available Machine Learning technique, *i.e. Semi-supervised Learning.*
The Foundation of this concept is around a family of a concept whose goal is to solve one of the most prevalent challenges of Machine Learning *i.e. availability of Labelled data*.
There are multiple terminologies around this idea *i.e. Semi-supervised, Weakly-supervised, Active Learning etc.*. So first of all we will define these.

Different approaches have been depicted in the above image. If we try to infer from the depiction we can say that *Semi-supervised, Weakly supervised and Active learning *are the approach to get the supervised modelling done with a lesser cost and effort of Data labelling.

^{Supervised Learning - This is the defacto ML technique. In this approach, we use has 100% accurately labelled data. Definitely, labels can be inaccurate but we assumed it no so. Typical ML algorithm i.e. Regression is an example of this type of Learning}

^{Un-supervised Learning - In this approach, we train the model with unlabelled data. ML algorithm i.e. CLustering,Anomoly detection are the examples of this type of Learning}

**Semi-supervised Learning** - * This is the topic of this post. In this type of learning, we use both Supervised and Un-supervised learning. With the Un-supervised approach, we try to label the unlabelled dataset using its proximity(Not limited to just one technique) to the labelled dataset's classes. Then we apply the Supervised Learning to the full dataset.*
**The key assumption will be** - *The points that are near-by Or sharing the same Cluster belongs to the same Class. The data points as shown in the below depiction might not work in a typical Semi-supervised approach.*

^{Weakly-supervised Learning - This itself is a family of approaches. The main idea is not to reply to Complete and accurate Labels. So, Labels can be inaccurate too. Semi-supervised is then a special case of this approach. "....Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labelling large amounts of training data in a supervised learning setting..." [ Wikipedia ]. May read this paper to dive deep into the approaches A brief introduction to weakly supervised learning}

^{Active Learning - * This is not a new learning approach. This is a way to boost the Semi-Supervised Or Weakly-supervised approach using the input of Subject matter Experts in the most optimized way. The underlying idea is that rather than asking the domain experts to label all the data, we better ask them to Label the most difficult(for the Machine) ones. For the human, both cases will take almost the same effort. Check the contrastive examples of Movie-Reviews in this depiction }

We will use the MNIST digit dataset Link.

We have used this smaller version of the MNIST(784 Features) to avoid the dimensionality reduction and extra computation but the concept will remain the same and work on that too.

**Our Approach will be -**

*Build a Classifier with 75% of the dataset for baseline score on the 25% test dataset**Build another Classifier assuming we have only 5% of the data as Labelled and get the score on this. Our goal is to make this score almost equal to the initial baseline using Semi-supervised and Active Learning.**Build a KMeans Cluster with all the data. Assign the CLASS of each Cluster's centre to all the points in that Cluster [Semi-supervised]. Get the new score**Pick the incorrect ~7.5% of the records and get its correct CLASS[Active-Learning]. We already got the class but let's assume a Human will do so.**Get the final score*

```
# skipping the data load part
from sklearn.model_selection import train_test_split
x_train,x_test, y_train,y_test = train_test_split(x, y, test_size=.25, random_state=1)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(max_depth=6,min_samples_leaf=5)
model.fit(x_train, y_train)
model.score(x_test, y_test)
```

Result0.96^{This is our First baseline score}

Now, get 5% of the training data and assume the other 95% as unlabelled. Calculate the new score. *Remember, we will always predict the score on initial test data which will remain untouched through-out*.

```
x_train_label,x_train_unlabel, y_train_label, y_train_unlabel = \
train_test_split(x_train, y_train, test_size=.95,stratify=y_train, random_state=2)
model = RandomForestClassifier(n_estimators=20, max_depth=5,min_samples_leaf=2)
model.fit(x_train_label, y_train_label)
model.score(x_test, y_test)
```

Result0.76^{This is our baseline with 5% labelled data}

Now, Let's build a KMeans Cluster for all the training data. We will assume we don't have labels for these data.

```
from sklearn.cluster import KMeans
def create_cluster(k = 2):
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=k)
kmeans.fit(x_train)
return kmeans
kmeans = create_cluster(k=50)
```

Code-explanation^{K=50 is based on prior knowledge. You will have to create a plot of K Vs silhouette_score i.e. the elbow method}

Let's do the Semi-supervised part. We will get the Center of each Cluster. Then predict the Digit CLASS for each centre. Assign that CLASS to all the members of the centre

```
centroid_clusr = kmeans.predict(kmeans.cluster_centers_) #1
centroid_digit = model.predict(kmeans.cluster_centers_) #2
cluster_digit_map = dict(zip(centroid_clusr,centroid_digit)) #3
y_pred_unlabel_clust = kmeans.predict(x_train_unlabel) #4
y_pred_unlabel_digit = list(map(lambda x: cluster_digit_map[x], y_pred_unlabel_clust)) #5
x_train_new = pd.concat([x_train_label,x_train_unlabel], axis=0) #6
y_train_new = np.concatenate([y_train_label, y_pred_unlabel_digit])
model = RandomForestClassifier(max_depth=7, min_samples_leaf=6)
model.fit(x_train_new, y_train_new)
model.score(x_test, y_test)
```

Code-explanation^{#1 - Get the Cluster Id for all the centers}^{#2 - Get the Digit Class for all the Cluster centers}^{#3 - Create a dictionary mapping for above two}^{#4 - Predict the Cluster Id for our unlabelled data i.e the 95%}^{#5 - Get the Digit CLASS for the above using the map. This will be the "Y" of our unlabelled data i.e the 95%}^{#6 - Concatenate the unlabelled to the Labelled}^{Get the score on the above data as our improved score i.e. Semi-Supervised learning.}

Result^{0.8688888888888889 }

Let's do the Active Learning step. We will infuse the correct label for the worst ~100 records of the unlabelled dataset i.e. the 95%. Ideally, we may get this done with an expert but since we already got all the labels, so we can simply pick that form the y_train.

```
unlabel_proba = np.max(model.predict_proba(x_train_unlabel), axis=-1) #1
index_list = []
for id,elem in enumerate(y_train_unlabel): #2
if elem != y_pred_unlabel_digit[id] and unlabel_proba[id] > 0.425: #3
index_list.append(id)
y_pred_unlabel_digit[id] = elem #4
x_train_new = pd.concat([x_train_label,x_train_unlabel], axis=0)
y_train_new = np.concatenate([y_train_label, y_pred_unlabel_digit])
model = RandomForestClassifier(max_depth=6, min_samples_leaf=5)
model.fit(x_train_new, y_train_new)
model.score(x_test, y_test)
```

Code-explanation^{#1 - Get the probabilities of unlabelled dataset}^{#2 - Loop on the true unlabelled dataset(y)}^{#3 - Find those which were predicted incorrectly with a high confidence}^{#3 - 0.425 was chosen because we wanted ~100 records}^{#4 - push the correct label into our unlabelled dataset}^{Get the score on the above data as our further improved score i.e. adding Active learning.}

Result^{0.9355555555555556 }

Great, we reached almost 94% with only 5% of initial data and then another 7.5% as part of Active learning infusion. We didn't try too much on parameter tuning.
We must be mindful of the fact that the above percentage, choice of Clustering technique etc. will differ according to the dataset.

You may try the exercise,

- On the full MNIST dataset
- Visualize the 3 sets of digits
*i.e. all, labelled, labelled in the last step*with t-SNE

Artificial Intelligence(AI) and Machine Learning(ML) is one of the most important areas in the current Technology space. AI is touching the lives of all of us in one way or the other *e.g. Search Engine, Conversational AI, Image segmentation.*
If you are a beginner or just curious about ML and what all the related terminologies mean, this is the post for you. In some cases, it will help the mid-level guys too because we have met people who are into this space but with an ambiguous understanding of multiple terms.

So, let's start learning. We will start with the most Umbrella term i.e. AI

Artificial Intelligence is a general idea to attain human-level intelligence using a machine. It is the pursuit of Human to match Human-Level intelligence. We are still very far from the goal.
But be mindful that it has nothing to do with *Driverless car Or Image captioning*. These may be the contemporary best results of our time but this pursuit is very old.
*It goes back to 1950 when Alan Turing devised the Idea of the Turing test.*

^{In the 1940s and 50s, a handful of scientists from a variety of fields (mathematics, psychology, engineering, economics and political science) began to discuss the possibility of creating an artificial brain. The field of artificial intelligence research was founded as an academic discipline in 1956.Wikipedia}

The initial approach to build logic was based on Rules. It is still a common approach in traditional programming. A limitation of this approach is that it is not automatic *i.e. can only be done using the knowledge of Domain experts.*

Machine Learning is the approach where data is used to develop the program. We are relying on our Algorithm to learn the underlying pattern of the data and define all the rules(*too complex to call it a rule but we may say so for the sake of simplicity*)

Check the below depiction, how ML and Traditional programming differs. The second image is stating that ML learns the pattern from one set of data and then it can predict the pattern of future data.

Deep Learning is the recent state-of-the-art (*Wikipedia*) approach using the deep neural network which enables us to analyse complex data *e.g. Images/Text*.

Neural Network is a special kind of Algorithm whose initial motivation was based on the Human brains internal design. Definitely its the current best approach but we never know what will be the go-to algorithm in the next 5 years. This is a very active area of research.

In a very simple form, data can have two aspects i.e. the data itself and its labelling e.g.

- Tweets(Data) and its sentiment
*i.e. Happy, Neutral(Label)* - Image(Data) and the object in the image
*i.e. Cat/Dog(Label)* - Credit cards transactions((Data) and its Categorization
*i.e. Normal/Fraudulent(Label)*

So, supervised learning is the class of ML learning when we input both the Data and the label to the Model. In this class, we can predict the Class of Image* i.e. Cat/Dog or may try to predict the expected Sales of the future. *

Unsupervised learning is the other class of ML learning when we only input the Data to the Model. In this class, we can't tell anything specific about the class but we may say that a particular record doesn't look very similar to the rest of the data *e.g. Credit card fraud.*

^{But why look beyond Supervised learning Supervised learning looks very intuitive and definitely it is easier to implement and infer. But getting a large amount of Labelled data is an expensive task because labelling requires human effort. For example, you may get a large amount of image data from the internet but labelling them will need a lot of effort}

Unsupervised learning can be used standalone or it can be blended with Supervised Learning to achieve a variety of tasks. Check our blog on this i.e. Anomaly/Novelty.

This is a level ahead of the two. There is a scenario where we can't even have unlabelled data. *e.g. How will we generate the data for road scenarios to train a driverless car or how do we get data to train a model to play Chess*.

In Reinforcement Learning, a software agent makes observations and takes actions within an environment, and in return it receives rewards. Its objective is to learn to act in a way that will maximize its expected rewards over time.
*For example, we may create a dummy chessboard(environment) and the agent(RL software) to play the game. It will be rewarded for winning the game per rule and penalized for losing.*

A model is the model of the underlying data. But this term is used very casually in this field.

A model has two states

**When it is untrained**, at this point it is just a Mathematical equation without actual value e.g. $$ salary = \alpha*AGE + \beta*EDUCATION $$**When it is trained with the data**, at this point the Mathematical equation learn the underlying values e.g. $$ salary = 1.5*AGE + 5*EDUCATION $$ These values of the Model are called parameters. This was a very simple example, not all models are like this. In fact, this is one of the simplest models.

Data Scientist is a very overly used terminology in this field.

If I have to give you an analogy from Cricket then it will be like an All-rounder who must have decent hands on multiple aspects of cricket *e.g. batting, Bowling, Fielding*.

Let's see the life-cycle of a data,
In the beginning, it needs Data engineering and in the end, it needs ML engineering.
Data Scientist should be able to own basic tasks of both of these and in addition complete ownership of Data modelling.

Needless to say, It all depends on the requirement of the Organization. If it's a large project, then it may have a separate associate for all the roles but for a small project, it might expect people to own overlapping roles.

This was all for beginners, in the next section of the post we will extend the story. Moving on this story will bring more technical terms *i.e. Parametric models or Non-parametric model.*
We will also try to figure out how a Model learns the underlying data in a layman's term.

This post was written on Quora as an Answer by us, so we thought to make it a formal post here at our Official Channel. *3 Months will be a little less to gain deep understanding but good to complete a full Iteration.*
**A key challenge with ML** is that it touches multiple disciplines. At the same time, you dont need to complete 100% of all in a serial approach.
**Resources are freely available** but making a collage out of all is the actual skill. This overflow of information can sometimes become a challenge too as it confuses a new entrant.

But be clear that it's a difficult task at hand. Don't believe what you read in click-bait blogs *i.e. you can learn in 1-2 months*. So, start with a tough mindset and ready to show your perseverance.
*Below is the high-level roadmap for the same*

**Objective -** Get you to hand wet with Python. Python is at the core of AI, ML. Most of the leading Frameworks are base on Python

- Complete 4 chapters from Python official docs
*Link*With this, we are good with Python fundamentals.

2. Practice in Google Colab *Link*

Its easy and convenient to code in Cloud. It has a quick start-up guide.

**Objective -** Understand Python Data structures esp. List.
Then grasp the two most important Libraries *i.e. Numpy and Pandas*. Make sure you can *Slice arrays, Do List comprehension, can do CRUD with Pandas DataFrame*. Though we will need a bit more for that, you may come back as and when needed.

- Follow Python for Data Analysis by Wes Mckinney
*Paid*This is the best content curated for ML(Python). It has a range from very simple to advanced stuff. Complete chapter #1 to #5. It's ok to skip any heavy-stuff if you feel so.

2. Python Data Science Handbook by Jake VanderPlas *Link*

This is a to-the-point book for Data Science. You may also use it as the Free alternative for the above. Complete chapters #1 to #3

**Objective -** Understand Matplotlib and Seaborn. Skimming the needed Statistics.
Use the above two to do basic data exploration. This is a good practice to try to figure out the possible patterns and exceptions of the Dataset. Keep this in mind, EDA is more of an art than science, so you will learn it throughout the journey. Also, none of the books covers it explicitly as a chapter.

- Python Data Science Handbook by Jake VanderPlas
*Link**]*Complete chapter #4 for Matplotlib and Seaborn. With this, you should have all the required hands-on Plotting.

2. Quickly review Statistics and EDA concept [chapter#1 to #4] *Link*

This is more around theoretically understanding of the EDA

3. Another blog to grasp EDA *Link*

This is more directly related to practical EDA

4. Code

Most of the EDA related code is either plotting Or simple Statistical function. You can easily do that using Pandas and Seaborn.

**Objective** - *Now you are ready to learn the Models.
Simply follow this one book and you are pretty good with theory and Scikit-Learn coding. As far as datasets are concerned, simply follow what the book has used.*

- Hands-On Machine Learning by Aurlien Gron
*Paid*This is one of the best books for ML, DL. It's not that other books

*[See the list at the end]*are not equally good but the way it has accommodated Machine Learning as well as Deep Learning and explained almost every inner working and tricks is amazing. At the same time, it has more than ample coding examples.

Complete Part-I of the book i.e. chapter #1 to #9.

2. Andrew NG course on ML *Link*

This was a revolutionary course in ML and inspired millions to learn. Follow it as a reference to understand any concept. It will not cover all of the topics

e.g. Ensemble modelling. Also, it is not in Python. But it will be the best resource to grasp underlying concepts.

*Objective -* Now you are ready to apply your learning to real-world problems.
But dont assume that you will simply start solving every bit. Still, there are concepts that can only be learned with practising *e.g. quickly manipulating Dataset using Pandas*. Lets list what we can do on this and also list what still remains :-)

These are the specific cases that are still not covered. But you can learn it in parallel with your practice or after this 3-Month window. By this time you have a decent experience of the core things, so you can also figure out the needed approach.

a. *Times series data*

These types of data need some special treatment. Jason Brownlee has many good posts on TS data. You can learn as and when you need it.

Link]

b. *Handling Imbalance Data*

This is just a special case of data. You can check any good blog or can check the last link

i.e. Machine Learning Mastery

c. Recommendation System

*d. Feature engineering*

Its not that you have not done FE till now but as I said earlier it is more of an Art than Science. You will learn along the journey. See the reference[#1] at the end for a great read.

e. Post modeling activities

This task is a separate learning domain. You can follow a recent book by Andriy Burkov Link and in parallel explore more on the same.

f. Observe and Practice

Try spending some time with some of the notebooks on Kaggle. Focus on the notebooks which are around EDA, Feature engineering, or the topic you need. Avoid the Notebooks which claim

How I top, reached top 0.1%, etc for time being.

Try to practice dataset which teaches different concepts,e.g. Too many features, Lot of NaNs, Imbalance, High volume, very small samples, etc.

a. UCI ML repository *Link*]

Search dataset as per your need *i.e. size, CAT columns, Feature count, Classification, Regression, etc.

b. Another dataset repository *Link*]

Search dataset as per your need

i.e. size, CAT columns, Feature count, Classification, Regression, etc.

c. Kaggle - Advance regression *Link*]

d. Kaggle - Credit card fraud *Link*]

e. Kaggle - Craigslist used Car *Link* ]

As we said at the start, 3 months is not ample to gain all the expertise. So, we are listing some content/books which will quench your thirst for knowledge if you want to do so. Many of these are not listed in other lists that you can find on the internet but these are the best of the league. Use these to glance for any specific topic you are interested in

- Feature Engineering and Selection: A Practical Approach for Predictive Models, By Max Kuhn and Kjell Johnson
*Link*If you want to dive deep into the art of Feature Engineering

2. Support Vector Machine Succinctly by Alexandre Kowalczyk *Link*

If you want to understand how SVM and its maths work.

3. Experimental Design and Analysis, Howard J. Seltman *Link*

This is the complete book link for the reference listed in the EDA portion in the Blog. Read it to understand Probability concepts required for ML

4. STAT 501 online course by PennState Eberly College of Science *Link*

LinearRegression has many things to tell if you are looking for interpretability instead of a quick fit/prediction with Scikit-Learn. This content has all that is needed to comprehend

i.e. Assumptions, Interpretation.

5. Understanding Imbalanced dataset, Jeremy Jordon *Link*

To quickly gain the needed knowledge into Imbalanced datasets.

6. An Introduction to Statistical Learning, by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani *Link*

This and the next one are your handy reference books to look for particular concepts. This one is more friendly to start the ML journey.

7. The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, Jerome Friedman *Link*

Use this to go deep into particular concepts

e.g Tree/ Ensemble/Cross-Validation etc.

8. StackExchange sites *i.e. Stats, Datascience and StackOverflow* *Link* *Link* *Link*

While you will use SO whenever you are stuck, but the other two can also be treated as reading great contents

i.e. some good answers to confusing questions.

9. StatQuest with Josh Starmer, Youtube Channel *Link*

Another gem of a resource if you want to learn a topic like a Kid

10. Introduction to Machine Learning for Coders By Jeremy Howard *Link*

Jeremy is a great teacher

esp. Focussing on thinking beyond what you get in textbooks. Good to dive into Ensembleesp. Random Forest and Python code.

11. Rules of Machine Learning: Best Practices for ML Engineering by Martin Zinkevich *Link*

Words of wisdom on Best practice in ML. This will help you look into the journey holistically.

12. Interpretable Machine Learning by Christoph Molnar *Link *

Interpretability is still evolving and a big challenge in this field. This book has a good summary of related concepts. It will also teach you multiple concepts about the Model you already know.

13. Official websites and Paper reading

While Blogs/Books are great but scanning the examples given on the official website is a good approach to learn things quickly

esp. understanding the parameters. Keep checking Scikit-Learn and Seaborn websites. Seaborn , Scikit-learn

You should also try to read the paperesp. Paper for RandomForest and Extremely RandomizedTree. This work will add a very different level of knowledge and perspective to know things. Randomforest, Extremely RandomizedTree

*Dont run for a high score*, think in terms of Robustness and Business-value- Dont just read/watch, make sure you are coding everything by typing yourself. Learning that you gain by solving simple errors is unmatchable. If you will copy/paste you will never see those issue
*e.g. CAT Encoding a dataset having mixed types of Features* - Try to understand the trade-off of every model.
*Dont just try to be a RandomForest/GBM Ninja*:-) - MLOps and Post Modelling work in itself is a completely separate semester. Keep this in mind
- Maintain a balance between understanding the concept and able to code it.
- Try to commit to a longer period of learning ideally lifelong but here at least another 8-9 Months.
- Know how the Dunning-Kruger effect works and be conscious of it when stuck with anything difficult. The key is to keep moving.
- Beware of short-cuts proponents and baits as there are millions on the internet in this field. Learning takes some time and effort.

We, at 10xAI Learning, offer Instructor-Led online courses in *Python, Machine Learning, and Deep Learning*.
Check-out our website *https://learning.10xai.co* for the details and Call/WhatsApp us for any needed information.
**A glimpse from our Course**

This is the first post of our series "*Beyond Fit and Predict*". In this series, we will learn multiple concepts that are mostly less discussed but hold very important information about the Model. We will understand all the important Machine Learning *i.e. DecisionTree, LogisticsRegression, LinearRegression, Support Vector Machines* from a new perspective.

The prerequisite is that you have a basic understanding of the algorithm as we will not discuss "*how the algorithm works*". In this post, we will learn about **RandomForest(RF)**.

We will understand,

*Why it is said that RandomForest never overfits**What are Bagging and OOB**Should we always trust the Feature Importance result of RF algorithm**How to look into individual Trees of using Scikit-Learn**Understanding a few of the important parameter of Scikit-Learn e.g. max_features*

Bootstrap is a sampling approach where we resample with replacement. The advantage of this approach is that all the samples are independent. This leads to high variance among different samples which is the key requirement when building a Variance reduction Ensembling model *i.e. Bagging, RandomForest. Let's see the steps*

a. Select a single data point

b. Put it back

c. Select the next data point

d. Repeat

Yes, you are thinking absolutely correct. This will result in many data points coming out multiple times and few datapoints not coming out at all.

The data points which didn't compile for a particular sampling round are called an Out-of-Bag sample. This gives us a readily available validation sample when we use Bootstrap. At the same time, the sample on which we train our model has a lesser number of effective data points and hence more Bias. Let's generate the same using Scikit-Learn

```
from sklearn.utils import resample
x_train = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# We have kept sample size same as training data. How we do in Bagging
sample = resample(x_train, replace=True, n_samples=len(x_train))
oob = [elem for elem in x_train if elem not in sample]
sample
```

Result^{ [10, 6, 6, 5, 3, 3, 10, 2, 4, 8] }^{6, 3,10 are repeated. [1, 7, 9 ] are OOB}

In this way, you may generate as many samples as you want and build a RandomForest(*actually Bagging model*) from scratch using a DecisionTree model.

*What is the % of OOB data point*

You might be pondering on this question. It's roughly 37%. Why **37%?**

Read this **great SE thread** for the detailed answer (Hint - *It's a simple probability calculation*)

You can make an Ensemble of Trees based on bagged samples as we did in the last section. RandomForest goes one step ahead and adds further Randomization (*lesser correlation among Trees*) by picking a "** Random**" subset of all the Features prior to each split.

An excerpt from "*An Introduction to Statistical Learning*" [*Link*]. A very good read for all who have just started their ML journey.

^{Random forests provide an improvement over bagged trees by way of a random small tweak that decorrelates the trees. As in bagging, we build a number forest of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors. The split is allowed to use only one of those m predictors. A fresh sample of m predictors are taken at each split}

`max_features`

is the parameter for this task in `scikit-learn's RandomForest`

. You may apply these words of wisdom from the Scikit-Learn document Link for this parameter.

^{The latter is the size of the random subsets of features to consider when splitting a node. The lower the greater the reduction of variance, but also the greater the increase in bias. Empirical good default values are max_features=None (always considering all features instead of a random subset) for regression problems, and max_features="sqrt" (using a random subset of size sqrt(n_features)) for classification tasks (where n_features is the number of features in the data)}

Needless to remind that **this Randomization will further increase the Bias**.

Now you have built your RandomForest and want to use its Feature importance capability.

Lets us do a quick recap of the concept. Feature importance tells us about the importance of each feature based on the Model's perception while fitting itself on the training data.

It's based on Impurity reduction* i.e. how clean is the split when a particular feature was picked for the split*. This is aggregated over all the split and then overall the Trees (*in the case of an Ensembled model)*

Here we will not code to calculate the FI but understand a few of its caveat so that we are better aware while using the same.

- This is a common issue for most of the approaches to calculating FI*Correlated Features**i.e. This exists with LinearRegression's coefficients too*. If your data has two correlated Features then both will actually share the FI and you might get an incorrect perception while reviewing the FI ranking.Let's say, we have 3 Features and the FI scores are -

*0.5, 0.3, 0.1, 0.1*In this data, we added a new feature that is highly correlated to the first feature. What it means is that whenever the Training process split on

*Feature#1*while building the previous model, now it can split on both*Feature#1*or*Feature#2*since both the Features are correlated and hence will have similar split quality.So, the model can pick

*Feature#1*and*Feature#2*with almost equal probability*i.e. ~50% of the times*, and hence the total share of impurity dip will be shared between both of the features.So, the new FI score will be -

*0.25, 0.25, 0.3, 0.1, 0.1*It means now

which might be deceiving in most cases.*Feature#3 becomes the most important feature*This is another subtle issue that might deceive your understanding of the FI score and this issue is not too simple to comprehend as compared to the previous one. The point is the way FI score is calculated in the Gini Impurity reduction approach favours the features which have a high Cardinality as compared to a feature not having high Cardinality.*Biased towards high Cardinality feature -*

Check the examples in this Bog post*Beware Default Random Forest Importances*,**Why it happens**^{Features with more potential cut-points are more likely to produce a good criterion value by chance. As a result, variables of higher cardinality are more likely to be chosen for splitting than those of lower cardinality. But this will not happen in initial splits as the Important feature will have cut-offs which will reduce the Impurity significantly not possible for a feature split by Random chance. This tendency dominates when the sample size reduces i.e. after few splits}^{Check this excerpt from Understanding Random Forests: From Theory to Practice - Sec. 7.2}^{".....the origin of the problem stems from the fact that node impurity is misestimated when the size Nt of the node samples is too small. To a large extent, the issue is aggravated by the fact that trees are fully developed by default, making impurity terms collected near the leaves usually unreliable."}

This question surfaces quite regularly over many discussion forums whether RandomForest never overfits since something similar to this was claimed by the inventor ** Leo Breiman**.

This is what the author said in the paper [*Link*]

^{ Random forests are an effective tool in prediction. Because of the Law of Large Numbers, they do not overfit}

**But neither this is the case nor the Author meant so**. Let's review another point from the paper,

^{For random forests, an upper bound can be derived for the generalization error in terms of two parameters that are measures of how accurate the individual classifiers are and of the dependence between them. The interplay between these two gives the foundation for understanding the workings of random forests }

As we also discussed that one of the key pillars is the independence of different Trees. Less the correlation, better the Forest. The second point simply meant that if the individual Trees are not doing great work *e.g. Overfitting*, so will be the Forest.

*Hence, RF will definitely overfit if the individual fitted Trees does so.*

This is a simple exercise, you can easily try with a small code. You should!

We, at 10xAI, believe in balanced learning across the axis of *Theory-Code and Width-Depth*. So, let's review some code related stuff of RandomForest

If you want you may individually review all the Trees*Get all the Trees -**i.e. their prediction for individual data points*. Using that you may able to figure out the confidence of the RF for the particular prediction. This is discussed in*Jeremy Howard's*book too Deep Learning for CodersYou can get the individual trained Tree using the

*model.estimators_*attributes`# Score - Individual Tree on Full test Data tree_score = np.array([accuracy_score(y_test, model.predict(x_test)) for model in model.estimators_]) #1 # Probability - Individual Tree on Single Test data data_tree_proba = np.array([tree.predict_proba(data.reshape(-1, 30)) for tree in model.estimators_ for data in x_test.to_numpy()]) #2 # To a DataFrame data_tree_proba = data_tree_proba[:,:,0].reshape(-1,model.n_estimators) data_tree_proba = pd.DataFrame(data_tree_proba,columns=["col_"+str(i+1) for i in range(model.n_estimators)])`

**Code-explanation**^{#1 - A simple List comprehension on all the Trees to predict the score on test data}^{#2 - A nested List comprehension on all the Trees and each test data points to predict the class probability}- The model will calculate the OOB score for the training dataset when we set the flag "oob_score=True" while initializing the Model. Then we can get the value using the "oob*Get OOB score**score*" attribute`from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0) clf = RandomForestClassifier(max_depth=2, random_state=0, oob_score=True) clf.fit(X, y) clf.oob_score_`

- The key advantage of Bagging based approach over Boosting based approach is that in the former, Trees are built independently and hence can be built in parallel. In scikit-learn, you just need to set "n_jobs=-1". Below is the code snippet on a 2-Core machine*Build Tree in parallel*`clf = RandomForestClassifier(n_estimators=500) %timeit clf.fit(X, y) # 1 loop, best of 3: 16 s per loop clf = RandomForestClassifier(n_estimators=500, n_jobs=-1) %timeit clf.fit(X, y) # 1 loop, best of 3: 10.1 s per loop`

We are sure that now you have an in-depth understanding of RandomFOrest not just in terms of how it works but also many inner intricacies. You very well understand the key parameters and attributes of Scikit-Learn RandomForest.

We will suggest you read the complementary user guide of Scikit-Learn models e.g. [This]. These are quite informative and also contain the link to original papers or best references.

Now you have all the required knowledge to answer almost any question regarding RandomForest and bagging. *We will end this post with a question :-)*

]]>

^{"....Good values for max_features are - 1. "all features" for regression problems, and 2. "sqrt(n_features)" for classification tasks.....". So the question is - Why does the random sampling of Features don't work for the Regression task? }

This is the second post of the Outlier series. We will learn how to catch the Outliers Or Anomalous data points. Don't worry about the terminology, we have covered that in the first part.

In the process, we will develop a high-level understanding of the 4 key algorithms for the purpose and also understand the working of these algorithms. We will also learn to program an Anomaly detector using these algorithms.

We have already Learnt [Check the part-I here ],

- Clarifying the
*taxonomies*around Anomaly detection - Understanding and applying
*GaussianMixtureModel*

In this post, we will Learn,

- Understanding and applying
*IsolationForest* - Understanding and applying
**LocalOutlierFactor** - Using the reconstruction error of
*PCA*

*How it works*

In this model, we build a RandomForest of DecisionTrees. Each DecisionTree is built on Random splits and Random Feature selection.
If we build a tree following the above approach, an Outlier that is sitting very far from the normal datasets must be split early in the full tree. Definitely, this will not happen for every Tree but it should average out when a decent number of Trees are used in a RandomForest. So, we can simply count the depth required for a point to get isolated. An outlier will have a very small depth.
*Check the depiction below - *

Be mindful of the fact that this distinction is possible only after averaging the split over many DecisionTree. With a single DecisionTree, it can't be guaranteed.

Excerpts from the Scikit-learn documents^{The IsolationForest isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. Since recursive partitioning can be represented by a tree structure, the number of splitting required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length averaged over a forest of such random trees, is a measure of normality and our decision function. Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.}

*Let's see the Code*

By design scikit-learn's IsolationForest is for Outlier detection rather than Novelty detection as it uses a parameter *i.e. contamination* to identify the Threshold for Outlier and its predict method simply output class *i.e. Normal vs Outlier*. But we will use it the way we did with GMM.

```
from sklearn.ensemble import IsolationForest
model = IsolationForest(random_state=0, contamination=0.025).fit(x_train) #1
y_pred = model.predict(x_test_normal) #2
false_postive = sum(y_pred==-1)
y_pred = model.predict(x_test_outlier)
true_postive = sum(y_pred==-1)
```

Code-explanation^{#1 - Kept contamination= 0.025}^{#2 - Predict function results in +1, -1 for Normal and Outlier respectively}

Result^{Outliers - All 2203 out of 2203 identified}^{Normal - 67 out of 2500 incorrectly identified}

*How it works*

This model uses a KNN approach to find out the density of data around each data point. If a data point is an Outlier then the density of the data point will be similar to its Neighbour since Outliers points are neighbour to each other. *Let's see this depiction -*

In the image,^{If we consider one of the Outlier as the first point then its density and the density of its neighbour i.e. the black solid point is quite different. In the case of normal data points, These densities should be similar.}

Excerpts from the Scikit-learn document^{The anomaly score of each sample is called the Local Outlier Factor. It measures the local deviation of the density of a given sample with respect to its neighbours. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighbourhood. More precisely, the locality is given by k-nearest neighbours, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbours, one can identify samples that have a substantially lower density than their neighbours. These are considered outliers.}

**Let's see the Code**

```
from sklearn.neighbors import LocalOutlierFactor
model = LocalOutlierFactor(n_neighbors=10,novelty=True).fit(x_train)
y_pred = model.predict(x_test_normal)
false_postive = sum(y_pred==-1)
y_pred = model.predict(x_test_outlier)
true_postive = sum(y_pred==-1)
```

Code-explanation^{Very similar to the IsolationForest but we have to use the parameter "novelty" to get the +1, -1 output from predict. Also, we have not to tune it for the K but you can try. Lastly, this model is computationally very inefficient}

Result^{Outliers - All 2203 out of 2203 identified}^{Normal - 100 out of 2500 incorrectly identified}

*How it works*

In the approach, we use the reconstruction error of PCA.
If we reduce the dimension of a dataset using PCA and then try to revert it back to the initial dimension, it will have a reconstruction error for all the points. The dataset has lost the information along the dimension which was removed while reducing the dimension. Since we expect an outler to lie far away from the key principal component[*Actually that is the reason it is an Outlier*], it will have a high reconstruction error as compared to Inliers.
*Check the depiction below*

If you observe the Outliers points have larger reconstruction error as compared to the average error of Inliers data points. *This is the information that we can utilize to catch the Outliers.*

*Let's see the Code*

We will not do the full coding for this approach but I will put the required steps.

- Reduce the dimension for the data. Will need some tuning here to find out the best
*n_components.* - Reconstruct the data points. Use
*scikit-learn PCA's inverse_transform*function - Calculate the error
*i.e. Cosine similarity, Euclidean distance. Use metrics.pairwise module.*It has all the required functions - Define the Threshold
*i.e. around 2.5 percentile of errors* - Calculate the error for new data and map the Class based on the Threshold

The dataset was relatively easy but appropriate to explain the concept. Adding more data of KDD cup could add other issues *i.e. a lot of Categorical features* and the post would get diverted.
You must try other variants of the available dataset of the KDD cup.
Try completing the code for the PCA approach following the mentioned steps.
You may also try these concepts on Credit-Card-Fraud dataset available on Kaggle.
*When the Outlier and Inliers are quite close* and it is difficult to find a clear boundary then we must study the False-Positive, False-Negative and try to come up with new Features. Point is that Exploratory Data analysis is always required and we must invest time in it.

*Finding the Threshold*, we followed a completely unsupervised approach but since we had got the Label for our dataset, we may work in a semi-supervised way to figure out the appropriate Threshold to have a minimum of False-positive and False-Negative.

*Categorical (Nominal)* features are part of almost every dataset. These features need a separate step while pre-processing the data *i.e. Encoding.*

*Key reasons for this encoding are -*

- Almost every Library implementation need Features in numeric form even if the interpretation is Nominal
- Almost every Algorithm works only with Continuous features
*(Tree, Nave Bayes can be few exceptions)* - You may achieve a better result and/or smaller feature set with an appropriate encoding

When you try to encode, you may face any of these situations -

- High Cardinality of the Feature
*e.g. Zipcode* - Feature gets new value in test data
- Feature values are Ordered
*e.g. Size of apparels (XS, S, M, L, XL etc.)* - Features may be Cyclic
*e.g. Days on Month*

Based on the above need and list of scenarios, we can have multiple ways to deal with all. Let's dive into each one.

Label encoding is the minimum level of encoding that we can perform. It only solves the first problem *i.e. Satisfy the Frameworks requirement*
This is simply mapping the Feature value to numbers *i.e. 1-N*

*e.g. [Cat, Dog, Elephant ]* > *{ cat :* **1**, *dog* : **2**, *elephant :* **3** *}*

Challenge with Label Encoding^{While this has satisfied the Framework but it has created something which will deceive many Models e.g. LinearRegression, Neural Networks. The simple reason is that these models will treat the feature as continuous and use the information that "2 time Cat is a Dog etc." which is an incorrect pattern to learn.}

This is something that we do not want! So, let's move to *One-Hot Encoding*.

*OneHotEncoding* is a technique where we split the Feature into one binary-feature for each of the values for the feature in the main dataset.

*e.g. [Cat, Dog, Elephant ] > { cat : [1, 0, 0]*,

*One-Hot imply one of the bits is Hot(1).*

Image explanation^{Each value becomes a new Features indicating the presence of that values i.e Is_CAT==1 imply the value of original feature is CAT for that data point }

**Interpretation of OHE feature**

Unlike *LabelEncoder* which doesn't make a guaranteed sense to all the Models, *OHE* can guide the Model to deduce some pattern in the direction of the right Target.

If we assume the underlying Model as *LinarRegression,* the Coefficient of *Is_Cat* will indicate the change in the mean output when the Feature changes from *0 to 1 i.e. No_Cat to Cat (see below depiction)*

Challenge with One-Hot Encoding^{1. It will add as many dimension as the number of unique values i.e. can become very challenging with Feature having high Cardinality}^{2. Not very efficient for models like Neural Network which works better on continuous data}

These techniques are very handy when dealing with feature with very high dimension

**Binary encoding** is the simple binary representation of Label encoded data

*e.g. [Cat, Dog, Elephant, Fox ] > { cat* :**[0, 0]**, *dog*: **[0, 1]**, *elephant : [ 1, 0], *fox

**BaseN** is the approach to generalize this to any base *i.e. Octal 8 Values/bits*

**Hashing encoding** is done by mapping each value to a new value using a hash function.

^{[Hashing function] and related concept are quite prominent in computer science.}

These encodings have similar challenges as that of Label encoding *i.e. absence of a meaningful representation of Feature.*

It can be used with sparse data in a very high dimensional space i.e. Collaborative filtering. Check this paper *[Link].*

In a common ML problem, these techniques are not very helpful

*Count encoding* means that for a given categorical feature, replace the names of the groups with the group counts.

This encoding can work when the Feature count is having a good correlation with the Target. *e.g. Country name and sample data of any disease.
More the infected people more is the positive sample of that country in the dataset.*

Supervised vs Unsupervised^{Methods till now were all Unsupervised i.e. didn't involve the Label. But we can have techniques that use the Labels too}

Target encoding is a family of techniques where we use the Target in the encoding process. What it means, we are getting guidance from the Target itself.

The most simple approach can be to use the mean/median of the target for the particular value of the Feature.

The most commonly used approach is as suggested in this paper [ Link ]

It basically tells us to map the estimated probability for the value i.e.

`cat > (count of cat for Y==1) / (Total count of Cat),`

*For a regression setup, the average value should be used. In the case of multi-Class, there will be m-1 encoded features i.e. each representing the encoding for one Y(class)*

But this mapping is not used because the those values which have have very few samples for positive class will be misrepresented in this setting.

`A prior probability is calculated = *(Count of Cat) / (Sample size)*`

Then, both the mapping is blended with a parameter i.e. $$ \lambda * posterior\_estimate + (1 -\lambda) * prior\_estimate $$

*Lambda* is between [0-1] and is mapped to the count of each value.
*What it means that for values having high count the posterior_estimate will dominate and for values having low count the prior_estimate will dominate*

*Below table depict a target encoding for Countries on Income census dataset* [ Link ]

Target encoding has a few key benefits^{ - It only creates one dimension in the Feature set}^{ -It can work in almost every scenario i.e. ordinal, High Cardinal, Cyclic etc.}The risk that is attached^{to this approach is of data leakage which may result in highly optimistic results on the train set. A suitable cross-validation approach must be used with this encoding scheme.}

Entity embedding is another Supervised technique that utilizes a Neural Network to fit the OHE version of the feature to the target and the weight of the Network is used as the encoding.

*We have a separate blog on this technique, please check it. [ Link ]*

Ordinal feature are those where the values have an underlying order esp. when that order has a similar impact on the Target *e.g. Education level*

Encoding it with a simple approach *i.e. OHE* will do the minimum work but the Model will miss the information that is mapped to the Order of the values.

As in every case, Target encoding can be tried here too.

Another approach for Ordinal encoding is using "**Orthogonal Polynomial Contrast encoding**".

Contrast encoding is a family of techniques to encode categorical features. The key difference with OHE is that it does not use [ 0 ] as a reference value. The issue with using 0 as reference value is that the Intercept of the Target is the mean of the reference value and all the parameter is expressed in terms of the deviation with the reference value. Ideally, it should be with the mean of all values. Read here [Link]

Polynomial contrast is the extension to add an underlying trend in the mapping. The trend can be Linear, Quadratic, Cubic Or higher level polynomial. e.g.

So, each poly degree will add one dimension. Though not necessarily all will be useful.

Individual features are orthogonal to each other. This is important so that each must add unique information as a feature. Individual values are not important. All the values just satisfy the required conditions *i.e. Following the polynomial, Orthogonal (i.e. dot product =0), Sum=0.*

The assumption here^{is that the values(factors) are equally spaced. The reason for that is because all the values are equally-spaced consecutive values on the polynomial curve. What it means is that if we encode only these 3 factors which are not consecutive i.e. XS, S, XL, this encoding approach will be incorrect.}

When a particular feature has too many values(factors) than using the default OHE can add a lot many dimensions which is undesirable. You may take these approaches -

- Remove feature value which has a very small frequency
- Try Target encoding
- Try Entity embedding

When the values of the Feature have very little frequency, it means it has little importance to predict the target pattern. In such a scenario, you may,

- Club such feature into a consolidated feature
- Remove a few of these which are below a Threshold count

If a new value is encountered after training, it might break the Model if not properly handled before-hand.

One approach would have "Other" as a value to map such value. All though it will not help in the pattern mapping but can work as a "Catch" statement. Ideally, you must review such a scenario in the purview of the data domain *i.e. Why a new value arrived.*

To handle this situation in your Train-Test split, you can

- Manually verify that each Categorical feature has at least one value of each factor. If not add one record post the split and prior to the encoding.
- Use the
parameter of`categories`

`scikit-learn OHE`

to provide the list of all possible value before-hand

A cyclic feature is a feature whose values are connected at the end *e.g. Days of a month, here 30 is equally closer to 01 and 29.*

Hence, the encoding technique must have this pattern in the resulting mapping. *Target encoding and Entity embedding can inherently* learn this but other techniques might miss this if not done explicitly.

In this case, we can map the values to a sine and cosine curve (*in an approach similar to the polynomial degree mapping)*.

^{The only reason we need both the curve is that just having one of the two can result in the same encoding for two values e.g. sine(45)=sine(135). Read this SE Answer [ Link ]}

```
feat_sine = sin((2*pi)/30*day_of_month)
feat_cos = cos((2*pi)/30*day_of_month)
```

The above plot is a logical depiction of how the two points will represent each day of a week. In a similar manner, we can create 30 slots for days of Month.

Although this can solve the problem of values being cyclic but not necessarily it will be the best encoding.

Using the above explanation you can easily do all the required encoding. But you don't need to do so. * category_encoders* from

`scikit-learn.contrib`

```
!pip install category_encoders # Install the Library
import category_encoders as ce # Import the Library
encoder = ce.TargetEncoder() # Create Class for Target Encoder
x_train = encoder.fit_transform(x_train, y_train) # fit_transform
x_test = encoder.transform(x_test) # transform
```

Other good things about this Library is that^{- It follows the scikit-learn standard convention for fit transform.}^{- It handles the backend issue that you face with scikit-learn OHE e.g. not allowing float etc. It returns you a clean Dataframe with all the columns properly named.}

**i) Do we need encoding for Tree-based Model**

The way the Tree-based model works, it doesn't matter a lot for it to need OHE Or any other encoding. It simply works with Label encoding. In fact, it will be a bit faster with Label encoded data.

**ii) Do we need to always map (K-1) feature during OHE for K factors**

This is again, a very frequently asked question. Please check my answer at SE. [ Link ]

]]>When we have a very high Cardinality for a Feature or a group of Features. Entity Embedding is an elegant way to get a lower-dimensional representation of the Features.

The most simple example would be the **Latitude/Longitude** representation of a large number of **zip code** though in this case, we know this relation from our general knowledge.

^{Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space reveals the intrinsic properties of the categorical variables. [From the paper]}

Embedding is a very famous concept in text analysis[*Word embedding*], where each text is represented using a set of Numeric features and the encoding should end up learning the underlying relationship *e.g. in the embedded Lion, Tiger might be closer to each other but a bit part from Cow, Dog.*

Needless to say that this relationship is based on a contextual setup. The Closeness and difference will depend on the context. e.g. two zip code may be close in the context of Revenue while the same zip code may look far when the context is of reported crime.

You may read this blog on word embedding if you want to dive very deep Link

The key goal is to learn a continuous space for the discrete Feature. This will not only ease the model's training process but also it **can be reused for a similar scenario at some other place.**

The whole process is very simple. We create a Neural Network with an additional layer that acts as a mapping for the "*Feature-to-be-embedded*" to Embedding space.

The output of the embedding layer is concatenated with all the remaining Features. After this Layer, it is just like a simple Neural Network. The embedding weights are initialized with a random value and learned by backpropagation.

Take a look at the depiction in the diagram. We may create separate embeddings for each feature Or may club multiple features into one as shown in the image.

The size of the Embedding *i.e. count of Neurons is a Hyperparameter to tune.*

Now, Let's try to understand the neural network for the task.

In the diagram below, the green rectangle is the Label Encoded values of the Features which we want to embed. The Green Circle is the Embedding Layer. The blue pentagon is the concatenated layer.

The outputs of the Embedding layer will be the embedding for a particular input.

^{Image source - Feature Engineering and Selection: A Practical Approach for Predictive Models, Max Kuhn and Kjell Johnson}

Keras Embedding Layer will do all the work needed as per the above explanation. It takes the embedding size and the vocab-length as parameters. Length of Input is the number of words but in our case, every Feature value has a length of 1. This is more useful in a Text Data analysis setup where features values are sentences.

```
tf.keras.layers.Embedding(
input_dim,
output_dim,
.
.
.
input_length=None
)
```

^{input_dim: Integer. Size of the vocabulary, i.e. maximum integer index + 1. output_dim: Integer. Dimension of the dense embedding. input_length: Length of input sequences, when it is constant. This argument is required if you are going to connect Flatten then Dense layers upstream}

```
import tensorflow as tf
from tensorflow import keras
embedding_size = 25
embed_input = embed_input = layers.Input(shape=(1)) #Green Rectangle
embed_layer = keras.layers.Embedding(max(x_train_embed)+1, embedding_size, input_length=1)(embed_input) #Green Circle (embed_layer) #1
embed_layer = layers.Flatten() #2
#embed_layer = keras.layers.Reshape(target_shape=(embedding_size,))
other_input = keras.Input( shape = (x_train_other.shape[1])) #Orange Rectangle #3
model = keras.layers.concatenate([embed_layer, other_input], axis=-1) #Blue Pentagon
drop_rate=0.0;hidden_count = 5
for i in range(hidden_count):
model = keras.layers.Dense(embedding_size+20, activation='relu')(model)
model = keras.layers.Dropout(rate=drop_rate)(model)
model = keras.layers.Dense(1, activation='linear')(model)
model = keras.layers.Dense(1,)(model)
model = keras.Model(inputs=[embed_input, other_input ], outputs=model)
model.compile(loss='mean_squared_error', metrics=['mean_squared_error'])
model.fit([x_train_embed, x_train_other], y_train,
epochs=25, validation_split=0.2)
```

Code-explanation^{#1 - x_train_embed is the label encoded data. embedding_size is 25 for our case}^{#2 - Default output shape: (batch_size, input_length, output_dim). We flattened the 1x25 2-D since we are concatenating with a similar Input. You may use Reshape Layer too. }^{#3 -This is the Input from all other remaining Features}

Moel is traine now. Let's pull the Embedding vectors from the Network.

```
model_emb = tf.keras.Sequential()
model_emb.add(keras.layers.Input(shape=(1)))
model_emb.add(model.layers[1]) #1
#model_emb.summary()
model_emb.predict([81])
df = pd.DataFrame(model_emb.predict(pd.unique(x_train_embed))[:,0,:],
columns=[embed_col+str(i) for i in range(embedding_size)])
df_map = pd.concat([df, pd.Series(pd.unique(x_train_embed))], axis=1) #2
```

Code-explanation^{#1 - We have created a small Network to predict the Embeddings}^{#2 - Predicted for all the values and then concatenated with the unique values to create a ready-to-use map}

It works in a context *i.e. what is the Data represent and what is the Target*. The data we get can have too many Categorical values but it might be quite possibly the in some unknown space these may be represented by smaller Domesnions i.e. 4 or 5 etc.

A very common example can be data related to geography e.g. Countries. They might be equivalent to their distance from the Equator Or their GDP. So with such representation, they can be simply encoded in smaller dimensions.

Another example can be, let's say we have 12 Cardinal Feature representing each month and a Target representing a Tourist count. Definitely, the tourist count will have a smooth transition from Nov-Jan even though the Year changes. The Neural Network must learn a mapping that can represent this relation *e.g. Dec is at a similar distance to both November and January.*

**The task in hand is to find the optimum number of Dimensional that can represent the underlying patter in the best possible manner**

For that, we rely on the Neural Network Learning process. If the loss is decreasing then Embedding is moving in the correct direction from the starting point which was a random value

Let's review a few of its **limitation **first,

- Very first limitation is to use a Neural Network. Not everyone would like to bring on a new technological idea to solve an encoding challenge.
- Need tuning of the embedding dimensions
- Loss of Interpretability

Despite its limitations, the idea is very versatile. It is the foundation on which text-data analytics has achieved all its successes. In text-data, it works like a pre-trained Network. Offers not just dimensionality reduction but also infuse sufficient information into the data. This was all we have for this post. You may use this concept when you have very high Cardaniity features. There can be other approaches too. Check out the Categorical encoding post Link

]]>In this blog post, we will learn how to catch the Outliers Or Anamolous data points. Don't worry about the terminology, we will start with the meaning first.

In the process, we will look into 4 key algorithms for the purpose and also understand the working of these algorithms. We will also learn to program aa simple Anomaly detector using these algorithms. You may also use these approaches in Exploratory Data Analysis steps to figure out the Outliers.

We will Learn,

- Clarifying the
*taxonomies*around Anomaly detection - Understanding and applying
*IsolationForest* - Understanding and applying
*GaussianMixtureModel* - Understanding and applying
**LocalOutlierFactor** - Using the reconstruction error of
*PCA*

Let's start with the definition of important terminologies.

**Anomaly/Outlier**- We assume a certain pattern in our datasets and any data points which is far away from the pattern are considered as Anomaly. Or we can say, farther the point more is the chance of it to be an Anomaly. The normal data points are called**Inliers**.*The task is to figure out an approach to define distance and a threshold to define the boundary for Inliers(Normal) vs Outlier(Anomalous)*- Problem and underlying idea remain the same but the approach to apply the solution changes. We take a semi-unsupervised Machine learning approach here. We use only the Inliers datasets and train the Model. Then we can use a few of the labels to decide a Threshold of distance to identify the Outlier. Treat any new dataset which is beyond the Threshold as*Novelty detection**New i.e. Novel.**In the case of Anomaly detection, we can take a Supervise approach and utilizing the Outlier to decide the decision boundary between the Outlier and Inliers. but in this case, we have to figure out a Threshold-based on expert-judgment or few data points*- Class imbalance is when the dataset is not evenly balanced between multiple classes. This problem can become a special case of Anomaly detection when the percentage of minority class is very low.*Class Imbalance**We will not discuss Class-Imbalance in this post.*^{Semi-supervised Learning - This is an approach where we utilize the available Label for some datasets to figure out the labels of other datasets using Unsupervised learning and then apply Supervised Learning on the whole dataset. We have used this term for a little different approach where we will use the Label to figure out the Threshold for Normal-Anomaly boundary}

Let's try to see how *Normal-Outliers* will look in different scenarios and what all approaches we may take to figure out the difference.

We will try to figure out the Outliers of *Scenario-I and II*. For *Scenario-III* we will need a close analysis of the two data points and need to come up with a new Feature for the Model which can distinguish between the Normal and Anomaly

*Possible approaches*

- Get the distance of all the points from all other points. The Outliers will have a higher average distance
- Neighbor of the neighbour of an Outlier is the Outlier itself
- If we build a Clustering model which assign the probability of belongingness to the Cluster to each point, then Outliers will have a smaller probability
- If we build a DecisionTree to segregate each data points, chances are high that an Outlier will be segregated to a Leaf very early in the Tree

*Our Dataset and approach*

We will use the classic *KDD Cup 99* dataset. *The KDD Cup 99 dataset was created by processing the tcpdump portions of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset, created by MIT Lincoln Lab.* We will use the "*http structure*" subset of the dataset with 3 Features

In our training, we will only use the Normal data points and build a Novelty detector. We will use the Outlier data point as new data points in the testing stage.

*Let's see the Code*

```
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from sklearn.datasets import fetch_kddcup99
dataset = fetch_kddcup99(subset='http', shuffle=True, return_X_y=False)
x, y = pd.DataFrame(dataset.data),pd.DataFrame(dataset.target)
# Let's keep only normal and back.
x = x.loc[y.isin([b'normal.',b'back.']).values] #1
y = y.loc[y.isin([b'normal.',b'back.']).values]
x_train = x.loc[y==0,:] #2
x_test_outlier = x.loc[y==1,:] #3
from sklearn.model_selection import train_test_split
x_train,x_test_normal = train_test_split(x_train,test_size=2500) #4
```

Code-explanation^{#1 - Filter the dataset to keep only one type of Outliers. Full dataset has multiple types}^{#2, #3 - Filter Inliers as x_train and Outliers as x_test(Positive)}^{#4 - Filter another dataset out of Inliers as a separate x_test(Negative)}^{All other parts of the code is quite trivial and self-explanatory.}

We will build multiple Models for the dataset. Let's start with Gaussian Mixture Model.

*How it works*

GMM model learns to find a weighted mixture of multiple Gaussian distributions across the dataset. Hence, we get a probability for each data point for its belongingness for each individual Gaussian fit. *Check the depiction below*

We can observe the contours of the mixture of two Gaussian. The probability decreases as the points move farther away from the mean. *So the outlier points will have a small probability for both the Gaussian components.*

*Let's see the Code*

```
from sklearn.mixture import GaussianMixture
aic = []
for n in range(1,10):
gm = GaussianMixture(n_components=n, random_state=0,n_init=10).fit(x_train)
aic.append(gm.aic(x_train))
sns.lineplot(x=np.arange(1,10),y=aic) # To identity the best n_components
```

Code-explanation^{n_components is the number of Gaussians we want to mix in the Model. Just like we use silhouette_score for KMeans we use the Akaike information criterion[ AIC ] for GMM. In this case we got best n_components=2}

```
gm = GaussianMixture(n_components=2, random_state=0,n_init=10).fit(x_train)
proba_train = gm.predict_proba(x_train) #1
threshold = np.percentile(proba_train, 2.5, axis=0) #2
score_test_normal = gm.predict_proba(x_test_normal)
x_test_normal.loc[score_test_normal < threshold] #3
score_test_outlier = gm.predict_proba(x_test_outlier)
x_test_outlier[score_test_outlier < threshold] #4
```

Code-explanation^{#1 - Calculate the probability for each data points for each Gaussian component}^{#2 - Calculate a Threshold of probability considering the 2.5 percentile as Threshold. This is just a common guess here. Should be driven by case}^{#3 - Filter the records which are having a lesser probability than the Threshold from x_test_normal . This will be our "False Positive"}^{#4 - Do the same for x_test_outlier . This will be our "True Positive" }

Result^{Outliers - All 2203 out of 2203 identified}^{Normal - 135 out of 2500 incorrectly identified}

Now we understand different taxonomy around Outliers and Novelty detection. We also learnt one of the approaches to figure out the Outliers.
In the next post [Link ], we will continue with the remaining models *i.e. Isolation Forest, LocalOutlierFactor and PCA*. See you there.

Deep Neural Network has always been a Black Box and it is still so but there are many good techniques that can help us to gain some insights about the black box.

In this 4-blog series, we will understand and code these techniques on Image data *i.e. CNN*. In doing so, we will go through multiple approaches.

This is Part-III of the series. In this post, we will learn to get the Class Activation Map for a CNN even when it has multiple Fully-Connected-Layers. This was one of the challenges with *CAM [Read Part-II here and Part-I here ]*

This approach will require a little understanding of TensorFlow GradientTape.

In the CAM approach, we calculated the heatmap using the below formula $$CAM = \Sigma(w_i * FM_i)$$

^{wi is the weight of the GAP Layer of ith Feature map of the last Layer connecting to the Softmax }

**Grad Cam** is exactly based on the same approach, the only thing that changes is to think of an alternate entity instead of the weight which will allow the Model to have any number of FC layers after the Conv. layer.

Below is the depiction of the above idea *i.e. a typical convolutional Network.* Let's try to understand the idea.

^{Note - The last Feature map are the Features for the FC Layers and we can assume the FC layer as a complex function over these Features. After the last FC layer, two different sets of weights are connected to the two output neurons, so the final function is different for different output classes i.e. f1 and f2}

Our goal is to get the impact of the last convolutional layer on the outputs. Since the outputs are a function of **FM(x)**, *the derivative of the outputs class w.r.t to the FM will directly represent how the output class will change with a small change in the FM*

So, if we compare to the CAM approach, the multiplier weights are replaced by the derivative and to get the derivative we don't need a constraint on the Architecture. It can be calculated for any number of FC layers.

We will follow these steps -

- Fit a Pre-trained model on the Cats-Dogs dataset
- Get the Class's partial derivative
*w.r.t the last convolution layer's output* - Calculated the mean of the derivative for each FM
- Calculate the weighted result of all the FM and the Derivative
*i.e. Grad-CAM* - Resize the CAM and superimpose on the original image

^{Note - As mentioned earlier, every feature map will be connected to two output Neurons with two different sets of weights, so we have to use the Derivative for that particular Class w.r.t the FM}

Define and trained a pre-trained model *[ VGG16 here]*

```
# Case - II - Train Classification layers
import tensorflow as tf
from tensorflow import keras
from keras.applications.vgg16 import VGG16
model = VGG16(weights='imagenet')
```

^{Note - Model is loaded as it is i.e. we included the Top FC Layers}

In the below snippet, we calculated the Gradient of the last

```
# Get the gradient of the max class w.r.t the output of the conv.(last) layer
conv_layer = model.get_layer("block5_conv3")
joint_model = keras.Model([model.inputs], [conv_layer.output, model.output])
with tf.GradientTape() as gtape:
conv_output, predictions = joint_model(img)
y_pred = predictions[:, np.argmax(predictions[0])]
grads = gtape.gradient(y_pred, conv_output)
print(grads.shape)
```

^{Note - This is a typical way to get the Gradient for any Node w.r.t. any parameter list. If you want to understand the concept, check this Blog [ Link ]. For now, you may assume that grads have got the Gradients of the last convolution Layer}

In this code snippet, we calculated the weighted sum(Grad-CAM) of all the FM. Then we resize the CAM and apply a colour-map.

```
grads = tf.reduce_mean(grads, axis=(0,1,2)) #1
grad_cam = tf.math.multiply(conv_output,grads) #2
grad_cam = tf.reduce_mean(grad_cam, axis=-1)[0] #3
# Upsample (resize) it to the size of the image i.e. 224x224
import cv2
grad_cam = cv2.normalize(grad_cam, None, alpha = 0, beta = 255, norm_type = cv2.NORM_MINMAX, dtype = cv2.CV_8UC3)
grad_cam = cv2.resize(grad_cam, None, fx=224/14, fy=224/14, interpolation = cv2.INTER_CUBIC)
grad_cam = cv2.applyColorMap(grad_cam, cv2.COLORMAP_JET)[:,:,::-1] # This slicing is to swap R,B channel to align cv2 with Matplotlib
```

Code-explanation^{#1 - Calculated the mean value of each Gradient (Gradient shape is 1x14x14x512). For ImageNet it was 7x7x2048}^{#2, #3 - Multiplied the Grad-cam with the FM and calculated the mean across all the weighted FM}

The above operation has been depicted below Image,

In the below snippet, we have simply Superimpose the Grad-CAM to the original image using OpenCV functions.

```
img = image.load_img(img_path, target_size=(224, 224))
img = image.img_to_array(img)
img = cv2.normalize(img, None, alpha = 0, beta = 255, norm_type = cv2.NORM_MINMAX, dtype = cv2.CV_8UC3)
superimposed_img = cv2.addWeighted(img, 0.5, grad_cam, 0.25,0)
_, ax = plt.subplots(1,3,figsize=(15,5))
ax[0].imshow(superimposed_img, cmap='jet')
ax[1].imshow(img, cmap='jet')
ax[2].imshow(grad_cam, cmap='jet')
```

Code-output

This approach solved the problem created by the CAM approach *i.e. we can have any number of FC layers*

This can also be extended to define a bounding box across the image using a relevant thresholding approach. In the paper, there are additional use cases by blending Grad-CAM with few other known approaches. Check the paper here **Link**

**On the down-side,**

- Though not very severe but still, it requires an extra pass of the Image through the Network
- It is still based on upsampling the CAM so the resolution is not very high. But this issue has been solved in the paper with an improved approach

With this post,

- we are fully equipped with all the techniques required to look inside a CNN and also understand the evolution over time.
- We also understand the working of Tensorflow GradientTape

In the next part of this series, we will learn to visualize the weights and FM of convolutional layers. This will help more in understanding how the model is learning in the training.

]]>In this post, we will understand the approach that we need to take if we want the Gradient of any Node with respect to any other Node in a Tensorflow graph.

It's ok if you don't understand Tensorflow Graph. For this post, just keep in mind that a Neural Network in Tensorflow/Keras is a graph consisting of Nodes(Operation) and Edges(Data).

The most common need to get the Gradient of a Node w.r.t to another Node is when we do the Backpropagation step during training. We also used it in our Blog on CNN Visualization using Gradient-CAM [Check here]

Below is a depicting of a typical Neural Network. As mentioned, we need the Gradient of the Loss w.r.t each weight.

There are 3 key ways to calculate the differentiation -

**Finite-Difference**-*This is the traditional way to calculate the difference for any point for any function using the formula of slope i.e.*$$f'(x) = \lim_{\delta\to 0}\frac{f(x+\delta)-f(x)}{\delta}$$^{This approach will require a full pass of the neural network every time. Also, the delta must be almost equal to zero otherwise it will have inaccurate results. If you will keep the delta too small then it might get a round-off error as the value decimal value to overflow the float size}**Symbolic/Analytically**-*we can calculate the required Gradient for any function using the simple chain rule and using the known derivatives e.g.*$$f(x) = xcos(x)$$ $$=> f'(x) = cos(x) - xsin(x)$$^{Again, this technique will be incompatible with Neural Network which prefers to work with Tensors of float. Also, this is computationally inefficient considering the fact that every Network is unique and modern networks are too big for this approach. Lastly, this technique doesn't work with loop, benching and recursion}**Automatic Differentiation**- This is not a new technique in itself but it fits a wrapper of programming technique over chain-rule of the derivative. This is also called as Auto-Diff to differentiate it from manual approaches. It has two different ways to achieve the derivative. Let's understand both for this sequential expression- $$ x_1, x_2 \xrightarrow{log} {x_3}; x_3, x_4 \xrightarrow{sqrt} {x_5} ; x_5, x_6 \xrightarrow{sine} {y}$$*Forward-mode Auto-Diff*- In this mode, while moving forward with the calculation, we also calculate the derivative of individual result and connect using the chain rule at the end. e.g. in the above expression, let's say we have to calculate the derivative of y w.r.t x_{1}. So,*we will keep accumulating the intermediate derivatives i.e.*$$\frac{\partial x_3}{\partial x_1} ; \frac{\partial x_5}{\partial x_3} ; \frac{\partial y}{\partial x_5}$$ In the end, we can get the desired result by simple multiplication.^{With this approach, we can overcome the issues mentioned for previous approaches but with this approach, we need a full pass for each variable i.e. x1. In a typical Neural network, we have millions of weights, so this approach will be very inefficient.}*Reverse-mode Auto-Diff*- In this mode, we don't target the derivative w.r.t to any specific variable but we save all the outputs and derivatives for each step in the forward pass step. Then in the reverse pass, we use that values to calculate the desired derivative using the chain rule.^{With this approach, we can calculate the derivative of output w.r.t. any number of intermediate variable(weights). On the down-side, this approach needs a lot of memory to save( record) all those intermediate values and steps. It also needs appropriate planning so that we only record step/variables along a particular path instead of all possible paths in the whole network }

^{You can read more on it - [ Wikipedia] and in this [ Paper].}

Tensorflow uses Reverse-mode Automatic differentiation approach.

Now we fully understand how it is calculated. So, let's use the Tensorflow API and calculate it for a toy Graph [See below image]

```
import tensorflow as tf
input = tf.constant([[1.0,2.0]])
weights = tf.Variable([[0.25],[0.15]])
output = tf.linalg.matmul(input, weights)
# Loss - MSE
y_true = tf.constant(0.75)
loss = tf.math.sqrt(tf.math.subtract(tf.math.square(y_true), tf.math.square(output)))
```

The above snippet is a simple implementation of the Graph shown in the previous image. Now, let's add the Tensorflow GradientTape in this code.

```
input = tf.constant([[1.0,2.0]])
weights = tf.Variable([[0.25],[0.15]])
y_true = tf.constant(0.75)
with tf.GradientTape() as tape: #1
output = tf.linalg.matmul(input, weights)
# Loss - Squared Diff
loss = tf.math.squared_difference(y_true, output)
gradients = tape.gradient(loss, weights) #2
gradients
```

Code-output^{<tf.Tensor: shape=(2, 1), dtype=float32, numpy=array([[0.39999998], [0.79999995]], dtype=float32)>}

Code-explanation^{#1 - We need to initiate a "with" context with GradientTape. It makes Tensorflow aware that it has to calculate the Gradient of all the nodes used in the code within the "with" context. This is the point we discussed in the previous section that we need a way to control which steps to record and which ones to ignore. We are achieving that using the "with" context}^{#2 - Then, we can simply get the Gradient of any output node w.r.t to an input Node. Since TF GradientTape has recorded and saved all intermediate outputs and derivatives.}

*By default,*GradientTape only watches a Variable not Constant. But there is a provision to force this- The tape is automatically erased immediately after we call its gradient() method, so we will get an exception if we try to call it again. Here also, we can force it to act otherwise.
- We can't calculate the Gradient of a vector/Matrix w.r.t. another vector/matrix. It must be a Scaler value w.r.t to a Vector/Matrix. It will result, the sum of the gradients of each target
- The tape can't record the gradient path if the calculation exits TensorFlow
*e.g. using Numpy* - Try to keep the code within the "with" context as lean as possible

Let's see another snippet for these points.

```
input = tf.constant([[1.0,2.0]])
# Will not work if "tape.watch" is not added
weights = tf.constant([[0.25],[0.15]]) #1
y_true = tf.constant(0.75)
with tf.GradientTape(persistent=True) as tape: #2
tape.watch(weights) #3
output = tf.linalg.matmul(input, weights)
# Loss - Squared Diff
loss = tf.math.squared_difference(y_true, output)
gradients = tape.gradient(loss, weights)
gradients_copy = tape.gradient(loss, weights) #4
gradients_copy
```

Code-explanation^{#1 - We defined the weights as Constant, it will not return the Gradient if this Constant not placed under tape.watch. You may try it by commenting #3}^{#2 - Added persistent=True, this will force the tape to not clear its content after first call to tape.gradient}^{#3 - Put the weights under tape.watch( )}^{#4 -Called the tape.gradient( ) again, this will work because the tape was initialized as persistent=True}

Since we have got the Gradient of the Loss w.r.t to the weighs, so we can simply train the Model using the very basic concept of Gradient Descent i.e.

$$w = w - learning\_rate * Gradient$$

Let's take 3 data points and run a loop on them.

```
data = tf.data.Dataset.from_tensor_slices([[5.0,2.0], [3.0,3.0], [4.0,1.5]])
weights = tf.Variable([[0.55],[0.75]],shape=[2,1])
y_true = tf.data.Dataset.from_tensor_slices([0.75,0.5,0.4])
lr = 0.01 # Learning rate
dataset = tf.data.Dataset.zip((data,y_true)) # Zipped the two tensor to loop
for data, y_true in dataset:
with tf.GradientTape() as tape:
output = tf.linalg.matmul(tf.reshape(data,[1,2]), weights)
loss = tf.math.squared_difference(y_true, output)
gradients = tape.gradient(loss, weights)
weights.assign_sub(tf.multiply(lr,gradients)) # Applied the Gradient Descent
weights # Final trained weights
```

Code-output^{<tf.Variable 'Variable:0' shape=(2, 1) dtype=float32, numpy= array([[0.02995202], [0.47385702]], dtype=float32)>}

The above code snippet is self-explanatory. If you are not aware of TensorFlow Data API, then you may take a quick look into the official guide [Here]

For more insights on GradientTape, check the official guide for it, [Here]

For advanced concepts *i.e. Higher-Order derivative, Derivative for tensor targets/tensor Source, etc*. check this official guide [Here]

With this post, we understood different ways to calculate the Derivative. Then we derived into the working of Auto-diff.
Then we learnt the Tensorflow implementation *i.e. GradientTape *to achieve the derivative of output for any input weights. Try to go through the references mentioned in different sections.

With all this knowledge and code, you should be comfortable to build a custom neural network for any purpose.

We will use this knowledge in our blog to "Visualise a convolutional neural network"[Link].

Deep Neural Network has always been a Black Box and it is still so but there are many good techniques that can help us to gain some insights about the black box.

In this 4-blog series, we will understand and code these techniques on Image data *i.e. CNN*. In doing so, we will go through multiple approaches.

This is Part-II of the series. In this post, we will visualize the image to find out the most important pixels using a technique called **Class Activation Map** as described in this **Paper**. This is too a straightforward technique, but everything is complex unless it is made simple, *so all the credits to the Researchers.*

Let's assume a CNN model with Conv. layers connected to a single fully connected layer *via. a Global Average Pool layer*. This is a very common CNN Architecture.

So, the image will be scanned with all the convolution Layer and the last Layer will have the key Features in the form of multiple Feature maps. Then using these feature maps the Fully connected layer will decide what the Category is *e.g. Cat/Dog.*

What it means is that the combined effect of the Feature map and the respective weight of the fully connected layer is the value that the model has for every image.

So, if we create a weighted sum of the weight and the Feature maps, it should represent the effective spatial importance *(More on it later)* for the image. This will represent the **Class Activation Map** of that Class *(use weights that are connected to the particular Class Neuron of the Softmax).*

Then the task that remains is to resize this map to the size of the image.

Below is the depiction of the above idea *[Image is taken directly from the paper - Arxiv link]*

^{The Global Average Pool layer is the direct representation of the last Feature map and the weight connecting the Global Average Pool to the Softmax is the weight can be considered equivalent to a weight that is directly mapped from the FM to the Softmax}

We will follow these steps

- Fit a Pre-trained model on the Cats-Dogs dataset
- Get the last convolution layer's output for the Original Image
- Get the respective weight of the feature map(FM) connecting to the Softmax
- Calculate the weighted result of all the FM which is the
*Class Activation Map(CAM)* - Resize the CAM and superimpose on the original image
^{Note - As mentioned earlier, every feature map will be connected to two output Neurons with two weight, so we have to use the weight for that particular Class which was predicted by the Model}

Define and trained a pre-trained model *[ ResNet50 here]*

```
import tensorflow as tf
from tensorflow import keras
from keras.applications.resnet50 import ResNet50
base_model = ResNet50(weights='imagenet', include_top=False)
model = keras.Sequential()
model.add(base_model)
model.add(keras.layers.GlobalAveragePooling2D())
model.add(keras.layers.Dense(2, activation="softmax"))
#Freeze the layers of Pre-trained models
for layer in base_model.layers:
layer.trainable = False
optimizer = keras.optimizers.Adam(lr=0.02)
model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])
#No need to train for a very long time
history = model.fit(traindata, epochs=1, validation_data=testdata)
```

In the below snippet, picked a random image from the folder and predict its output with an Intermediate model to get the Feature maps of the last convolution layer

```
img_path = '/content/train/' + str(img_list[np.random.randint(0,len(img_list))])
img = image.load_img(img_path, target_size=(224, 224))
img = image.img_to_array(img)
y_class = img_path.split(sep="/")[-1][:3]
img = np.expand_dims(img, axis=0)
img = preprocess_input(img) #1
# Create a model with Conv block (only)
model_b = keras.Sequential()
model_b.add(base_model)
op = model_b.predict(img)[0] # This is 2048 FM of size 7x7
# Weights from the lasy Layer [ This is from original Model]
weights = model.get_weights()[-2] #2
weights.shape
```

Code-explanation^{#1 - This is the preprocess function of the ResNet model [Keras version]}^{#2 - Get the Weights(not bias) from the last layer. The index is to fetch that. [-1] will fetch the Biases}

In this code snippet, we calculated the weighted sum(CAM) of all the FM. Then we resize the CAM and apply a colour-map.

```
# Weights will depend on the actual class of the Image
if y_class=='cat':
cam = op*weights[:,0].reshape(1,1,-1)
else:
cam = op*weights[:,1].reshape(1,1,-1)
cam = cam.sum(axis=-1)
# Upsample (resize) it to the size of the image i.e. 224x224 [ResNet]
import cv2
cam = cv2.normalize(cam, None, alpha = 0, beta = 255, norm_type = cv2.NORM_MINMAX, dtype = cv2.CV_8UC3)
cam = cv2.resize(cam, None, fx=224/7, fy=224/7, interpolation = cv2.INTER_CUBIC)
cam = cv2.applyColorMap(cam, cv2.COLORMAP_JET)[:,:,::-1] # This slicing is to swap R,B channel to align cv2 with Matplotlib
```

In the below snippet, we have simply Superimpose the CAM to the original image using OpenCV functions.

```
# Load the same image again
img = image.load_img(img_path, target_size=(224, 224))
img = image.img_to_array(img)
img = cv2.normalize(img, None, alpha = 0, beta = 255, norm_type = cv2.NORM_MINMAX, dtype = cv2.CV_8UC3)
superimposed_img = cv2.addWeighted(img, 0.5, cam, 0.25,0)
# Displaying with Matplotlib
_, ax = plt.subplots(1,3,figsize=(15,5))
ax[0].imshow(superimposed_img, cmap='jet')
ax[1].imshow(img, cmap='jet')
ax[2].imshow(cam, cmap='jet')
```

Code-output

This approach was quite simple and also no special computation was required like the occlusion method.

We can also extend this to define an abounding box across the image using a relevant thresholding approach. *Though the Box would be more on the CAM rather than the full object.*

**On the down-side,**

- CAM can not be applied to networks that use multiple fully-connected layers before the output layer, so fully-connected layers are replaced with convolutional ones and the network is re-trained.
*It expects a single fully connected layer post the global average pol layer*. - It is still based on upsampling the CAM so the resolution is not very high

So, we need an approach that doesn't need these restrictions. That is where we apply the concept of **Grad-CAM** *Arxiv link*

We will learn this technique in the next part of this blog series. Check it Here *Link*

Deep Neural Network has always been a Black Box and it is still so but there are many useful techniques that can help us to gain relevant insights about the black box.

In this 4-blog series, we will understand and code these techniques on Image data *i.e. CNN*. In doing so, we will go through multiple approaches

Ways using which we may look into CNN,

- Looking at the Filters
- Looking at the Convolution feature map
- Class activation map

This is a family of techniques. The key goal of each is to find out which part of the image is more responsible for the output. Definitely, another added goal is to do this efficiently otherwise this is an easy task.

Class Activation map (heat map)^{A heatmap across the image's pixels according to the importance of the pixel for a particular output Image courtesy - Matthew D. Zeiler, Arxiv link}

Key approaches for getting the Class Activation map (heat map) are

- Occluding parts of the image as described in this paper Arxiv link
- Simple CAM as described in this paper Arxiv link
- Grad-CAM as described in this paper Arxiv link

In this post, we will look into the first approach. This is very simple and will give us the optimum value of thrill and effort :-)

This is a very simple technique. What we do here is do occlude a part/pixel of the input image and observe the Class probability.

So, *when a more important part is occluded, the dip will be higher*. Then using all the probability dip, we can make a simple heat-map. This technique is very similar to the approach of calculating Feature Importance using the feature-permutation method.

We will follow these steps

- Fit a Pre-trained model on Cats-Dogs dataset [ Link ]
- Select an image and create are occluded copies. The number of copies will depend upon the size of the occlusion. As shown in this depiction

- Get the last convolution layer's output for the Original Image
- Get the same output for all the occluded images and note the average dip across the occluded region.
- The dip will represent the importance(Heatmap) of the region
*i.e. all the pixels* - Resize the heatmap and superimpose on the original image

Note-^{We have changed the strategy a bit as compared to the original paper. Paper has suggested to do it for the Strongest Filter but this will work equally well and reduce the overhead to find the Strongest filter}

In the below snippet, we have defined the parameters for image size and patch. Then using a nested loop we have created the Occluded images.

```
# Defining parameters
size = 224
patch_size = 32
num_images = int((size/patch_size)**2)
import cv2
img = cv2.normalize(img, None, alpha = 0, beta = 255, norm_type = cv2.NORM_MINMAX, dtype = cv2.CV_8UC3)
plt.imshow(img)
# Creating different image occluded at different position
x = np.zeros((num_images, size, size, 3),dtype=int) #1
x[:] = img #2
i=j=k=0
while i < size:
while j < size:
x[k,i:i+patch_size,j:j+patch_size,:] = 0. #3
j+=patch_size
k+=1
i+=patch_size
j=0
```

Code-explanation^{#1 - Blank array to hold all the occluded images. num_images will depend on the patch size, smaller the patch bigger the count}^{#2 - Filing all the image slots with the original image. Then in the next step, we will put a patch on a different location}^{#3 - Slicing a portion and occluding it by filling with zero. Loop to move across images and pixels.}

In the below snippet, we have preprocessed the whole image with ResNet pre-process function. Then calculated the dip on the prediction for the Original image and Occluded image

```
x = preprocess_input(x)
# Creating a Model to get the prediction for Conv Layer
model_b = keras.Sequential()
model_b.add(base_model) # base_model is ResNet pre-trained conv block
op = model_b.predict(x) # Output of Occluded Image
img_full = np.expand_dims(img, axis=0).copy()
img_full = preprocess_input(img_full)
img_full_op = model_b.predict(img_full) # Output of all the Original Images
# Calculate the dip and making ignoring the Spikes
act_dip = img_full_op - op
act_max_fm = act_dip.mean(axis=1).mean(axis=1).max(axis=1) #1, #2
heatmap = act_max_fm.reshape(int(size/patch_size),int(size/patch_size)) #3
```

Code-explanation^{#1 - act_max_fm is of shape (49, 7, 7 2048). 49 for the Occluded images. 7x7 is the Featuremap size of ResNet last Conv. layer. 2048 is the count of Feature maps. We calculated the mean for each feature map and then the max of the 2048 such outputs.}^{#2 - So the assumption is that the FM which was most important will have the biggest dip and should come out in max operation.}^{#3 -act_max_fm is a 1-D array, so we reshaped it to 2-D i.e. 7x7 from (49,)}

In the below snippet, we have simply added the colour map to the heatmap and Superimposed it to the original image using OpenCV functions.

```
# Normalizing and applying color map to the Heatmap
import cv2
hm = heatmap.copy()
hm = cv2.normalize(hm, None, alpha = 0, beta = 255, norm_type = cv2.NORM_MINMAX, dtype = cv2.CV_8UC3)
hm = cv2.resize(hm, None, fx=patch_size, fy=patch_size, interpolation = cv2.INTER_CUBIC)
hm = cv2.applyColorMap(hm, cv2.COLORMAP_JET)[:,:,::-1] # Flipped BGR to RGB
superimposed_img = cv2.addWeighted(img, 0.5, hm, 0.25, 0,)
_, ax = plt.subplots(1,3,figsize=(15,5))
ax[0].imshow(superimposed_img, cmap='jet')
ax[1].imshow(img, cmap='jet')
ax[2].imshow(hm, cmap='jet')
```

While this approach was very intuitive *i.e. even if you don't understand the Knitty-gritty of a CNN, you can understand that this should work*.

**On the down-side,**

- It requires a lot of extra computation in terms of creating and predicting all the images through the Model. To get an estimate for this issue,
*for an image of size 224x224 and occlusion size of 4x4, we need 3136 images which are equivalent to 3136*3*224*224*8 Bytes ~ 4 GBs of memory* - Heat map fluctuates a lot based on the size of the patch and the method to pick the dip in Feature maps
*i.e. Mean, max, etc.*

So, we need an approach where we don't have to pass the CNN multiple times. That is where we apply the concept of **Class Activation Map** *Paper*

We will learn this technique in the next part of this blog series. Check it *Here*

In this blog, we will learn many of the important Python stuff that might be new to you Or you must keep it at your fingertips.

List comprehension is the go-to way to iterate in Python, you should seldom use for-Loop. Here, we will try to develop an analogy of list-comprehension with for-loop *(because that syntax is the default in our mind)*

With the for-loop analogy, you can apply the list comprehension to a deeper level of nesting.
Check the Image *i.e. how the numeric indicator is mapped between for-loop and list-comprehension*

```
simple_list = ['This', 'is', '10xAI', 'Learning', '!!!']
# ---> [<<<Your expression>>> for elem in simple_list]
[elem.upper() for elem in simple_list]
```

Output^{['THIS', 'IS', '10XAI', 'LEARNING', '!!!']}

With the analogy, now it is a cakewalk.

```
nested_list = [ ['This', 'is'], ['10xAI', 'Learning'], ['It\'s', 'all'], ['about','AI'], ['!','!'] ]
# ---> [<<<Your expression>>> for inner_list in outer_list for elem in inner_list]
[elem for inner_list in nested_list for elem in inner_list]
```

Output^{['This', 'is', '10xAI', 'Learning', "It's", 'all', 'about', 'AI', '!', '!']}

Zip is a handy tool to pack the respective elements of two sequences. Enumerate doesn't need any introduction. So, just placing a simple example

```
models = ['LR', 'RandomForest', 'DecisionTree', 'SVM', 'KNN']
score = [0.7, 0.9, 0.8, 0.75, 0.72]
# Zipping respective pair
[[tup[0],tup[1]] for tup in zip(models,score)]
# Enumerate with Dict-Comprehension
{id:{tup[0]:tup[1]} for id,tup in enumerate(zip(models,score))}
```

Functions are objects too. You can assign and pass them as any other type using its Identifier.

```
def g(x) : return 10*x
def f(g_x, x): return 1/g_x(x) # Accepts functin as one of the parameters
simple_list = [0.1, 0.2, 0.4, 0.5, 1]
[f(g , i) for i in simple_list] # Passing function as argument
# Can assign another identifier
func_var = g # No parenthesis
[f(func_var , i) for i in simple_list]
```

Output^{[1.0, 0.5, 0.25, 0.2, 0.1]}

Know and use these two parameters for the print function.

`print(objects, sep=' ', end='\n', file=sys.stdout, flush=False)`

`end`

defines what will be appended at the end of a print statement. Default is a new line, that' why the next print starts from the next line.
`sep`

defines how two objects will be separated. Default is blank.

You can change both these defaults.

```
models = ['LR', 'RandomForest', 'DecisionTree', 'SVM', 'KNN']
score = [0.7, 0.9, 0.8, 0.75, 0.72]
# Defaults
for i in range(5):
print(models[i], score[i])
# using sep
for i in range(5):
print(models[i], score[i], sep="==")
# using end
for i in range(5):
print(models[i], score[i], sep="=", end=" | ")
```

Output^{LR=0.7 | RandomForest=0.9 | DecisionTree=0.8 | SVM=0.75 | KNN=0.72 |}

The packing and unpacking of variable are one of the kool stuff in Python. Let's quickly learn some of its variations and extension. Using a * in the right you can ignore the items which are not required.

```
many_vars = [100, "aqz", 0.002, "ignore"]
# We just need the 100
var, *_ = many_vars # _ is the throw-away variable
print(var)
# If we want to keep all others just for future ref
var, *my_bin = many_vars
print(my_bin) # my_bin is a list
```

With an ***** before a sequence will unpack it.

```
many_vars = [ [100, "aqz"] , [0.002, "ignore"] ]
# Default way, we need the 100
var, *my_bin = many_vars
print(var) # var is the list as due to only 1st level unpacking
# Using *, we need the 100
var, *my_bin = (*many_vars[0],*many_vars[1])
print(var) # var is the list as due to only 1st level unpacking
```

Whenever you need to get the unique values in a sorted manner. Use this

```
text = "a quick brown fox jumps over the lazy dog"
sorted(set(text))
# Use join for making it String
'_'.join(sorted(set(text)))
# Reversed it
list(reversed(sorted(set(text)))) # Reverse returns an Iterator, so list func applied
```

Output^{'_a_b_c_d_e_f_g_h_i_j_k_l_m_n_o_p_q_r_s_t_u_v_w_x_y_z'}

First thing first - Python is 0-indexed.

Slice has 3 elements **i.e. start, stop, step**. The last item is indexed as -1 and the second last is indexed as -2 and son on. The step can be negative. The last index gets excluded.

Let's apply this information.

```
simple_list = [1, 7, 8, 0, 5, 23]
# with negative Index
simple_list[:-2 ] # From start to second-last element(-2 excluded)
# Trimming
simple_list[1:-1 ] # Very often needed with String
# Alternate Even-Indexed
simple_list[::2 ]
# Alternate Odd-Indexed
simple_list[1::2 ]
# Reversing the list
simple_list[::-1 ] # One of the coolest Python stuff
```

Join is your friend when you want to get the string from a sequence by concatenating all its elements with a specific character. You can also concatenate with blank.

```
simple_list = ['This', 'is', '10xAI', 'Learning', '!!!']
' '.join(simple_list)
# With List-comprehension
' '.join([elem.upper() for elem in simple_list])
```

Output^{THIS IS 10XAI LEARNING !!!}

The *map* is another very useful built-in function. It takes a function and one or more Iterable and applies the function to the elements of the Iterable. Function argument will be a tuple having a number of elements equal to the number of Iterable

```
simple_list = [1, 2, 3, 4, 5 ]
def func(x) : return x**2
# Get the square
list(map(func, simple_list)) # Will return an Iterator, so use list
# Use different exponent for different values
exponent_list = [5, 4, 3, 2, 1 ]
def func(x, y) : return x**y
list(map(func, simple_list,exponent_list)) # passing two list, hence function should accept 2 parameters
```

The filter is another very handy built-in function. It takes a function and an Iterable and return the elements of the Iterable for which the function evaluates to True

```
simple_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
def odd(x) : return True if x%2!=0 else False
# Get the Odd elements
list(filter(odd, simple_list)) # Will return an Iterator, so use list
# This is the same as this :-)
[elem for elem in simple_list if elem%2!=0]
```

**Itertools** is a handy module in Python when you quickly want to generate a permutation, a combination of product of two sequences.

```
import itertools
comb = itertools.combinations([1,2,3,4], 3) # all combination(i.e. 2,3 and 3,2 are same ) of length 3
list(comb)
perm = itertools.permutations([1,2,3,4], 3) # all permutation (i.e. 2,3 and 3,2 are different ) of length 3
list(perm)
prod = itertools.product(['A','B'],[1,2,3,4]) # all Cartesian product
list(prod)
```

We can use a Generator Expression *i.e. genexps* as a simple way to write an inline generator.

Its work exactly like List Comprehension except,

- Uses ( ) instead of [ ]
- Does not generate a List but an iterator yielding one item at a time
*i.e. saves memory* **Use it when you want to pass an Iterator but not intended to save**

```
gen = (x*x for x in range(10))
next(gen) #0
next(gen) #1
next(gen) #2
list(gen) # Make a list
# Just like List Comprehension without saving a List in memory
sum(x*x for x in range(10)) # Dont need the extra Brackets when used with a function
```

This is quite confusing sometimes. Though you may sail through w/o understanding these concepts.
Something is Iterable means we can loop over its elements
This is the collection we use in for loop i.e. `for elem in Iterable:`

*An Iterator is more of the background implementation to make the loop happening*

**Technically -**

Iterable, when called on ** iter()** built-In function return an Iterator
Iterator has a

```
s = 'cat'
t = iter(s) # Made an Iterator
next(t) # call the next built-in function
# Output 'c'
next(t)
# Output 'a'
```

]]>

Read this Answer at SO^{https://stackoverflow.com/a/9884501/2187625}