AI/ML/Deep Learning Cheat Sheet
Cheat Sheet
Dependent Variable
- The variable a Neural Network is trained to predict.Independent Variable
- Variables that are used as training input to be able to predict theDependent Variable
Continuous Variables
- Variables that are not discrete and can have a range between two points. They can be be natively used in multiplication e.g. price, age, height, weight.Categorical Variables
- Discrete variables are used in classifying or labelling an object. They are not easily used in multiplication or have no meaning in multiplication e.g. a color, a random user id, country name.Embedding
- An Embedding is a collection of numbers that are used to represent a Categorical Variable. Since Categorical Variables do not have an inherent numerical value, the Embeddings are used to to represent them and are used in the training process.
Pandas
Loading a csv
df = pd.read_csv("./decision_tree_sample.csv")
Writing a data frame to csv
df.to_csv('output.csv')
- If you don’t need the indexes of the rows
df.to_csv('output.csv', index=False)
DataFrame Columns
df.columns
Iterating through column values
for column in df.columns:
for value in df[column]:
print(value)
Iterating over rows
for idx, row in df.iterrows():
print(row)
Mutation through a Vectorized Approach
The Vectorized Approach is a more efficient pattern for manipulating row entires by processing column data at one time instead of looping through each row and doing calculations and updates per row.
The example below, calculates all the lag values for each column and then updates the whole column after the calculation.
def add_lag_features(df, num_weeks):
# Create the lag feature columns in the dataframe.
for i in range(1, 4):
if f"LAG_{i}" not in df.columns:
df[f"LAG_{i}"] = 0
buckets = setup_buckets(df, num_weeks)
# Add lag features using vectorized operations
# LAG_1, LAG_2, LAG_3
lag_nums = [1, 2, 3]
for num in lag_nums:
lag_col = f"LAG_{num}"
lag_values = []
for idx, row in df.iterrows():
week = row["WEEK"]
lag_idx = week - num
if lag_idx > 0: # Valid lag index
name = row["NAME"]
lag_quantity = buckets.get(lag_idx, {}).get(name, {}).get("UNITS", 0)
else:
lag_quantity = 0 # Default value for invalid lag
lag_values.append(lag_quantity)
df[lag_col] = lag_values # Assign the calculated lag values
return df
Drop Columns
df = df.drop(columns=["COLUMN_A", "COLUMN_B"])
Get row by index
df.iloc[idx]
Filtering rows based on column value
- We can get a view to a subset of the data frame by filtering
week_1 = df.loc[new_df["WEEK"] == 1]
Splitting data frames
- We can split dataframes on a condition
condition = df["Age"] < 50
sample_split = df[condition]
Boolean Mask Condition
We can also use a boolean mask to inverse a condition.
This example splits the data frame into two subsets using a boolean mask
conditoin = df["Age"] < 50
group_1 = df[condition]
group_2 = df[~condition]
Getting the last item
Using tail, to get the last item in a data frame
last_item = df.tail(1)
Update all rows in a column
df["WEEK"] = new_week
Mutate the original dataframe
- loc tells pandas to mutate the original data frame NOT a copy
apply
, applies a lambda function for every row in the column
df.loc[:, "WEEK"] = df["WEEK"].apply(lambda x: num_weeks + 1 - x)
Reset indexes
- Sometimes when combining data frames, there might be duplicate indexes. Resetting it prevents any collisions.
new_df = pd.DataFrame(rows).reset_index(drop=True)