AI/ML/Deep Learning Cheat Sheet

Nov 19, 2024ยท
Christopher Coverdale
Christopher Coverdale
ยท 3 min read

Cheat Sheet

  • Dependent Variable - The variable a Neural Network is trained to predict.

  • Independent Variable - Variables that are used as training input to be able to predict the Dependent Variable

  • Continuous Variables - Variables that are not discrete and can have a range between two points. They can be be natively used in multiplication e.g. price, age, height, weight.

  • Categorical Variables - Discrete variables are used in classifying or labelling an object. They are not easily used in multiplication or have no meaning in multiplication e.g. a color, a random user id, country name.

  • Embedding - An Embedding is a collection of numbers that are used to represent a Categorical Variable. Since Categorical Variables do not have an inherent numerical value, the Embeddings are used to to represent them and are used in the training process.

Pandas

Loading a csv

df = pd.read_csv("./decision_tree_sample.csv")

Writing a data frame to csv

df.to_csv('output.csv')
  • If you don’t need the indexes of the rows
df.to_csv('output.csv', index=False)

DataFrame Columns

df.columns

Iterating through column values

for column in df.columns:
    for value in df[column]:
        print(value)

Iterating over rows

for idx, row in df.iterrows():
    print(row)

Mutation through a Vectorized Approach

The Vectorized Approach is a more efficient pattern for manipulating row entires by processing column data at one time instead of looping through each row and doing calculations and updates per row.

The example below, calculates all the lag values for each column and then updates the whole column after the calculation.

def add_lag_features(df, num_weeks):
    # Create the lag feature columns in the dataframe.
    for i in range(1, 4):
        if f"LAG_{i}" not in df.columns:
            df[f"LAG_{i}"] = 0

    buckets = setup_buckets(df, num_weeks)

    # Add lag features using vectorized operations
    # LAG_1, LAG_2, LAG_3
    lag_nums = [1, 2, 3]
    for num in lag_nums:
        lag_col = f"LAG_{num}"
        lag_values = []

        for idx, row in df.iterrows():
            week = row["WEEK"]
            lag_idx = week - num

            if lag_idx > 0:  # Valid lag index
                name = row["NAME"]
                lag_quantity = buckets.get(lag_idx, {}).get(name, {}).get("UNITS", 0)
            else:
                lag_quantity = 0  # Default value for invalid lag

            lag_values.append(lag_quantity)

        df[lag_col] = lag_values  # Assign the calculated lag values

    return df

Drop Columns

df = df.drop(columns=["COLUMN_A", "COLUMN_B"])

Get row by index

df.iloc[idx]

Filtering rows based on column value

  • We can get a view to a subset of the data frame by filtering
week_1 = df.loc[new_df["WEEK"] == 1]

Splitting data frames

  • We can split dataframes on a condition
condition = df["Age"] < 50
sample_split = df[condition]

Boolean Mask Condition

  • We can also use a boolean mask to inverse a condition.

  • This example splits the data frame into two subsets using a boolean mask

conditoin = df["Age"] < 50

group_1 = df[condition]
group_2 = df[~condition]

Getting the last item

Using tail, to get the last item in a data frame

last_item = df.tail(1)

Update all rows in a column

df["WEEK"] = new_week

Mutate the original dataframe

  • loc tells pandas to mutate the original data frame NOT a copy
  • apply, applies a lambda function for every row in the column
df.loc[:, "WEEK"] = df["WEEK"].apply(lambda x: num_weeks + 1 - x)

Reset indexes

  • Sometimes when combining data frames, there might be duplicate indexes. Resetting it prevents any collisions.
new_df = pd.DataFrame(rows).reset_index(drop=True)

Did you find this page helpful? Consider sharing it ๐Ÿ™Œ