Cheat Sheet

Dependent Variable - The variable a Neural Network is trained to predict.
Independent Variable - Variables that are used as training input to be able to predict the Dependent Variable
Continuous Variables - Variables that are not discrete and can have a range between two points. They can be be natively used in multiplication e.g. price, age, height, weight.
Categorical Variables - Discrete variables are used in classifying or labelling an object. They are not easily used in multiplication or have no meaning in multiplication e.g. a color, a random user id, country name.
Embedding - An Embedding is a collection of numbers that are used to represent a Categorical Variable. Since Categorical Variables do not have an inherent numerical value, the Embeddings are used to to represent them and are used in the training process.

Pandas

Loading a csv

df = pd.read_csv("./decision_tree_sample.csv")

Writing a data frame to csv

df.to_csv('output.csv')

If you don’t need the indexes of the rows

df.to_csv('output.csv', index=False)

DataFrame Columns

df.columns

Iterating through column values

for column in df.columns:
    for value in df[column]:
        print(value)

Iterating over rows

for idx, row in df.iterrows():
    print(row)

Mutation through a Vectorized Approach

The Vectorized Approach is a more efficient pattern for manipulating row entires by processing column data at one time instead of looping through each row and doing calculations and updates per row.

The example below, calculates all the lag values for each column and then updates the whole column after the calculation.

def add_lag_features(df, num_weeks):
    # Create the lag feature columns in the dataframe.
    for i in range(1, 4):
        if f"LAG_{i}" not in df.columns:
            df[f"LAG_{i}"] = 0

    buckets = setup_buckets(df, num_weeks)

    # Add lag features using vectorized operations
    # LAG_1, LAG_2, LAG_3
    lag_nums = [1, 2, 3]
    for num in lag_nums:
        lag_col = f"LAG_{num}"
        lag_values = []

        for idx, row in df.iterrows():
            week = row["WEEK"]
            lag_idx = week - num

            if lag_idx > 0:  # Valid lag index
                name = row["NAME"]
                lag_quantity = buckets.get(lag_idx, {}).get(name, {}).get("UNITS", 0)
            else:
                lag_quantity = 0  # Default value for invalid lag

            lag_values.append(lag_quantity)

        df[lag_col] = lag_values  # Assign the calculated lag values

    return df

Drop Columns

df = df.drop(columns=["COLUMN_A", "COLUMN_B"])

Get row by index

df.iloc[idx]

Filtering rows based on column value

We can get a view to a subset of the data frame by filtering

week_1 = df.loc[new_df["WEEK"] == 1]

Splitting data frames

We can split dataframes on a condition

condition = df["Age"] < 50
sample_split = df[condition]

Boolean Mask Condition

We can also use a boolean mask to inverse a condition.
This example splits the data frame into two subsets using a boolean mask

conditoin = df["Age"] < 50

group_1 = df[condition]
group_2 = df[~condition]

Getting the last item

Using tail, to get the last item in a data frame

last_item = df.tail(1)

Update all rows in a column

df["WEEK"] = new_week

Mutate the original dataframe

loc tells pandas to mutate the original data frame NOT a copy
apply, applies a lambda function for every row in the column

df.loc[:, "WEEK"] = df["WEEK"].apply(lambda x: num_weeks + 1 - x)

Reset indexes

Sometimes when combining data frames, there might be duplicate indexes. Resetting it prevents any collisions.

new_df = pd.DataFrame(rows).reset_index(drop=True)

AI/ML/Deep Learning Cheat Sheet

Cheat Sheet

Pandas

Loading a csv

Writing a data frame to csv

DataFrame Columns

Iterating through column values

Iterating over rows

Mutation through a Vectorized Approach

Drop Columns

Get row by index

Filtering rows based on column value

Splitting data frames

Boolean Mask Condition

Getting the last item

Update all rows in a column

Mutate the original dataframe

Reset indexes

Related

AI/ML/Deep Learning Cheat Sheet

Cheat Sheet

Pandas

Loading a csv

Writing a data frame to csv

DataFrame Columns

Iterating through column values

Iterating over rows

Mutation through a Vectorized Approach

Drop Columns

Get row by index

Filtering rows based on column value

Splitting data frames

Boolean Mask Condition

Getting the last item

Update all rows in a column

Mutate the original dataframe

Reset indexes

Did you find this page helpful? Consider sharing it 🙌

Related