Pandas: Creating Calculated Column
Several approaches can be made to create a calculated value based column of a data frame. Loops can be used to iterate over each row, calculating a value for the result. This is fine on a small dataset, but it can be costly with large data. Below are two examples, one using a For Loop for a small slice of a dataset, and another using the apply method. While the for loop is shown, the apply method should be used as it’s more efficient, faster and gets better results.
For Loop Solution
Using a bit of Python and Pandas we can create calculated values for a new column in a data frame. An example is with this Netflix dataset that I’m using. Getting that data down to a subset of TV-MA Netflix movies made in 2021, a calculated column of the length of the title name of each film can be created using the following approach.
The idea below is to loop through the data frame, and each row will create a new column called “title_length,” which is set to the value of the current row’s title length. The length of the title is graded using Python’s len() function:
for label, row in tvma_2021.iterrows(): tvma_2021[label, "title_length"] = len(row["title"]) print(tvma_2021)
S37, S87, S94 are all movie id’s and the title_length column is the created column with the created values based on title length.
No loop required in this process. Instead we define the new column name by using the syntax data frame[“new_column_name”] and setting it equal to the column we want to iterate on, with a .apply() method call. Inside the .apply() method, we supply the function we want run each time.
tvma_2021["len"] = tvma_2021["title"].apply(len) print(tvma_2021[["title", "len"]])
In the results above, the title and new column “len” are output only (using the syntax dataframe[[column1], [column2]]). The new column (len) is calculating the length using the “len” method via the apply method.
The code can be read as, create column “len” and set each row value to the length of the title of that row.
Now that it’s created, we can subset the data frame looking for titles that are exceptionally long (over 80 characters). Metrics can be run to see if there’s a loss of interest on long titles vs shorter ones.