joins

sparkcraft.joins.add_salt_column(df, skew_factor)[source]

Adds a salt column to a DataFrame. We will be using this salt column when we are trying to perform join, groupBy, etc. operations into a skewed DataFrame. The idea is to add a random column and use the original keys + this salted key to perform the operations, so that we can avoid data skewness and possibly, OOM errors.

Parameters:

df (DataFrame) – A PySpark DataFrame
skew_factor (int) – The skew factor. For example, if we set this value to 3, then the salted column will be populated by the elements 0, 1 and 2, extracted from a uniform probability distribution.

Returns:

The original DataFrame with a salt_id column.

Return type:

DataFrame

sparkcraft.joins.optimal_cross_join(df_bc, df)[source]

A simple trick to solve the problem we have with CrossJoins between two dataframes, when the resulting partitions will be a multiplication of the initial partitions (e.g. if we make the cross join between df1 and df2 both with 100 partitions, then the resulting partitions will be 10_000).

Parameters:

df_bc (DataFrame) – The DataFrame to be broadcasted
df (DataFrame) – The DataFrame

Returns:

The cross joined DataFrame

Return type:

DataFrame