partitions
- sparkcraft.partitions.add_partition_id_col(df, partition_id_colname='partition_id')[source]
Adds a column named partition_id to the input DataFrame which represents the partition id as output by pyspark.sql.functions.spark_partition_id method.
- Parameters:
df (DataFrame) – A PySpark DataFrame
partition_id_colname (str) – The name of the column containing the partition id
- Returns:
The input DataFrame with an additional column (partition_id) which represents the partition id
- sparkcraft.partitions.get_optimal_number_of_partitions(df, partition_cols=None, df_sample_perc=None, target_size_in_bytes=134217728, estimate_biggest_key_probability=0.95, estimate_biggest_key_relative_error=0.0)[source]
This method calculated the optimal number of partitions for the input PySpark DataFrame df.
- Parameters:
df (DataFrame) – A PySpark DataFrame
partition_cols (str | List[str] | None) – The columns, if provided, to partition the DataFrame by
df_sample_perc (float | None) – If provided, the sampling percentage for approximate size estimation
target_size_in_bytes (int) – The target size of each partition (~128MB)
estimate_biggest_key_probability (float) – In order to estimate the biggest key (that is, the partition cols key that contains the highest number of elements inside to estimate the size of the partitions).
estimate_biggest_key_relative_error (float) – The relative error of the estimate_biggest_key_probability estimation. Defaults to 0. (to obtain exact quantiles), but be careful with this since operation may be very expensive.
- Returns:
The optimal number of partition for the given DataFrame
- sparkcraft.partitions.get_partition_count_df(df)[source]
Generates a DataFrame containing the number of elements for each partition. This method may be handy when trying to determine if data is skewed.
- Parameters:
df (DataFrame) – A PySpark DataFrame
- Returns:
A DataFrame containing partition_id and count columns
- Return type:
DataFrame
- sparkcraft.partitions.get_partition_count_distribution(df, probabilities, relative_error=0.0)[source]
Generates a DataFrame containing
- Parameters:
df (DataFrame) – A PySpark DataFrame
probabilities (List[float]) – The list of probabilities to be shown in the output DataFrame
relative_error (float) – The relative target precision. For more information, check PySpark’s documentation about approxQuantile (https://spark.apache.org/docs/latest/api/python/ reference/pyspark.sql/api/pyspark.sql.DataFrame.approxQuantile.html). Defaults to 0. to obtain exact quantiles, but if the operation is too expensive you can increase this value (although the quantile precision will diminish)
- Returns:
A list containing the value for each probability.
- Return type:
List[float]
- sparkcraft.partitions.remove_empty_partitions(df)[source]
This method will remove empty partitions from a DataFrame. It is useful after a filter, for example, when a great number of partitions may contain zero registers.
Note: This functionality may be useless if you are using Adaptive Query Execution from Spark 3.0
- Parameters:
df (DataFrame) – A pyspark DataFrame
- Returns:
A DataFrame with all empty partitions removed