partitions

sparkcraft.partitions.add_partition_id_col(df, partition_id_colname='partition_id')[source]

Adds a column named partition_id to the input DataFrame which represents the partition id as output by pyspark.sql.functions.spark_partition_id method.

Parameters:
  • df (DataFrame) – A PySpark DataFrame

  • partition_id_colname (str) – The name of the column containing the partition id

Returns:

The input DataFrame with an additional column (partition_id) which represents the partition id

sparkcraft.partitions.get_optimal_number_of_partitions(df, partition_cols=None, df_sample_perc=None, target_size_in_bytes=134217728, estimate_biggest_key_probability=0.95, estimate_biggest_key_relative_error=0.0)[source]

This method calculated the optimal number of partitions for the input PySpark DataFrame df.

Parameters:
  • df (DataFrame) – A PySpark DataFrame

  • partition_cols (str | List[str] | None) – The columns, if provided, to partition the DataFrame by

  • df_sample_perc (float | None) – If provided, the sampling percentage for approximate size estimation

  • target_size_in_bytes (int) – The target size of each partition (~128MB)

  • estimate_biggest_key_probability (float) – In order to estimate the biggest key (that is, the partition cols key that contains the highest number of elements inside to estimate the size of the partitions).

  • estimate_biggest_key_relative_error (float) – The relative error of the estimate_biggest_key_probability estimation. Defaults to 0. (to obtain exact quantiles), but be careful with this since operation may be very expensive.

Returns:

The optimal number of partition for the given DataFrame

sparkcraft.partitions.get_partition_count_df(df)[source]

Generates a DataFrame containing the number of elements for each partition. This method may be handy when trying to determine if data is skewed.

Parameters:

df (DataFrame) – A PySpark DataFrame

Returns:

A DataFrame containing partition_id and count columns

Return type:

DataFrame

sparkcraft.partitions.get_partition_count_distribution(df, probabilities, relative_error=0.0)[source]

Generates a DataFrame containing

Parameters:
  • df (DataFrame) – A PySpark DataFrame

  • probabilities (List[float]) – The list of probabilities to be shown in the output DataFrame

  • relative_error (float) – The relative target precision. For more information, check PySpark’s documentation about approxQuantile (https://spark.apache.org/docs/latest/api/python/ reference/pyspark.sql/api/pyspark.sql.DataFrame.approxQuantile.html). Defaults to 0. to obtain exact quantiles, but if the operation is too expensive you can increase this value (although the quantile precision will diminish)

Returns:

A list containing the value for each probability.

Return type:

List[float]

sparkcraft.partitions.remove_empty_partitions(df)[source]

This method will remove empty partitions from a DataFrame. It is useful after a filter, for example, when a great number of partitions may contain zero registers.

Note: This functionality may be useless if you are using Adaptive Query Execution from Spark 3.0

Parameters:

df (DataFrame) – A pyspark DataFrame

Returns:

A DataFrame with all empty partitions removed