utils

sparkcraft.utils.session.get_log4j_logger(spark)[source]

Gets a logger needed for logging useful information :param spark: A Spark Session :return: A log4j logger for Spark logging

Parameters:

spark (SparkSession) –

sparkcraft.utils.session.get_spark_session(app_name=None)[source]

Gets / Generates a Spark Session :param app_name: The Spark application name. This parameter is optional. :return: A Spark session

Parameters:

app_name (str | None) –

Return type:

SparkSession

sparkcraft.utils.size_estimation.df_size_in_bytes_approximate(df, sample_perc=0.05)[source]

This method takes a sample of the input DataFrame (sample_perc) and applies df_size_in_bytes_exact method to it. After it calculates the exact size of the sample, it extrapolates the total size.

Parameters:
  • df (DataFrame) – A PySpark DataFrame

  • sample_perc (float) – The percentage of the DataFrame to sample. By default, a 5 %

Raises:

ValueError – If sample_perc is less than or equal to 0 or if it’s greater than 1.

Returns:

The approximate size in bytes

sparkcraft.utils.size_estimation.df_size_in_bytes_exact(df)[source]

Calculates the exact size in memory of a DataFrame by caching it and accessing the optimized plan

Note: BE CAREFUL WITH THIS FUNCTION BECAUSE IT WILL CACHE ALL THE DATAFRAME!!! IF YOUR DATAFRAME IS TOO BIG USE estimate_df_size_in_bytes!!

Parameters:

df (DataFrame) – A pyspark DataFrame

Returns:

The exact size in bytes