utils
- sparkcraft.utils.session.get_log4j_logger(spark)[source]
Gets a logger needed for logging useful information :param spark: A Spark Session :return: A log4j logger for Spark logging
- Parameters:
spark (SparkSession) –
- sparkcraft.utils.session.get_spark_session(app_name=None)[source]
Gets / Generates a Spark Session :param app_name: The Spark application name. This parameter is optional. :return: A Spark session
- Parameters:
app_name (str | None) –
- Return type:
SparkSession
- sparkcraft.utils.size_estimation.df_size_in_bytes_approximate(df, sample_perc=0.05)[source]
This method takes a sample of the input DataFrame (sample_perc) and applies df_size_in_bytes_exact method to it. After it calculates the exact size of the sample, it extrapolates the total size.
- Parameters:
df (DataFrame) – A PySpark DataFrame
sample_perc (float) – The percentage of the DataFrame to sample. By default, a 5 %
- Raises:
ValueError – If sample_perc is less than or equal to 0 or if it’s greater than 1.
- Returns:
The approximate size in bytes
- sparkcraft.utils.size_estimation.df_size_in_bytes_exact(df)[source]
Calculates the exact size in memory of a DataFrame by caching it and accessing the optimized plan
Note: BE CAREFUL WITH THIS FUNCTION BECAUSE IT WILL CACHE ALL THE DATAFRAME!!! IF YOUR DATAFRAME IS TOO BIG USE estimate_df_size_in_bytes!!
- Parameters:
df (DataFrame) – A pyspark DataFrame
- Returns:
The exact size in bytes