Spark descriptive name for cached dataframes

Concise names for cached tables.

Georg Heiler

Jul 23, 2019 1 min read

Have you ever wondered where the cryptic names of cached dataframes and RDD in Spark’s web UI belong to? Usually no specific name is set. When you apply a df.cache spark will auto generate the name as a snippet from the query plan. But this is not very descriptive, especially if there are a number of cached tables or if the spark cluster is shared by several users.

However, there is a better way:

def namedCache(name: String, storageLevel: StorageLevel = MEMORY_AND_DISK)(
      df: DataFrame): DataFrame = {
    df.sparkSession.sharedState.cacheManager
      .cacheQuery(df, Some(name), storageLevel)
    df
  }

one can simply explicitly pass a name and now it shows up in the web UI. This greatly simplifies debugging for me.

Apache-Spark Big-Data

Spark descriptive name for cached dataframes

Georg Heiler

research & software engineer specialized in data