Have you ever wondered where the cryptic names of cached dataframes and RDD in Spark’s web UI belong to? Usually no specific name is set. When you apply a df.cache spark will auto generate the name as a snippet from the query plan. But this is not very descriptive, especially if there are a number of cached tables or if the spark cluster is shared by several users.

However, there is a better way:

def namedCache(name: String, storageLevel: StorageLevel = MEMORY_AND_DISK)(
      df: DataFrame): DataFrame = {
      .cacheQuery(df, Some(name), storageLevel)

one can simply explicitly pass a name and now it shows up in the web UI. This greatly simplifies debugging for me.

