Dynamically select columns by type
Feb 22, 2019·
·
1 min read
Dr. Georg Heiler
In pandas it is really easy to select only columns matching a certain data type:
df.select_dtypes(include=['float64'])
In spark, such a function is not included by default. However, it can easily be coded by hand:
val df = Seq(
(1, 2, "hello")
).toDF("id", "count", "name")
import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {
val cols = df.schema.toList
.filter(x => x.dataType == colType)
.map(c => col(c.name))
df.select(cols:_*)
}
val res = selectByType(IntegerType, df)

Authors
senior data expert
Georg is a Senior data expert at Magenta and a ML-ops engineer at ASCII.
He is solving challenges with data. His interests include geospatial graphs
and time series. Georg transitions the data platform of Magenta to the cloud
and is handling large scale multi-modal ML-ops challenges at ASCII.