Arrow 2.0.0 - structs in pandas
Nested types for pandas!
struct type columns, i.e. neested columns which contain an array of nested attributes or a map are very useful for data modeling and analytics use-case.
However, until recent, Arrow did not support them - and thus beloved pandas failed to properly read such columns when working with pyspark. Finally, this is solved! In the past a workaround (i.e. conversion to lists) was required (see https://stackoverflow.com/questions/45341182/nested-data-in-parquet-with-python).
With pyArrow 2.0.0 structs are fully supported as https://issues.apache.org/jira/browse/ARROW-1644?src=confmacro is resolved!
So far Apache Spark is not officially supporting the lastest pyarrow release. You can still force it to use 2.x using conda! For me in my tests it worked well albeit having warnings displayed!