Arrow 2.0.0 - structs in pandas

Nested types for pandas!

struct type columns, i.e. neested columns which contain an array of nested attributes or a map are very useful for data modeling and analytics use-case.

However, until recent, Arrow did not support them - and thus beloved pandas failed to properly read such columns when working with pyspark. Finally, this is solved! In the past a workaround (i.e. conversion to lists) was required (see https://stackoverflow.com/questions/45341182/nested-data-in-parquet-with-python).

With pyArrow 2.0.0 structs are fully supported as https://issues.apache.org/jira/browse/ARROW-1644?src=confmacro is resolved!

So far Apache Spark is not officially supporting the lastest pyarrow release. You can still force it to use 2.x using conda! For me in my tests it worked well albeit having warnings displayed!

Keep in mind that I did observe a difference in timestamp handling: when reading the file from disk using pandas it is receiving an integer as part of a dictionary (for the struct) - vs. when calling collect, a proper datetime object is present. However, it is stored as a tuple.
Georg Heiler
Georg Heiler
PhD candidate & data scientist

My research interests include large geo-spatial time and network data analytics.

comments powered by Disqus

Related