Apache-Spark

Cloud arbitrage for spark pipelines

Spark-based data PaaS solutions are convenient. But they come with their own set of challenges such as a high vendor lock-in and obscured costs. We show how to use a dedicated orchestrator (dagster-pipes). It can not only make Databricks an implementation detail but also save cost. Also, it improves developer productivity. It allows you to take back control.

Jun 21, 2024 12:00 AM — 12:00 AM

Georg Heiler, Hernan Picatto

Cloud arbitrage for spark pipelines

AI basierte Root Cause Analyse von CPD Störquellen in Docsis Netzen

Good quality network connectivity is ever more important. For hybrid fiber coaxial (HFC) networks, searching for upstream \emph{high noise} in the past was cumbersome and time-consuming. Even with machine learning due to the heterogeneity of the network and its topological structure, the task remains challenging. We present the automation of a simple business rule (largest change of a specific value) and compare its performance with state-of-the-art machine-learning methods and conclude that the precision@1 can be improved by 2.3 times. As it is best when a fault does not occur in the first place, we secondly evaluate multiple approaches to forecast network faults, which would allow performing predictive maintenance on the network.

May 10, 2022 12:00 AM — May 12, 2022 12:00 AM

Georg Heiler

AI basierte Root Cause Analyse von CPD Störquellen in Docsis Netzen

Comparing SQL-based streaming approaches

Comparing established and up-and-coming streaming approaches for an integrated real-time data model

Georg Heiler

Apr 1, 2022 28 min read

Comparing SQL-based streaming approaches

Identifying the root cause of cable network problems with machine learning

Good quality network connectivity is ever more important. For hybrid fiber coaxial (HFC) networks, searching for upstream high noise in …

Georg Heiler, Thassilo Gadermaier, Thomas Haider, Allan Hanbury, Peter Filzmoser

Identifying the root cause of cable network problems with machine learning

Scalable data pipelines from dagster with pyspark

Getting started with simple dagster pipelines.

Georg Heiler, Sandy Ryza

Mar 4, 2022 5 min read

Scalable data pipelines from dagster with pyspark

Scalable sparse matrix multiplication

Using Apache Spark for sparse matrix multiplication

Georg Heiler

Aug 6, 2021 6 min read

Exact percentiles in Spark

Combining the power of Scala and Python to make the calculation of percentiles in Spark easy and fast

Georg Heiler

Nov 21, 2020 7 min read

Arrow 2.0.0 - structs in pandas

Finally, nested types in Arrow.

Georg Heiler

Nov 20, 2020 1 min read

Data preparation using spark without ACID tables

Georg Heiler

Nov 19, 2020 4 min read

Run the latest version of spark

Execute the latest version of spark on HDP.

Georg Heiler

Aug 31, 2020 4 min read

Run the latest version of spark