Ingest nested JSON to Lakehouse tables · Learn Microsoft Fabric

Surm Man

Aug 13 (edited) in Technical

Ingest nested JSON to Lakehouse tables

Hello!

I am trying to ingest the below JSON file into (two) tables in a Lakehouse.

https://www.kaggle.com/datasets/aditeloo/the-world-dataset-of-covid19?resource=download&select=owid-covid-data.json

I was trying to test Pipelines and PySpark notebooks on this task.

This file is less than 50 MB, so it is fairly small.

1) Pipeline cannot handle (preview) this file as a source.

I have attached two screenshots showing the error.

2) This file is fairly simple, however, its data is nested. It has countries (Dimension table), and for each country, it has daily covid cases (Fact table). This means I can attempt to load "Dim Country" and "Fact Covid" tables using PySpark. However, due to the structure of the json file, it appears that this file does not fit nicely in a Spark Data Frame. Each country code appears as a column in the Spark Data Frame instead of a row. I am looking for ways to get two Data Frames, one for "Dim Country" and another for "Fact Covid", to be saved as Delta tables in the Lakehouse.

I have added two screenshots.

I am keen to hear feedback from other users and if someone can try to load this file and guide me in the right direction, I am very grateful.

20 comments