Naresh Mepani

Learn Microsoft Fabric

Activity

Mon

Wed

Fri

Sun

Dec

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

What is this?

Less

Memberships

Learn Microsoft Fabric

Public • 5.5k • Free

12 contributions to Learn Microsoft Fabric

Surm Man

Aug 13 in

Technical

Ingest nested JSON to Lakehouse tables

Hello! I am trying to ingest the below JSON file into (two) tables in a Lakehouse. https://www.kaggle.com/datasets/aditeloo/the-world-dataset-of-covid19?resource=download&select=owid-covid-data.json I was trying to test Pipelines and PySpark notebooks on this task. This file is less than 50 MB, so it is fairly small. 1) Pipeline cannot handle (preview) this file as a source. I have attached two screenshots showing the error. 2) This file is fairly simple, however, its data is nested. It has countries (Dimension table), and for each country, it has daily covid cases (Fact table). This means I can attempt to load "Dim Country" and "Fact Covid" tables using PySpark. However, due to the structure of the json file, it appears that this file does not fit nicely in a Spark Data Frame. Each country code appears as a column in the Spark Data Frame instead of a row. I am looking for ways to get two Data Frames, one for "Dim Country" and another for "Fact Covid", to be saved as Delta tables in the Lakehouse. I have added two screenshots. I am keen to hear feedback from other users and if someone can try to load this file and guide me in the right direction, I am very grateful.

New comment Aug 17

Naresh Mepani

1 like • Aug 17

@Surm Man We have approx 500mb deep nest xml which we are converting into json size came down to 280mb using native python standard packages. Thereafter using duckdb package to read the json into memory and unpacking it. We found this to be most efficient way of handling small to larger json files in bulk or singular large file.

Naresh Mepani

1 like • Aug 17

There are possibly many other packages including panda and spark that handles json files. If you have a need of ingesting multiple deep nested json for or large json files then duckdb proved more optimal and consistently handling it. It has native support for json and plus it can unpack all in memory etc.

Will Needham

Aug 7 in

General

Recent Notebook enhancements (including Python Notebooks

An update on the recent improvements to the Notebook, in Microsoft Fabric, including the ribbon update, environment folder in Resource tab and the upcoming features like Python Notebook You can watch here: https://www.youtube.com/watch?v=JmBoV4ujfkc

New comment Aug 31

Recent Notebook enhancements (including Python Notebooks

Naresh Mepani

1 like • Aug 7

Single node cluster probably the best feature. Spark cluster takes a while to spin up and with libraries like duckdb and polars pre-installed would become heavily used

Avinash Sekar

Jul 16 in

Technical

Inconsistent Connections to Azure SQL Managed Instance with Fabric Notebook

'm trying to connect to an Azure SQL Managed Instance with Microsoft Fabric Notebook using PySpark code. When I attempt to connect, I encounter the following error: Py4JJavaError: An error occurred while calling o4722.load. : com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host db-sql-mi.public.e789gjshb.database.windows.net, port 3342 has failed. Error: "connect timed out. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall." However, if I try connecting 15 times, it occasionally succeeds and returns the output data frame for the specified table. On the other hand, my teammate, who is located in the US (where the SQL instance is also located), can connect and get the output within 2 seconds every time. He is the architect of this project. I am not very familiar with networking, so I am wondering if location-specific access or network speed is affecting my connection (I am based in India). My architect has whitelisted my public IP address (which I found via a Google search) and I worked the entire day on that specific Wi-Fi connection, but I still experienced issues. It did connect once, but then failed to connect again for the next two hours, though it worked one out of 15 times after that. If there were an access issue, would it be possible to connect even those 2 times out of 15? CODE: from pyspark.sql import SparkSession # spark = SparkSession.builder.appName("Python Spark SQL Server Example").getOrCreate() username = 'username' password = 'password' server_type = 'sqlserver' jdbc = 'db-sql-mi.public.e789gjshb.database.windows.net:3342' dbname = 'db-01' driver = 'com.microsoft.sqlserver.jdbc.SQLServerDriver'

New comment Jul 18

Naresh Mepani

1 like • Jul 18

@Avinash Sekar check your network settings for the azure sql server, does it allow public access or have private end point. You would either need to create managed private end in workspace settings if access is via private end. Public access granted then should be able to connect using credentials. With private end point and no public access you can connect via notebook only via managed private end point, data pipeline won’t be possible for now

Nikola H

Jun 18 in

General

Where to write DAX - Semantic Model or Report

Where do you typically write your DAX code? With the separation of the Semantic Model and the Power BI Report in Microsoft Fabric, you now have two choices: 1. 🧊Semantic Model: You have the option to input measures under a table within the Semantic Model. These measures will then be accessible in all reports linked to that Semantic Model. 2. 📊Report: Alternatively, you can input measures directly into your report. However, these measures will be exclusive to that specific report. While it appears that the first option (Semantic Model) might be more practical, I’m curious if anyone has found any disadvantages to using the first option. Are there any benefits to the second option? Can anyone share their experiences?

Poll

10 members have voted

New comment Jun 20

Naresh Mepani

1 like • Jun 19

It depends i would say. In the semantic model if the measure needs to be accessible to all users, handles governance, centrally managed, opportunity to resolve performance. On report side, makes sense if it’s bespoke and requires complex dax. Ideal to keep 80-20 rule to ensure semantic model simple and can serve most audience but putting complex dax can be counter productive. Of course they can be many more factors

Sreedhar Vengala

Jun 17 in

Technical

Creating surrogate key with an Identity column in Delta lake

Just wanted to know how others are dealing with creating surrogate key with auto-increment in delta table. Seems this is something supported in Databricks, but not in fabric. in Databricks: CREATE OR REPLACE TABLE demo ( id BIGINT GENERATED ALWAYS AS IDENTITY, product_type STRING, sales BIGINT ); Going forward, the identity column titled "id" will auto-increment whenever you insert new records into the table. You can then insert new data like so: INSERT INTO demo (product_type, sales) VALUES ("Batteries", 150000);

New comment Jun 20

Naresh Mepani

1 like • Jun 19

You could use, Windows function like row_number and getting max value from target table.

1-10 of 12

Level 2

3points to level up

Naresh Mepani

@naresh-mepani-9580

Head of data architecture specialising in data architecture/ Data modelling / data architect/ Microsoft fabric end to end solution architecture

Active 15d ago

Joined Mar 23, 2024

Contributions

Followers

Following