Activity
Mon
Wed
Fri
Sun
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
What is this?
Less
More

Memberships

Learn Microsoft Fabric

Public • 5.5k • Free

12 contributions to Learn Microsoft Fabric
Ingest nested JSON to Lakehouse tables
Hello! I am trying to ingest the below JSON file into (two) tables in a Lakehouse. https://www.kaggle.com/datasets/aditeloo/the-world-dataset-of-covid19?resource=download&select=owid-covid-data.json I was trying to test Pipelines and PySpark notebooks on this task. This file is less than 50 MB, so it is fairly small. 1) Pipeline cannot handle (preview) this file as a source. I have attached two screenshots showing the error. 2) This file is fairly simple, however, its data is nested. It has countries (Dimension table), and for each country, it has daily covid cases (Fact table). This means I can attempt to load "Dim Country" and "Fact Covid" tables using PySpark. However, due to the structure of the json file, it appears that this file does not fit nicely in a Spark Data Frame. Each country code appears as a column in the Spark Data Frame instead of a row. I am looking for ways to get two Data Frames, one for "Dim Country" and another for "Fact Covid", to be saved as Delta tables in the Lakehouse. I have added two screenshots. I am keen to hear feedback from other users and if someone can try to load this file and guide me in the right direction, I am very grateful.
1
20
New comment Aug 17
Ingest nested JSON to Lakehouse tables
1 like • Aug 17
@Surm Man We have approx 500mb deep nest xml which we are converting into json size came down to 280mb using native python standard packages. Thereafter using duckdb package to read the json into memory and unpacking it. We found this to be most efficient way of handling small to larger json files in bulk or singular large file.
1 like • Aug 17
There are possibly many other packages including panda and spark that handles json files. If you have a need of ingesting multiple deep nested json for or large json files then duckdb proved more optimal and consistently handling it. It has native support for json and plus it can unpack all in memory etc.
Recent Notebook enhancements (including Python Notebooks
An update on the recent improvements to the Notebook, in Microsoft Fabric, including the ribbon update, environment folder in Resource tab and the upcoming features like Python Notebook You can watch here: https://www.youtube.com/watch?v=JmBoV4ujfkc
10
4
New comment Aug 31
 Recent Notebook enhancements (including Python Notebooks
1 like • Aug 7
Single node cluster probably the best feature. Spark cluster takes a while to spin up and with libraries like duckdb and polars pre-installed would become heavily used
Inconsistent Connections to Azure SQL Managed Instance with Fabric Notebook
'm trying to connect to an Azure SQL Managed Instance with Microsoft Fabric Notebook using PySpark code. When I attempt to connect, I encounter the following error: Py4JJavaError: An error occurred while calling o4722.load. : com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host db-sql-mi.public.e789gjshb.database.windows.net, port 3342 has failed. Error: "connect timed out. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall." However, if I try connecting 15 times, it occasionally succeeds and returns the output data frame for the specified table. On the other hand, my teammate, who is located in the US (where the SQL instance is also located), can connect and get the output within 2 seconds every time. He is the architect of this project. I am not very familiar with networking, so I am wondering if location-specific access or network speed is affecting my connection (I am based in India). My architect has whitelisted my public IP address (which I found via a Google search) and I worked the entire day on that specific Wi-Fi connection, but I still experienced issues. It did connect once, but then failed to connect again for the next two hours, though it worked one out of 15 times after that. If there were an access issue, would it be possible to connect even those 2 times out of 15? CODE: from pyspark.sql import SparkSession # spark = SparkSession.builder.appName("Python Spark SQL Server Example").getOrCreate() username = 'username' password = 'password' server_type = 'sqlserver' jdbc = 'db-sql-mi.public.e789gjshb.database.windows.net:3342' dbname = 'db-01' driver = 'com.microsoft.sqlserver.jdbc.SQLServerDriver'
0
1
New comment Jul 18
1 like • Jul 18
@Avinash Sekar check your network settings for the azure sql server, does it allow public access or have private end point. You would either need to create managed private end in workspace settings if access is via private end. Public access granted then should be able to connect using credentials. With private end point and no public access you can connect via notebook only via managed private end point, data pipeline won’t be possible for now
Where to write DAX - Semantic Model or Report
Where do you typically write your DAX code? With the separation of the Semantic Model and the Power BI Report in Microsoft Fabric, you now have two choices: 1. 🧊Semantic Model: You have the option to input measures under a table within the Semantic Model. These measures will then be accessible in all reports linked to that Semantic Model. 2. 📊Report: Alternatively, you can input measures directly into your report. However, these measures will be exclusive to that specific report. While it appears that the first option (Semantic Model) might be more practical, I’m curious if anyone has found any disadvantages to using the first option. Are there any benefits to the second option? Can anyone share their experiences?
Poll
10 members have voted
0
2
New comment Jun 20
1 like • Jun 19
It depends i would say. In the semantic model if the measure needs to be accessible to all users, handles governance, centrally managed, opportunity to resolve performance. On report side, makes sense if it’s bespoke and requires complex dax. Ideal to keep 80-20 rule to ensure semantic model simple and can serve most audience but putting complex dax can be counter productive. Of course they can be many more factors
Creating surrogate key with an Identity column in Delta lake
Just wanted to know how others are dealing with creating surrogate key with auto-increment in delta table. Seems this is something supported in Databricks, but not in fabric. in Databricks: CREATE OR REPLACE TABLE demo ( id BIGINT GENERATED ALWAYS AS IDENTITY, product_type STRING, sales BIGINT ); Going forward, the identity column titled "id" will auto-increment whenever you insert new records into the table. You can then insert new data like so: INSERT INTO demo (product_type, sales) VALUES ("Batteries", 150000);
0
2
New comment Jun 20
1 like • Jun 19
You could use, Windows function like row_number and getting max value from target table.
1-10 of 12
Naresh Mepani
2
3points to level up
@naresh-mepani-9580
Head of data architecture specialising in data architecture/ Data modelling / data architect/ Microsoft fabric end to end solution architecture

Active 15d ago
Joined Mar 23, 2024
powered by