Say I have a data source (ADLS Gen2) with the following contents.
CompanyData/Employee/emp1.json
CompanyData/Employee/emp2.json
CompanyData/Company/comp1.json
The emp* and comp* have different JSON fields/structures (can also have nested fields).
Ultimately, I want these JSONs' data converted to lakehouse Tables with columns inferred from the json data structure, at-least for the top level fields?
Tables
-- CompanyData
----Employee (Structure inferred from JSON structure?)
-------Columns: FullName, Age, etc. Rows from emp1.json and emp2.json
----Company
-------Columns: CompanyName, Address, etc. Rows from comp1.json
So, the questions are :-
- I believe the shortcut has to be created under "Files", as the json in the source, is not in delta parquet format for onelake to automatically recognize it as table. Is this right?
- Can I achieve what I want automatically out of the box? or do I have to write a spark job or something to transform the data from files into Tables?
- The source can keep getting modified (addition of files, deletions, etc). How can I keep lakehouse "tables" in sync with the shortcut in "Files" - the source can potentially have 10s and 100s of millions of json files. Periodically running spark job to transform all those millions of files may not be a good idea.
Thanks!!