Activity
Mon
Wed
Fri
Sun
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
What is this?
Less
More

Memberships

Data Innovators Exchange

Public • 179 • Free

3 contributions to Data Innovators Exchange
Modelling Address Data
The question arose as to the correct way to model address data. Michael Olschimke explained his point of view and shared his idea on how to proceed in the latest session of Data Vault Friday. You can find the video either here in the classroom or below. How would you model address data? Would you do it differently to Michael?
9
3
New comment 11d ago
Modelling Address Data
3 likes • 11d
Usually, I prefer to put address attributes in a Sat. However, I did have this same case in a Healthcare warehouse because there were multiple specialty clinics housed at the same address. The BK was very similar to the example and it really was the best way to identify a building owned by the agency, and was how the business thought of it. We used the full set of attributes for the BK and hashed then according to the DV 2 best practices. I would never recommend using a surrogate key for the address BK (or any Hub actually) as that really will not work unless you have only one source system for addresses over time. Very unlikely, hence the use of true BK attributes to future proof your design. I do like the idea of using a JSON document with all the attributes as the BK and then hash that for the Hub PK. I did not have that as an option at the time. With the VARIANT data type in Snowflake, that is now a valid option to consider. However, as Michael points out, if the keys within the JSON get reordered, you might end up loading technical duplicates because they will hash to different values. In that case you may need a Same as Link to align them in your Business Vault.
My 5 Tips when working with Snowflake
Of course there are dozen of tips available for Snowflake, but let me share the ones which came into my mind very quickly: 1) Understand how Snowflake stores the data! They are using micro-partitions, organized in a columnar way. Micro Partitions store statistics like distinct values and value-ranges for each column. Your goal should always be to prune as much as possible from both when querying data. For example: Only select columns you really need, and apply filters on columns where the values are mostly not overlapping multiple Micro Partitions. Also think on re-clustering your data if necessary, or creating your own values with a pattern to cluster your data on (usually only necessary for huge amounts of data in one table). 2) When data is spilled to local storage while querying, is a good indicator that a bigger warehouse makes sense. I assume here that the query itself is already optimized and we are just dealing with a lot of data and maybe complex logics. But keep in mind: Increasing the size of the Snowflake Virtual Warehouse by 1 step (i.e. M -> L), doubles the costs or the same runtime! (calculated per cluster). So, when the query time is less than 50%, we achieved a win-win: faster & cheaper result! If the runtime could not be reduced by 50% or more, then you have to decide whether the quicker response is worth the money you now spend. 3) Snowflakes no-copy clones allow you to test features and fixes against your production in a very easy and fast way. It should be part of your deployment pipelines. 4) Insert-only reduces the number of versions Snowflake has to create for the Micro Partitions. Updates and Deletes cause this versioning of already existing Micro Partitions what costs time and additional storage. That also means that Data Vault with its Insert-Only approach meets the scalability factors of Snowflake! 5) The QUALIFY statement improved the code writing a lot. It is using the result of a window-function as filter, means, you don't have to write nested sub-queries with and self-joins.
11
2
New comment 11d ago
5 likes • 11d
Always be sure that auto suspend and auto resume are enabled for your virtual warehouse clusters so they shut off automatically when you query or load process completes. The trick is figuring out the auto suspend timing. If you have lots of processes using the same VWH you don't want it to shutdown just to restart immediately for another process as there can be a little lag for the really large clusters. So know your processes and how often they run to optimize the timing and minimize your costs. Another trick is to to use multiple VWH if you have a lot of different things happening in parallel. For example querying your info mart and loading your raw vault can easily be separated. Then you can size them as needed and also more easily track the coat of each type of process.
Data Vault by example
Are you aware of publicly accessible examples for Data Vault 2.0 implementations? I published a data vault example with oracle for training purposes here: https://danielfritzler.de/data-vault-oracle.html and then there is the willibald implementation with datavault4dbt https://www.linkedin.com/pulse/willibald-data-vault-warehouse-dbt-andreas-haas/ Do you have more?
6
3
New comment Aug 3
Data Vault by example
3 likes • Aug 3
We built one when I was at Snowflake to show near real-time application https://quickstarts.snowflake.com/guide/vhol_data_vault/index.html?index=..%2F..index#0
1-3 of 3
Kent Graziano
2
9points to level up
@kent-graziano-4146
Semi-retired Snowflake and Data Vault evangelist. Author, speaker, advisor.

Active 9d ago
Joined Jul 25, 2024
powered by