5 Common Questions You’re Asking About Snowflake
- November 14, 2022
The emergence of cloud technology is a beacon in the world of data and analytics. As with any new technology, the cloud is opening the doors to exciting, new opportunities. At the same time, the premise of new technology also comes with some uncertainties and ambiguity. It’s our purpose here at NTT DATA to drive confidence in the knowledge of technology, so these fears can be put to rest.
To achieve these goals, we are proud to work with Snowflake as an Elite partner. Snowflake is a leader in the cloud industry, and the Snowflake Data Cloud provides companies the opportunity to access, build, collaborate, and monetize their data. NTT DATA has trained 500 professionals on Snowflake’s capabilities and implemented them for global clients and projects located in the United States, India, Japan, and Europe.
Over the course of this article, our goal is to facilitate a conversation around all things Snowflake, including Snowflake’s features and capabilities. In addition, you’ll learn some of our best practices and recommendations that NTT DATA’s Snowflake experts have discovered during more than Snowflake implementations.
Here are just a few of Snowflake’s capabilities:
- Prepping Snowflake for data load
- Loading structured data via different-sized warehouses
- Running analytical queries
- Utilizing result set caching
- Cloning tables
- Loading semi-structured data
- Creating views on raw semi-structured data
- Building queries joining structured and semi-structured data
- Understanding and using time travel
We know how imperative it is to make the right choice when it comes to company data. With that in mind, it’s best to help our clients get a better overall understanding of Snowflake, the Data Cloud, and its features. With that in mind, let's check out the answers to five questions frequently asked by our clients.
When it comes to preventing unauthorized access and protecting sensitive data such as personally identifiable information (PII), does Snowflake provide a way to selectively hide the column data of a table?
Yes — Snowflake’s Column-level Security provides features to protect sensitive data and unauthorized access at a column level in a table or a view by means of Dynamic Data Masking or External Tokenization.
With Dynamic Data Masking, data that is available within Snowflake can be secured by the application of masking policies that obfuscate it at query time. When a query is executed by a user, Snowflake checks to see if any masking policy conditions are met to determine whether users see masked, partially masked, obfuscated, or tokenized data.
With External Tokenization, data is tokenized before being loaded into Snowflake, and then, the data is detokenized at query runtime. Tokenization is the process of removing sensitive data by replacing it with an undecipherable token. Through masking policies using external functions to complete detokenization, the data can be dynamically deciphered to access the actual contents. This is more secure than Dynamic Data Masking since the data rests in the tokenized format. Additionally, one can consider using a view, or series of views, instead of, or in conjunction with, masking policies to further filter columns and provide more meaningful column aliases.
What are micro-partitions in Snowflake? How can micro-partitions and data clustering improve DML operations and speed up query execution?
“Micro-partition” is the term used to describe the way Snowflake partitions/divides tables into blocks of 50–500MB data (when uncompressed) during DML operations (inserts, updates, deletes). These columns are stored independently within micro-partitions in what is often referred to as columnar storage. The columns are compressed at rest, and the use of this storage method allows for individual columns to be scanned quickly and efficiently. Clustering metadata is collected over each micro-partition as it gets created and used for granular pruning of the data during query processing.
Additionally, micro-partitions are never modified since DML operations either create new micro-partitions and/or mark previously used micro-partitions for removal. For example, if an update statement updated 20 values in a micro-partition, a new micro-partition, which has the original data plus the 20 updated values, will be created and referenced as the “current” micro-partition. Then, the other micro-partition will be marked for removal post time travel and failsafe expiration.
What is the cloning feature in Snowflake? Can it help me to save time and money on storage costs?
Cloning, or “zero-copy” cloning, is a feature provided by Snowflake where a logical copy of a database object is created from a specific snapshot of another database object without duplicating storage.
Snowflake achieves this through logical micro-partition metadata separation. This process allows for one copy of the data to remain in storage, but have multiple micro-partition references, for different database objects, for that single copy of data. In addition to this, when a table is cloned, it doesn’t take up data storage, so you can save money on storage costs.
Snowflake can create a copy at both the container level and the object level regardless of whether it’s a database, schema, or table. Similarly, cloning a container-level object automatically clones its contained objects.
It’s also important to keep in mind that a cloned object is writable and completely independent of the cloned source. In this case, DML applied to the source object, or the clone object, is not interconnected. As a result, the DML is only applied to a particular copy of the data which will create new micro-partitions for that branch only. Therefore, it’s recommended that you review the considerations for the handling of cloned objects and how they’re created within a database, schema, or table.
What is Snowflake Time Travel?
At a basic level, Snowflake’s Time Travel feature enables historical data to be accessed after it has been either modified or deleted at any point in a defined retention period. Time Travel was designed to help restore data-related objects, tables, schemas, and databases that may have had DML accidentally applied, and it can also clone data objects at, or before, specific points in time. Likewise, data can also be queried at, or before, specific points in time--even if it has been modified.
Additionally, Time Travel gives visibility into overall data usage and manipulation over different time periods specified by the user. One important item to note is that all-time travel periods vary depending on the Snowflake license. The default retention period is one day, and it’s automatically enabled for all accounts. However, users who have the Enterprise Edition (or higher) of Snowflake can set the retention period to any value ranging from zero to 90 days for all permanent databases, schemas, and tables.
What is the best use case for Snowpipe? How can it be used to provide real-time insights into data and overcome the barriers of legacy solutions?
In its simplest form, Snowpipe is a data ingestion service offered by Snowflake that enables files to be loaded into Snowflake via two mechanisms: automated and REST endpoints.
The automated mechanism relies on cloud messaging. Using this method, the cloud provider notifies Snowflake when a file becomes available in a Snowflake external stage (cloud storage).
On the other hand, the REST endpoint mechanism functions similarly to the automated mechanism.
What makes it different? With this method, the client application calls a public REST endpoint with the name of a pipe object and a list of data filenames to be ingested into the target table. If these new data files match the list discovered by the pipe object, they are queued for loading. At this point, Snowflake-provided compute resources will load data from the queue into a Snowflake table according to the parameters specified by the pipe object.
Snowpipe was designed to be a more automated and cost-effective alternative to the bulk copy method using a designated warehouse. Therefore, it’s the better choice for handling small, quickly-delivered files or batching small files that need to be processed within a short window of their arrival or as a stream.
Snowpipe is another tool in the data integration tool bag, and it differs from other data integration solutions like Fivetran, Matillion, Talend/Stitch, etc. Snowpipe imports data from Snowflake external stages, such as AWS S3, Azure Blob/ADLS Gen 2/AGP V2, or Google Cloud Storage, etc. This feature is what helps it stand out from the others because other data integration tools connect directly to a multitude of source systems to extract and load data directly into Snowflake (or other target systems). When evaluating options for data pipelines, the actual data integration use case should be at the forefront of any decision.
Wrapping Up
We’ve reached the end of our five questions, but these answers are just the tip of the iceberg when it comes to expanding your knowledge of Snowflake and its capabilities. Whether you need to optimize for security, costs, and performance in Snowflake, review the overall ecosystem or explore data pipelining, analytics, and automation, we can help you on the journey to cross the finish line. Connect with us to learn more.