Salesforce Wave Analytics dataflow performance: a practitioner's guide for 2017

By mid-2017, organizations that adopted Salesforce Wave Analytics in 2014 or 2015 are encountering significant performance bottlenecks in their dataflow pipelines. These teams, who initially embraced Wave Analytics as a fast way to build dashboards and reports, now face dataflows that take 45 to 90 minutes to run on datasets exceeding 5 million rows. The root causes are not new - they stem from design decisions made in early implementations and the lack of scalable practices in the platform's native tooling.

In 2017, Wave Analytics was still in its early stages of maturity. Many teams had not yet adopted the newer Einstein Analytics or Tableau CRM frameworks. The dataflow performance issues are particularly acute for sales orgs that rely heavily on Salesforce data for forecasting, pipeline tracking, and customer insights. The Spring 2017 release introduced breaking changes that further complicated dataflow optimization. This guide outlines the key patterns and anti-patterns we've observed in our engagements with 300+ Salesforce orgs, including the impact of sfdcDigest, watermarking, and the dataset registration trap.

The dataset registration trap

One of the most common performance pitfalls in early Wave Analytics implementations is the use of sfdcRegister in dataflows. This function is supposed to register a dataset in the Wave platform, but it often fails silently when datasets exceed a certain size or complexity. In 2017, teams using sfdcRegister for datasets over 10 million rows often ran into failures or incomplete data loads.

Here's a typical example of a problematic dataflow:

{
 "name": "Account Dataflow",
 "steps": [
 {
 "name": "Extract Accounts",
 "type": "sfdcQuery",
 "query": "SELECT Id, Name, CreatedDate FROM Account"
 },
 {
 "name": "Register Dataset",
 "type": "sfdcRegister",
 "dataset": "Account"
 }
 ]
}

This pattern fails because sfdcRegister doesn't handle large datasets well. It often causes the dataflow to timeout or return incomplete results. The fix is to replace sfdcRegister with sfdcWrite and explicitly define the dataset schema.

Watermarking with sfdcDigest

In 2017, teams often implemented incremental dataflows using sfdcDigest with watermarking. The intent was to reduce processing time by only pulling new or updated records. However, the implementation often went wrong.

Here's a flawed approach:

{
 "name": "Incremental Account Dataflow",
 "steps": [
 {
 "name": "Extract Accounts",
 "type": "sfdcQuery",
 "query": "SELECT Id, Name, LastModifiedDate FROM Account WHERE LastModifiedDate > '2017-01-01T00:00:00Z'"
 },
 {
 "name": "Digest",
 "type": "sfdcDigest",
 "watermark": "LastModifiedDate"
 }
 ]
}

This approach fails because sfdcDigest doesn't properly handle time zones or complex data types. It also doesn't scale well with datasets over 5 million rows. Teams often ended up with dataflows that took 60+ minutes to complete.

A better approach is to use a custom watermark field with a more solid data ingestion strategy. For example:

{
 "name": "Improved Incremental Account Dataflow",
 "steps": [
 {
 "name": "Extract Accounts",
 "type": "sfdcQuery",
 "query": "SELECT Id, Name, LastModifiedDate FROM Account WHERE LastModifiedDate > {watermark}"
 },
 {
 "name": "Write to Dataset",
 "type": "sfdcWrite",
 "dataset": "Account"
 }
 ]
}

The augment vs append dilemma

In 2017, teams often used sfdcAugment to append new data to existing datasets. This was a common pattern for handling incremental updates. However, sfdcAugment is not designed for large datasets and often causes dataflows to fail or return inconsistent results.

Here's an example of a problematic dataflow:

{
 "name": "Append Dataflow",
 "steps": [
 {
 "name": "Extract New Accounts",
 "type": "sfdcQuery",
 "query": "SELECT Id, Name, CreatedDate FROM Account WHERE CreatedDate > '2017-01-01T00:00:00Z'"
 },
 {
 "name": "Augment Dataset",
 "type": "sfdcAugment",
 "dataset": "Account"
 }
 ]
}

This approach fails because sfdcAugment doesn't handle large data volumes efficiently. It also runs into issues with schema mismatches and data type conflicts.

The solution is to use sfdcAppend instead, which is more solid for large datasets:

{
 "name": "Append Dataflow",
 "steps": [
 {
 "name": "Extract New Accounts",
 "type": "sfdcQuery",
 "query": "SELECT Id, Name, CreatedDate FROM Account WHERE CreatedDate > {watermark}"
 },
 {
 "name": "Append to Dataset",
 "type": "sfdcAppend",
 "dataset": "Account"
 }
 ]
}

Nightly full-refresh dataflows in Sales Wave App

Sales Wave App dataflows in 2017 often run full refreshes every night. For organizations with datasets larger than 30 million rows, these dataflows can take 60 - 90 minutes to complete. This is a major bottleneck for real-time reporting and analytics.

Here's a typical dataflow for the Sales Wave App:

{
 "name": "Sales Wave App Refresh",
 "steps": [
 {
 "name": "Extract Opportunity Data",
 "type": "sfdcQuery",
 "query": "SELECT Id, Amount, StageName, CloseDate, AccountId FROM Opportunity"
 },
 {
 "name": "Write to Dataset",
 "type": "sfdcWrite",
 "dataset": "Opportunity"
 }
 ]
}

This pattern is inefficient because it doesn't use incremental updates or partitioning. The solution is to implement a hybrid approach that combines full and incremental refreshes, using a combination of sfdcQuery and sfdcAppend.

Impact of Spring 2017 release changes

The Spring 2017 release introduced breaking changes that affected many dataflow patterns. Specifically, the way sfdcRegister and sfdcDigest handle large datasets was altered. These changes caused many existing dataflows to fail or run much slower.

Teams that had previously relied on sfdcRegister for dataset registration now had to refactor their dataflows to use sfdcWrite and explicitly define dataset schemas. Similarly, sfdcDigest was deprecated in favor of more solid watermarking strategies.

Performance benchmarks and real-world examples

Across 300+ Salesforce orgs in 2017, we found that dataflows with datasets over 5 million rows typically ran 3x slower than those under 1 million rows. Organizations with datasets exceeding 30 million rows saw average run times of 60 - 90 minutes for full refreshes, compared to a target of 12 - 18 minutes.

One financial services client saw a 40% reduction in dataflow run times after replacing sfdcRegister with sfdcWrite and implementing a custom watermarking approach. Another client using sfdcAugment saw a 50% improvement in dataflow reliability after switching to sfdcAppend.

Best practices for 2017

In 2017, the best practices for Wave Analytics dataflow performance include:

  • Avoid sfdcRegister for datasets over 1 million rows
  • Use sfdcAppend instead of sfdcAugment for incremental updates
  • Implement custom watermarking strategies using sfdcQuery and sfdcWrite
  • Partition large datasets into smaller chunks for processing
  • Refactor dataflows to avoid full-refresh patterns where possible

Implications for your organization

If your organization is running dataflows on datasets over 5 million rows, you're likely experiencing performance issues. The patterns outlined here are critical for maintaining scalability and performance in 2017. Teams that implement these changes early will see significant improvements in dataflow run times and reliability.

Organizations using Salesforce Wave Analytics in 2017 should prioritize refactoring their dataflows to avoid the pitfalls of sfdcRegister, sfdcAugment, and full-refresh patterns. These changes will help ensure that analytics and reporting remain fast and accurate.

FAQ

Q: What are the most common causes of slow dataflows in Wave Analytics? A: Slow dataflows are most often caused by the use of sfdcRegister, sfdcAugment, and full-refresh patterns on large datasets. These functions are not designed for high-volume data processing.

Q: Can I still use sfdcDigest for watermarking in 2017? A: While sfdcDigest is still available, it's not recommended for datasets over 5 million rows. Use custom watermarking strategies instead.

Q: What's the recommended replacement for sfdcRegister? A: Replace sfdcRegister with sfdcWrite and explicitly define dataset schemas. This improves both performance and reliability.

Engage CRMA Labs for a fixed-fee audit, sprint, or retainer at https://crmalabs.com