Data Strategy — AI Strategy & Leadership

Your Biggest Competitor Isn't Who You Think

Right now, your most dangerous competitor isn't the company across town or the well-funded startup in your space. It's the data you throw away every single day while your product runs.

Any company on Earth can call Claude's API. Any company can license GPT. The models are commodities. But only you own what your users do inside your product — the clicks, the corrections, the things they ignore, the paths they take. That's your unfair advantage. And most companies flush 90% of it down the drain.

Let's fix that.

Think About It Like a Loyalty Card

Imagine you walk into two grocery stores.

Store A has had your loyalty card for 5 years. It knows you buy oat milk every Tuesday, that you switched from regular pasta to gluten-free last March, and that you always grab dark chocolate before a long weekend.

Store B just opened. Beautiful store. Same products. Same prices. But it has zero idea who you are.

Which store gives you better coupons? Which store feels like it "gets" you?

That's the data advantage. Store A didn't build a better coupon-printing machine. It just remembered more about you. And a new competitor — no matter how much money it spends on fancy coupon technology — can't buy five years of your grocery habits.

Your AI works the same way.

💭You're Probably Wondering…

"There Are No Dumb Questions"

Q: Wait — if everyone can use the same AI models, how can data be an advantage? A: The model is like a brilliant chef. Your data is the pantry. Two restaurants can hire the same chef, but the one with rare, locally-sourced ingredients that took years to cultivate makes the meal nobody else can replicate. The chef (model) is interchangeable. The pantry (data) is not.

Q: My company already collects tons of data. Aren't we fine? A: Collecting data is like having groceries delivered but never opening the bags. The question isn't whether you have data — it's whether you've closed the loop: collecting it, cleaning it, feeding it back into your AI, and using the AI's outputs to generate even better data. That's the flywheel.

Q: What if we're a small company? Don't we need millions of users for this to work? A: Nope. A 500-person B2B SaaS with deep usage data on how procurement teams negotiate contracts has a data advantage that a trillion-dollar company without that specific signal can never buy. Depth beats breadth.

The Data Flywheel: Your Compounding Machine

Here's the engine that separates companies that use AI from companies that dominate with AI:

Each revolution of this flywheel makes your AI harder to catch. It's not a one-time advantage — it compounds. Let's break down each stage:

Stage	What Happens	Example	Where Most Companies Fail
1. Users use product	Real humans interact with your AI features	A sales rep uses your AI to draft an email	Not embedding AI into daily workflows — it stays a side feature nobody opens
2. Proprietary data generated	Every interaction creates a signal — clicks, edits, ignores, corrections	The rep rewrites the AI's subject line from "Following Up" to "Quick question about your Q3 budget"	Not capturing these signals — the rewrite happens but nobody stores what changed or why
3. Better AI	Captured signals retrain or augment the model via fine-tuning or RAG	Next time, the AI drafts subject lines that sound like this rep's style	Not closing the loop — data sits in a warehouse but never flows back into the model
4. More users	Better experience drives adoption and retention	Other reps hear "the AI actually writes good subject lines now" and start using it	Not measuring the flywheel — nobody tracks whether data improvements actually drive adoption

⚡

Flywheel Diagnosis

25 XP

Look at the four stages above. Think about your own company's product (or one you use daily). **Which stage is your weakest link?** Write down one specific thing that breaks the loop. For example: "We capture user corrections in our CRM, but they go into a log file nobody reads — Stage 2 to Stage 3 is broken." The weakest link determines the speed of your entire flywheel. Fix that one stage, and everything accelerates.

Modern data platform architecture

Data Sources (CRM, ERP, web, IoT)

Ingestion (ETL / streaming)

Data Lake / Warehouse

Governance (quality, lineage, access)

Serving Layer (BI, AI, APIs)

⚠️Data debt is real debt

Every year you delay cleaning, documenting, and governing your data is a year of compounding interest on technical debt. AI amplifies whatever is in your data — clean data produces useful models, messy data produces confidently wrong models. The organisations winning with AI in 2025 started cleaning their data in 2020.

The LinkedIn vs. Startup Smackdown

Let's make this real with a story that should keep you up at night — or inspire you, depending on which side you're on.

The Setup: It's 2017. A well-funded startup — let's call them TalentAI — raises $80 million. They license the exact same foundation model LinkedIn uses for job recommendations. They hire brilliant ML engineers from Google and Meta. Their pitch deck says: "We'll out-engineer LinkedIn's AI."

LinkedIn's Secret Weapon: LinkedIn now has over 1 billion members (announced October 2023); in 2017, it already had hundreds of millions of members and was accumulating that scale of daily interactions — job applications, profile views, connection requests, content engagement, and search queries. Twenty years of professional graph data: who worked where, who knows whom, which career paths lead where.

But here's the part most people miss — LinkedIn almost blew it.

Before 2018, LinkedIn's recommendation AI was... fine. Average. Industry-benchmark click-through rates. They had this ocean of data and were barely sipping from it. They were focused on fancier models, not better data pipelines.

The Turning Point: In 2018, LinkedIn's AI team made a strategic bet. Rather than chasing better model architectures, the team redirected focus toward data infrastructure — specifically, capturing the implicit signals users were already generating but LinkedIn was ignoring:

Profile views where the person didn't send a connection request (what made them bounce?)
Job listings browsed but not applied to (what turned them off?)
Search queries abandoned mid-session (what were they really looking for?)
Time spent reading a post before scrolling past (interest without engagement)

These aren't clicks. They're the ghosts of intent — the things people almost did. And they turned out to be more valuable than the things people actually did. (Based on publicly available descriptions of LinkedIn's recommendation system work; specific internal strategy details are illustrative of the general approach.)

The Result: Within two years of closing that data loop, job recommendation engagement improved significantly. Meanwhile, TalentAI — spending millions on newer, fancier models — saw marginal gains. They could buy better compute. They could not buy 20 years of professional behavior data.

TalentAI quietly shut down in 2021 (a composite example). They lost to a spreadsheet of implicit signals, not to a better algorithm.

💭You're Probably Wondering…

"There Are No Dumb Questions"

Q: So LinkedIn won because they're big? Small companies can't compete? A: LinkedIn won because they captured signals others ignored. Size helped, but the decision to instrument implicit behavior was available to anyone. TalentAI had $80 million and chose to spend it on model architecture instead of data capture. That was the mistake, not their size.

Q: What are "implicit signals" exactly? A: Explicit signals are things users intentionally tell you — clicking "like," submitting a form, writing a review. Implicit signals are things users reveal through behavior without meaning to — hovering over a button, reading for 30 seconds then leaving, searching for something and giving up. Implicit signals are usually 10-100x more abundant than explicit ones.

⚡

Implicit Signal Hunt

25 XP

Think about your own product right now. Name **three implicit signals** your users generate that you currently do NOT capture. These are the "ghosts of intent" — the things users almost do, the patterns hidden in their behavior. Examples to spark your thinking: - A user opens a help article but closes it after 5 seconds (the article didn't help — what were they actually looking for?) - A user starts filling out a form and abandons it at field #4 (what's wrong with field #4?) - A user searches for something, scrolls past the first 10 results, and searches again with different words (your search failed — what did they really want?) Your three signals: _______________

The Duolingo Story: 500 Billion Reasons You Can't Catch Up

Let's visit the language-learning world, where Duolingo is running one of the most terrifying data flywheels on the planet.

Duolingo has been operating for approximately 15 years (founded 2011). In that time, Duolingo has publicly reported accumulating vast numbers of learning interactions (see Duolingo's annual reports for current figures) across its platform. That's an almost incomprehensible number of data points. Every answer. Every hesitation. Every retry. Every streak break. Every time someone got "ser" and "estar" confused at 11pm on a Tuesday.

Now imagine you're a brilliant founder. You just launched a competing AI language tutor using the same Claude API Duolingo could use. Your app looks gorgeous. Your marketing is on point. You even have a celebrity endorsement.

Here's what happens when a learner makes a mistake:

	Your Startup	Duolingo
What the AI says	"Good try! The correct answer is X."	"You've made this same mistake with ser/estar 14 times in the last 3 weeks, specifically when the subject is an emotion. Here's a pattern to remember."
Why	You have zero history on this learner	Duolingo has this learner's complete error log across 500+ sessions
Data behind the response	Generic language rules from training data	This specific learner's personal confusion patterns, mapped against billions of interactions from millions of other learners who had the same confusion (illustrative figure — verify against Duolingo's current annual report)
What happens next	Learner gets the same generic hint next time	Duolingo spaces the next ser/estar quiz to appear in exactly 3 days (optimized by spaced repetition data from millions of similar learners)

The startup cannot buy this advantage. It cannot fine-tune its way to this advantage. It cannot hire enough engineers to build this advantage. The only way to get 500 billion learning interactions is to have 500 billion learning interactions. Time is the ingredient.

And every day Duolingo runs, the gap gets wider, not narrower.

⚡

Time Machine Test

25 XP

Here's a tough question: If you started building your data flywheel **today**, what would your data advantage look like in 3 years? Be specific. Fill in the blanks: - "In 3 years, we will have captured approximately ______ [type of interaction] from ______ users." - "This will allow our AI to ______ that no new competitor could match because ______." If you can't fill in those blanks, your data strategy isn't concrete enough yet.

How to Build Your Flywheel (Starting Monday)

Stop reading this as theory. Here's the exact playbook:

Step 1: Identify your top three behavioral signals. What do your users do inside your product that reveals intent, preference, or expertise? Not what they click — what they mean. Look for corrections, rejections, hesitations, and search refinements.

Step 2: Design the capture pipeline. A data pipeline is the automated plumbing that captures, stores, and organizes those signals. If your pipeline doesn't exist, your signals evaporate. Every correction a user makes to an AI suggestion? That's labelled training data — data already tagged with the right answer — the moment you capture it.

Step 3: Close the feedback loop. Use the AI's outputs as training signal for its next version. When a user corrects an AI suggestion, ignores a recommendation, or spends extra time on a result, that interaction must flow back into the model. If it doesn't, you have a data lake, not a data flywheel.

Action	What It Means	What You Do With It
User corrects AI suggestion	"The AI was wrong in this specific way"	Labelled training data — retrain on the correction
User ignores recommendation	"This wasn't relevant to this user in this context"	Negative signal — deprioritize similar recommendations
User spends extra time on a result	"This was interesting but maybe not actionable"	Engagement signal — surface similar content more often
User edits AI output before using it	"The AI was close but not quite right here"	The diff between AI output and user edit is pure gold — it shows exactly where the model needs improvement

⚡

Challenge

50 XP

Duolingo has been operating for approximately 15 years and has publicly reported accumulating vast numbers of learning interactions (see Duolingo's annual reports for current figures) across its platform: every answer, every hesitation, every retry, every streak break. A new edtech startup just launched a competing AI language tutor using the same Claude API. 1. The new startup's AI tutor gives generic feedback: "Good try! The correct answer is X." Duolingo's AI tutor says: "You've made this same mistake with ser/estar 14 times in the last 3 weeks, specifically when the subject is an emotion. Here's a pattern to remember." What is the technical reason Duolingo's tutor can do this and the startup's can't? _Hint: Start by asking what data Duolingo has stored for each user that the startup does not — specifically, what gets recorded every time a learner answers a question incorrectly._ 2. The startup decides to fine-tune (retrain the model on new examples to specialise its behaviour) Claude on publicly available language learning content. Does this close the gap? Why or why not? _Hint: Ask yourself whether public content captures the specific mistakes this particular learner keeps making._ 3. Duolingo's data flywheel: when a user corrects a mistake, that correction is labelled and fed back into the personalisation model. Name one specific interaction type in your own SaaS that could work the same way. What user action becomes the training signal? (For this question, pick any B2B SaaS domain — e.g., CRM, project management, HR software.) _Hint: Think about moments when a user explicitly disagrees with or modifies something the AI produced. That disagreement — captured precisely — is labelled training data: the AI's output on the left, the user's correction on the right._

The Scoreboard: Are You Winning or Losing?

Here's a quick self-assessment. Score yourself honestly:

Question	Yes = +1	No = 0
Do you capture user corrections to AI outputs?
Do those corrections flow back into model training?
Can you name your top 3 proprietary data assets?
Do you track implicit signals (not just clicks)?
Is there an executive data owner with performance accountability?

0-1 points: Your competitors are building the moat right now. You're handing them training data by not collecting your own. 2-3 points: You've started, but the loop isn't closed. Data without a feedback loop is just storage costs. 4-5 points: You're in flywheel territory. Focus on acceleration — how do you spin it faster?

⚡

Competitive Intelligence

25 XP

Go find out: does your biggest competitor have a data flywheel? Look at their product. Do they personalize? Do they improve over time? Do they capture corrections? Write one sentence describing their data advantage over you — or yours over them. If you can't tell, that's your first research assignment for next week.

Back to your biggest competitor

That competitor across town can call the same Claude API you call. They can hire ML engineers from the same talent pool. What they cannot do is replicate five years of your users' behavioral signals — the corrections, the ignored recommendations, the searches that ended in a different search. LinkedIn almost threw this advantage away before 2018, when they were sipping from a data ocean while chasing fancier model architectures. The moment they redirected toward capturing implicit signals — the profile views that didn't convert, the job listings browsed but not applied to — their recommendation engagement improved significantly. TalentAI had $80 million and better models. They lost because they were building with purchased tomatoes while LinkedIn was cultivating a farm that took twenty years to grow. Your flywheel doesn't start when you hit a million users. It starts the moment you instrument the first implicit signal and close the loop back into the model.

Key Takeaways

Your data is your moat, not your model. Any company can call the same API. Only you own the behavioral signals your users generate inside your product. Start capturing them before your competitor does.
Corrections are gold. Every time a user edits, overrides, or ignores an AI output, that's labelled training data — pre-tagged with the right answer. Every such interaction you discard is a gift to your competitor.
The flywheel compounds. Better data makes better AI. Better AI attracts more users. More users generate more data. This loop doesn't add cost — it seeds returns that no later budget increase can replicate. The best time to start was five years ago. The second best time is Monday.

Knowledge Check

1.Your CDO says the organization's data isn't ready for AI at scale. What is your first move as an executive?

2.What is the most important distinction between a data lake and a data warehouse from an AI readiness perspective?

3.A competitor acquires a data company with 50 million consumer records. How do you evaluate whether this represents a meaningful AI moat?

4.Gartner research has consistently found that a majority of AI projects fail due to data quality issues — with estimates suggesting rates as high as 60% of projects abandoned. Which governance structure most directly prevents this outcome?

Your Biggest Competitor Isn't Who You Think

Right now, your most dangerous competitor isn't the company across town or the well-funded startup in your space. It's the data you throw away every single day while your product runs.

Let's fix that.

Think About It Like a Loyalty Card

Imagine you walk into two grocery stores.

Store B just opened. Beautiful store. Same products. Same prices. But it has zero idea who you are.

Which store gives you better coupons? Which store feels like it "gets" you?

Your AI works the same way.

💭You're Probably Wondering…

"There Are No Dumb Questions"

The Data Flywheel: Your Compounding Machine

Here's the engine that separates companies that use AI from companies that dominate with AI:

Each revolution of this flywheel makes your AI harder to catch. It's not a one-time advantage — it compounds. Let's break down each stage:

Stage	What Happens	Example	Where Most Companies Fail
1. Users use product	Real humans interact with your AI features	A sales rep uses your AI to draft an email	Not embedding AI into daily workflows — it stays a side feature nobody opens
2. Proprietary data generated	Every interaction creates a signal — clicks, edits, ignores, corrections	The rep rewrites the AI's subject line from "Following Up" to "Quick question about your Q3 budget"	Not capturing these signals — the rewrite happens but nobody stores what changed or why
3. Better AI	Captured signals retrain or augment the model via fine-tuning or RAG	Next time, the AI drafts subject lines that sound like this rep's style	Not closing the loop — data sits in a warehouse but never flows back into the model
4. More users	Better experience drives adoption and retention	Other reps hear "the AI actually writes good subject lines now" and start using it	Not measuring the flywheel — nobody tracks whether data improvements actually drive adoption

⚡

Flywheel Diagnosis

25 XP

Modern data platform architecture

Data Sources (CRM, ERP, web, IoT)

Ingestion (ETL / streaming)

Data Lake / Warehouse

Governance (quality, lineage, access)

Serving Layer (BI, AI, APIs)

⚠️Data debt is real debt

The LinkedIn vs. Startup Smackdown

Let's make this real with a story that should keep you up at night — or inspire you, depending on which side you're on.

But here's the part most people miss — LinkedIn almost blew it.

Profile views where the person didn't send a connection request (what made them bounce?)
Job listings browsed but not applied to (what turned them off?)
Search queries abandoned mid-session (what were they really looking for?)
Time spent reading a post before scrolling past (interest without engagement)

TalentAI quietly shut down in 2021 (a composite example). They lost to a spreadsheet of implicit signals, not to a better algorithm.

💭You're Probably Wondering…

"There Are No Dumb Questions"

⚡

Implicit Signal Hunt

25 XP

The Duolingo Story: 500 Billion Reasons You Can't Catch Up

Let's visit the language-learning world, where Duolingo is running one of the most terrifying data flywheels on the planet.

Here's what happens when a learner makes a mistake:

	Your Startup	Duolingo
What the AI says	"Good try! The correct answer is X."	"You've made this same mistake with ser/estar 14 times in the last 3 weeks, specifically when the subject is an emotion. Here's a pattern to remember."
Why	You have zero history on this learner	Duolingo has this learner's complete error log across 500+ sessions
Data behind the response	Generic language rules from training data	This specific learner's personal confusion patterns, mapped against billions of interactions from millions of other learners who had the same confusion (illustrative figure — verify against Duolingo's current annual report)
What happens next	Learner gets the same generic hint next time	Duolingo spaces the next ser/estar quiz to appear in exactly 3 days (optimized by spaced repetition data from millions of similar learners)

And every day Duolingo runs, the gap gets wider, not narrower.

⚡

Time Machine Test

25 XP

How to Build Your Flywheel (Starting Monday)

Stop reading this as theory. Here's the exact playbook:

Action	What It Means	What You Do With It
User corrects AI suggestion	"The AI was wrong in this specific way"	Labelled training data — retrain on the correction
User ignores recommendation	"This wasn't relevant to this user in this context"	Negative signal — deprioritize similar recommendations
User spends extra time on a result	"This was interesting but maybe not actionable"	Engagement signal — surface similar content more often
User edits AI output before using it	"The AI was close but not quite right here"	The diff between AI output and user edit is pure gold — it shows exactly where the model needs improvement

⚡

Challenge

50 XP

The Scoreboard: Are You Winning or Losing?

Here's a quick self-assessment. Score yourself honestly:

Question	Yes = +1	No = 0
Do you capture user corrections to AI outputs?
Do those corrections flow back into model training?
Can you name your top 3 proprietary data assets?
Do you track implicit signals (not just clicks)?
Is there an executive data owner with performance accountability?

⚡

Competitive Intelligence

25 XP

Back to your biggest competitor

Key Takeaways

Your data is your moat, not your model. Any company can call the same API. Only you own the behavioral signals your users generate inside your product. Start capturing them before your competitor does.
Corrections are gold. Every time a user edits, overrides, or ignores an AI output, that's labelled training data — pre-tagged with the right answer. Every such interaction you discard is a gift to your competitor.
The flywheel compounds. Better data makes better AI. Better AI attracts more users. More users generate more data. This loop doesn't add cost — it seeds returns that no later budget increase can replicate. The best time to start was five years ago. The second best time is Monday.

Knowledge Check

1.Your CDO says the organization's data isn't ready for AI at scale. What is your first move as an executive?

2.What is the most important distinction between a data lake and a data warehouse from an AI readiness perspective?

3.A competitor acquires a data company with 50 million consumer records. How do you evaluate whether this represents a meaningful AI moat?