Data Strategy
Why proprietary data is your only sustainable AI advantage, and how to build the flywheel that compounds it.
Your Biggest Competitor Isn't Who You Think
Right now, your most dangerous competitor isn't the company across town or the well-funded startup in your space. It's the data you throw away every single day while your product runs.
Any company on Earth can call Claude's API. Any company can license GPT. The models are commodities. But only you own what your users do inside your product — the clicks, the corrections, the things they ignore, the paths they take. That's your unfair advantage. And most companies flush 90% of it down the drain.
Let's fix that.
Think About It Like a Loyalty Card
Imagine you walk into two grocery stores.
Store A has had your loyalty card for 5 years. It knows you buy oat milk every Tuesday, that you switched from regular pasta to gluten-free last March, and that you always grab dark chocolate before a long weekend.
Store B just opened. Beautiful store. Same products. Same prices. But it has zero idea who you are.
Which store gives you better coupons? Which store feels like it "gets" you?
That's the data advantage. Store A didn't build a better coupon-printing machine. It just remembered more about you. And a new competitor — no matter how much money it spends on fancy coupon technology — can't buy five years of your grocery habits.
Your AI works the same way.
"There Are No Dumb Questions"
Q: Wait — if everyone can use the same AI models, how can data be an advantage? A: The model is like a brilliant chef. Your data is the pantry. Two restaurants can hire the same chef, but the one with rare, locally-sourced ingredients that took years to cultivate makes the meal nobody else can replicate. The chef (model) is interchangeable. The pantry (data) is not.
Q: My company already collects tons of data. Aren't we fine? A: Collecting data is like having groceries delivered but never opening the bags. The question isn't whether you have data — it's whether you've closed the loop: collecting it, cleaning it, feeding it back into your AI, and using the AI's outputs to generate even better data. That's the flywheel.
Q: What if we're a small company? Don't we need millions of users for this to work? A: Nope. A 500-person B2B SaaS with deep usage data on how procurement teams negotiate contracts has a data advantage that a trillion-dollar company without that specific signal can never buy. Depth beats breadth.
The Data Flywheel: Your Compounding Machine
Here's the engine that separates companies that use AI from companies that dominate with AI:
Each revolution of this flywheel makes your AI harder to catch. It's not a one-time advantage — it compounds. Let's break down each stage:
| Stage | What Happens | Example | Where Most Companies Fail |
|---|---|---|---|
| 1. Users use product | Real humans interact with your AI features | A sales rep uses your AI to draft an email | Not embedding AI into daily workflows — it stays a side feature nobody opens |
| 2. Proprietary data generated | Every interaction creates a signal — clicks, edits, ignores, corrections | The rep rewrites the AI's subject line from "Following Up" to "Quick question about your Q3 budget" | Not capturing these signals — the rewrite happens but nobody stores what changed or why |
| 3. Better AI | Captured signals retrain or augment the model via fine-tuning or RAG | Next time, the AI drafts subject lines that sound like this rep's style | Not closing the loop — data sits in a warehouse but never flows back into the model |
| 4. More users | Better experience drives adoption and retention | Other reps hear "the AI actually writes good subject lines now" and start using it | Not measuring the flywheel — nobody tracks whether data improvements actually drive adoption |
Flywheel Diagnosis
25 XPModern data platform architecture
The LinkedIn vs. Startup Smackdown
Let's make this real with a story that should keep you up at night — or inspire you, depending on which side you're on.
The Setup: It's 2017. A well-funded startup — let's call them TalentAI — raises $80 million. They license the exact same foundation model LinkedIn uses for job recommendations. They hire brilliant ML engineers from Google and Meta. Their pitch deck says: "We'll out-engineer LinkedIn's AI."
LinkedIn's Secret Weapon: LinkedIn now has over 1 billion members (announced October 2023); in 2017, it already had hundreds of millions of members and was accumulating that scale of daily interactions — job applications, profile views, connection requests, content engagement, and search queries. Twenty years of professional graph data: who worked where, who knows whom, which career paths lead where.
But here's the part most people miss — LinkedIn almost blew it.
Before 2018, LinkedIn's recommendation AI was... fine. Average. Industry-benchmark click-through rates. They had this ocean of data and were barely sipping from it. They were focused on fancier models, not better data pipelines.
The Turning Point: In 2018, LinkedIn's AI team made a strategic bet. Rather than chasing better model architectures, the team redirected focus toward data infrastructure — specifically, capturing the implicit signals users were already generating but LinkedIn was ignoring:
- Profile views where the person didn't send a connection request (what made them bounce?)
- Job listings browsed but not applied to (what turned them off?)
- Search queries abandoned mid-session (what were they really looking for?)
- Time spent reading a post before scrolling past (interest without engagement)
These aren't clicks. They're the ghosts of intent — the things people almost did. And they turned out to be more valuable than the things people actually did. (Based on publicly available descriptions of LinkedIn's recommendation system work; specific internal strategy details are illustrative of the general approach.)
The Result: Within two years of closing that data loop, job recommendation engagement improved significantly. Meanwhile, TalentAI — spending millions on newer, fancier models — saw marginal gains. They could buy better compute. They could not buy 20 years of professional behavior data.
TalentAI quietly shut down in 2021 (a composite example). They lost to a spreadsheet of implicit signals, not to a better algorithm.
"There Are No Dumb Questions"
Q: So LinkedIn won because they're big? Small companies can't compete? A: LinkedIn won because they captured signals others ignored. Size helped, but the decision to instrument implicit behavior was available to anyone. TalentAI had $80 million and chose to spend it on model architecture instead of data capture. That was the mistake, not their size.
Q: What are "implicit signals" exactly? A: Explicit signals are things users intentionally tell you — clicking "like," submitting a form, writing a review. Implicit signals are things users reveal through behavior without meaning to — hovering over a button, reading for 30 seconds then leaving, searching for something and giving up. Implicit signals are usually 10-100x more abundant than explicit ones.
Implicit Signal Hunt
25 XPThe Duolingo Story: 500 Billion Reasons You Can't Catch Up
Let's visit the language-learning world, where Duolingo is running one of the most terrifying data flywheels on the planet.
Duolingo has been operating for approximately 15 years (founded 2011). In that time, Duolingo has publicly reported accumulating vast numbers of learning interactions (see Duolingo's annual reports for current figures) across its platform. That's an almost incomprehensible number of data points. Every answer. Every hesitation. Every retry. Every streak break. Every time someone got "ser" and "estar" confused at 11pm on a Tuesday.
Now imagine you're a brilliant founder. You just launched a competing AI language tutor using the same Claude API Duolingo could use. Your app looks gorgeous. Your marketing is on point. You even have a celebrity endorsement.
Here's what happens when a learner makes a mistake:
| Your Startup | Duolingo | |
|---|---|---|
| What the AI says | "Good try! The correct answer is X." | "You've made this same mistake with ser/estar 14 times in the last 3 weeks, specifically when the subject is an emotion. Here's a pattern to remember." |
| Why | You have zero history on this learner | Duolingo has this learner's complete error log across 500+ sessions |
| Data behind the response | Generic language rules from training data | This specific learner's personal confusion patterns, mapped against billions of interactions from millions of other learners who had the same confusion (illustrative figure — verify against Duolingo's current annual report) |
| What happens next | Learner gets the same generic hint next time | Duolingo spaces the next ser/estar quiz to appear in exactly 3 days (optimized by spaced repetition data from millions of similar learners) |
The startup cannot buy this advantage. It cannot fine-tune its way to this advantage. It cannot hire enough engineers to build this advantage. The only way to get 500 billion learning interactions is to have 500 billion learning interactions. Time is the ingredient.
And every day Duolingo runs, the gap gets wider, not narrower.
Time Machine Test
25 XPHow to Build Your Flywheel (Starting Monday)
Stop reading this as theory. Here's the exact playbook:
Step 1: Identify your top three behavioral signals. What do your users do inside your product that reveals intent, preference, or expertise? Not what they click — what they mean. Look for corrections, rejections, hesitations, and search refinements.
Step 2: Design the capture pipeline. A data pipeline is the automated plumbing that captures, stores, and organizes those signals. If your pipeline doesn't exist, your signals evaporate. Every correction a user makes to an AI suggestion? That's labelled training data — data already tagged with the right answer — the moment you capture it.
Step 3: Close the feedback loop. Use the AI's outputs as training signal for its next version. When a user corrects an AI suggestion, ignores a recommendation, or spends extra time on a result, that interaction must flow back into the model. If it doesn't, you have a data lake, not a data flywheel.
| Action | What It Means | What You Do With It |
|---|---|---|
| User corrects AI suggestion | "The AI was wrong in this specific way" | Labelled training data — retrain on the correction |
| User ignores recommendation | "This wasn't relevant to this user in this context" | Negative signal — deprioritize similar recommendations |
| User spends extra time on a result | "This was interesting but maybe not actionable" | Engagement signal — surface similar content more often |
| User edits AI output before using it | "The AI was close but not quite right here" | The diff between AI output and user edit is pure gold — it shows exactly where the model needs improvement |
Challenge
50 XPThe Scoreboard: Are You Winning or Losing?
Here's a quick self-assessment. Score yourself honestly:
| Question | Yes = +1 | No = 0 |
|---|---|---|
| Do you capture user corrections to AI outputs? | ||
| Do those corrections flow back into model training? | ||
| Can you name your top 3 proprietary data assets? | ||
| Do you track implicit signals (not just clicks)? | ||
| Is there an executive data owner with performance accountability? |
0-1 points: Your competitors are building the moat right now. You're handing them training data by not collecting your own. 2-3 points: You've started, but the loop isn't closed. Data without a feedback loop is just storage costs. 4-5 points: You're in flywheel territory. Focus on acceleration — how do you spin it faster?
Competitive Intelligence
25 XPBack to your biggest competitor
That competitor across town can call the same Claude API you call. They can hire ML engineers from the same talent pool. What they cannot do is replicate five years of your users' behavioral signals — the corrections, the ignored recommendations, the searches that ended in a different search. LinkedIn almost threw this advantage away before 2018, when they were sipping from a data ocean while chasing fancier model architectures. The moment they redirected toward capturing implicit signals — the profile views that didn't convert, the job listings browsed but not applied to — their recommendation engagement improved significantly. TalentAI had $80 million and better models. They lost because they were building with purchased tomatoes while LinkedIn was cultivating a farm that took twenty years to grow. Your flywheel doesn't start when you hit a million users. It starts the moment you instrument the first implicit signal and close the loop back into the model.
Key Takeaways
- Your data is your moat, not your model. Any company can call the same API. Only you own the behavioral signals your users generate inside your product. Start capturing them before your competitor does.
- Corrections are gold. Every time a user edits, overrides, or ignores an AI output, that's labelled training data — pre-tagged with the right answer. Every such interaction you discard is a gift to your competitor.
- The flywheel compounds. Better data makes better AI. Better AI attracts more users. More users generate more data. This loop doesn't add cost — it seeds returns that no later budget increase can replicate. The best time to start was five years ago. The second best time is Monday.
Knowledge Check
1.Your CDO says the organization's data isn't ready for AI at scale. What is your first move as an executive?
2.What is the most important distinction between a data lake and a data warehouse from an AI readiness perspective?
3.A competitor acquires a data company with 50 million consumer records. How do you evaluate whether this represents a meaningful AI moat?
4.Gartner research has consistently found that a majority of AI projects fail due to data quality issues — with estimates suggesting rates as high as 60% of projects abandoned. Which governance structure most directly prevents this outcome?