How Valve Ran One of the Smartest A/B Tests in Gaming History — and Never Told Anyone

22 months. 128 heroes. Millions of players. Billions of matches. One giant experiment disguised as a gameplay feature.

By Bill Ni

Mostly written with Claude

On March 25, 2026, glaring inside 20,000 words of patch notes for DotA 2's Patch 7.41, Valve dropped a single line that sent shockwaves through the competitive scene: "Facets removed from the game."

The community reaction was split. Some celebrated the removal of a system they saw as bloated and impossible to balance. Others mourned the loss of strategic depth. Most coverage framed it the same way: Valve tried something, it didn't work out, they rolled it back.

I don't think that is the case.

I believe Valve knew exactly what they were doing from the start. Facets were never just a gameplay feature. They were a data collection mechanism — a massive, live A/B test dressed up as a hero customization system, run across the entire player base for nearly two years. And when Valve had the data they needed, they shut the experiment down, harvested the results, and used them to permanently reshape the game.

For the Uninitiated: What Is DotA 2?

If you're reading this and you've never touched a MOBA, here's what you need to know.

DotA 2 is a free-to-play competitive game developed by Valve, the company behind Steam, Half-Life, and Counter-Strike. Two teams of five players each select from a roster of over 120 unique characters called heroes, then compete on a mostly symmetrical map to destroy each other's base. Games last anywhere from 20 minutes to over an hour. The game is free, enormously popular, and legendarily complex.

Each hero has a unique set of abilities. Items purchased during the game modify those abilities and stats. Team composition, drafting strategy, in-game resource management, and split-second mechanical execution all matter. DotA 2 sits at the extreme end of competitive depth — professional teams compete for millions of dollars at events like The International, and the skill ceiling is so high that players can grind for thousands of hours and still feel like beginners.

This complexity matters for our story because it means that every design decision Valve makes about a hero doesn't exist in isolation. Change one hero, and you change how that hero interacts with every other hero, every item, every strategy. The design space is enormous, and testing changes in a vacuum is practically impossible. The only lab that matters is the live game itself.

In May 2024, Valve released Patch 7.36, which introduced two new mechanics: Innate Abilities (passive traits baked into each hero) and Facets.

Facets worked like this: every hero had at least two Facets. During the strategy phase before a match, each player selected one Facet for their chosen hero. Once locked in, the choice was permanent for the duration of that game. Your opponents couldn't see which Facet you'd picked until the match started.

What Facets actually did varied enormously from hero to hero. Some were subtle stat tweaks. Others fundamentally altered how a hero played. A few examples:

Morphling had two Facets called Ebb and Flow. Ebb made Morphling an Agility hero — a traditional right-click carry that scales into the late game by hitting harder and faster. Flow turned him into a Strength hero — tankier, more suited to mid lane or even support play. This wasn't a minor buff. Choosing a Facet determined Morphling's entire identity for that game.

Axe had a Facet called One Man Army that converted a portion of his armor into bonus Strength when no allied heroes were nearby. It was devastatingly effective, maintaining win rates above 61% at times — one of the highest Facet win rates in the game across its entire lifespan.

Magnus had Reverse Reverse Polarity, which pushed enemies away during his ultimate ability instead of pulling them in. It hovered around a 2.4% pick rate. Players almost universally considered it the worst Facet in the game.

Invoker, one of Dota 2's most mechanically complex heroes, eventually received three Facets — one centered on each of his three orb abilities (Quas, Wex, Exort). Each Facet granted a bonus level to its corresponding orb, permanently improved a specific spell, and came with unique Aghanim's Scepter and Shard upgrades. Three Facets effectively turned one hero into three distinct sub-characters.

Sand King had a Facet called Dust Devil that was so dominant at The International 2024 (the game's biggest annual tournament) that it single-handedly warped the professional meta around the hero.

Valve's stated design intent was to give players meaningful pre-game customization — a way to tailor a hero's playstyle to their preference or to the specific matchup they were facing. On the surface, it was a feature. Underneath, it was something else entirely.

Almost immediately after launch, a pattern emerged that the community treated as a failure: for most heroes, one Facet was simply better than the other. Not situationally better. Categorically better.

Some Facets had pick rates above 95%. Across all skill levels and all matchups, nearly every player selected the same option. The "choice" was, in practical terms, an illusion. Community members and analysts pointed this out constantly. Reddit threads lamented the imbalance. Patch after patch, Valve tweaked Facets — buffing the weak ones, nerfing the dominant ones — in what looked like a losing battle to make the system work.

And that's where most people stopped their analysis: Facets are imbalanced, Valve can't fix them, the system is flawed.

But consider it from Valve's perspective. If you're trying to decide which of two design directions is better for a hero, is it a problem that 95% of players are telling you the answer? Or is that exactly the signal you were looking for?

The "imbalance" wasn't a bug. It was the result.

Let's think about what Valve was actually collecting over those 22 months.

For every hero in the game, across every skill bracket, in every possible matchup combination, Valve had access to:

Pick rates — which Facet players chose, and how overwhelmingly they preferred it
Win rates — which Facet produced better outcomes, controlling for hero and bracket
Skill-bracket variation — did Immortal-rank players pick differently than Crusaders? Did the "correct" choice change at different levels of play?
Meta responsiveness — how quickly did pick rates shift after balance patches? Which Facets were fragile (small nerf killed them) and which were robust (stayed dominant despite nerfs)?
Role flexibility — which Facets enabled a hero to be played in an entirely different role (support vs. core)? Was that a design direction worth keeping?
Player sentiment through behavior — not through surveys, not through forums, but through what players actually did with real stakes on the line

This isn't the kind of data you get from an internal playtest with 50 testers over two weeks. This is data from millions of self-selecting participants making decisions under genuine competitive pressure, with outcomes measured in wins and losses, aggregated over nearly two years. And Valve didn't have to recruit a single participant. They just shipped the feature and watched.

Now compare this to how hero design decisions are traditionally made. A designer proposes a change. It gets debated internally. Maybe it goes through a limited playtest. Maybe it hits a public test server where a small, non-representative subset of players tries it out. Then it ships, and the team watches metrics and community reaction to see if it was the right call.

With Facets, Valve could test two (or three) versions of a hero simultaneously, in the real game, at full scale, for as long as they wanted. Every match was a trial. Every player was a participant. Every outcome was a data point.

Patch 7.41: Harvesting the Results

If Facets were just a failed feature, you'd expect Valve to remove them and move on. Revert heroes to their pre-Facet state. Maybe keep a few of the popular ones as permanent additions. A simple rollback.

That's not what happened.

Patch 7.41 is meticulous. It's surgical. Across the entire roster of 128+ heroes, Valve took the Facet data and folded it back into the game through three distinct strategies — each one revealing a different way they used the experiment's results.

Strategy 1: Bake the Winner into the Base Kit

For many heroes, Valve took the dominant Facet and permanently integrated it into the hero's default abilities or Innate Ability. No more choice. The experiment determined the answer, and the answer became permanent.

Heroes like Centaur Warrunner, Crystal Maiden, and Lina had their favored Facets merged directly into their base kits. These heroes now are the version that players overwhelmingly chose.

Leshrac is a particularly revealing case. His popular Facet, Misanthropy, had increased the explosion frequency of his Diabolic Edict ability but prevented it from damaging buildings. In 7.41, Valve kept the explosion frequency buff but removed the building damage restriction. They didn't just copy the popular Facet — they improved on it, using the data to understand that players valued the faster explosions (the upside) and merely tolerated the building restriction (the tradeoff). The result is a hero that's better than either Facet alone — a synthesis that's only possible when you understand exactly what each design direction contributed.

Strategy 2: Gate the Choice Behind an Item Purchase

For heroes where the Facet choice had genuine strategic value — where both options represented legitimately different playstyles — Valve didn't eliminate the choice. They moved it behind an in-game item purchase, preserving the decision while shifting it to a point in the match where it could be better balanced.

Invoker is the flagship example. His three Facets became three options presented when purchasing Aghanim's Scepter or Aghanim's Shard. The items now sit inert in your inventory until you manually activate them and choose an upgrade. The choice is permanent for the game, just like Facets were. But it's gated behind a significant gold investment (4,200 gold for Scepter), which means it hits at mid-game rather than being free from minute zero. This matters for balance: a free power spike at level one is much harder to tune than one gated behind an item timer.

Shadow Shaman and Viper received similar treatment. The Facet data told Valve that these heroes genuinely benefited from having options — but the options needed to exist within the game's economy rather than outside it.

Strategy 3: Override the Popular Choice

This is the category that most strongly supports the intentional-test thesis. For some heroes, Valve did not keep the popular Facet. They kept the one they believed was better for the game.

Clockwerk is the clearest example. His more popular Facet was Expanded Armature. Players picked it most of the time. But in 7.41, Valve retained Armor Power instead — the less-picked option that let Clockwerk consume Chainmail items for permanent armor bonuses.

If Facets were just a feature that Valve was cleaning up, they'd have kept the popular one. But if Facets were a test, and Valve was analyzing the data with designer judgment, this choice makes perfect sense. Popularity is one signal. Win rate is another. Long-term design coherence is a third. Role health across the meta is a fourth. Valve had all of these signals, and for Clockwerk, the less-popular option was the better answer.

Some heroes lost both Facets entirely without a clear replacement, and their win rates cratered. Timbersaw dropped 7%. Templar Assassin fell 10%. These weren't accidents — they were deliberate editorial decisions. Valve looked at the data from both Facets, decided neither version was the right direction for the hero, and chose to rebuild from scratch. The test told them what not to do, which is just as valuable.

Why I Believe This Was Intentional

You could argue that Valve simply got lucky — that Facets were a genuine feature that happened to produce useful data, and Valve was smart enough to use that data when removing them. That's the charitable-accident interpretation, and it's not unreasonable.

But several details make me think this was deliberate from the beginning.

The structure was too clean. If you were designing a system purely for gameplay, you'd optimize for the player experience. You'd make both Facets equally viable. You'd iterate aggressively until the choice felt meaningful every game. Instead, Valve let massive imbalances persist for months at a time. They adjusted Facets, but never with the urgency you'd expect if the primary goal were a polished player-facing feature. The pace of iteration looked much more like a team that was collecting data than a team that was trying to ship a balanced system.

New heroes shipped without Facets. If Facets were a core feature that Valve was committed to, every new hero would have launched with them. Instead, multiple heroes released without Facets, receiving them later or not at all. This makes no sense if Facets are a feature. It makes perfect sense if Facets are a test — you don't need to run the test on heroes that haven't accumulated enough match data yet.

The removal was too sophisticated. If Valve were simply reverting a failed experiment, the patch would be much simpler: remove Facets, restore pre-7.36 hero states. Instead, 7.41 is one of the largest and most detailed hero rework patches in DotA 2 history. Every hero was individually addressed. Facet effects were routed through three different integration strategies based on what the data suggested for each specific case. This isn't cleanup. This is the payoff.

Valve has done this before. Valve is a company that famously runs its entire business on data. Steam's recommendation algorithms, CS2's matchmaking systems, DotA 2's own behavior score and reporting infrastructure — Valve has a deep institutional comfort with large-scale data collection disguised as product features. The idea that they'd use a gameplay mechanic as a design research tool isn't a stretch. It's consistent with how they operate.

The timing was deliberate. Valve dropped Patch 7.41 in the middle of ESL One Birmingham 2026, one of the year's first major LAN tournaments. Sixteen of the world's best teams had to adapt on the fly. If you're Valve, and you're confident that your data-driven reworks are correct, this is the perfect field test: force the best players in the world to immediately play the post-experiment version of the game, under competitive pressure, on a global stage. That's not the behavior of a company cleaning up a mess. That's the behavior of a company stress-testing a new product release.

The A/B Test Framework

Let's formalize what Valve did in the language of experimental design, because the parallel is almost exact:

Hypothesis: For each hero, there exist multiple viable design directions. Some will be better than others in terms of balance, player satisfaction, and role health.

Test Design: Present each player with a forced choice between design variants (Facets) before every match. The choice is visible in aggregate through public APIs that third-party sites like Dotabuff, STRATZ, and Dota2ProTracker surface as pick rates and win rates.

Sample Size: The entire DotA 2 player base, running millions of matches per week across all skill brackets and regions.

Duration: 22 months (May 2024 – March 2026), spanning multiple balance patches that functioned as iterative adjustments to the test parameters.

Metrics: Pick rate (revealed preference), win rate (effectiveness), skill-bracket distribution (complexity signal), meta evolution over time (robustness signal).

Analysis and Implementation: Close the test. For each hero, integrate the design variant supported by the data — adjusting for designer judgment where raw popularity conflicts with long-term game health. Deploy results as permanent changes.

This is textbook A/B testing methodology. The only unusual thing about it is the scale, the stakes, and the fact that the test subjects were playing a video game instead of clicking a checkout button.

The Broader Lesson: Ship the Experiment

There's a principle here that extends far beyond gaming.

Most teams — in software, in product design, in any field — agonize over design decisions in conference rooms. They debate. They prototype. They run small tests on subsets of users. And then they ship a single version, hoping they made the right call.

What Valve did was different. They shipped the debate. They said: "We don't know whether Morphling should be an Agility carry or a Strength hero. Let's ship both and let 15 million players tell us." They said: "We don't know whether Axe's fantasy is a solo berserker or a team fighter. Let's find out." They said: "We don't know which version of Invoker is the most interesting. Let's build three and run the experiment for two years."

The cost was real. For 22 months, DotA 2's balance suffered. New players were confused. The game was harder to learn. Competitive integrity was compromised by imbalanced Facets that warped the meta. Valve accepted that cost — or more likely, decided it was worth paying for the data.

And now the experiment is over, and the game is reshaped by its results. Every hero in Dota 2 has been pressure-tested against alternative versions of itself, and the surviving designs carry the weight of billions of data points behind them.

No playtest, no focus group, no internal review could have produced that.

What Comes Next

Patch 7.41 dropped mid-tournament, and the professional scene is still adjusting. Win rates are volatile. Some heroes who relied on strong Facets are suddenly weaker. Others are liberated. The meta is in flux — which is exactly where Valve wants it, because a shifting meta generates engagement, and engagement generates data for the next round of iteration.

For players and fans, this is one of the most exciting periods in DotA 2's recent history. Not because something was added, but because the game was refined by the largest live design experiment it's ever undergone.

And for anyone who thinks about product design, A/B testing, or decision-making under uncertainty: pay attention. Valve just showed the world what it looks like when you stop testing in a lab and start testing in production — at scale, for real, with the courage to commit to the results.

They just never told anyone that was the plan.

Patch data and statistics referenced in this post are drawn from Dotabuff, Liquipedia, Dota2ProTracker, STRATZ, and reporting from Insider Gaming, The Gamer, GosuGamers, Dot Esports, CyberScore, Strafe, and esports.gg.