AI Policy & Governance, CDT Research, Free Expression

Bird’s Eye View: The Limits of Twitter’s Algorithm Release

April 26, 2023 / Gabriel Nicholas

It was one year ago, on the very day that Twitter’s board voted unanimously to accept Elon Musk’s $44 billion acquisition offer, that an anonymous Twitter employee — perhaps a disgruntled engineer or Musk acolyte — posted a code repository on Twitter’s official Github page simply named “the-algorithm.” The repository, likely a joke in response to Musk’s promise to open source Twitter’s recommendation algorithm, was empty and quickly deleted. But nearly a year later, a repository with the exact same name appeared again on Twitter’s Github, this time filled with hundreds of thousands of lines of code.

Musk, it seemed, had fulfilled his promise of transparency. Or had he?

At CDT, we have called many times for social media companies to be transparent about their recommendation algorithms. We see transparency not as an end in and of itself, but as a means to building a better internet by improving accountability to users, providing more transparency into opaque content moderation systems, and giving academics a way to research the larger societal effects of recommendation algorithms.

Yet “the-algorithm” fails to achieve these ends. While the parts of Twitter’s recommendation algorithm that the company has shared provide interesting insight, Twitter has ended or eviscerated its transparency tools that would allow civil society and researchers to understand their impact: its Moderation Research Consortium, transparency reports for government takedown requests, and effectively its API (by raising its price from zero to half a million dollars per year). Twitter’s “algorithm” is, in practice, a sort of twisted gift of the Magi — after shaving the public’s head, we are now being given a comb.

To understand why “the-algorithm” is not the transparency holy grail some hoped it would be, it is important to understand what it is and isn’t. Twitter has not exactly “open sourced” its algorithm as some have framed it. The code is heavily redacted and missing several configuration files, meaning that it’s essentially impossible for any independent researcher to run the algorithm on sample inputs or otherwise test it. The published code is also only a snapshot of Twitter’s recommendation system and is not actually connected to the live code running on its servers. That means Twitter can make changes to its production code and not include it in its public repository, or make changes to the public repository that are not reflected in its production code.

So what has Twitter shared about its “For You” algorithm? First, it has shared new information about the architecture of the system. Like many recommendation algorithms, Twitter’s “For You” algorithm is broken into three systems:

Candidate generation. Twitter has billions of potential tweets it can serve a user at any given time. The candidate generation system uses a giant neural network to select 1,500 tweets a given user will likely be interested in. This system also predicts the likelihood that the user will engage in certain actions with each candidate tweet, such as retweeting and liking.
Ranking. Once the 1,500 possible tweets to potentially serve are selected, they are scored based on the likelihood of those engagement actions, with some actions weighted more heavily than others. Higher scoring tweets will generally appear closer to the top of a user’s feed.
Filtering. Tweets are not ordered strictly by their score. Heuristics and filters are applied to, for instance, avoid showing multiple tweets by the same author or to downrank tweets by authors the user has previously reported for violating site policy.

Within these stages, Twitter has shared the most about its ranking process, specifically how much it weighs each predicted action in scoring (see the chart below.) The final score is calculated by adding up the probability of each action multiplied by its weight — so (probability of favoriting) * 0.5 + (probability of retweeting) * 1.0, and so on. It is difficult to know the true impact of each action, though, since the weights likely partially account for how rare or common a given action is. For instance, while the likelihood that someone replies to a tweet may be weighted 27 times as strongly as the likelihood that someone favorites a tweet, replying is a far less common action than favoriting. Without knowledge of the baseline likelihood of each action, it is impossible to determine how strongly each one factors into what a user sees in their recommendations.

Predicted probability that a user will…	Score weight
Favorite a Tweet	0.5
Retweet a Tweet	1.0
Reply to a Tweet	13.5
Open the Tweet author profile and Likes or replies to a Tweet	12.0
Watch at least half of a video in a Tweet	0.005
Reply to the Tweet, and then that reply is engaged with by the Tweet’s author	75.0
Click into the conversation of the Tweet and reply or like it	11.0
Click into the conversation of the Tweet and stay there for at least two minutes	10.0
Negatively react to the Tweet (hit “show less often”, block, or mute the Tweet author)	-74.0
Report the Tweet	-369.0

Twitter also revealed some information about what other factors it considers besides the total ranking score of a tweet. For instance, Twitter tries to balance recommendations between people you do and don’t follow, avoid recommending too many consecutive tweets from the same author, and promote users subscribed to Twitter Blue. (This last factor was not mentioned in the blog post but was found in the code by multiple people, including Igor Brigadir and Vicki Boykis.)

There is also a lot of code that Twitter did not share. It did not release much about the algorithm for generating candidate tweets to be ranked — besides that, it tries to balance in- and out-network tweets (i.e., from people you do and do not follow) — including its parameters and training data. Without this information, researchers have no way to run or otherwise independently test these systems. Twitter also explicitly did not share its trust and safety algorithms for detecting things such as abuse, toxicity, or adult content, in order to prevent people from finding workarounds, though they did release some of the categories of content they flag.

Nevertheless, if Twitter still had its former research tools, in particular its academic API, this disclosure would be a high watermark for social media transparency and a promising jumping off point for new research. How might Twitter’s downranking of URLs that lead users off of Twitter affect political discourse? With the increased reach of Twitter Blue users, who now has less reach? What’s the deal with the peach emoji? The Twitter API once gave researchers the data they needed to wrestle with these kinds of questions.

Algorithmic transparency can be useful if it allows users to understand their own experiences online, researchers to learn more about algorithms’ larger societal effects, and civil society to advocate for improvements. Twitter’s current approach to transparency unfortunately achieves none of these ends. While disclosure of parts of “the-algorithm” has given the public a better sense of how Twitter’s recommendation algorithm operates, the company has neutered its own efforts by removing the tools researchers use to understand how the algorithm affects the real world and cutting off the lines of communication with civil society necessary to receive and incorporate feedback. In other words, “the-algorithm” isn’t useful without a bird’s eye view.

Bird’s Eye View: The Limits of Twitter’s Algorithm Release

Related Reading

CDT’s Matt Scherer Testifies Before Connecticut Senate’s General Law Committee on Senate Bill 2, An Act Concerning Artificial Intelligence

CDT Europe’s AI Bulletin: April 2024

Context Before Code: Meta’s Oversight Board Policy Advisory Opinion on the Word “Shaheed” Calls for Language and Cultural Nuance in Content Moderation