Back to Blog

Not All Branded Mentions are Treated the Same

Ahrefs published a fascinating study confirming that brand mentions have the highest correlation with getting recommended in AI search. 

But here’s a question that almost no AI SEOs are asking: “Are all brand mentions treated the same?”

The short answer is “no”. 

LLMs like ChatGPT are trained on lots of content across multiple datasets. OpenAI has confirmed that earlier models had a separate dataset (WebText2) that was made up entirely of articles shared on Reddit with at least 3 Karma

This dataset was sampled more often during the LLM training process, which meant getting mentioned in these articles would technically be more impactful than a brand mention in a low-quality guest post.

So no, not all brand mentions are equal. Getting brand mentions in heavily weighted data sets is more impactful.

Read on to learn how LLMs like ChatGPT are trained so that you can better understand how some brand mentions are more impactful than others.

How OpenAI Trained GPT-2

An AI assistant like ChatGPT is only as good as the data it’s trained on. Garbage in, garbage out.

So while working on GPT-2, OpenAI had to make sure they were training their model on high quality datasets in order to get the best chatbot possible.

The most readily available dataset comes from Common Crawl, which provides a completely free index of the entire web (currently around ~2 billion pages).

But here’s the issue. The internet is filled with spam, broken pages, low-effort content, factual inaccuracies, and sites that exist only to sell ads. 

If you trained an AI model on every piece of content across the web, you could end up with a pretty useless chatbot filled with lots of low-quality and inaccurate information.

So OpenAI needed to find a way to give high quality web pages more weight over lower quality articles.

Their solution? Create a dataset of just articles that had 3+ Karma on Reddit and give those more weight.

Here’s a direct quote from one of their GPT-2 whitepapers:

“…we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 Karma.”

Essentially they looked at every link ever shared on Reddit, and they only kept the ones that came from a post with at least 3 Karma. If a handful of Reddit users thought a link was worth sharing, it was probably better than a random web page.

This dataset was called “WebText”, and was the entire training data for GPT-2. No Wikipedia. No books. No Reddit comments. Just this WebText dataset containing roughly ~8 billion words from 8 million links.

GPT-3 Training Data

Fast forward to GPT-3, the foundational model that was used for ChatGPT. OpenAI wanted to expand its training data to have a more knowledgeable large language model. 

There were a total of 5 datasets used to train GPT-3.

  • WebText2
  • Common Crawl (filtered)
  • Books1
  • Books2
  • Wikipedia

WebText2

First, OpenAI included an expanded version of the WebText dataset called “WebText2” by scraping links from Reddit over a longer period of time.

“The original WebText dataset was a web scrape of outbound links from Reddit through December 2017 which received at least 3 Karma. In the second version, WebText2, we added outbound Reddit links from the period of January to October 2018, also with a minimum of 3 Karma.”

Source: Scaling Laws for Neural Language Models

As you can see, this WebText2 dataset included the same process of extracting web pages that had at least 3 Karma.

Common Crawl (filtered)

Next, they wanted to use more web pages from Common Crawl, but just those that were higher quality. Like we discussed earlier, the web is filled with lots of junk.

So this time instead of just taking articles with 3+ Karma, they built a system to automatically classify articles as “good” or “bad”.

“Using the original WebText as a proxy for high-quality documents, we trained a classifier to distinguish these from raw Common Crawl. We then used this classifier to re-sample Common Crawl by prioritizing documents which were predicted by the classifier to be higher quality.”

Source: Language Models are Few-Shot Learners

Basically, OpenAI used the WebText dataset as an example of “good” articles. Then they trained an AI classifier to look for patterns and characteristics that the “good” articles had in common. For example, these could be things like…

  • Written with clear, coherent sentences that flow naturally
  • Stays on topic throughout the article
  • Proper grammar and punctuation
  • Includes things like well-reasoned arguments, storytelling, explanations
  • Includes quotes from experts
  • Claims backed up with sources and citations.

They could then run this AI classifier across the entire Common Crawl index of the web to identify even more high quality articles that had the same patterns and characteristics as the initial WebText dataset.

This new method of using WebText as an example of “good” articles, and finding more high-quality articles in the Common Crawl index with similar patterns and characteristics was used to create an updated dataset.

Books1 & Books2

OpenAI used two datasets that consisted almost entirely of books.

  • Books1 – widely believed to be BookCorpus, also called Toronto Books Corpus. It contains roughly 7,000+ unpublished novels scraped from the web (primarily from a self-publishing site called Smashwords). 
  • Books2 – a bit more mysterious since OpenAI has never officially disclosed its source. The leading theory is that it’s derived from a copyrighted library of 4-5 million books.

Wikipedia

Lastly, OpenAI used the entire English Wikipedia as the final dataset to be used to train GPT-3.

Weighting the Datasets

So now GPT-3 has 5 datasets to use in the LLM training process.

  • WebText2
  • Common Crawl (filtered)
  • Books1
  • Books2
  • Wikipedia

Okay, time for the important part: these datasets were not treated equally during the training process.

Some datasets were used more heavily in the training process than others. 

In the world of AI there’s a term called “epoch”, which is essentially how many times the AI model reads the dataset during the training process.

1 epoch = the AI read through all the content in the dataset 1 time

The more times the dataset is exposed to the AI model, the more ingrained that dataset will be in the LLM’s knowledge. 

Our brains work the same way. If you read an article one time, you’ll have a decent understanding of the article. But the more times you read the same article, the more knowledgeable you’ll be about the article and the easier it will be for you to recite.

So in this case, the WebText2 dataset (articles with 3+ Reddit Karma) was read by the LLM nearly 3 times. Whereas only about 44% of the Common Crawl dataset was ever seen by the LLM.

Takeaways

The datasets and weights outlined in GPT-3’s training documentation makes it clear that LLMs prioritize some datasets over others. Getting positively mentioned in more heavily weighted datasets will have a bigger impact on AI recommending your brand versus getting mentioned on less weighted datasets.

When building brand mentions, focus on quantity and quality. Here are some examples of the more impactful sources to get your brand mentioned.

  • Wikipedia – The #1 most heavily weighted dataset for GPT-3, and likely heavily weighted for ChatGPT. 
  • OpenAI Partners – OpenAI has established a lot of partnerships with online publications and media companies. These publications are likely more heavily weighted than your average blog, so being mentioned in these places is likely more impactful.
  • Other high quality articles – It’s not completely clear how OpenAI’s quality classifier works. But getting mentioned in quality articles on reputable websites will likely be more impactful than low-quality articles on spammy websites.