In part one of this series we looked at some of the nuances related to automatically producing coherent casino reviews using language models, specifically GPT-2. We learned a little bit of jargon, some foundational concepts and several practical considerations when using transformer models to produce coherent casino articles. If you’ve not yet read part one, you can find it online at igbaffiliate.com.
I’m delighted to say that the big picture implications of GPT language models weren’t lost on some of you. Several readers highlighted concerns, in particular about the following paragraph:
“The model is also capable of generating fully formed HTML or Markdown. What’s more, by training/tuning your model using scraped content from the dominant casino affiliate in the space, it’s possible to use some simple pre-processing to learn casino reviews, including the internal and external link structures. Yes, you read that right… no more guessing what the optimal cross-linking strategy looks like, simply train the GPT-2 model to learn where to put the links.”
Before we cover any more specifics on the practical steps required to automatically produce casino reviews, let’s look at the ethical issues in more detail.
ETHICS AND THE AI PARADIGM
Artificial intelligence is a game changer on many levels. As this new era becomes more pervasive, there are some curious considerations with respect to copyright and ownership of content. What this means, of course, is the notion of original content is disrupted.
Take text content for example: when we write an original piece of text content, we base the text on what we’ve learned from experience, researched topics, even the formation of coherent language itself. From the structure of the article to the concepts and topics it covers, everything – in all cases with zero exceptions – found its way into your brain via your sensor inputs. From educators, books, YouTube, websites, or from personal experience, everything. In order for concepts to reach our minds, they first must pass through our biological senses. The biological process involved in ingesting information.
All of which begs the question, who owns the copyright to AI-generated content? The naive argument would be to claim the copyright belongs to the owner of the training data. An argument which quickly collapses when you take into account that the GPT-2 published pre-trained models are trained on a significant subset of the internet.
The original OpenAI blog post from February 2019, ‘Better Language Models and their Implications’, explains how the training data for GPT-2 was curated:
“[We] used outbound links from Reddit which received at least three karma. This can be thought of as a heuristic indicator for whether other users found the link interesting (whether educational or funny), leading to higher data quality than other similar datasets, such as Common Crawl.”
To add further ambiguity, it’s worth noting this, from CommonCrawl.org:
“The Common Crawl corpus contains petabytes of data collected over eight years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.”
If your casino affiliate site has ever earned a three-karma-points-or-higher Reddit link, it’s safe to assume that all the pre-trained GPT-2 and GPT-3 models have already been ‘digitally inspired’ by your site.
If you use any of the smart grammar tools, including but not limited to Grammarly, Gmail, LinkedIn, etc, you’re already using a derivative of one or more web scale datasets.
While scraping a focused casino industry dataset for fine-tuning the GPT algorithm, I did encounter one leading casino affiliate who had configured their CloudFlare DNS protection such that the site became too challenging to crawl from a single IP. So I eventually moved onto lower hanging fruit.
My point here isn’t to advocate content theft, but rather to present the new paradigm for consideration. NLP language models are not going away, they will ultimately replace humans.
When we publish content on the web, we put it out for the world to read, for search engines to crawl (scrape), index and rank and for Facebook, Twitter and others to crawl, extract and cache elements such as title, description and image. Who knows what Facebook’s AI research team does with that text.
TRAINING VS TUNING – PRIVATE AND PUBLIC DATA SETS
GPT-2 was a huge leap in terms of model size, which makes training such a model cost-prohibitive since it requires tens of GPU years to train. To put this in perspective, one GPU year is equivalent to 365 GPUs running for 24 hours.
As such, only well-funded research labs such as OpenAI, Google Research, Facebook AI Research and Microsoft Research are equipped to produce such state-of-the-art models.
In part one, I briefly mentioned that the most advanced model right now is GPT-3 which is an order of magnitude larger still.
So to recap, GPT-3 is not practical for most business users. The largest GPT-3 model with 175 billion parameters took 355 GPU years at a cost of $4.6m even with the lowest-priced GPU cloud on the market.
Decoding and producing text on a pre-trained GPT-3 model requires a cluster, since the model is too large to store in memory on a single machine. I found with testing that GPT-2 was sufficient to achieve my goals and manageable on a 32-core, Xeon server (CPU only)
It’s still worth keeping this step change from GPU years to GPU centuries in mind for three reasons.
1. Price performance for compute and storage continues to follow a double exponential rate of improvement, as per Kurzweil’s ‘Law of Accelerating Returns’.
2. The AI economy is essentially owned and operated by a Big Tech oligarchy.
3. In practical terms, a modest-sized GPT-2 implementation works well and I was able to retune the medium-sized (355 million parameter) model on a 32-core server in around a week.
I hope that’s helped reframe some of the ethical questions around this paradigm. I’m now skipping back to the practical training process for the remainder of this part of the article.
Even after only 24 hours of tuning with casino reviews, the sample output from each training batch could be seen to quickly adapt to casino but would run into problems with games and software, referencing Xbox and PlayStation along with their respective titles rather than game developers and slot titles.
After 72 hours of training, most of these issues had been ironed out. However, when it came to outputting text, it would still run into the occasional loop or repetition, as illustrated in this example:
— sample start —
Restricted Countries and Territories
Players from the United States, United Kingdom, France, Denmark, Spain, Spain, Italy, Belgium, Netherlands, Hungary, Turkey, Hungary, Austria, Turkey, Hungary, Austria, Turkey, Hungary, Austria and Ukraine are not permitted to play at the Casino.
— sample end —
While occurrences of this problem largely disappeared after seven days of training, another obstacle refused to budge. Even though a fine-tuned 355 million parameter GPT-2 model can produce very coherent text, there remains a big problem: facts tend to be wrong.
Below is an example of a paragraph produced after 96 hours of training:
— sample start —
Live Casino Games
Players can play at (NAME REDACTED) which is an online casino where players can play slots, table games and live casino games. The live casino features live games hosted by friendly dealers which include Live Roulette, Live Blackjack and Live Baccarat.
— sample end —
The problem here for those who are still paying attention is that (NAME REDACTED) is a games developer rather than a casino brand. Furthermore, GPT-2 has assumed this to be a live casino brand with live roulette, live blackjack and live baccarat.
This problem required me to rethink the training process. By training on pre-processed paragraphs with the brand names and games replaced by tags (the brand names with a tag; live games with , , ), we get the benefit of coherent natural language, with the flexibility of reinserting the correct facts back into the text as a post-processing step. In this way the pre-processed training data appears as follows:
— sample start —
Live Casino Games
Players can play at Casino which is an online casino where players can play slots, table games and live casino games. The live casino features live games hosted by friendly dealers which include , and .
— sample end —
Once tuned on training data following this pattern, we’re able to replace the tags as a post-processing step and insert the correct brand name, along with three of the live games from a database populated with accurate casino data comprising brand name, website address, customer support contact details, software providers, withdrawal and deposit methods, etc.
I’ll cover things in more detail in the final step, but I wanted to illustrate the flexibility of GPT and some simple workarounds.
In the next iGB Affiliate I’ll present the final part of this three-part series. I’ll look at pre-processing, post-processing and some excellent learning resources which I’ve found most useful along the way.