- UX & DESIGN
In the second part of his series on using GPT-2 models to generate online casino reviews, Paul Reilly explores some of the practicalities of the training process. He also responds to ethical concerns raised by some readers regarding comments he made in part one
In part one of this series we looked at some of the nuances related to automatically producing coherent casino reviews using language models, specifically GPT-2. We learned a little bit of jargon, some foundational concepts and several practical considerations when using transformer models to produce coherent casino articles. If you’ve not yet read part one, you can find it online at igbaffiliate.com. I’m delighted to say that the big picture implications of GPT language models weren’t lost on some of you. Several readers highlighted concerns, in particular about the following paragraph: “The model is also capable of generating fully formed HTML or Markdown. What’s more, by training/tuning your model using scraped content from the dominant casino affiliate in the space, it’s possible to use some simple pre-processing to learn casino reviews, including the internal and external link structures. Yes, you read that right… no more guessing what the optimal cross-linking strategy looks like, simply train the GPT-2 model to learn where to put the links.” Before we cover any more specifics on the practical steps required to automatically produce casino reviews, let’s look at the ethical issues in more detail.
ETHICS AND THE AI PARADIGM
Artificial intelligence is a game changer on many levels. As this new era becomes more pervasive, there are some curious considerations with respect to copyright and ownership of content. What this means, of course, is the notion of original content is disrupted.
Take text content for example: when we write an original piece of text content, we base the text on what we’ve learned from experience, researched topics, even the formation of coherent language itself. From the structure of the article to the concepts and topics it covers, everything – in all cases with zero exceptions – found its way into your brain via your sensor inputs. From educators, books, YouTube, websites, or from personal experience, everything. In order for concepts to reach our minds, they first must pass through our biological senses. The biological process involved in ingesting information.
All of which begs the question, who owns the copyright to AI-generated content? The naive argument would be to claim the copyright belongs to the owner of the training data. An argument which quickly collapses when you take into account that the GPT-2 published pre-trained models are trained on a significant subset of the internet.
The original OpenAI blog post from February 2019, ‘Better Language Models and their Implications’, explains how the training data for GPT-2 was curated:
“[We] used outbound links from Reddit which received at least three karma. This can be thought of as a heuristic indicator for whether other users found the link interesting (whether educational or funny), leading to higher data quality than other similar datasets, such as Common Crawl.”
To add further ambiguity, it’s worth noting this, from CommonCrawl.org:
“The Common Crawl corpus contains petabytes of data collected over eight years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.”
If your casino affiliate site has ever earned a three-karma-points-or-higher Reddit link, it’s safe to assume that all the pre-trained GPT-2 and GPT-3 models have already been ‘digitally inspired’ by your site.
If you use any of the smart grammar tools, including but not limited to Grammarly, Gmail, LinkedIn, etc, you’re already using a derivative of one or more web scale datasets.
While scraping a focused casino industry dataset for fine-tuning the GPT algorithm, I did encounter one leading casino affiliate who had configured their CloudFlare DNS protection such that the site became too challenging to crawl from a single IP. So I eventually moved onto lower hanging fruit.
My point here isn’t to advocate content theft, but rather to present the new paradigm for consideration. NLP language models are not going away, they will ultimately replace humans.
When we publish content on the web, we put it out for the world to read, for search engines to crawl (scrape), index and rank and for Facebook, Twitter and others to crawl, extract and cache elements such as title, description and image. Who knows what Facebook’s AI research team does with that text.
TRAINING VS TUNING – PRIVATE AND PUBLIC DATA SETS
GPT-2 was a huge leap in terms of model size, which makes training such a model cost-prohibitive since it requires tens of GPU years to train. To put this in perspective, one GPU year is equivalent to 365 GPUs running for 24 hours. As such, only well-funded research labs such as OpenAI, Google Research, Facebook AI Research and Microsoft Research are equipped to produce such state-of-the-art models.
In part one, I briefly mentioned that the most advanced model right now is GPT-3 which is an order of magnitude larger still.
So to recap, GPT-3 is not practical for most business users. The largest GPT-3 model with 175 billion parameters took 355 GPU years at a cost of $4.6m even with the lowest-priced GPU cloud on the market.
Decoding and producing text on a pre-trained GPT-3 model requires a cluster, since the model is too large to store in memory on a single machine. I found with testing that GPT-2 was sufficient to achieve my goals and manageable on a 32-core, Xeon server (CPU only)
It’s still worth keeping this step change from GPU years to GPU centuries in mind for three reasons.
1. Price performance for compute and storage continues to follow a double exponential rate of improvement, as per Kurzweil’s ‘Law of Accelerating Returns’.
2. The AI economy is essentially owned and operated by a Big Tech oligarchy.
3. In practical terms, a modest-sized GPT-2 implementation works well and I was able to retune the medium-sized (355 million parameter) model on a 32-core server in around a week.
I hope that’s helped reframe some of the ethical questions around this paradigm. I’m now skipping back to the practical training process for the remainder of this part of the article.