TRANSFORMER

1. Introduction

Machine learning (ML), A Subfield of artificial intelligence, teaches computers to solve tasks based on structured data, language, audio, or images, by providing examples of inputs and the desired outputs.
This is different from traditional computer programming, where programmers write a sequence of specific instructions.
Here, the ML model learns to generate desirable outputs by adjusting its many knobs often in the millions.
ML has a history of developing methods with handcrafted features that may work only for specific, narrow problems.
There are several such examples. In the text, classifying a document as scientific or literary may be solved by counting the number of times certain words appear.
In audio, spoken text is recognised by converting the audio into a time-frequency representation.
In images, a car may be found by checking for the existence of specific carlike edge-shaped patterns.
Such handcrafted features are combined with simple or shallow, learning classifiers that typically have up to tens of thousands of knobs. In technical parlance, these knobs are called parameters.

2. Deep neural networks

In the first part of the 2010s, deep neural networks (DNNs) took over ML by storm, replacing the classic pipeline of handcrafted features and simple classifiers.
DNNs ingest a complete document or image and generate a final output, without the need to specify a particular way of extracting features.
While these deep and large models have existed in the past, their large size and millions of parameters hindered their use.
The resurgence of DNNs in the 2010s is attributed to the availability of large-scale data and fast parallel computing chips called graphics processing units.
Furthermore, the models used for text or images were still different recurrent neural networks were popular in language understanding while convolutional neural networks (CNNs) were popular in computer vision, that is, machine understanding of the visual world.

3. Attention Is All You Need

In a pioneering paper entitled "Attention Is All You Need" that appeared in 2017, a team at Google proposed transformers a DNN architecture that has today gained popularity across all modalities (image, audio and language).
The original paper proposed transformers for the task of translating a sentence from one language to another, similar to what Google Translate does when converting a sentence from, say English to Hindi.
A transformer is a two-part neural network. The first part is an "encoder: that ingests the input sentence in the source language (English) and the second part is a "decoder" that generates the translated sentence in the target language (Hindi).
The encoder converts each word in the source sentence to an abstract numerical form that captures the meaning of the word within the context of the sentence and stores it in a memory bank.
Just like a person would write or speak, the decoder generates one word at a time referring to what has been generated so far and looking back at the memory bank to find the appropriate word.
Both these processes use a mechanism called "attention", hence the name of the paper.
A key improvement over previous methods is the ability of a transformer to translate long sentences or paragraphs correctly.
The adoption of transformers subsequently exploded. The capital "T" in ChatGPT, for example, stands for "transformer".
Transformers have also become popular in computer vision as they simply cut an image into small square patches and line them up, just like words in a sentence.
By doing so, and after training on large amounts of data, a transformer can provide better results than CNNs.
Today, transformer models constitute the best approach for image classification, object detection and segmentation, action recognition and a host of other tasks.
Transformers' ability to ingest anything has been exploited to create joint vision and language models which allow users to search for an image, describe one and even answer questions regarding the image.

4. Attention

Attention in ML allows a model to learn how much importance should be given to different inputs.
In the translation example, attention allows the model to select or weigh words from the memory bank when deciding which word to generate next.
While describing an image, attention allows models to look at the relevant parts of the image when generating the next word.
A fascinating aspect of attention-based models is their ability for self-discovery, by parsing a lot of data.
In the translation case, the model is never told that the word 'dog' in English means kutta in Hindi.
Instead, it finds these associations by seeing several training sentence pairs where "dog" and kutta appear together.
A similar observation applies to image captioning. For an image of a "bird flying above water", the model is never told which region of the image corresponds to "bird" and which "water".
Instead, by training on several image caption pairs with the word "bird", it discovers common patterns in the image to associate the flying thing with "bird".
Transformers are attention models on steroids. They feature several attention layers within the encoder, to provide meaningful context across the input sentence or image and form the decoder to the encoder when generating a translated sentence or describing an image.

5. The billion and trillion scale

In the last year, transformer models have become larger and trained on more data than before.
When these colossuses train on written text, they are called large language models (LLMs).
ChatGPT uses hundreds of billions of parameters whereas GPT 4 uses hundreds of trillions.
While these models are trained on simple tasks, such as filling in the blanks or predicting the next word, they are very good at answering questions, creating stories, summarising documents, writing code, and even solving mathematical word problems in steps.
Transformers are also the bedrock of generative models that create realistic images and audio.
Their utility in diverse domains makes transformers a very powerful and universal model.
However, there are some concerns. The scientific community is yet to figure out how to evaluate these models rigorously.
There are also instances of "hallucination", whereby models make confident but wrong claims. we must urgently address societal concerns, such as data privacy and attribution to creative work, that arise as a result of their use.
At the same time, given the tremendous progress, and ongoing efforts to create guardrails guiding their use and work on leveraging these models for positive outcomes (for example in health care, education and agriculture), optimism would not be misplaced.

For Prelims: Machine learning, deep neural networks, convolutional neural networks, ChatGPT, large language models,

For Mains:

1. What is a transformer? Discuss the machine learning model that powers ChatGPT. (250 Words)

Previous Year Questions

1. With the present state of development, Artificial Intelligence can effectively do which of the following? (UPSC 2020)

1. Bring down electricity consumption in industrial units.

2. Create meaningful short stories and songs.

3. Disease diagnosis.

4. Text-to-Speech Conversion.

5. Wireless transmission of electrical energy.

Select the correct answer using the code given below:

A. 1, 2, 3 and 5 only B. 1, 3 and 4 only C. 2, 4 and 5 only D. 1, 2, 3, 4 and 5

Answer: D

Source: The Hindu

Share to Social

UPSC Article