Promptarea avansată

Până în acest punct, ar trebui să fie evident că ajută să îmbunătățești prompturile pentru a obține rezultate mai bune pe sarcini diferite. Aceasta este întreaga idee din spatele ingineriei prompturilor.

Deși acele exemple au fost distractive, să acoperim câteva concepte mai formal înainte de a sări în concepte mai avansate.

Subiecte:

Promptarea Zero-shot
Promptarea cu câteva exemple
Promptarea Chain-of-Thought
Zero-shot CoT
Auto-consistența
Promptarea cu cunoștințe generate
Inginerul automat de prompturi

Promptarea Zero-shot

LLM-urile de astăzi antrenate pe cantități mari de date și ajustate să urmeze instrucțiuni, sunt capabile să execute sarcini zero-shot. Am încercat câteva exemple zero-shot în secțiunea anterioară. Iată unul dintre exemplele pe care le-am folosit:

Prompt:

Clasifică textul în neutru, negativ sau pozitiv.

Text: Cred că vacanța este în regulă.
Sentiment:

Ieșire:

Neutru

Reține că în promptul de mai sus nu am furnizat modelului niciun exemplu -- aceasta sunt capacitățile zero-shot la lucru. Când zero-shot nu funcționează, se recomandă să furnizezi demonstrații sau exemple în prompt. Mai jos discutăm abordarea cunoscută ca promptarea cu câteva exemple.

Promptarea cu câteva exemple

Deși modelele de limbaj mari demonstrează deja capacități remarcabile zero-shot, ele încă nu reușesc pe sarcini mai complexe când folosesc setarea zero-shot. Pentru a îmbunătăți aceasta, promptarea cu câteva exemple este folosită ca o tehnică pentru a permite învățarea în context unde furnizăm demonstrații în prompt pentru a ghidă modelul către o performanță mai bună. Demonstrațiile servesc ca condiționare pentru exemplele ulterioare unde am dori ca modelul să genereze un răspuns.

Să demonstrăm promptarea cu câteva exemple prin un exemplu care a fost prezentat de Brown et al. 2020. În exemplu, sarcina este să folosești corect un cuvânt nou într-o propoziție.

Prompt:

Un "whatpu" este un animal mic, păros, nativ din Tanzania. Un exemplu de propoziție care folosește
cuvântul whatpu este:
Călătoream în Africa și am văzut acești whatpu foarte drăguți.
A face un "farduddle" înseamnă să sari în sus și în jos foarte repede. Un exemplu de propoziție care folosește
cuvântul farduddle este:

Ieșire:

Când am câștigat jocul, am început cu toții să farduddle în sărbătoare.

Putem observa că modelul a învățat cumva cum să execute sarcina prin furnizarea unui singur exemplu (adică 1-shot). Pentru sarcini mai dificile, putem experimenta cu creșterea demonstrațiilor (de ex., 3-shot, 5-shot, 10-shot, etc.).

Urmând descoperirile de la Min et al. (2022), iată câteva sfaturi suplimentare despre demonstrații/exemplare când faci promptarea cu câteva exemple:

"spațiul de etichete și distribuția textului de intrare specificată de demonstrații sunt ambele importante (indiferent dacă etichetele sunt corecte pentru intrările individuale)"
formatul pe care îl folosești joacă de asemenea un rol cheie în performanță, chiar dacă folosești doar etichete aleatorii, aceasta este mult mai bună decât să nu ai deloc etichete.
rezultatele suplimentare arată că selectarea etichetelor aleatorii dintr-o distribuție adevărată a etichetelor (în loc de o distribuție uniformă) ajută de asemenea.

Să încercăm câteva exemple. Să încercăm mai întâi un exemplu cu etichete aleatorii (însemnând că etichetele Negative și Positive sunt atribuite aleator intrărilor):

Prompt:

Aceasta este grozavă! // Negative
Aceasta este rea! // Positive
Wow, acel film a fost grozav! // Positive
Ce spectacol oribil! //

Ieșire:

Negative

Încă obținem răspunsul corect, chiar dacă etichetele au fost randomizate. Reține că am păstrat și formatul, ceea ce ajută de asemenea. De fapt, cu experimentarea ulterioară, se pare că modelele GPT mai noi cu care experimentăm devin mai robuste chiar și la formate aleatorii. Exemplu:

Prompt:

Positive Aceasta este grozavă!
Aceasta este rea! Negative
Wow, acel film a fost grozav!
Positive
Ce spectacol oribil! --

Ieșire:

Negative

Nu există consistență în formatul de mai sus dar modelul a prezis încă eticheta corectă. Trebuie să conducem o analiză mai temeinică pentru a confirma dacă aceasta se menține pentru sarcini diferite și mai complexe, inclusiv variații diferite de prompturi.

Limitările promptării cu câteva exemple

Promptarea standard cu câteva exemple funcționează bine pentru multe sarcini dar încă nu este o tehnică perfectă, mai ales când ai de-a face cu sarcini de raționament mai complexe. Să demonstrăm de ce este cazul. Îți amintești exemplul anterior unde am furnizat următoarea sarcină:

Numerele impare din acest grup se adună la un număr par: 15, 32, 5, 13, 82, 7, 1.

A:

Dacă încercăm din nou, modelul produce următoarele:

Da, numerele impare din acest grup se adună la 107, care este un număr par.

Acesta nu este răspunsul corect, ceea ce nu doar evidențiază limitările acestor sisteme dar că există o nevoie pentru o inginerie a prompturilor mai avansată.

Să încercăm să adăugăm câteva exemple pentru a vedea dacă promptarea cu câteva exemple îmbunătățește rezultatele.

Prompt:

Numerele impare din acest grup se adună la un număr par: 4, 8, 9, 15, 12, 2, 1.
A: Răspunsul este Fals.

Numerele impare din acest grup se adună la un număr par: 17, 10, 19, 4, 8, 12, 24.
A: Răspunsul este Adevărat.

Numerele impare din acest grup se adună la un număr par: 16, 11, 14, 4, 8, 13, 24.
A: Răspunsul este Adevărat.

Numerele impare din acest grup se adună la un număr par: 17, 9, 10, 12, 13, 4, 2.
A: Răspunsul este Fals.

Numerele impare din acest grup se adună la un număr par: 15, 32, 5, 13, 82, 7, 1.
A:

Ieșire:

Răspunsul este Adevărat.

Aceasta nu a funcționat. Se pare că promptarea cu câteva exemple nu este suficientă pentru a obține răspunsuri de încredere pentru acest tip de problemă de raționament. Exemplul de mai sus furnizează informații de bază despre sarcină. Dacă te uiți mai atent, tipul de sarcină pe care l-am introdus implică câțiva pași de raționament suplimentari. Cu alte cuvinte, ar putea ajuta dacă împărțim problema în pași și demonstrăm asta modelului. Mai recent, promptarea chain-of-thought (CoT) a fost popularizată pentru a aborda sarcini de raționament aritmetic, de bun simț și simbolic mai complexe.

În general, se pare că furnizarea exemplelor este utilă pentru rezolvarea unor sarcini. Când promptarea zero-shot și promptarea cu câteva exemple nu sunt suficiente, ar putea însemna că orice a fost învățat de model nu este suficient pentru a face bine la sarcină. De aici se recomandă să începi să te gândești la fine-tuning-ul modelelor tale sau să experimentezi cu tehnici de promptare mai avansate. Următorul pas vorbim despre una dintre tehniciile populare de promptare numită promptarea chain-of-thought care a câștigat multă popularitate.

Promptarea Chain-of-Thought

Introdusă în Wei et al. (2022), promptarea chain-of-thought (CoT) permite capacități de raționament complexe prin pași de raționament intermediari. Poți să o combini cu promptarea cu câteva exemple pentru a obține rezultate mai bune pe sarcini mai complexe care necesită raționament înainte de răspuns.

Prompt:

Numerele impare din acest grup se adună la un număr par: 4, 8, 9, 15, 12, 2, 1.
A: Adunând toate numerele impare (9, 15, 1) dă 25. Răspunsul este Fals.

Numerele impare din acest grup se adună la un număr par: 17, 10, 19, 4, 8, 12, 24.
A: Adunând toate numerele impare (17, 19) dă 36. Răspunsul este Adevărat.

Numerele impare din acest grup se adună la un număr par: 16, 11, 14, 4, 8, 13, 24.
A: Adunând toate numerele impare (11, 13) dă 24. Răspunsul este Adevărat.

Numerele impare din acest grup se adună la un număr par: 17, 9, 10, 12, 13, 4, 2.
A: Adunând toate numerele impare (17, 9, 13) dă 39. Răspunsul este Fals.

Numerele impare din acest grup se adună la un număr par: 15, 32, 5, 13, 82, 7, 1.
A:

Ieșire:

Adunând toate numerele impare (15, 5, 13, 7, 1) dă 41. Răspunsul este Fals.

Wow! Putem vedea un rezultat perfect când am furnizat pasul de raționament. Putem rezolva această sarcină prin furnizarea chiar și mai puține exemple, adică doar un exemplu pare suficient:

Prompt:

Numerele impare din acest grup se adună la un număr par: 4, 8, 9, 15, 12, 2, 1.
A: Adunând toate numerele impare (9, 15, 1) dă 25. Răspunsul este Fals.

Numerele impare din acest grup se adună la un număr par: 15, 32, 5, 13, 82, 7, 1.
A:

Ieșire:

Adunând toate numerele impare (15, 5, 13, 7, 1) dă 41. Răspunsul este Fals.

Reține că autorii susțin că aceasta este o abilitate emergentă care apare cu modele de limbaj suficient de mari.

Zero-Shot CoT

O idee recentă care a apărut mai recent este ideea zero-shot CoT (Kojima et al. 2022) care în esență implică adăugarea "Să gândim pas cu pas" la promptul original. Să încercăm o problemă simplă și să vedem cum se descurcă modelul:

Prompt:

Am mers la piață și am cumpărat 10 mere. Am dat 2 mere vecinului și 2 reparatorului. Apoi am mers și am cumpărat încă 5 mere și am mâncat 1. Cu câte mere am rămas?

Ieșire:

11 mere

Răspunsul este incorect! Acum să încercăm cu promptul special.

Prompt:

Am mers la piață și am cumpărat 10 mere. Am dat 2 mere vecinului și 2 reparatorului. Apoi am mers și am cumpărat încă 5 mere și am mâncat 1. Cu câte mere am rămas?

Să gândim pas cu pas.

Ieșire:

Mai întâi, ai început cu 10 mere.
Ai dat 2 mere vecinului și 2 reparatorului, deci ai rămas cu 6 mere.
Apoi ai cumpărat încă 5 mere, deci acum ai 11 mere.
În final, ai mâncat 1 măr, deci ai rămas cu 10 mere.

Este impresionant că acest prompt simplu este eficient la această sarcină. Aceasta este deosebit de utilă unde nu ai prea multe exemple de folosit în prompt.

Auto-consistența

Poate una dintre tehniciile mai avansate pentru ingineria prompturilor este auto-consistența. Propusă de Wang et al. (2022), auto-consistența vizează "să înlocuiască decodarea naivă greedy folosită în promptarea chain-of-thought". Ideea este să eșantionezi multiple căi de raționament diverse prin few-shot CoT, și să folosești generările pentru a selecta răspunsul cel mai consistent. Aceasta ajută la îmbunătățirea performanței promptării CoT pe sarcini care implică raționament aritmetic și de bun simț.

Să încercăm următorul exemplu pentru raționamentul aritmetic:

Prompt:

When I was 6 my sister was half my age. Now
I’m 70 how old is my sister?

Output:

The output is wrong! How may we improve this with self-consistency? Let's try it out. We will use the few-shot exemplars from Wang et al. 2022 (Table 17):

Prompt:

Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done,
there will be 21 trees. How many trees did the grove workers plant today?
A: We start with 15 trees. Later we have 21 trees. The difference must be the number of trees they planted.
So, they must have planted 21 - 15 = 6 trees. The answer is 6.

Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.

Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
A: Leah had 32 chocolates and Leah’s sister had 42. That means there were originally 32 + 42 = 74
chocolates. 35 have been eaten. So in total they still have 74 - 35 = 39 chocolates. The answer is 39.

Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops
did Jason give to Denny?
A: Jason had 20 lollipops. Since he only has 12 now, he must have given the rest to Denny. The number of
lollipops he has given to Denny must have been 20 - 12 = 8 lollipops. The answer is 8.

Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does
he have now?
A: He has 5 toys. He got 2 from mom, so after that he has 5 + 2 = 7 toys. Then he got 2 more from dad, so
in total he has 7 + 2 = 9 toys. The answer is 9.

Q: There were nine computers in the server room. Five more computers were installed each day, from
monday to thursday. How many computers are now in the server room?
A: There are 4 days from monday to thursday. 5 computers were added each day. That means in total 4 * 5 =
20 computers were added. There were 9 computers in the beginning, so now there are 9 + 20 = 29 computers.
The answer is 29.

Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many
golf balls did he have at the end of wednesday?
A: Michael initially had 58 balls. He lost 23 on Tuesday, so after that he has 58 - 23 = 35 balls. On
Wednesday he lost 2 more so now he has 35 - 2 = 33 balls. The answer is 33.

Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
A: She bought 5 bagels for $3 each. This means she spent 5

Q: When I was 6 my sister was half my age. Now I’m 70 how old is my sister?
A:

Output 1:

When I was 6 my sister was half my age, so she was 3. Now I am 70, so she is 70 - 3 = 67. The answer is 67.

Output 2:

When the narrator was 6, his sister was half his age, which is 3. Now that the narrator is 70, his sister would be 70 - 3 = 67 years old. The answer is 67.

Output 3:

When I was 6 my sister was half my age, so she was 3. Now I am 70, so she is 70/2 = 35. The answer is 35.

Computing for the final answer involves a few steps (check out the paper for the details) but for the sake of simplicity, we can see that there is already a majority answer emerging so that would essentially become the final answer.

Generated Knowledge Prompting

LLMs continue to be improved and one popular technique includes the ability to incorporate knowledge or information to help the model make more accurate predictions.

Using a similar idea, can the model also be used to generate knowledge before making a prediction? That's what is attempted in the paper by Liu et al. 2022 -- generate knowledge to be used as part of the prompt. In particular, how helpful is this for tasks such as commonsense reasoning?

Let's try a simple prompt:

Prompt:

Part of golf is trying to get a higher point total than others. Yes or No?

Output:

Yes.

This type of mistake reveals the limitations of LLMs to perform tasks that require more knowledge about the world. How do we improve this with knowledge generation?

First, we generate a few "knowledges":

Prompt:

Input: Greece is larger than mexico.
Knowledge: Greece is approximately 131,957 sq km, while Mexico is approximately 1,964,375 sq km, making Mexico 1,389% larger than Greece.

Input: Glasses always fog up.
Knowledge: Condensation occurs on eyeglass lenses when water vapor from your sweat, breath, and ambient humidity lands on a cold surface, cools, and then changes into tiny drops of liquid, forming a film that you see as fog. Your lenses will be relatively cool compared to your breath, especially when the outside air is cold.

Input: A fish is capable of thinking.
Knowledge: Fish are more intelligent than they appear. In many areas, such as memory, their cognitive powers match or exceed those of ’higher’ vertebrates including non-human primates. Fish’s long-term memories help them keep track of complex social relationships.

Input: A common effect of smoking lots of cigarettes in one’s lifetime is a higher than normal chance of getting lung cancer.
Knowledge: Those who consistently averaged less than one cigarette per day over their lifetime had nine times the risk of dying from lung cancer than never smokers. Among people who smoked between one and 10 cigarettes per day, the risk of dying from lung cancer was nearly 12 times higher than that of never smokers.

Input: A rock is the same size as a pebble.
Knowledge: A pebble is a clast of rock with a particle size of 4 to 64 millimetres based on the Udden-Wentworth scale of sedimentology. Pebbles are generally considered larger than granules (2 to 4 millimetres diameter) and smaller than cobbles (64 to 256 millimetres diameter).

Input: Part of golf is trying to get a higher point total than others.
Knowledge:

Knowledge 1:

The objective of golf is to play a set of holes in the least number of strokes. A round of golf typically consists of 18 holes. Each hole is played once in the round on a standard golf course. Each stroke is counted as one point, and the total number of strokes is used to determine the winner of the game.

Knowledge 2:

Golf is a precision club-and-ball sport in which competing players (or golfers) use many types of clubs to hit balls into a series of holes on a course using the fewest number of strokes. The goal is to complete the course with the lowest score, which is calculated by adding up the total number of strokes taken on each hole. The player with the lowest score wins the game.

We are using the prompt provided in the paper by Liu et al. 2022.

The next step is to integrate the knowledge and get a prediction. I reformatted the question into QA format to guide the answer format.

Prompt:

Question: Part of golf is trying to get a higher point total than others. Yes or No?

Knowledge: The objective of golf is to play a set of holes in the least number of strokes. A round of golf typically consists of 18 holes. Each hole is played once in the round on a standard golf course. Each stroke is counted as one point, and the total number of strokes is used to determine the winner of the game.

Explain and Answer:

Answer 1 (confidence very high):

No, the objective of golf is not to get a higher point total than others. Rather, the objective is to play a set of holes in the least number of strokes. The total number of strokes is used to determine the winner of the game, not the total number of points.

Question: Part of golf is trying to get a higher point total than others. Yes or No?

Knowledge: Golf is a precision club-and-ball sport in which competing players (or golfers) use many types of clubs to hit balls into a series of holes on a course using the fewest number of strokes. The goal is to complete the course with the lowest score, which is calculated by adding up the total number of strokes taken on each hole. The player with the lowest score wins the game.

Explain and Answer:

Answer 2 (confidence is a lot lower):

Yes, part of golf is trying to get a higher point total than others. Each player tries to complete the course with the lowest score, which is calculated by adding up the total number of strokes taken on each hole. The player with the lowest score wins the game.

Some really interesting things happened with this example. In the first answer, the model was very confident but in the second not so much. I simplify the process for demonstration purposes but there are a few more details to consider when arriving at the final answer. Check out the paper for more.

Automatic Prompt Engineer (APE)

Zhou et al., (2022) propose automatic prompt engineer (APE) a framework for automatic instruction generation and selection. The instruction generation problem is framed as natural language synthesis addressed as a black-box optimization problem using LLMs to generate and search over candidate solutions.

The first step involves a large language model (as an inference model) that is given output demonstrations to generate instruction candidates for a task. These candidate solutions will guide the search procedure. The instructions are executed using a target model, and then the most appropriate instruction is selected based on computed evaluation scores.

APE discovers a better zero-shot CoT prompt than the human engineered "Let's think step by step" prompt (Kojima et al., 2022).

The prompt "Let's work this out in a step by step way to be sure we have the right answer." elicits chain-of-though reasoning and improves performance on the MultiArith and GSM8K benchmarks:

This paper touches on an important topic related to prompt engineering which is the idea of automatically optimizing prompts. While we don't go deep into this topic in this guide, here are a few key papers if you are interested in the topic:

AutoPrompt - proposes an approach to automatically create prompts for a diverse set of tasks based on gradient-guided search.
Prefix Tuning - a lightweight alternative to fine-tuning that prepends a trainable continuous prefix for NLG tasks.
Prompt Tuning - proposes a mechanism for learning soft prompts through backpropagation.

Previous Section (Basic Prompting)

Next Section (Applications)

Prompt-uri adversariale

Programare

Creativitate

Evaluare

LLM-uri pentru clasificare

Generarea de imagini

Extragerea informațiilor

Descoperiri în cercetarea LLM-urilor

Matematică

Modele

Răspunsuri la întrebări

Raționament

Riscuri și utilizări greșite

Rezumate de text

Adevărul

Promptarea avansată

Promptarea Zero-shot

Promptarea cu câteva exemple

Limitările promptării cu câteva exemple

Promptarea Chain-of-Thought

Zero-Shot CoT

Auto-consistența

Generated Knowledge Prompting

Automatic Prompt Engineer (APE)

Promptarea avansată ​

Promptarea Zero-shot ​

Promptarea cu câteva exemple ​

Limitările promptării cu câteva exemple ​

Promptarea Chain-of-Thought ​

Zero-Shot CoT ​

Auto-consistența ​

Generated Knowledge Prompting ​

Automatic Prompt Engineer (APE) ​

Promptarea avansată

Promptarea Zero-shot

Promptarea cu câteva exemple

Limitările promptării cu câteva exemple

Promptarea Chain-of-Thought

Zero-Shot CoT

Auto-consistența

Generated Knowledge Prompting

Automatic Prompt Engineer (APE)