Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP3230 Principles of Operating Systems Programming Assignment Two Total 11 points Programming Exercise – Accelerate LLM Inference via Multi-Threading
Objectives 1. An assessment task related to ILO 4 [Prac7cability] – “demonstrate knowledge in applying system soAware and tools available in the modern opera7ng system for soAware development”. 2. A learning ac7vity related to ILO 2. 3. The goals of this programming exercise are: § to have direct prac7ce in designing and developing mul7threading programs; § to learn how to use POSIX pthreads (and semaphore) libraries to create, manage, and coordinate mul7ple threads in a shared memory environment; § to design and implement synchroniza7on schemes for mul7threaded processes using semaphores, or mutex locks and condi7on variables. Tasks Optimize the matrix-vector-multiplication algorithm of GPT by multi-threading. Similar to other neural networks, GPT and its variations utilize matrix-vector-multiplication, or called fullyconnected/linear layer in DL, to apply the parameter learned, which takes >70% of the whole calculation. Thus, to accelerate the GPT and get faster response, it’s critical to have faster matrixvector-multiplication, and multi-threading are usually considered powerful. In this assignment, we will use an open-source variation of GPT, llama2 released by Meta, and we provide a complete pure C implementation of its inference in seq.c as the baseline of your work, along with model weights. You need to use pthread.h with either the semaphore or (mutex_lock + conditional variable) to implement a multi-threading version of matrix-vector-multiplication. This multi-threading version will significantly accelerate the inference of Large Language Model. Acknowledgement: This assignment is based on the open-source project llama2.c by Andrej Karpathy, thanks open-source. GPT-based Large Language Model In high-level, GPT is a machine that could generate words one by one based on previous words (also known as prompts), and Figure 1a illustrate the basic workflow of GPT on generating “How are you”: Figure 1. GPT Insight. a) GPT generate text one by one, and each output is the input of next generation. b) GPT has four major component: Tokenizer turns word (string) into vector, Softmax + Sample give next token, and each layer has Attention and FFN (Feed-Forward Network), consisting of many Matrix-Vector-Multiplication Figure 1b showcases the inference workflow of each word like “You” in “How are you”: First, words are transformed into tokens using a tokenizer, which is essentially a (python) dictionary that assigns a unique vector to each word. The embedding vectors go through multiple layers, each consisting of three steps. § The first step is attention, where the model calculates attention scores based on the cosine similarity between the current word's query embedding and the embeddings of previous words (keys). The attention output is a weighted average of the value embeddings, and this process involves learnable parameters in the form of Matrix-Vector-Multiplication (linear layer). § The second step is a feed-forward network (FFN) that adds more learnable parameters through Matrix-Vector-Multiplication. § The third step is positional embedding, which takes into account the ordering of words in natural language by adding positional information to the attention calculations. After going through all the layers, the embeddings are classified to generate a specific word as the output. This involves using a softmax function to convert the embeddings into a probability distribution, and randomly samples a word from the distribution. Understanding GPT is not required for this assignment. Just remember that LLM uses a lot of MatrixVector-Multiplication to apply learned parameters to make it powerful. Task: Matrix-Vector-Multiplication Figure 2. Matrix-Vector-Multiplication Algorithm. As shown in the Figure 2, Matrix-Vector-Multiplication can be illustrated as two iterations: For Each Row i For Column j, accumulate Matrix[i][j] * Vector[j] to Out[i] More specifically, a sample C implementation is shown below (also in seq.c): void mat_vec_mul(float* out, float* vec, float* mat, int col, int row) { for (int i = 0; i < row; i++) { float val = 0.0f; for (int j = 0; j < col; j++) { val += mat[i * col + j] * vec[j]; // mat[i * col + j] := mat[i][j] } out[i] = val; } } Your task in this assignment is to parallelize the outer iteration (at the 2nd line) by allocating rows to threads. More specifically, in the case of a Matrix with rows and threads working on the computation, if is divisible by , the k-th thread ( = 0, 1, … , − 1) will handle the rows from ! × & ' to !( + 1) × & − 1'. To illustrate, if we have a 6-row matrix with 2 threads, the 0th thread will handle rows 0 to 2, while the 1st thread will handle rows 3 to 5. If is not divisible by , we can assign first − 1 threads ( = 0, 1, … , − 2) with - & . rows, while the last thread handles remaining rows. More explanation on such design can be found on Appendix a. Parallel Checking. Moreover, in order to reduce overhead, you are required to create one set of threads and reuse them for all mat_vec_mul() function calls, instead of creating threads for each mat_vec_mul()function call. One popular way based on Synchronization is illustrated in Figure 3. Figure 3. Reference Synchronization Workflow, consisting of 3 function: a) CREATE_MAT_VEC_MUL function: create n threads, each threads fall asleep immediately; b) MAT_VEC_MUL function: assign new parameters, wake up threads to work on parameters and wait until threads to finish to return; c) DESTROY_MAT_VEC_MUL function: wake up threads to collect system usage and exit, wait until threads to exit and collect usage of threads.