Skip to content

Latest commit

ย 

History

History
247 lines (165 loc) ยท 10.7 KB

File metadata and controls

247 lines (165 loc) ยท 10.7 KB

Better Transformer๋ฅผ ์ด์šฉํ•œ ๊ณ ์† ํŠธ๋žœ์Šคํฌ๋จธ ์ถ”๋ก 

์ €์ž: ๋งˆ์ดํด ๊ทธ์‰ฌ๋นˆ๋“œ ๋ฒˆ์—ญ: ์ด์ง„ํ˜

์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ๋Š” PyTorch 1.12 ๋ฒ„์ „์˜ ์ผ๋ถ€๋กœ Better Transformer (BT)๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” torchtext๋ฅผ ์‚ฌ์šฉํ•ด ์ƒ์šฉํ™”๋œ ์ œํ’ˆ ์ˆ˜์ค€์˜ ์ถ”๋ก ์—์„œ Better Transformer๋ฅผ ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. Better Transformer๋Š” ์ƒ์šฉ ์ œํ’ˆ ์ˆ˜์ค€์œผ๋กœ ๋ฐ”๋กœ ์ ์šฉ๊ฐ€๋Šฅํ•œ fastpath์ž…๋‹ˆ๋‹ค. ์ด๋Š”, CPU์™€ GPU์—์„œ ๊ณ ์„ฑ๋Šฅ์œผ๋กœ ๋” ๋น ๋ฅด๊ฒŒ Transformer ๋ชจ๋ธ์„ ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๊ฒŒ๋” ํ•ด์ค๋‹ˆ๋‹ค. ์ด fastpath ๊ธฐ๋Šฅ์€ PyTorch ์ฝ”์–ด nn.module์„ ์ง์ ‘ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๊ฑฐ๋‚˜ torchtext๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์— ๋Œ€ํ•ด ์ดํ•ดํ•˜๊ธฐ ์‰ฝ๊ณ  ๋ช…ํ™•ํ•˜๊ฒŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

Better Transformer fastpath๋กœ ๊ฐ€์†ํ™”๋  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์€ PyTorch ์ฝ”์–ด torch.nn.module ํด๋ž˜์Šค์ธ TransformerEncoder, TransformerEncoderLayer, ๊ทธ๋ฆฌ๊ณ  MultiHeadAttention์„ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, torchtext๋Š” fastpath ๊ฐ€์†ํ™”์˜ ์ด์ ์„ ์–ป๊ธฐ ์œ„ํ•ด ์ฝ”์–ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ชจ๋“ˆ๋“ค์„ ์‚ฌ์šฉํ•˜๋„๋ก ์—…๋ฐ์ดํŠธ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. (์ถ”ํ›„ ๋” ๋งŽ์€ ๋ชจ๋“ˆ์ด fastpath ์‹คํ–‰์„ ์ง€์›ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.)

Better Transformer๋Š” ๋‘ ๊ฐ€์ง€ ์œ ํ˜•์˜ ๊ฐ€์†ํ™”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  • CPU์™€ GPU์— ๋Œ€ํ•œ Native multihead attention(MHA) ๊ตฌํ˜„์œผ๋กœ ์ „๋ฐ˜์ ์ธ ์‹คํ–‰ ํšจ์œจ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
  • NLP ์ถ”๋ก ์—์„œ์˜ sparsity๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€๋ณ€ ๊ธธ์ด ์ž…๋ ฅ(variable input lengths)์œผ๋กœ ์ธํ•ด ์ž…๋ ฅ ํ† ํฐ์— ๋งŽ์€ ์ˆ˜์˜ ํŒจ๋”ฉ ํ† ํฐ์ด ํฌํ•จ๋  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋Ÿฌํ•œ ํ† ํฐ๋“ค์˜ ์ฒ˜๋ฆฌ๋ฅผ ๊ฑด๋„ˆ๋›ฐ์–ด ์ƒ๋‹นํ•œ ์†๋„ ํ–ฅ์ƒ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

Fastpath ์‹คํ–‰์€ ๋ช‡ ๊ฐ€์ง€ ๊ธฐ์ค€์„ ์ถฉ์กฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ฑด, ๋ชจ๋ธ์ด ์ถ”๋ก  ๋ชจ๋“œ์—์„œ ์‹คํ–‰๋˜์–ด์•ผ ํ•˜๋ฉฐ gradient tape ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•˜์ง€ ์•Š๋Š” ์ž…๋ ฅ ํ…์„œ์— ๋Œ€ํ•ด ์ž‘๋™ํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค(์˜ˆ: torch.no_grad๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ–‰).

์ด ์˜ˆ์ œ๋ฅผ Google Colab์—์„œ ๋”ฐ๋ผํ•˜๋ ค๋ฉด, ์—ฌ๊ธฐ๋ฅผ ํด๋ฆญ.

์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ Better Transformer์˜ ๊ธฐ๋Šฅ๋“ค

  • ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ ๋กœ๋“œ (Better Transformer ์—†์ด PyTorch ๋ฒ„์ „ 1.12 ์ด์ „์— ์ƒ์„ฑ๋œ ๋ชจ๋ธ)
  • CPU์—์„œ BT fastpath๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ์™€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์˜ ์ถ”๋ก ์˜ ์‹คํ–‰ ๋ฐ ๋ฒค์น˜๋งˆํฌ (๋„ค์ดํ‹ฐ๋ธŒ MHA๋งŒ ํ•ด๋‹น)
  • (๊ตฌ์„ฑ ๊ฐ€๋Šฅํ•œ)๋””๋ฐ”์ด์Šค์—์„œ BT fastpath๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ์™€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์˜ ์ถ”๋ก ์˜ ์‹คํ–‰ ๋ฐ ๋ฒค์น˜๋งˆํฌ (๋„ค์ดํ‹ฐ๋ธŒ MHA๋งŒ ํ•ด๋‹น)
  • sparsity ์ง€์› ํ™œ์„ฑํ™”
  • (๊ตฌ์„ฑ ๊ฐ€๋Šฅํ•œ) ๋””๋ฐ”์ด์Šค์—์„œ BT fastpath๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ์™€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์˜ ์ถ”๋ก ์˜ ์‹คํ–‰ ๋ฐ ๋ฒค์น˜๋งˆํฌ (๋„ค์ดํ‹ฐ๋ธŒ MHA + ํฌ์†Œ์„ฑ)

์ถ”๊ฐ€์ ์ธ ์ •๋ณด๋“ค

๋” ๋‚˜์€ ํŠธ๋žœ์Šคํฌ๋จธ์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ์ •๋ณด๋Š” PyTorch.Org ๋ธ”๋กœ๊ทธ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ณ ์† ํŠธ๋žœ์Šคํฌ๋จธ ์ถ”๋ก ์„ ์œ„ํ•œ Better Transformer.

  1. ์„ค์ •

1.1 ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

torchtext.models ์˜ ์ง€์นจ์— ๋”ฐ๋ผ ๋ฏธ๋ฆฌ ์ •์˜๋œ torchtext ๋ชจ๋ธ์—์„œ XLM-R ๋ชจ๋ธ์„ ๋‹ค์šด๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ฐ€์†๊ธฐ ์ƒ์—์„œ์˜ ํ…Œ์ŠคํŠธ๋ฅผ ์‹คํ–‰ํ•˜๊ธฐ ์œ„ํ•ด DEVICE๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. (ํ•„์š”์— ๋”ฐ๋ผ ์‚ฌ์šฉ ํ™˜๊ฒฝ์— ๋งž๊ฒŒ GPU ์‹คํ–‰์„ ํ™œ์„ฑํ™”๋ฉด ๋ฉ๋‹ˆ๋‹ค.)

import torch
import torch.nn as nn

print(f"torch version: {torch.__version__}")

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

print(f"torch cuda available: {torch.cuda.is_available()}")

import torch, torchtext
from torchtext.models import RobertaClassificationHead
from torchtext.functional import to_tensor
xlmr_large = torchtext.models.XLMR_LARGE_ENCODER
classifier_head = torchtext.models.RobertaClassificationHead(num_classes=2, input_dim = 1024)
model = xlmr_large.get_model(head=classifier_head)
transform = xlmr_large.transform()

1.2 ๋ฐ์ดํ„ฐ์…‹ ์„ค์ •

๋‘ ๊ฐ€์ง€ ์œ ํ˜•์˜ ์ž…๋ ฅ์„ ์„ค์ •ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์ž‘์€ ์ž…๋ ฅ ๋ฐฐ์น˜์™€ sparsity๋ฅผ ๊ฐ€์ง„ ํฐ ์ž…๋ ฅ ๋ฐฐ์น˜์ž…๋‹ˆ๋‹ค.

small_input_batch = [
               "Hello world",
               "How are you!"
]
big_input_batch = [
               "Hello world",
               "How are you!",
               """`Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.`

It was in July, 1805, and the speaker was the well-known Anna
Pavlovna Scherer, maid of honor and favorite of the Empress Marya
Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
of high rank and importance, who was the first to arrive at her
reception. Anna Pavlovna had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg, used only by the elite."""
]

๋‹ค์Œ์œผ๋กœ, ์ž‘์€ ์ž…๋ ฅ ๋ฐฐ์น˜ ๋˜๋Š” ํฐ ์ž…๋ ฅ ๋ฐฐ์น˜ ์ค‘ ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•˜๊ณ , ์ž…๋ ฅ์„ ์ „์ฒ˜๋ฆฌํ•œ ํ›„ ๋ชจ๋ธ์„ ํ…Œ์ŠคํŠธํ•ฉ๋‹ˆ๋‹ค.

input_batch=big_input_batch

model_input = to_tensor(transform(input_batch), padding_value=1)
output = model(model_input)
output.shape

๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ฒค์น˜๋งˆํฌ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

ITERATIONS=10
  1. ์‹คํ–‰

2.1 CPU์—์„œ BT fastpath๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ์™€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์˜ ์ถ”๋ก ์˜ ์‹คํ–‰ ๋ฐ ๋ฒค์น˜๋งˆํฌ (๋„ค์ดํ‹ฐ๋ธŒ MHA๋งŒ ํ•ด๋‹น)

CPU์—์„œ ๋ชจ๋ธ์„ ์‹คํ–‰ํ•˜๊ณ  ํ”„๋กœํŒŒ์ผ ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค:

  • ์ฒซ ๋ฒˆ์งธ ์‹คํ–‰์€ ์ „ํ†ต์ ์ธ ์‹คํ–‰('slow path')์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๋‘ ๋ฒˆ์งธ ์‹คํ–‰์€ model.eval()์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์ถ”๋ก  ๋ชจ๋“œ๋กœ ์„ค์ •ํ•˜๊ณ  torch.no_grad()๋กœ ๋ณ€ํ™”๋„(gradient) ์ˆ˜์ง‘์„ ๋น„ํ™œ์„ฑํ™”ํ•˜์—ฌ BT fastpath ์‹คํ–‰์„ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค.

CPU์—์„œ ๋ชจ๋ธ์„ ์‹คํ–‰ํ•  ๋•Œ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์„ ๊ฒ๋‹ˆ๋‹ค.(ํ–ฅ์ƒ ์ •๋„๋Š” CPU ๋ชจ๋ธ์— ๋”ฐ๋ผ ๋‹ค๋ฆ…๋‹ˆ๋‹ค) fastpath ํ”„๋กœํŒŒ์ผ์—์„œ ๋Œ€๋ถ€๋ถ„์˜ ์‹คํ–‰ ์‹œ๊ฐ„์ด ๋„ค์ดํ‹ฐ๋ธŒ `TransformerEncoderLayer`์˜ ์ €์ˆ˜์ค€ ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•œ `aten::_transformer_encoder_layer_fwd`์— ์†Œ์š”๋˜๋Š” ๊ฒƒ์„ ์ฃผ๋ชฉํ•˜์„ธ์š”:

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=False) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=False) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

2.2 (๊ตฌ์„ฑ ๊ฐ€๋Šฅํ•œ)๋””๋ฐ”์ด์Šค์—์„œ BT fastpath๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ์™€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์˜ ์ถ”๋ก ์˜ ์‹คํ–‰ ๋ฐ ๋ฒค์น˜๋งˆํฌ (๋„ค์ดํ‹ฐ๋ธŒ MHA๋งŒ ํ•ด๋‹น)

BT sparsity ์„ค์ •์„ ํ™•์ธํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

model.encoder.transformer.layers.enable_nested_tensor

์ด๋ฒˆ์—” BT sparsity์„ ๋น„ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค.

model.encoder.transformer.layers.enable_nested_tensor=False

DEVICE์—์„œ ๋ชจ๋ธ์„ ์‹คํ–‰ํ•˜๊ณ , DEVICE์—์„œ์˜ ๋„ค์ดํ‹ฐ๋ธŒ MHA ์‹คํ–‰์— ๋Œ€ํ•œ ํ”„๋กœํŒŒ์ผ ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค:

  • ์ฒซ ๋ฒˆ์งธ ์‹คํ–‰์€ ์ „ํ†ต์ ์ธ ('slow path') ์‹คํ–‰์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๋‘ ๋ฒˆ์งธ ์‹คํ–‰์€ model.eval()์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์ถ”๋ก  ๋ชจ๋“œ๋กœ ์„ค์ •ํ•˜๊ณ  torch.no_grad()๋กœ ๋ณ€ํ™”๋„(gradient) ์ˆ˜์ง‘์„ ๋น„ํ™œ์„ฑํ™”ํ•˜์—ฌ BT fastpath ์‹คํ–‰์„ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค.

GPU์—์„œ ์‹คํ–‰ํ•  ๋•Œ, ํŠนํžˆ ์ž‘์€ ์ž…๋ ฅ ๋ฐฐ์น˜๋กœ ์„ค์ •ํ•œ ๊ฒฝ์šฐ ์†๋„๊ฐ€ ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์„ ๊ฒ๋‹ˆ๋‹ค.

model.to(DEVICE)
model_input = model_input.to(DEVICE)

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

2.3 (๊ตฌ์„ฑ ๊ฐ€๋Šฅํ•œ) ๋””๋ฐ”์ด์Šค์—์„œ BT fastpath๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ์™€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์˜ ์ถ”๋ก ์˜ ์‹คํ–‰ ๋ฐ ๋ฒค์น˜๋งˆํฌ (๋„ค์ดํ‹ฐ๋ธŒ MHA + ํฌ์†Œ์„ฑ)

sparsity ์ง€์›์„ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค.

model.encoder.transformer.layers.enable_nested_tensor = True

DEVICE์—์„œ ๋ชจ๋ธ์„ ์‹คํ–‰ํ•˜๊ณ , DEVICE์—์„œ์˜ ๋„ค์ดํ‹ฐ๋ธŒ MHA์™€ sparsity ์ง€์› ์‹คํ–‰์— ๋Œ€ํ•œ ํ”„๋กœํŒŒ์ผ ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค:

  • ์ฒซ ๋ฒˆ์งธ ์‹คํ–‰์€ ์ „ํ†ต์ ์ธ ('slow path') ์‹คํ–‰์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๋‘ ๋ฒˆ์งธ ์‹คํ–‰์€ model.eval()์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์ถ”๋ก  ๋ชจ๋“œ๋กœ ์„ค์ •ํ•˜๊ณ  torch.no_grad()๋กœ ๋ณ€ํ™”๋„(gradient) ์ˆ˜์ง‘์„ ๋น„ํ™œ์„ฑํ™”ํ•˜์—ฌ BT fastpath ์‹คํ–‰์„ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค.

GPU์—์„œ ์‹คํ–‰ํ•  ๋•Œ, ํŠนํžˆ sparsity๋ฅผ ํฌํ•จํ•˜๋Š” ํฐ ์ž…๋ ฅ ๋ฐฐ์น˜ ์„ค์ •์—์„œ ์ƒ๋‹นํ•œ ์†๋„ ํ–ฅ์ƒ์„ ๋ณผ ์ˆ˜ ์žˆ์„ ๊ฒ๋‹ˆ๋‹ค.

model.to(DEVICE)
model_input = model_input.to(DEVICE)

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

์š”์•ฝ

์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ๋Š” torchtext์—์„œ PyTorch ์ฝ”์–ด์˜ ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋” ๋ชจ๋ธ์„ ์œ„ํ•œ Better Transformer ์ง€์›์„ ํ™œ์šฉํ•˜์—ฌ, Better Transformer๋ฅผ ์ด์šฉํ•œ ๊ณ ์† ํŠธ๋žœ์Šคํฌ๋จธ ์ถ”๋ก ์„ ์†Œ๊ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. BT fastpath ์‹คํ–‰์ด ๊ฐ€๋Šฅํ•ด์ง€๊ธฐ ์ด์ „์— ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์—์„œ Better Transformer์˜ ์‚ฌ์šฉ์„ ์‹œ์—ฐํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ BT fastpath ์‹คํ–‰์˜ ๋‘ ๊ฐ€์ง€ ๋ชจ๋“œ์ธ ๋„ค์ดํ‹ฐ๋ธŒ MHA ์‹คํ–‰๊ณผ BT sparsity ๊ฐ€์†ํ™”์˜ ์‚ฌ์šฉ์„ ์‹œ์—ฐ ๋ฐ ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.