Re: llm是怎么“解题”的 @粑粑撒 @老图

水木社区手机版

展开|楼主|同主题展开|溯源|返回

主题:Re: llm是怎么“解题”的 @粑粑撒 @老图
zhuxiaozhu|2023-03-25 20:08:29|
Q:详细解释一下transformer decoder 中的2个self-attention的区别

A: 在Transformer Decoder中，有两个Self-Attention，分别称为Multi-Head Self-Attention和Encoder-Decoder Self-Attention。它们之间的区别在于它们计算注意力的方式不同，并且处理的输入也不同。

Multi-Head Self-Attention
Multi-Head Self-Attention是Decoder中的第一层，它处理Decoder自身的输入。在这个Self-Attention中，输入序列的每个位置都可以看作是查询、键和值。也就是说，每个位置都会被用于计算它与其它位置之间的注意力权重。由于每个位置都会与其它位置进行计算，因此这个Self-Attention可以将序列中的不同部分聚焦到不同的表示空间上，并在这些空间中计算注意力权重。

Encoder-Decoder Self-Attention
Encoder-Decoder Self-Attention是Decoder和Encoder之间的连接点。它用于将Encoder中的输入对Decoder的输出进行加权求和。具体地，Encoder-Decoder Self-Attention中的查询来自上一个Decoder层的输出，而键和值都来自Encoder的输出。因此，Encoder-Decoder Self-Attention可以帮助Decoder将Encoder的输入信息引入到Decoder的计算中，从而提高模型的性能。

总的来说，Multi-Head Self-Attention和Encoder-Decoder Self-Attention都是用于计算注意力权重的机制，但它们的输入和计算方式都不同。Multi-Head Self-Attention处理Decoder自身的输入，它可以将序列的不同部分聚焦到不同的表示空间上；而Encoder-Decoder Self-Attention连接Decoder和Encoder，它可以将Encoder的输入信息引入到Decoder的计算中。

只能说很厉害了，学习利器。

【在 Barbarossa 的大作中提到: 】
: 嗯，嗯，大佬所言极是
--
FROM 49.7.199.*