I have thoroughly tested MoA (with one layer) on some objective benchmarks (less subjective compared to MT-bench), such as GSM8K, HotpotQA.
It seems that when the LLMs are 7B-level, it does not work anymore.
Here in my setting,
the three LLMs in layer one is mistralai/Mistral-7B-Instruct-v0.1/2/3, while the aggregator is meta-llama/Meta-Llama-3.1-8B-Instruct.
(before the experiment, I have tested each model's capability to solve the problem, the most powerful one is llama-3.1-8B).
Then, when applying MoA, I find that the performance decrease, for example, in GSM8K, the acc decreases from 75.1 to 61.3, where llama-3.1 solely achives 75.1, here rounds=0; while 61.3 is from rounds=1 that the intermidiate layer consists of the mistral-7B v0.1/2/3.
This finding also applies to HotpotQA.
Does anyone face the similar observation with me ? Any suggestions on how to use 7B-level llms ?