Research Track 7：多模态大模型最新的一些论文

发布人

MLLM 
https://arxiv.org/pdf/2410.23262
EMMA: End-to-End Multimodal Model for Autonomous Driving

https://arxiv.org/pdf/2410.21276
GPT-4o system card

https://arxiv.org/pdf/2410.20482
What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration

LoRA
https://arxiv.org/pdf/2410.23775
In-Context LoRA for Diffusion Transformers

https://arxiv.org/pdf/2410.19694
Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs

https://arxiv.org/pdf/2410.20777
KD-LoRA: A Hybrid Approach to Efficient Fine-Tuning with LoRA and Knowledge Distillation

Generation
https://arxiv.org/pdf/2410.19310
Flow Generator Matching

https://arxiv.org/pdf/2410.23277
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

https://arxiv.org/pdf/2410.20280
MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

Benchmark 
https://openai.com/index/introducing-simpleqa/
https://cdn.openai.com/papers/simpleqa.pdf
Measuring short-form factuality in large language models

Reasoning 
O1 Replication Journey: A Strategic Progress Report – Part 1
https://arxiv.org/pdf/2410.18982

Survey 
https://arxiv.org/pdf/2410.20011
A Survey of Small Language Models

https://arxiv.org/pdf/2410.19884
A Survey of AI-Generated Video Evaluation

https://arxiv.org/pdf/2410.21169
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

https://arxiv.org/pdf/2410.19878
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies

https://arxiv.org/pdf/2410.22217
Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

打开封面下载高清视频观看高清视频视频下载器

Research Track 7：多模态大模型最新的一些论文

mono-internvl：一体化的多模态大模型

【NeurIPS2024 Oral】VAR：使用next scale prediction，基于自回归架构的图片生成模型

英伟达发布MM-Embed：融合文本和图像的跨模态信息检索新模型

Ferret-UI 2：拥有跨平台UI理解的多模态大模型

Janus：基于分离视觉编码器的统一理解与生成的多模态大模型

Qwen2-VL：支持任意精度图片以及视频输入的开源大模型系列

OMG-LLaVA：拥有segmentation能力的视觉多模态大模型

Aria：基于MoE架构的原生多模态大模型

Allegro：开源的SOTA视频生成模型

Emu3：统一理解和生成的多模态大模型

BLIP3: 抛弃Q-former的BLIP

phi-3.5：微软大模型系列

CogVLM2：智谱AI新一代多模态大模型系列

MLLM多模态大模型三大奠基模型：VIT/CLIP/BLIP模型原理详解+项目实战，通俗易懂的大模型入门教程！

transfusion：统一transformer和diffusion框架的多模态大模型

LongLLaVA:基于Jamba的多图理解多模态大模型

Research Track 5：多模态大模型最新的一些论文

OLMoE：基于MoE的全开源大模型

mini-Gemini：支持高精度图片输入的多模态大模型

今年的智驾只有一个声音：端到端+大模型

多模态大模型综述: 数据、训练任务、架构分类、大模型实战训练

LLaMA-omni：低延时的语言交互多模态大模型

SHOW-o：统一理解和生成任务的transformer

UnifiedMLLM：多任务多模态大模型

国内首门多模态大模型与自动驾驶实战课程到底讲了啥

【多模态+知识图谱】博士轻松带你从零构建知识图谱！基于知识图谱的六大项目实战—医药问答系统、知识抽取、推荐系统、Neo4j数据库、大模型

如果能重来，我要选Ai

【你知道吗？】Cursor如何索引你的代码库文件?

ChartMoE：使用MoE adapter的Chart理解多模态大模型

【强推！】B站最全！科大讯飞终于把【多模态大模型】给讲通透了！CLIP、blip、blip2三种模型原理一次性学透！全程干货分享无废话！

VITA: 开源版GPT-4o实现

Research Track 6：多模态大模型最新的一些论文

llava-onevision：llava系列集大成者

【共享LLM前沿】假如我从11月1号开始学大模型！9小时学会搭建对话机器人办公助手、大模型预训练微调、四大多模态大模型！

Fluid：使用连续token表示，随机顺序生成的自回归文生图模型

还得是知网，开题报告信手拈来

Cambrian-1：以视觉为中心，基于多个vision encoder的多模态大模型

Research Track 2: 多模态大模型最新的一些论文

AVG-LLaVA：自适应尺度视觉特征选择的多模态大模型

GameNGen：使用diffusion model做的游戏引擎