-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpublications.html
1320 lines (1264 loc) · 103 KB
/
publications.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en-us">
<head>
<!-- Required meta tags -->
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no, user-scalable=no">
<!-- Font Awesome for social media icons -->
<script src="https://kit.fontawesome.com/791291c78f.js" crossorigin="anonymous"></script>
<!-- Bootstrap CSS -->
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css" integrity="sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<!-- Site Information -->
<title> SAIL@Princeton </title>
<style type="text/css">
.smlinks {
color: black;
}
.smlinks:hover {
color: rgb(7, 107, 255);
}
.paper-item {
margin-bottom: 15px; /* Adjust this value to increase/decrease the space */
}
.badge.badge-secondary {
cursor: pointer;
}
</style>
<!-- Favicon -->
<!-- TODO(ruipan): we could add a favicon of the website here -->
<!-- https://realfavicongenerator.net/ -->
<!-- <link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png">
<link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png">
<link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png">
<link rel="manifest" href="/site.webmanifest">
<link rel="mask-icon" href="/safari-pinned-tab.svg" color="#5bbad5">
<meta name="msapplication-TileColor" content="#da532c">
<meta name="theme-color" content="#ffffff"> -->
<!-- Functionality for searching papers -->
<script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
<script>
$(document).ready(function () {
// Function to get URL parameters
function getQueryParam(name) {
let urlParams = new URLSearchParams(window.location.search);
return urlParams.get(name) || "";
}
// Function to perform search
function filterPapers(query) {
query = query.toLowerCase();
$(".paper-item").each(function () {
let text = $(this).text().toLowerCase();
$(this).closest("li").toggle(text.includes(query));
});
}
// Populate search bar and apply filter if "search" parameter exists
let searchQuery = getQueryParam("search");
if (searchQuery) {
$("#search").val(searchQuery);
filterPapers(searchQuery); // Directly apply the filter
}
// Attach event listener for manual searches
$("#search").on("input", function () {
filterPapers($(this).val());
});
// Add click event to badge elements
$(".badge.badge-secondary").on("click", function () {
let keyword = $(this).text().trim();
$("#search").val(keyword).trigger("input"); // Update search bar and trigger filtering
});
// Clear search when "Clear" button is clicked
$("#clear-search").on("click", function () {
$("#search").val("").trigger("input"); // Clear input and reset filter
});
});
</script>
</head>
<body>
<!-- Nav Bar -->
<!-- TODO(ruipan): figure out how to align the nav items to the right rather than the left -->
<nav class="navbar navbar-expand-lg navbar-light sticky-top navbar-custom" style="background-color: #f58025">
<a class="navbar-brand" href="index.html">
<img src="./images/princeton_square.jpg" width="30" height="30" class="d-inline-block align-top">
SAIL@Princeton
</a>
<button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbarSupportedContent" aria-controls="navbarSupportedContent" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div class="collapse navbar-collapse" id="navbarSupportedContent">
<ul class="navbar-nav mr-auto">
<li class="nav-item" data-toggle="collapse" data-target=".navbar-collapse.show">
<a class="nav-link" href="index.html#projects">Projects</a>
</li>
<li class="nav-item" data-toggle="collapse" data-target=".navbar-collapse.show">
<a class="nav-link" href="people.html">People</a>
</li>
<li class="nav-item" data-toggle="collapse" data-target=".navbar-collapse.show">
<a class="nav-link" href="publications.html">Publications</a>
</li>
</ul>
</div>
</nav>
<!-- Jumbotron -->
<div class="jumbotron jumbotron-fluid text-center">
<div class="container">
<div class="row align-items-center">
<div class="col-sm-12">
<h2 class="jumbotron-heading">Publications of SAIL@Princeton</h2>
<p class="lead">Our publications showcase cutting-edge research at the intersection of systems and machine learning,
advancing efficient, scalable, and secure AI/ML systems. From novel models and algorithms to optimized runtime systems for training and inference,
our work pushes the boundaries of next-generation AI infrastructure. Explore our latest contributions to AI/ML and systems research below.</p>
</div>
</div>
</div>
</div>
<!-- Search bar -->
<div class="container">
<div class="row">
<div class="col-sm-12">
<div class="d-flex mb-3">
<input type="text" id="search" class="form-control" placeholder="Search by title, author, or keyword..." style="flex: 1;">
<button id="clear-search" class="btn btn-outline-secondary ml-2">Reset</button>
</div>
</div>
</div>
</div>
<!-- Preprints -->
<div class="container">
<div class="row">
<div class="col-sm-12">
<h3>Preprints</h3>
<ul>
<li class="paper-item">
<h5>How to Train Long-Context Language Models (Effectively)</h5>
Tianyu Gao*, Alexander Wettig*, Howard Yen, Danqi Chen <br>
arXiv 2025<br>
<span class="badge badge-secondary">Efficient Training</span>
<div class="mt-2">
<a href="https://arxiv.org/abs/2410.02660" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#prolong-abstract" role="button" aria-expanded="false" aria-controls="prolong-abstract">Abstract</a>
</div>
<div class="collapse" id="prolong-abstract">
<div class="card card-body">
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information.
We first establish a reliable evaluation protocol to guide model development -- Instead of perplexity or simple needle-in-a-haystack (NIAH) tests,
we use a broad set of long-context tasks, and we evaluate models after SFT with instruction data as this better reveals long-context abilities.
Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset,
and many other design choices. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short data;
(2) training with a sequence length beyond the evaluation length boosts long-context performance;
(3) for SFT, using only short instruction datasets yields strong performance on long-context tasks.
Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
ProLong outperforms Llama-3.18B-Instruct on the majority of long-context tasks despite having seen only 5% as many tokens during long-context training.
Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.
</div>
</div>
</li>
<li class="paper-item">
<h5>Certifiably Robust RAG against Retrieval Corruption</h5>
Chong Xiang*, Tong Wu*, Zexuan Zhong, David Wagner, Danqi Chen, Prateek Mittal. <br>
arXiv 2025<br>
<span class="badge badge-secondary">Compound AI Systems</span>
<div class="mt-2">
<a href="https://arxiv.org/abs/2405.15556" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#robustrag-abstract" role="button" aria-expanded="false" aria-controls="robustrag-abstract">Abstract</a>
</div>
<div class="collapse" id="robustrag-abstract">
<div class="card card-body">
Retrieval-augmented generation (RAG) has been shown vulnerable to retrieval corruption attacks: an attacker can inject malicious passages into retrieval results to induce inaccurate responses.
In this paper, we propose RobustRAG as the first defense framework against retrieval corruption attacks.
The key insight of RobustRAG is an isolate-then-aggregate strategy: we get LLM responses from each passage in isolation and then securely aggregate these isolated responses.
To instantiate RobustRAG, we design keyword-based and decoding-based algorithms for securely aggregating unstructured text responses.
Notably, RobustRAG can achieve certifiable robustness: we can formally prove and certify that, for certain queries, RobustRAG can always return accurate responses,
even when the attacker has full knowledge of our defense and can arbitrarily inject a small number of malicious passages. We evaluate RobustRAG on open-domain QA and long-form text generation datasets and demonstrate its effectiveness and generalizability across various tasks and datasets.
</div>
</div>
</li>
<li class="paper-item">
<h5>RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation</h5>
Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Ganesh Ananthanarayanan, Ravi Netravali, Junchen Jiang <br>
arXiv 2024<br>
<span class="badge badge-secondary">Efficient Inference</span>
<span class="badge badge-secondary">Compound AI Systems</span>
<div class="mt-2">
<a href="https://arxiv.org/pdf/2412.10543" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#ragserve-abstract" role="button" aria-expanded="false" aria-controls="ragserve-abstract">Abstract</a>
</div>
<div class="collapse" id="ragserve-abstract">
<div class="card card-body">
RAG (Retrieval Augmented Generation) allows LLMs (large language models) to
generate better responses with external knowledge, but using more external
knowledge often improves generation quality at the expense of response delay.
Prior work either reduces the response delay (through better scheduling of RAG
queries) or strives to maximize quality (which involves tuning the RAG workflow),
but they fall short in optimizing the \emph {tradeoff} between the delay
and quality of RAG responses. This paper presents RAGServe, the first RAG system
that jointly schedules queries and adapts the key RAG configurations of each
job, such as the number of retrieved text chunks and synthesis methods,
in order to balance quality optimization and response delay reduction.
Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art
RAG scheduling system, RAGServe reduces the generation latency by 1.64--2.54×
without sacrificing generation quality.
</div>
</div>
</li>
</ul>
</div>
</div>
</div>
<!-- 2025 -->
<div class="container">
<div class="row">
<div class="col-sm-12">
<h3>2025</h3>
<ul>
<li class="paper-item">
<h5>LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity</h5>
Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, Felix Juefei-Xu, Yaqiao Luo, Peizhao Zhang, Tingbo Hou, Peter Vajda, Niraj K Jha, Xiaoliang Dai <br>
CVPR 2025<br>
<span class="badge badge-secondary">Efficient Inference</span>
<div class="mt-2">
<a href="https://arxiv.org/pdf/2412.09856" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#lingen-abstract" role="button" aria-expanded="false" aria-controls="lingen-abstract">Abstract</a>
<a href="https://lineargen.github.io/" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">Website</button>
</a>
</div>
<div class="collapse" id="lingen-abstract">
<div class="card card-body">
Text-to-video generation enhances content creation but
is highly computationally intensive: The computational cost
of Diffusion Transformers (DiTs) scales quadratically in the
number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to
generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number
of pixels. For the first time, LinGen enables high-resolution
minute-length video generation on a single GPU without
compromising quality. It replaces the computationallydominant and quadratic-complexity block, self-attention,
with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch
targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement
method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel
TEmporal Swin Attention block that focuses on temporal
correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that
LinGen outperforms DiT (with a 75.6% win rate) in video
quality with up to 15× (11.5×) FLOPs (latency) reduction.
Furthermore, both automatic metrics and human evaluation
demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1%
win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our
project website: https://lineargen.github.io/.
</div>
</div>
</li>
<li class="paper-item">
<h5>Marconi: Prefix Caching for the Era of Hybrid LLMs</h5>
Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, Ravi Netravali <br>
MLSys 2025<br>
<span class="badge badge-secondary">Efficient Inference</span>
<span class="badge badge-secondary">Sequence Modeling</span>
<span class="badge badge-secondary">State Space Models</span>
<div class="mt-2">
<a href="https://arxiv.org/pdf/2411.19379" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#marconi-abstract" role="button" aria-expanded="false" aria-controls="marconi-abstract">Abstract</a>
<a href="https://github.com/ruipeterpan/marconi" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">Code</button>
</a>
</div>
<div class="collapse" id="marconi-abstract">
<div class="card card-body">
Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent
layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language
Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency
optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of
in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and
instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most
of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix
caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously
assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a
taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints.
Across diverse workloads and Hybrid models, Marconi achieves up to 34.4× higher token hit rates (71.1% or 617
ms lower TTFT) compared to state-of-the-art prefix caching systems.
</div>
</div>
</li>
<li class="paper-item">
<h5>Mowgli: Passively Learned Rate Control for Real-Time Video</h5>
Neil Agarwal, Rui Pan, Francis Y. Yan, Ravi Netravali <br>
NSDI 2025<br>
<span class="badge badge-secondary">ML for Systems</span>
<span class="badge badge-secondary">Edge AI Systems</span>
<div class="mt-2">
<a href="https://arxiv.org/pdf/2410.03339" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#mowgli-abstract" role="button" aria-expanded="false" aria-controls="mowgli-abstract">Abstract</a>
</div>
<div class="collapse" id="mowgli-abstract">
<div class="card card-body">
Rate control algorithms are at the heart of video conferencing platforms,
determining target bitrates that match dynamic network characteristics for high quality.
Recent data-driven strategies have shown promise for this challenging task,
but the performance degradation they introduce during training has been a nonstarter
for many production services, precluding adoption.
This paper aims to bolster the practicality of data-driven rate control by presenting
an alternative avenue for experiential learning:
leveraging purely existing telemetry logs produced by the incumbent algorithm in production.
We observe that these logs often contain effective decisions, although often at the wrong times or in the wrong order.
To realize this approach despite the inherent uncertainty that log-based learning brings
(i.e., lack of feedback for new decisions), our system, Mowgli,
combines a variety of robust learning techniques (i.e., conservatively reasoning
about alternate behavior to minimize risk and using a richer model formulation to account for environmental noise).
Across diverse networks (emulated and real-world), Mowgli outperforms the widely deployed GCC algorithm,
increasing average video bitrates by 15-39% while reducing freeze rates by 60-100%.
</div>
</div>
</li>
</ul>
</div>
</div>
</div>
<!-- 2024 -->
<div class="container">
<div class="row">
<div class="col-sm-12">
<h3>2024</h3>
<ul>
<li class="paper-item">
<h5>Catastrophic jailbreak of open-source LLMs via exploiting generation</h5>
Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen<br>
ICLR 2024<br>
<div class="mt-2">
<a href="https://arxiv.org/pdf/2310.06987" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#genexploit-abstract" role="button" aria-expanded="false" aria-controls="genexploit-abstract">Abstract</a>
</div>
<div class="collapse" id="genexploit-abstract">
<div class="card card-body">
The rapid progress in open-source large language models (LLMs) is significantly
advancing AI development. Extensive efforts have been made before model release to align their behavior with human values, with the primary goal of ensuring their helpfulness and harmlessness. However, even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as
“jailbreaks”. These jailbreaks are typically triggered by specific text inputs, often referred to as adversarial prompts. In this work, we propose the generation
exploitation attack, an extremely simple approach that disrupts model alignment
by only manipulating variations of decoding methods. By exploiting different
generation strategies, including varying decoding hyper-parameters and sampling
methods, we increase the misalignment rate from 0% to more than 95% across
11 language models including LLAMA2, VICUNA, FALCON, and MPT families,
outperforming state-of-the-art attacks with 30× lower computational cost. Finally, we propose an effective alignment method that explores diverse generation
strategies, which can reasonably reduce the misalignment rate under our attack.
Altogether, our study underscores a major failure in current safety evaluation and
alignment procedures for open-source LLMs, strongly advocating for more comprehensive red teaming and better alignment before releasing such models1
</div>
</div>
</li>
<li class="paper-item">
<h5>MadEye: Boosting Live Video Analytics Accuracy with Adaptive Camera Configurations</h5>
Mike Wong, Murali Ramanujam, Guha Balakrishnan, Ravi Netravali<br>
NSDI 2024<br>
<span class="badge badge-secondary">Edge AI Systems</span>
<div class="mt-2">
<a href="https://michaeldwong.github.io/papers/madeye-nsdi24.pdf" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#madeye-abstract" role="button" aria-expanded="false" aria-controls="madeye-abstract">Abstract</a>
</div>
<div class="collapse" id="madeye-abstract">
<div class="card card-body">
Camera orientations (i.e., rotation and zoom) govern the
content that a camera captures in a given scene, which in
turn heavily influences the accuracy of live video analytics
pipelines. However, existing analytics approaches leave this
crucial adaptation knob untouched, instead opting to only
alter the way that captured images from fixed orientations
are encoded, streamed, and analyzed. We present MadEye,
a camera-server system that automatically and continually
adapts orientations to maximize accuracy for the workload
and resource constraints at hand. To realize this using commodity pan-tilt-zoom (PTZ) cameras, MadEye embeds (1) a
search algorithm that rapidly explores the massive space of
orientations to identify a fruitful subset at each time, and (2) a
novel knowledge distillation strategy to efficiently (with only
camera resources) select the ones that maximize workload accuracy. Experiments on diverse workloads show that MadEye
boosts accuracy by 2.9-25.7% for the same resource usage, or
achieves the same accuracy with 2-3.7× lower resource costs.
</div>
</div>
</li>
<li class="paper-item">
<h5>ADR-X: ANN-Assisted Wireless Link Rate Adaptation for Compute-Constrained Embedded Gaming Devices</h5>
Hao Yin, Murali Ramanujam, Joe Schaefer, Stan Adermann, Srihari Narlanka, Perry Lea, Ravi Netravali, Krishna Chintalapudi<br>
NSDI 2024<br>
<span class="badge badge-secondary">ML for Systems</span>
<div class="mt-2">
<a href="https://www.usenix.org/system/files/nsdi24-yin.pdf" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#adrx-abstract" role="button" aria-expanded="false" aria-controls="adrx-abstract">Abstract</a>
</div>
<div class="collapse" id="adrx-abstract">
<div class="card card-body">
The wireless channel between gaming console and accessories
e.g. controllers and headsets, experiences extremely rapid
variations due to abrupt head and hand movements amidst
an exciting game. In the absence of prior studies on wireless
packet losses for console gaming, through extensive evaluations and user studies, we find that state-of-the-art rate adaptation schemes, unable to keep up with these rapid changes,
experience packet loss rates of 2-10% while loss rates that
are 10× lower (0.1-0.5%) are required to ensure a high quality gaming experience. We present ADR-X, an ANN-based
contextual multi-armed bandit rate adaptation technique that
continuously predicts and tracks the channel and picks appropriate data rates. A key challenge for ADR-X is that it must
run on power and compute constrained embedded devices
under realtime constraints. ADR-X addresses this challenge
by meticulously crafting an ANN that leverages existing communication theory results to incorporate domain knowledge.
This allows ADR-X to achieve 10× lower packet losses than
existing schemes while also running 100× faster than stateof-the-art reinforcement learning schemes, making it suitable
for deployment on embedded gaming devices.
</div>
</div>
</li>
<li class="paper-item">
<h5>NetVigil: Robust and Low-Cost Anomaly Detection for East-West Data Center Security</h5>
Kevin Hsieh*, Mike Wong*, Santiago Segarra, Sathiya Kumaran Mani, Trevor Eberl, Anatoliy Panasyuk, Ravi Netravali, Ranveer Chandra, Srikanth Kandula<br>
NSDI 2024<br>
<span class="badge badge-secondary">ML for Systems</span>
<span class="badge badge-secondary">Privacy and Security</span>
<span class="badge badge-secondary">Novel ML Applications</span>
<div class="mt-2">
<a href="https://michaeldwong.github.io/papers/netvigil-nsdi24.pdf" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#netvigil-abstract" role="button" aria-expanded="false" aria-controls="netvigil-abstract">Abstract</a>
</div>
<div class="collapse" id="netvigil-abstract">
<div class="card card-body">
The growing number of breaches in data centers
underscores an urgent need for more effective security. Traditional perimeter defense measures and static zero-trust approaches are unable to address the unique challenges that arise
from the scale, complexity, and evolving nature of today’s
data center networks. To tackle these issues, we introduce
NetVigil, a robust and cost-efficient anomaly detection system
specifically designed for east-west traffic within data center
networks. NetVigil adeptly extracts security-focused, graphbased features from network flow logs and employs domainspecific graph neural networks (GNNs) and contrastive learning techniques to strengthen its resilience against normal
traffic variations and adversarial evasion strategies. Our evaluation, over various attack scenarios and traces from real-world
production clusters, shows that NetVigil delivers significant
improvements in accuracy, cost, and detection latency compared to state-of-the-art anomaly detection systems, providing
a practical, supplementary security mechanism to protect the
east-west traffic within data center networks.
</div>
</div>
</li>
<li class="paper-item">
<h5>
Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
<img src="images/acm_available_1.1.png" height="25"/><img src="images/acm_functional_1.1.png" height="25"/><img src="images/acm_reproduced_1.1.png" height="25"/>
</h5>
Yinwei Dai*, Rui Pan*, Anand Iyer, Kai Li, Ravi Netravali <br>
SOSP 2024<br>
<span class="badge badge-secondary">Efficient Inference</span>
<div class="mt-2">
<a href="https://dl.acm.org/doi/pdf/10.1145/3694715.3695963" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#apparate-abstract" role="button" aria-expanded="false" aria-controls="apparate-abstract">Abstract</a>
<a href="https://github.com/dywsjtu/apparate" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">Code</button>
</a>
</div>
<div class="collapse" id="apparate-abstract">
<div class="card card-body">
Machine learning (ML) inference platforms are tasked with balancing two competing goals:
ensuring high throughput given many requests, and delivering low-latency responses to support interactive applications.
Unfortunately, existing platform knobs (e.g., batch sizes) fail to ease this fundamental tension,
and instead only enable users to harshly trade off one property for the other.
This paper explores an alternate strategy to taming throughput-latency tradeoffs by changing the granularity
at which inference is performed.
We present Apparate, a system that automatically applies and manages early exits (EEs) in ML models,
whereby certain inputs can exit with results at intermediate layers.
To cope with the time-varying overhead and accuracy challenges that EEs bring,
Apparate repurposes exits to provide continual feedback that powers several novel runtime monitoring and adaptation strategies.
Apparate lowers median response latencies by 40.5-91.5% and 10.0-24.2% for diverse CV and NLP classification workloads,
and median time-per-token latencies by 70.4-77.9% for generative scenarios,
without affecting throughputs or violating tight accuracy constraints.
</div>
</div>
</li>
<li class="paper-item">
<h5>Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation</h5>
Anand Iyer, Mingyu Guan, Yinwei Dai, Rui Pan, Swapnil Gandhi, Ravi Netravali <br>
SOSP 2024<br>
<span class="badge badge-secondary">Efficient Inference</span>
<div class="mt-2">
<a href="https://dl.acm.org/doi/pdf/10.1145/3694715.3695978" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#e3-abstract" role="button" aria-expanded="false" aria-controls="e3-abstract">Abstract</a>
</div>
<div class="collapse" id="e3-abstract">
<div class="card card-body">
Machine learning inference platforms continue to face high request rates and strict latency constraints.
Existing solutions largely focus on compressing models to substantially lower compute costs (and time) with mild accuracy degradations.
This paper explores an alternate (but complementary) technique that trades off accuracy and resource costs on a per-input granularity:
early exit models, which selectively allow certain inputs to exit a model from an intermediate layer.
Though intuitive, early exits face fundamental deployment challenges, largely owing to the effects that exiting inputs have on batch size (and resource utilization)
throughout model execution. We present E3, the first system that makes early exit models practical for realistic inference deployments.
Our key insight is to split and replicate blocks of layers in models in a manner that maintains a constant batch size throughout execution,
all the while accounting for resource requirements and communication overheads. Evaluations with NLP and vision models show that E3 can deliver up to 1.74×
improvement in goodput (for a fixed cost) or 1.78× reduction in cost (for a fixed goodput).
Additionally, E3's goodput wins generalize to autoregressive LLMs (2.8-3.8×) and compressed models (1.67×).
</div>
</div>
</li>
<li class="paper-item">
<h5>Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers</h5>
Hongjie Wang, Bhishma Dedhia, Niraj K Jha<br>
CVPR 2024<br>
<span class="badge badge-secondary">Efficient Inference</span>
<div class="mt-2">
<a href="https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_Zero-TPrune_Zero-Shot_Token_Pruning_through_Leveraging_of_the_Attention_Graph_CVPR_2024_paper.pdf" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#zerotprune-abstract" role="button" aria-expanded="false" aria-controls="zero-tprune-abstract">Abstract</a>
</div>
<div class="collapse" id="zero-tprune-abstract">
<div class="card card-body">
Deployment of Transformer models on edge devices is
becoming increasingly challenging due to the exponentially
growing inference cost that scales quadratically with the
number of tokens in the input sequence. Token pruning is an
emerging solution to address this challenge due to its ease
of deployment on various Transformer backbones. However, most token pruning methods require computationally
expensive fine-tuning, which is undesirable in many edge
deployment cases. In this work, we propose Zero-TPrune,
the first zero-shot method that considers both the importance and similarity of tokens in performing token pruning. It leverages the attention graph of pre-trained Transformer models to produce an importance distribution for
tokens via our proposed Weighted Page Rank (WPR) algorithm. This distribution further guides token partitioning
for efficient similarity-based pruning. Due to the elimination of the fine-tuning overhead, Zero-TPrune can prune
large models at negligible computational cost, switch between different pruning configurations at no computational
cost, and perform hyperparameter tuning efficiently. We
evaluate the performance of Zero-TPrune on vision tasks
by applying it to various vision Transformer backbones and
testing them on ImageNet. Without any fine-tuning, ZeroTPrune reduces the FLOPs cost of DeiT-S by 34.7% and
improves its throughput by 45.3% with only 0.4% accuracy loss. Compared with state-of-the-art pruning methods that require fine-tuning, Zero-TPrune not only eliminates the need for fine-tuning after pruning but also does so
with only 0.1% accuracy loss. Compared with state-of-theart fine-tuning-free pruning methods, Zero-TPrune reduces
accuracy loss by up to 49% with similar FLOPs budgets.
Project webpage: https://jha-lab.github.io/zerotprune.
</div>
</div>
</li>
<li class="paper-item">
<h5>AT-EDM: Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models</h5>
Hongjie Wang, Difan Liu, Yan Kang, Yijun Li, Zhe Lin, Niraj K. Jha, Yuchen Liu<br>
CVPR 2024<br>
<span class="badge badge-secondary">Efficient Inference</span>
<span class="badge badge-secondary">Emerging Paradigms</span>
<div class="mt-2">
<a href="https://arxiv.org/pdf/2405.05252" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#atedm-abstract" role="button" aria-expanded="false" aria-controls="atedm-abstract">Abstract</a>
<a href="https://atedm.github.io/" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">Website</button>
</a>
</div>
<div class="collapse" id="atedm-abstract">
<div class="card card-body">
Diffusion Models (DMs) have exhibited superior performance in generating high-quality and diverse images. However, this exceptional performance comes at the cost of expensive architectural design, particularly due to the attention module heavily used in leading models. Existing works
mainly adopt a retraining process to enhance DM efficiency.
This is computationally expensive and not very scalable. To
this end, we introduce the Attention-driven Training-free
Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to perform run-time pruning of redundant tokens, without the need for any retraining. Specifically, for single-denoising-step pruning, we develop a novel
ranking algorithm, Generalized Weighted Page Rank (GWPR), to identify redundant tokens, and a similarity-based
recovery method to restore tokens for the convolution operation. In addition, we propose a Denoising-Steps-Aware
Pruning (DSAP) approach to adjust the pruning budget
across different denoising timesteps for better generation
quality. Extensive evaluations show that AT-EDM performs favorably against prior art in terms of efficiency
(e.g., 38.8% FLOPs saving and up to 1.53× speed-up over
Stable Diffusion XL) while maintaining nearly the same
FID and CLIP scores as the full model. Project webpage:
https://atedm.github.io.
</div>
</div>
</li>
<li class="paper-item">
<h5>DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling</h5>
Shikhar Tuli, Chi-Heng Lin, Yen-Chang Hsu, Niraj Jha, Yilin Shen, Hongxia Jin<br>
NAACL 2024<br>
<span class="badge badge-secondary">Efficient Inference</span>
<div class="mt-2">
<a href="https://aclanthology.org/2024.naacl-long.182.pdf" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#dynamo-abstract" role="button" aria-expanded="false" aria-controls="dynamo-abstract">Abstract</a>
</div>
<div class="collapse" id="dynamo-abstract">
<div class="card card-body">
Traditional language models operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction language models that reduce net inference times. Our models *dynamically* predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweight technique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite, DynaMo-7.3B-T3, achieves same-quality generated text as the baseline (Pythia-6.9B) while achieving 2.57× speed-up with only 5.87% and 2.67% parameter and training time overheads, respectively.
</div>
</div>
</li>
<li class="paper-item">
<h5>LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference</h5>
Hengrui Zhang, August Ning, Rohan Baskar Prabhakar, and David Wentzlaff<br>
ISCA 2024<br>
<span class="badge badge-secondary">Hardware Design for ML</span>
<div class="mt-2">
<a href="https://parallel.princeton.edu/papers/isca24_llmcompass.pdf" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#llmcompass-abstract" role="button" aria-expanded="false" aria-controls="llmcompass-abstract">Abstract</a>
</div>
<div class="collapse" id="llmcompass-abstract">
<div class="card card-body">
The past year has witnessed the increasing popularity of Large Language Models (LLMs). Their unprecedented
scale and associated high hardware cost have impeded their
broader adoption, calling for efficient hardware designs. With the
large hardware needed to simply run LLM inference, evaluating
different hardware designs becomes a new bottleneck.
This work introduces LLMCompass1
, a hardware evaluation
framework for LLM inference workloads. LLMCompass is fast,
accurate, versatile, and able to describe and evaluate different
hardware designs. LLMCompass includes a mapper to automatically find performance-optimal mapping and scheduling. It also
incorporates an area-based cost model to help architects reason
about their design choices. Compared to real-world hardware,
LLMCompass’ estimated latency achieves an average 10.9% error rate across various operators with various input sizes and an
average 4.1% error rate for LLM inference. With LLMCompass,
simulating a 4-NVIDIA A100 GPU node running GPT-3 175B
inference can be done within 16 minutes on commodity hardware,
including 26,400 rounds of the mapper’s parameter search.
With the aid of LLMCompass, this work draws architectural
implications and explores new cost-effective hardware designs. By
reducing the compute capability or replacing High Bandwidth
Memory (HBM) with traditional DRAM, these new designs
can achieve as much as 3.41x improvement in performance/cost
compared to an NVIDIA A100, making them promising choices
for democratizing LLMs.
</div>
</div>
</li>
<li class="paper-item">
<h5>Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference</h5>
Rohan Baskar Prabhakar, Hengrui Zhang, and David Wentzlaff <br>
NeurIPS 2024<br>
<span class="badge badge-secondary">Hardware Design for ML</span>
<div class="mt-2">
<a href="https://parallel.princeton.edu/papers/Kraken.pdf" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#kraken-abstract" role="button" aria-expanded="false" aria-controls="kraken-abstract">Abstract</a>
</div>
<div class="collapse" id="kraken-abstract">
<div class="card card-body">
Large Transformer networks are increasingly used in settings where low inference latency is necessary to enable new applications and improve the end-user
experience. However, autoregressive inference is resource intensive and requires
parallelism for efficiency. Parallelism introduces collective communication that
is both expensive and represents a phase when hardware resources are underutilized. Towards mitigating this, Kraken is an evolution of the standard Transformer
architecture that is designed to complement existing tensor parallelism schemes
for efficient inference on multi-device systems. By introducing a fixed degree of
intra-layer model parallelism, the architecture allows collective operations to be
overlapped with compute, decreasing latency and increasing hardware utilization.
When trained on OpenWebText, Kraken models reach a similar perplexity as standard Transformers while also preserving their language modeling capabilities as
evaluated on the SuperGLUE benchmark. Importantly, when tested on multi-GPU
systems using TensorRT-LLM engines, Kraken speeds up Time To First Token by
a mean of 35.6% across a range of model sizes, context lengths, and degrees of
tensor parallelism
</div>
</div>
</li>
<li class="paper-item">
<h5>SimPO: Simple Preference Optimization with a Reference-Free Reward</h5>
Yu Meng*, Mengzhou Xia*, Danqi Chen <br>
NeurIPS 2024<br>
<span class="badge badge-secondary">Efficient Training</span>
<div class="mt-2">
<a href="https://arxiv.org/pdf/2405.14734" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#simpo-abstract" role="button" aria-expanded="false" aria-controls="simpo-abstract">Abstract</a>
</div>
<div class="collapse" id="simpo-abstract">
<div class="card card-body">
Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability.
In this work, we propose SimPO, a simpler yet more effective approach. The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward.
This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient.
Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further enhancing the algorithm's performance.
We compare SimPO to DPO and its latest variants across various state-of-the-art training setups, including both base and instruction-tuned models like Mistral and Llama3.
We evaluated on extensive instruction-following benchmarks, including AlpacaEval 2, MT-Bench, and the recent challenging Arena-Hard benchmark.
Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length. Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard.
Our top-performing model, built on Llama3-8B-Instruct, achieves a remarkable 44.7 length-controlled win rate on AlpacaEval 2 -- surpassing Claude 3 Opus on the leaderboard, and a 33.8 win rate on Arena-Hard -- making it the strongest 8B open-source model.
</div>
</div>
</li>
<li class="paper-item">
<h5>Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training</h5>
Zexuan Zhong, Mengzhou Xia, Danqi Chen, Mike Lewis <br>
COLM 2024<br>
<span class="badge badge-secondary">Emerging Paradigms</span>
<div class="mt-2">
<a href="https://arxiv.org/abs/2405.03133" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#lory-abstract" role="button" aria-expanded="false" aria-controls="lory-abstract">Abstract</a>
</div>
<div class="collapse" id="lory-abstract">
<div class="card card-body">
Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective.
Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks.
In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training.
Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models;
(2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances.
We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters.
Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%).
Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision.
Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.
</div>
</div>
</li>
<li class="paper-item">
<h5>Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality</h5>
Tri Dao, Albert Gu <br>
ICML 2024<br>
<span class="badge badge-secondary">Emerging Paradigms</span>
<div class="mt-2">
<a href="https://arxiv.org/pdf/2405.21060" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#transformers-are-ssms-abstract" role="button" aria-expanded="false" aria-controls="transformers-are-ssms-abstract">Abstract</a>
</div>
<div class="collapse" id="transformers-are-ssms-abstract">
<div class="card card-body">
While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling.
</div>
</div>
</li>
<li class="paper-item">
<h5>FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning</h5>
Tri Dao<br>
ICLR 2024<br>
<span class="badge badge-secondary">Emerging Paradigms</span>
<div class="mt-2">
<a href="https://arxiv.org/abs/2307.08691" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#flashattention2-abstract" role="button" aria-expanded="false" aria-controls="flashattention2-abstract">Abstract</a>
</div>
<div class="collapse" id="flashattention2-abstract">
<div class="card card-body">
Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4× compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2× speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization).
</div>
</div>
</li>
<li class="paper-item">
<h5>Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning</h5>
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen<br>
ICLR 2024<br>
<span class="badge badge-secondary">Efficient Training</span>
<div class="mt-2">
<a href="https://arxiv.org/abs/2310.06694" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#sheared-llama-abstract" role="button" aria-expanded="false" aria-controls="sheared-llama-abstract">Abstract</a>
</div>
<div class="collapse" id="sheared-llama-abstract">
<div class="card card-body">
The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs.
Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models.
Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner,
and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters.
Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, and OpenLLaMA models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch.
This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building smaller LLMs.
</div>
</div>
</li>
</ul>
</div>
</div>
</div>
<!-- 2023 -->
<div class="container">
<div class="row">
<div class="col-sm-12">
<h3>2023</h3>
<ul>
<li class="paper-item">
<h5>SCouT: Synthetic Counterfactuals via Spatiotemporal Transformers for Actionable Healthcare</h5>
Bhishma Dedhia, Roshini Balasubramanian, Niraj K. Jha<br>
ACM Transactions on Computing for Healthcare, October 2023<br>
<span class="badge badge-secondary">Novel ML Applications</span>
<div class="mt-2">
<a href="https://dl.acm.org/doi/10.1145/3617180" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#scout-abstract" role="button" aria-expanded="false" aria-controls="scout-abstract">Abstract</a>
</div>
<div class="collapse" id="scout-abstract">
<div class="card card-body">
The synthetic control method has pioneered a class of powerful data-driven techniques to estimate the counterfactual reality of a unit from donor units. At its core, the technique involves a linear model fitted on the pre-intervention period that combines donor outcomes to yield the counterfactual. However, linearly combining spatial information at each time instance using time-agnostic weights fails to capture important inter-unit and intra-unit temporal contexts and complex nonlinear dynamics of real data. We instead propose an approach to use local spatiotemporal information before the onset of the intervention as a promising way to estimate the counterfactual sequence. To this end, we suggest a Transformer model that leverages particular positional embeddings, a modified decoder attention mask, and a novel pre-training task to perform spatiotemporal sequence-to-sequence modeling. Our experiments on synthetic data demonstrate the efficacy of our method in the typical small donor pool setting and its robustness against noise. We also generate actionable healthcare insights at the population and patient levels by simulating a state-wide public health policy to evaluate its effectiveness, an in silico trial for asthma medications to support randomized controlled trials, and a medical intervention for patients with Friedreich’s ataxia to improve clinical decision making and promote personalized therapy (code is available at https://github.com/JHA-Lab/scout).
</div>
</div>
</li>
<li class="paper-item">
<h5>EdgeTran: Device-Aware Co-Search of Transformers for Efficient Inference on Mobile Edge Platforms</h5>
Shikhar Tuli, Niraj K Jha<br>
IEEE Transactions on Mobile Computing 2023<br>
<span class="badge badge-secondary">Edge AI Systems</span>
<div class="mt-2">
<a href="https://ieeexplore.ieee.org/abstract/document/10301516" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#edgetran-abstract" role="button" aria-expanded="false" aria-controls="edgetran-abstract">Abstract</a>
</div>
<div class="collapse" id="edgetran-abstract">
<div class="card card-body">
Automated design of efficient transformer models has recently attracted significant attention from industry and academia. However, most works only focus on certain metrics while searching for the best-performing transformer architecture. Furthermore, running traditional, complex, and large transformer models on low-compute edge platforms is a challenging problem. In this work, we propose a framework, called ProTran, to profile the hardware performance measures for a design space of transformer architectures and a diverse set of edge devices. We use this profiler in conjunction with the proposed co-search technique to obtain the best-performing models that have high accuracy on the given task and minimize latency, energy consumption, and peak power draw to enable edge deployment. We refer to our framework for co-optimizing accuracy and hardware performance measures as EdgeTran. It searches for the best transformer model and edge device pair. Finally, we propose GPTran, a multi-stage block-level grow-and-prune post-processing step that further improves accuracy in a hardware-aware manner. The obtained transformer model is 2.8× smaller and has a 0.8% higher GLUE score than the baseline (BERT-Base). Inference with it on the selected edge device enables 15.0% lower latency, 10.0× lower energy, and 10.8× lower peak power draw compared to an off-the-shelf GPU.
</div>
</div>
</li>
<li class="paper-item">
<h5>Privacy Implications of Retrieval-Based Language Models</h5>
Yangsibo Huang, Samyak Gupta, Zexuan Zhong, Kai Li, Danqi Chen<br>
EMNLP 2023<br>
<span class="badge badge-secondary">Compound AI Systems</span>
<span class="badge badge-secondary">Privacy and Security</span>
<div class="mt-2">
<a href="https://arxiv.org/pdf/2305.14888" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#retrievalprivacy-abstract" role="button" aria-expanded="false" aria-controls="retrievalprivacy-abstract">Abstract</a>
</div>
<div class="collapse" id="retrievalprivacy-abstract">
<div class="card card-body">
Retrieval-based language models (LMs) have
demonstrated improved interpretability, factuality, and adaptability compared to their parametric counterparts, by incorporating retrieved
text from external datastores. While it is well
known that parametric models are prone to leaking private data, it remains unclear how the addition of a retrieval datastore impacts model
privacy. In this work, we present the first study
of privacy risks in retrieval-based LMs, particularly kNN-LMs. Our goal is to explore the
optimal design and training procedure in domains where privacy is of concern, aiming to
strike a balance between utility and privacy.
Crucially, we find that kNN-LMs are more susceptible to leaking private information from
their private datastore than parametric models.
We further explore mitigations of privacy risks.
When privacy information is targeted and readily detected in the text, we find that a simple
sanitization step would completely eliminate
the risks, while decoupling query and key encoders achieves an even better utility-privacy
trade-off. Otherwise, we consider strategies of
mixing public and private data in both datastore and encoder training. While these methods
offer modest improvements, they leave considerable room for future work. Together, our
findings provide insights for practitioners to
better understand and mitigate privacy risks in
retrieval-based LMs.
</div>
</div>
</li>
<li class="paper-item">
<h5>Marvolo: Programmatic Data Augmentation for Deep Malware Detection</h5>
Mike Wong, Edward Raff, James Holt, Ravi Netravali<br>
ECML PKDD 2023<br>
<span class="badge badge-secondary">ML for Systems</span>
<span class="badge badge-secondary">Privacy and Security</span>
<div class="mt-2">
<a href="https://michaeldwong.github.io/papers/marvolo.pdf" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#marvolo-abstract" role="button" aria-expanded="false" aria-controls="marvolo-abstract">Abstract</a>
</div>
<div class="collapse" id="marvolo-abstract">
<div class="card card-body">
Data acquisition for ML-driven malware detection is challenging. While large commercial datasets exist, they are prohibitively
expensive. On the other hand, an entity (e.g., a bank or government),
may be targeted with unique malware, but the data samples available
will never be sufficient to train a bespoke ML-based detector. While data
augmentation has been a key component in improving deep learning models by providing requisite diversity for generalization, it has proven far
more challenging for malware detection. The main challenges are that (1)
determining the augmentations to make is not straightforward, (2) operations are on binaries rather than source code (which is not available),
complicating correctness and understanding, and (3) labeling new files
mandates expensive binary reverse engineering. We present Marvolo for
creating realistic, semantics preserving transformations that mimic the
code alterations made by malware authors in practice, allowing us to generate augmented data on raw binary files. This also enables Marvolo to
safely propagate labels to newly-generated data. Across several malware
datasets and recent ML-based detectors, Marvolo improves accuracy
and AUC by up to 5% and 10% respectively, while boosting efficiency by
79x by avoiding redundant computation.
</div>
</div>
</li>
<li class="paper-item">
<h5>MUX-PLMs: Data multiplexing for high-throughput language models</h5>
Vishvak Murahari, Ameet Deshpande, Carlos E Jimenez, Izhak Shafran, Mingqiu Wang, Yuan Cao, Karthik Narasimhan <br>
EMNLP 2023<br>
<span class="badge badge-secondary">Efficient Inference</span>
<div class="mt-2">
<a href="https://arxiv.org/pdf/2302.12441" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#muxplms-abstract" role="button" aria-expanded="false" aria-controls="muxplms-abstract">Abstract</a>
<a href="https://github.com/state-spaces/mamba/" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">Code</button>
</a>
</div>
<div class="collapse" id="muxplms-abstract">
<div class="card card-body">
The widespread adoption of large language
models such as ChatGPT and Bard has led
to unprecedented demand for these technologies. The burgeoning cost of inference for everincreasing model sizes coupled with hardware
shortages has limited affordable access and
poses a pressing need for efficiency approaches
geared towards high throughput and performance. Multi-input multi-output (MIMO) algorithms such as data multiplexing, offer a
promising solution with a many-fold increase
in throughput by performing inference for multiple inputs at the cost of a single input. Yet
these approaches are not currently performant
enough to be deployed in modern systems. We
change that by developing MUX-PLMs, a class
of high throughput pre-trained language models
(PLMs) trained with data multiplexing, that can
be fine-tuned for any downstream task to yield
high-throughput high-performance. Our novel
multiplexing and demultiplexing modules proficiently entangle and disentangle inputs, and enable high-performance high throughput MUXPLMs that are competitive with vanilla PLMs
while achieving 2x/5x inference speedup with
only a 1 − 4% drop on a broad suite of tasks.
</div>
</div>
</li>
<li class="paper-item">
<h5>Mamba: Linear-Time Sequence Modeling with Selective State Spaces</h5>
Albert Gu*, Tri Dao* <br>
COLM 2023<br>
<span class="badge badge-secondary">State Space Models</span>
<span class="badge badge-secondary">Emerging Paradigms</span>
<span class="badge badge-secondary">Sequence Modeling</span>
<div class="mt-2">
<a href="https://arxiv.org/pdf/2312.00752" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#mamba-abstract" role="button" aria-expanded="false" aria-controls="mamba-abstract">Abstract</a>
<a href="https://github.com/state-spaces/mamba/" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">Code</button>
</a>
</div>
<div class="collapse" id="mamba-abstract">
<div class="card card-body">
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
</div>
</div>
</li>
<li class="paper-item">
<h5>Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning</h5>
Pengfei Zheng, Rui Pan, Tarannum Khan, Shivaram Venkataraman, Aditya Akella <br>
NSDI 2023<br>
<span class="badge badge-secondary">Efficient Training</span>
<div class="mt-2">
<a href="https://www.usenix.org/system/files/nsdi23-zheng.pdf" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
</a>
<a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#shockwave-abstract" role="button" aria-expanded="false" aria-controls="shockwave-abstract">Abstract</a>
<a href="https://github.com/uw-mad-dash/shockwave" target="_blank">
<button type="button" class="btn btn-outline-primary btn-sm">Code</button>
</a>
</div>
<div class="collapse" id="shockwave-abstract">
<div class="card card-body">
Dynamic adaptation has become an essential technique in accelerating distributed machine learning (ML) training:
Recent studies have shown that dynamically adjusting model structure (e.g., lottery ticket hypothesis) or hyperparameters (e.g., batch size)
can significantly accelerate training without sacrificing accuracy. However, existing ML cluster schedulers are not designed to handle dynamic adaptation.