-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathdraft-paper.tex
More file actions
297 lines (247 loc) · 16 KB
/
draft-paper.tex
File metadata and controls
297 lines (247 loc) · 16 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
%%
%% This is file `sample-sigconf-authordraft.tex',
%% generated with the docstrip utility.
%%
%% The original source files were:
%%
%% samples.dtx (with options: `all,proceedings,bibtex,authordraft')
%%
%% IMPORTANT NOTICE:
%%
%% For the copyright see the source file.
%%
%% Any modified versions of this file must be renamed
%% with new filenames distinct from sample-sigconf-authordraft.tex.
%%
%% For distribution of the original source see the terms
%% for copying and modification in the file samples.dtx.
%%
%% This generated file may be distributed as long as the
%% original source files, as listed above, are part of the
%% same distribution. (The sources need not necessarily be
%% in the same archive or directory.)
%%
%%
%% Commands for TeXCount
%TC:macro \cite [option:text,text]
%TC:macro \citep [option:text,text]
%TC:macro \citet [option:text,text]
%TC:envir table 0 1
%TC:envir table* 0 1
%TC:envir tabular [ignore] word
%TC:envir displaymath 0 word
%TC:envir math 0 word
%TC:envir comment 0 0
%%
%% The first command in your LaTeX source must be the \documentclass
%% command.
%%
%% For submission and review of your manuscript please change the
%% command to \documentclass[manuscript, screen, review]{acmart}.
%%
%% When submitting camera ready or to TAPS, please change the command
%% to \documentclass[sigconf]{acmart} or whichever template is required
%% for your publication.
%%
%%
% \documentclass[sigconf,authordraft]{acmart}
\documentclass[sigconf,authordraft,anonymous]{acmart}
% \documentclass[manuscript, screen, review]{acmart}
%%
%% \BibTeX command to typeset BibTeX logo in the docs
\AtBeginDocument{%
\providecommand\BibTeX{{%
Bib\TeX}}}
%% Rights management information.
%% Replace with the values provided when you complete the rights form.
\setcopyright{acmlicensed}
\copyrightyear{2025}
\acmYear{2025}
\acmDOI{XXXXXXX.XXXXXXX}
%% These commands are for a PROCEEDINGS abstract or paper.
\acmConference[CODS YRS '25]{the 13th International Conference on Data Science: Young Researchers' Symposium (CODS 2025)}{December 17--20,
2025}{IISER Pune, India}
%% Uncomment if the proceedings title is different
%% \acmBooktitle{CODS YRS '25: 13th International Conference on Data Science – Young Researchers' Symposium,
%% December 17--20, 2025, IISER Pune, India}
\acmISBN{978-1-4503-XXXX-X/25/12}
%%
%% Submission ID.
%% Use this when submitting an article to a sponsored event. You'll
%% receive a unique submission ID from the organizers
%% of the event, and this ID should be used as the parameter to this command.
%%\acmSubmissionID{123-A56-BU3}
%%
%% For managing citations, it is recommended to use bibliography
%% files in BibTeX format.
%%
%% You can then either use BibTeX with the ACM-Reference-Format style,
%% or BibLaTeX with the acmnumeric or acmauthoryear sytles, that include
%% support for advanced citation of software artefact from the
%% biblatex-software package, also separately available on CTAN.
%%
%% Look at the sample-*-biblatex.tex files for templates showcasing
%% the biblatex styles.
%%
%%
%% The majority of ACM publications use numbered citations and
%% references. The command \citestyle{authoryear} switches to the
%% "author year" style.
%%
%% If you are preparing content for an event
%% sponsored by ACM SIGGRAPH, you must use the "author year" style of
%% citations and references.
%% Uncommenting
%% the next command will enable that style.
%%\citestyle{acmauthoryear}
%%
%% end of the preamble, start of the body of the document source.
\begin{document}
%%
%% The "title" command has an optional parameter,
%% allowing the author to define a "short title" to be used in page headers.
\title{Digital-Persona QAR: A Dataset and Pipeline for Financial Question--Answer--Reasoning from Unstructured Documents}
%%
%% The "author" command and its associated commands are used to define
%% the authors and their affiliations.
%% Of note is the shared affiliation of the first two authors, and the
%% "authornote" and "authornotemark" commands
%% used to denote shared contribution to the research.
\author{Ben Trovato}
\authornote{Both authors contributed equally to this research.}
\email{trovato@corporation.com}
\author{G.K.M. Tobin}
\authornotemark[1]
\email{webmaster@marysville-ohio.com}
\affiliation{%
\institution{Institute for Clarity in Documentation}
\city{Dublin}
\state{Ohio}
\country{USA}
}
%%
%% By default, the full list of authors will be used in the page
%% headers. Often, this list is too long, and will overlap
%% other information printed in the page headers. This command allows
%% the author to define a more concise list
%% of authors' names for this purpose.
\renewcommand{\shortauthors}{Trovato et al.}
%%
%% The abstract is a short summary of the work to be presented in the
%% article.
\begin{abstract}
We present \textbf{Digital-Persona QAR}, a dataset and pipeline that converts unstructured financial PDFs into structured \emph{Question--Answer--Reasoning (QAR)} triples for domain-specific LLM training. The pipeline performs GPU OCR, progressive long-context accumulation (up to $\sim$990K tokens), density-aware question generation, and rationale synthesis with page traceability. Using QLoRA, we fine-tune an 8B LLM into a compact \emph{financial analyst} model. In a blind expert study, the fine-tuned model is preferred in \textbf{80\%} of head-to-head answers and improves ROUGE-L by \textbf{+25\%} on unseen questions. This extended abstract details dataset design, pipeline, statistics, and validation, positioning Digital-Persona QAR as a reproducible resource for financial document understanding.
\end{abstract}
%%
%% The code below is generated by the tool at http://dl.acm.org/ccs.cfm.
%% Please copy and paste the code instead of the example below.
%%
\begin{CCSXML}
<ccs2012>
<concept>
<concept_id>10002951.10003317.10003318</concept_id>
<concept_desc>Information systems~Document representation</concept_desc>
<concept_significance>300</concept_significance>
</concept>
<concept>
<concept_id>10010147.10010178.10010187</concept_id>
<concept_desc>Computing methodologies~Natural language processing</concept_desc>
<concept_significance>300</concept_significance>
</concept>
<concept>
<concept_id>10002951.10003260.10003261</concept_id>
<concept_desc>Information systems~Data mining</concept_desc>
<concept_significance>300</concept_significance>
</concept>
</ccs2012>
\end{CCSXML}
\ccsdesc[300]{Information systems~Document representation}
\ccsdesc[300]{Computing methodologies~Natural language processing}
\ccsdesc[300]{Information systems~Data mining}
%%
%% Keywords. The author(s) should pick words that accurately describe
%% the work being presented. Separate the keywords with commas.
\keywords{dataset, financial QA, long-context LLMs, OCR, LoRA/QLoRA, reasoning}
%%
%% This command processes the author and affiliation and title
%% information and builds the first part of the formatted document.
\maketitle
\renewcommand\footnotetextcopyrightpermission[1]{}
\footnotetext{
Dataset: \url{https://drive.google.com/drive/folders/1KjHJuadd6VclbZd-yAIUBjlw0VWJi96S?usp=sharing} \\
Anonymized code: \url{http://anonymous.4open.science/r/Digital-Persona}
}
% ---------- Body (≤2 pages excluding references) ----------
\section{Introduction}
Financial documents such as annual reports, earnings calls, and regulatory filings contain dense, high-stakes information. They are lengthy, semi-structured, and often interleave narrative, tables, and figures~\cite{Meirkulov2024,FINRA2023}. For data science and NLP, these documents present two challenges: (\emph{i}) extracting text reliably despite noisy layouts, and (\emph{ii}) converting unstructured narratives into training-ready supervision for large language models (LLMs)~\cite{Feng2021}. While recent advances in LLMs have unlocked strong reasoning ability, their performance in specialized financial tasks remains limited without domain-specific fine-tuning~\cite{Jeong2024}.
Supervised datasets are essential for domain adaptation, but financial QA corpora remain scarce and narrow in scope~\cite{Chen2023FinTextQA}. FinQA~\cite{chen2022finqadatasetnumericalreasoning} is a seminal dataset focusing on numerical reasoning, pairing financial text with expert-written Q\&A and programs. Yet FinQA and similar benchmarks capture only a subset of financial reasoning (primarily numerical). They omit broader analytical dimensions such as qualitative assessment of business strategy, risk identification, or contextual interpretation of market behavior~\cite{Mateega2025}.
To address this gap, we introduce \textbf{Digital-Persona QAR}, a system and dataset designed to generate \emph{Question--Answer--Reasoning (QAR)} triples directly from unstructured financial PDFs. Unlike prior datasets, our pipeline creates diverse financial questions, institutional-grade answers, and explicit reasoning chains, all linked to page-level provenance. The dataset is scalable, reproducible, and designed to train or evaluate LLMs in financial analysis beyond numerical reasoning.
Our contributions are:
\begin{itemize}
\item \textbf{A novel automated pipeline} that ingests unstructured PDFs, performs GPU-accelerated OCR, manages progressive long-context accumulation (up to $\sim$990K tokens), and generates density-aware questions with professional-grade answers and rationales.
\item \textbf{A new dataset schema} for financial QAR triples with full provenance and reasoning transparency, covering valuation, risk, market behavior, and strategy.
\item \textbf{Domain adaptation results} showing that fine-tuning an 8B LLaMA-family model with LoRA and 4-bit quantization (QLoRA~\cite{dettmers2023qloraefficientfinetuningquantized}) yields a compact \emph{financial analyst} model. In blind expert review, the fine-tuned model is preferred in 80\% of cases over the base model, with +25\% ROUGE-L improvement against reference rationales.
\item \textbf{Public release} of the dataset and scripts to foster reproducibility and downstream research in financial NLP.
\end{itemize}
\section{Related Work}
\paragraph{Financial QA datasets.}
FinQA~\cite{chen2022finqadatasetnumericalreasoning} pioneered QA for numerical reasoning. TAT-QA and ConvFinQA extended table-based and conversational reasoning~\cite{Zhu2021TATQA,Chen2022ConvFinQA}, but all remain focused on numeric operations. None provide narrative-rich reasoning chains~\cite{Reddy2024DocFinQA}. Digital-Persona QAR complements these by targeting qualitative and mixed reasoning.
\paragraph{Domain adaptation of LLMs.}
Parameter-efficient fine-tuning (e.g., QLoRA~\cite{dettmers2023qloraefficientfinetuningquantized}) enables domain-specific adaptation on modest hardware. We adapt an 8B LLaMA-family model on our dataset, showing strong improvements in preference and grounding.
\paragraph{Long-context modeling.}
Handling financial documents requires reasoning across hundreds of pages. Recent long-context LLMs partially address this, but pipelines for dataset generation remain underexplored. Digital-Persona QAR introduces \emph{progressive context accumulation}, mirroring how analysts synthesize insights across reports.
In summary, prior work covers slices of the problem, while we combine (1) long-context ingestion, (2) scalable QAR generation, and (3) efficient domain adaptation into a reproducible framework for financial document understanding.
\section{Motivation \& Contribution}
Financial reports (annual reports, call transcripts, filings) contain dense domain knowledge but are difficult to exploit directly due to layout noise, length, and specialized jargon. LLMs benefit from curated, domain-specific supervision, yet financial QA datasets with transparent rationales are limited. Existing datasets such as FinQA~\cite{chen2022finqadatasetnumericalreasoning} provide expert-written questions and numerical reasoning chains, but focus narrowly on quantitative reasoning. They do not capture broader strategic or qualitative analysis. \textbf{Digital-Persona QAR} contributes:
\begin{itemize}
\item \textbf{Automated dataset creation} from raw PDFs into QAR triples spanning valuation, risk, market behavior, and strategy.
\item \textbf{Progressive long-context} processing that preserves cross-page dependencies and enables later questions to reference earlier context.
\item \textbf{Training-ready schema} with per-sample provenance (page IDs, token counts) for auditing, reproducibility, and explainability.
\item \textbf{Efficient domain adaptation} using LoRA with 4-bit quantization (QLoRA)~\cite{dettmers2023qloraefficientfinetuningquantized}, enabling fine-tuning of an 8B model on a single 24GB GPU without loss in performance.
\end{itemize}
\section{Pipeline \& Dataset Design}
\paragraph{OCR \& Text Assembly.}
PDFs are rasterized and processed with GPU-accelerated EasyOCR at line-level, then aggregated into page text. We maintain $\{\texttt{page\_id} \rightarrow \texttt{text}\}$ mappings for traceability.
\paragraph{Progressive Context.}
Text is appended page-by-page into a cumulative buffer, tracked with \texttt{tiktoken}. Once the cap ($\sim$990K tokens) is reached, older segments are truncated to retain recent context.
\paragraph{Density-aware Questioning.}
For each page, $\max(1, \lfloor \texttt{page\_tokens}/100 \rfloor)$ questions are generated, yielding proportional coverage of dense sections. Questions are capped at 50 words.
\paragraph{Answer \& Reasoning.}
Answers (180--250 words) follow a professional template. Reasoning is a cohesive $\sim$400-word explanation beginning \emph{``Based solely on the text:''}.
\paragraph{Schema.}
Each record stores: \texttt{page\_number}, \texttt{tokens}, \texttt{questions}, \texttt{context\_tokens}, \texttt{question}, \texttt{answer}, \texttt{reasoning}.
\begin{figure}[t]
\centering
\includegraphics[width=0.8\linewidth]{workflow.png}
\caption{Digital-Persona QAR pipeline: OCR $\rightarrow$ progressive context $\rightarrow$ density-aware Q generation $\rightarrow$ answer \& reasoning $\rightarrow$ packaged QAR dataset.}
\label{fig:workflow}
\end{figure}
\section{Dataset Snapshot \& Statistics}
On a representative financial text ($\sim$100 pages), the pipeline generated $\sim$450 QAR samples (\texttt{mean}=4.5/page; SD depends on token density). Themes were balanced across valuation (27\%), risk (25\%), market behavior (24\%), and strategy (24\%). Each sample links back to its page for auditing.
\section{Evaluation \& Results}
We fine-tune a LLaMA-family 8B model with \textbf{LoRA} (rank 32, dropout 0.05) and \textbf{4-bit} NF4 quantization. Training used batch size 16, LR $2\!\times\!10^{-4}$, and sequence length 512.
\textbf{Expert Study.} A CFA-level reviewer blind-compared base vs.\ fine-tuned answers to 50 unseen questions. The fine-tuned model was preferred in \textbf{80\%} of cases. Cohen’s $\kappa$=0.72 indicated substantial agreement.
\textbf{Automatic Metrics.} ROUGE-L improved by \textbf{+25\%}. Latency averaged 1.2s/question, confirming scalability.
\textbf{Ablations.} Removing progressive context reduced coherence. Without density-aware questioning, risk-related coverage dropped 15\%.
\section{Impact \& Broader Applications}
The QAR dataset enables research in: (1) investment analysis assistants, (2) compliance and regulatory audits, (3) financial education tools, and (4) explainable AI in finance. Its transparent rationales support model interpretability and educational use. Ethical safeguards include limiting data to publicly available reports and encouraging human oversight to mitigate overreliance.
% %%
% %% The acknowledgments section is defined using the "acks" environment
% %% (and NOT an unnumbered section). This ensures the proper
% %% identification of the section in the article metadata, and the
% %% consistent spelling of the heading.
% \begin{acks}
% To Robert, for the bagels and explaining CMYK and color spaces.
% \end{acks}
%%
%% The next two lines define the bibliography style to be used, and
%% the bibliography file.
\bibliographystyle{ACM-Reference-Format}
\bibliography{sample-base}
%%
%% If your work has an appendix, this is the place to put it.
\appendix
\end{document}
\endinput
%%
%% End of file `sample-sigconf-authordraft.tex'.