A Transformer-Based Semantic Encoding Framework for Quantitative Analysis of Large-Scale Textual Reviews

Stanujkić, Dragiša

A Transformer-Based Semantic Encoding Framework for Quantitative Analysis of Large-Scale Textual Reviews

Autori: Darjan Karabašević, Aleksandra Vujko, Vuk Mirčetić, Gabrijela Popović, Dragiša Stanujkić

Časopis: Axioms

Volume, no: 15 , 3

ISSN: 2075-1680

DOI: 10.3390/axioms15030175

Stranice: 1-25

Link: https://www.mdpi.com/2075-1680/15/3/175

Apstrakt:

Increasing turbulence in contemporary business environments has made the quantitative analysis of unstructured textual data a central methodological challenge for researchers and decision-makers. The increasing availability of large-scale textual data has heightened the need for quantitative frameworks that can transform unstructured language into analyzable numerical representations. Transformer-based language models address this need by encoding text into high-dimensional semantic embeddings. Yet, these representations are commonly treated as black-box inputs for downstream tasks, with limited examination of their intrinsic numerical and geometric properties. The research in this manuscript addresses this gap by proposing a quantitative framework for analyzing transformer-based semantic embeddings as high-dimensional metric spaces prior to task-specific modeling. We employ an innovative methodological approach, considering vector norms regarding examining the dispersion of vector norms to detect concentration of measure, cosine similarity in the context of evaluating the distribution of pairwise cosines between vectors, and principal component analysis. For the purpose of the research, 3034 visitor-generated reviews related to national park experiences were used. Textual inputs are deterministically mapped into a normalized 384-dimensional embedding space using a transformer-based encoder. The analysis examines numerical stability through vector norm dispersion, semantic organization via cosine similarity distributions, variance structure using principal component analysis, and internal organization through unsupervised clustering validity metrics. Clustering is successful when high separation between clusters and high cohesion within clusters are achieved, which is why a single measure combining separation and cohesion metrics was proposed in the research. The results show almost perfect norm stability, backing up the choice of angular similarity as the right semantic metric. Variance decomposition and clustering results share a continuous high-dimensional semantic structure with no dominant latent components or clearly separable clusters. These results suggest that semantic meaning is best thought of as a continuous metric space rather than discrete categories, highlighting the need for representational diagnostics before predictive modeling

Ključne reči: transformer-based embeddings; semantic encoding; high-dimensional vector spaces; cosine similarity; principal component analysis; clustering validity; quantitative text analysis; metric space analysis

Priložene datoteke:

Darjan Karabasevic, Aleksandra Vujko, Vuk Mircetic, Gabrijela Popovic, Dragisa Stanujkic. 2026 [11857].pdf ( veličina: 1,35 MB, broj pregleda: 43 )

Zahvaljujemo se što ste preuzeli publikaciju sa portala Singipedia.

Ukoliko želite da se prijavite za obaveštenja o sadržajima iz oblasti ove publikacije, možete nam ostaviti adresu svoje elektronske pošte.