Application of LLMs and digitization in machine learning for thermomagnetic materials

Rignanese, Gian-MarcoZhang, HongbinPark, TheodoreTheodorePark2025-05-142025-05-142025-05-142024https://hdl.handle.net/2078.2/38929This thesis investigates the use of large language models (LLMs) and machine learning (ML) to predict new thermomagnetic materials. LLMs offer the opportunity to locate relevant plots in journals and papers (Polak and Morgan, 2024) as well as extract the relevant context, improving the process of data collection for ML. Combined with an automated digitization this process allows for the creation of a database which is then utilized for machine learning to predict new thermomagnetic materials. Additionally, this research allows for more easily performed machine learning by reducing the work and effort required in data collection as this methodology and code can be modified for other desired properties. Several different LLM models such as ChatGPT (2, 4), Mistral and Llama (2, 3), are tested, with Llama 3 8B instruct being the most successful. As a preliminary test, 15 papers in portable document format (PDF) published from 2010 to 2024 were tested, using LLMs to find relevant figures and context within the paper surrounding the plots. To properly test the LLM recognition 15 of these papers contain thermomagnetic plots within them, while 5 of them do not. Additionally, 16 documents without any relevant figures were tested. Relevant plots were digitized, exploring different digitization methods with modified digitization software such as code based on Plot2Spectra, ChartDETE and EasyOCR. While the results are promising further development is required.Large language modelDigitizationAutomated data collectionMachine learningApplication of LLMs and digitization in machine learning for thermomagnetic materialstext::thesis::master thesisthesis:46936