A collection of my personal projects

DataScout AI: A Retrieval-Augmented Generation (RAG) LLM Application for Website and PDF Querying
DataScout AI is a large language model chatbot application that simplifies data retrieval and analysis by using Retrieval-Augmented Generation (RAG). Built with LangChain and OpenAI GPT 3.5 Turbo model, it allows users to query information from uploaded documents (PDF, DOCX, TXT) and URLs in natural language. The app processes data by extracting content, splitting into manageable chunks, generating embeddings, and storing them in a FAISS vector index for efficient search and retrieval. Users can ask questions in natural language, and the app retrieves relevant information from the indexed content to generate accurate, source-backed answers in real time. Deployed on Streamlit Cloud, DataScout AI is a scalable and user-friendly solution for anyone looking to streamline the process of extracting insights from large volumes of text.

Advanced Loan Default Classification
Developed a machine learning model to assist lenders in the SBA loan approval process focusing on key features like loan term, issuing bank, and business location. The model, trained on historical data, predicts loan repayment likelihood, helping lenders make informed decisions that support responsible lending while increasing access to capital for qualified small businesses. Multiple machine learning models, including Logistic Regression, GLM, GBM, XGBoost, LightGBM, Random Forest and Decision Tree, were evaluated, with the Decision Tree model achieving a robust 89% accuracy and an F1 score of 0.886. The project prioritized the F1 score to ensure balanced precision and recall, addressing imbalanced data challenges. SHAP was used for model explainability, offering deep insights into feature importance, thereby streamlining the loan assessment process and reducing lending risks.
Satellite-based model for rice crop mapping and monitoring
This project involved developing a satellite-based model to map and monitor rice crops in An Giang province, Vietnam, using Sentinel-1 radar and Landsat 8-9 imagery. The model processed these datasets to accurately distinguish between rice and non-rice fields across 600 georeferenced locations. It addresses the challenges faced by Vietnam's rice industry in the Mekong Delta, which is threatened by land degradation, climate change, and rising sea levels. By leveraging satellite data and machine learning, the project aims to improve precision agriculture, inform agricultural policy, enhance market intelligence, and optimize water resource management, ultimately ensuring food security and sustainability in the region.

Movie recommendation system
This project developed a personalized movie recommendation system to address the overwhelming number of choices available on streaming platforms. Using the MovieLens 25M dataset, the project explores various recommendation techniques, including content-based filtering, collaborative filtering, and Item-Item recommendations with k-Nearest Neighbors (kNN). By using advanced algorithms, the system analyzes user preferences and movie attributes to provide tailored recommendations. This helps users spend less time searching and more time enjoying great movies.

COVID-19 SQL data exploration
This project focuses on analyzing COVID-19 trends and insights using SQL on a comprehensive global dataset that includes case counts, vaccination rates, mortality rates, and demographic data. The goal is to identify critical patterns, such as the relationship between vaccination coverage and infection rates, the effectiveness of public health interventions, and the geographic and demographic factors influencing the spread of the virus. Through advanced SQL querying and data manipulation, this project aims to generate actionable insights that can inform public health policies, enhance pandemic response strategies, and deepen our understanding of COVID-19’s impact on different populations.