Vijay Rama Raju

Data Engineer

Building intelligent data solutions with expertise in machine learning, cloud platforms, and big data technologies. Transforming raw data into actionable insights.

View My Work Get In Touch

About Me

I am a passionate and results-driven Software Data Engineer with experience in designing, developing, and maintaining robust data pipelines and analytics solutions. My expertise lies in leveraging big data technologies and cloud platforms to transform raw data into actionable insights. I thrive in collaborative environments and am always eager to take on new challenges in the world of data.

Work Experience & Research Contributions

Professional Experience

Software Engineer (Data)

The Clorox Company | San Jose, CA

June 2024 – Present

Developed and maintained dashboards in Power BI and Tableau for finance and operations teams using data from AWS S3
Implemented and optimized data models in SQL and Excel to support regular and ad-hoc financial reports
Curated and consolidated finance data of 30 companies and 300+ products with PySpark, Hadoop, Hive, and EMR
Cleaned structured and unstructured finance data to fine-tune Llama 3.3, creating a chatbot that generates automated financial reports leading to 30% cost savings
Built a web application with React, Node.js, and JavaScript for hosting 12+ in-house dashboards

Research Assistant

San Jose State University Research Foundation | San Jose, CA

Feb 2024 – Jun 2024

Raised coronary blockage detection accuracy to 93% (Dice score on 3GB OCT data) by implementing U-Net segmentation and OpenCV preprocessing in Python
Cut per-image inference latency by 60% (profiling logs) by porting compute kernels to CUDA and deploying on AWS G4 instances
Collaborated with a team of three to design an automated OCT coronary artery image processing algorithm using computer vision
Designed an algorithm to calculate the amount of blockage, increasing treatment efficiency by 90%

Graduate Teaching Assistant

San Jose State University | San Jose, CA

Jan 2024 – May 2024

Led and mentored 70 students by grading assignments and teaching concepts on Excel, PL/SQL, DDL, MDX, HiveQL, SparkSQL, Scala, and MongoDB, Neo4j
Taught tool usage like Redshift, S3, AWS Glue, EMR, Kinesis, FireHose, Lambda, and IAM for team projects
Planned research project leveraging machine learning on data stored in NoSQL databases
Held office hours to clarify student doubts, enhancing their understanding of data engineering practices

Software Engineer

Inn4Smart Solutions | India

Jan 2022 – July 2023

Leveraged Python and R to analyze user and operational data across modules via Power BI dashboards, increasing user adoption by 20%
Performed EDA on resident data to refine ad targeting, boosting engagement by 15%
Built a Snowflake data pipeline processing 10,000+ daily records across features
Optimized ETL workflows using PySpark and Apache Airflow, cutting processing time by 30%
Deployed IoT and NLP analytics for real-time community management, achieving 85% accuracy in usage detection

Contributions to Google Research

This summarizes contributions by @pvrraju across google/Xee, google/weather-tools, and google-research/arco-era5.

google/Xee

Pull Requests

PR #253 – Implement lazy loading to defer metadata RPCs until data access time
- Added a lazy_load=True parameter so metadata RPCs are deferred until actual data access.
- Improved dataset open performance while keeping backward compatibility.
- Refactored integration tests with time.perf_counter() for accurate timing.
PR #254 – Preallocate tiles numpy
- Replaced nested list comprehensions with np.empty() for tile pre-allocation.
- Updated indexing from tiles[i][j][k] to tiles[i, j, k].
- Enhanced readability and potential performance gains.
- Reused the lazy-loading commit from #253 as a base.

google/weather-tools

Pull Requests

PR #513 – Fix #243: Packaging log warnings
- Problem: MANIFEST.in contained prune test_data, which caused unnecessary log warnings.
- Fix: First commit commented out the line; second commit replaced it with global-exclude test_data/* to suppress warnings while excluding test data from distributions.
- Resolved issue #243.
- PR merged on Sep 10, 2025.

google-research/arco-era5

Pull Requests

PR #111 – Improve walkthrough notebooks
- Added warnings about regridding time and descriptions for weather event date strings.
- Expanded authentication docs: differences between gcloud auth login and gcloud auth application-default login.
- Clarified variable origins, chunking, and improved plots with better projections and labels.
- Explained dimensions, added conclusion "Looking Ahead", and reordered variables to match XArray order.
PR #113 – Improve notebooks (superseded)
- Added hooks to open datasets in Google Colab; improved plotting and documentation.
- Closed in favor of PR #114.
PR #114 – Improve notebooks new
- README: Added "Get Started with Colab Notebooks" section.
- Introduced a table linking to Colab notebooks (Surface Reanalysis, Model Levels).
- Clarified dataset update cadences and metadata for ERA5/ERA5T.
- Improved formatting and user onboarding experience.

Technical Skills

Programming & Scripting

Python SQL JavaScript Java R Scala

Big Data & Cloud

AWS (S3, EMR, Glue) PySpark Hadoop Apache Airflow Kafka Snowflake

Machine Learning & AI

TensorFlow PyTorch Scikit-learn MLflow LangGraph OpenAI API Computer Vision NLP Deep Learning CUDA U-Net OpenCV

Data Engineering Tools

Apache Spark ETL Pipelines Data Warehousing dbt FiveTran Kinesis Lambda Redshift

Software Engineering

React Node.js TypeScript FastAPI Docker Kubernetes CI/CD Git

Databases & Visualization

MySQL MongoDB Neo4j Power BI Tableau Pandas/NumPy

Featured Projects

🍷 WineIQ Predict – AI-Powered Quality Scoring

Python • Scikit-learn • MLflow • Flask • Docker • AWS • CI/CD

End-to-end MLOps pipeline for wine quality prediction with production-ready deployment. Features automated model training, experiment tracking, and containerized deployment.

150ms inference response time for scoring API
15% accuracy increase using ElasticNet model
Five-stage ML pipeline with data validation
Versioned models with automatic deployment

View Code

✈️ AI Travel Planner (Full-Stack LLMOps)

Python • LangGraph • FastAPI • Streamlit • OpenAI API • Docker

Agentic application powered by Large Language Models that creates personalized travel itineraries based on user queries with real-time information retrieval.

Sub-second response times for complex itineraries
LangGraph for tool orchestration and agentic workflows
FastAPI backend with optimized endpoints
Integration with multiple APIs for real-time data

View Code

🎬 MovieAssist - Intelligent Film Recommendation

Python • FastAPI • Vector Databases • LLM • Docker • React/TypeScript

Movie recommendation system leveraging vector embeddings and large language models for personalized film suggestions and detailed information.

Vector search for semantic movie matching
Knowledge graph integration for context-aware recommendations
FastAPI backend with efficient caching
Modern React frontend with TypeScript

View Code

🔧 Data Engineering with DBT

SQL • dbt • Snowflake • Python • Airflow

Implementation of modern data transformation workflows using dbt with Snowflake, demonstrating best practices for data modeling, testing, and documentation.

Modular data transformations with dbt
Automated data quality testing
Integration with CI/CD workflows
Snowflake optimized SQL patterns

View Code

☁️ OnCloud LLMOps - Cloud-Native LLM Deployment

Python • AWS/GCP • Docker • Kubernetes • LangChain

Framework for deploying and managing Large Language Models in cloud environments with focus on scalability, cost efficiency, and performance monitoring.

Serverless LLM inference endpoints
Auto-scaling based on demand patterns
Cost optimization techniques for LLM deployment
Performance monitoring and analytics

View Code

📊 Crime Analysis Dashboard

Python • Pandas • Scikit-learn • Matplotlib • Jupyter Notebook

Comprehensive analysis of crime data to identify patterns, trends, and potential predictive factors with statistical modeling and visualization.

Predictive modeling for crime hotspots
Temporal analysis of crime patterns
Demographic correlation studies
Interactive visualization dashboards

View Code

Let's Connect

Feel free to reach out for collaborations, opportunities, or just a friendly chat about data and AI!

pvrraju9996@gmail.com

LinkedIn Profile

GitHub Profile