Vijay Rama Raju

Data Engineer

Building intelligent data solutions with expertise in machine learning, cloud platforms, and big data technologies. Transforming raw data into actionable insights.

About Me

I am a passionate and results-driven Software Data Engineer with experience in designing, developing, and maintaining robust data pipelines and analytics solutions. My expertise lies in leveraging big data technologies and cloud platforms to transform raw data into actionable insights. I thrive in collaborative environments and am always eager to take on new challenges in the world of data.

Work Experience & Research Contributions

Professional Experience

Software Engineer (Data)

The Clorox Company | San Jose, CA

June 2024 – Present

  • Developed and maintained dashboards in Power BI and Tableau for finance and operations teams using data from AWS S3
  • Implemented and optimized data models in SQL and Excel to support regular and ad-hoc financial reports
  • Curated and consolidated finance data of 30 companies and 300+ products with PySpark, Hadoop, Hive, and EMR
  • Cleaned structured and unstructured finance data to fine-tune Llama 3.3, creating a chatbot that generates automated financial reports leading to 30% cost savings
  • Built a web application with React, Node.js, and JavaScript for hosting 12+ in-house dashboards

Research Assistant

San Jose State University Research Foundation | San Jose, CA

Feb 2024 – Jun 2024

  • Raised coronary blockage detection accuracy to 93% (Dice score on 3GB OCT data) by implementing U-Net segmentation and OpenCV preprocessing in Python
  • Cut per-image inference latency by 60% (profiling logs) by porting compute kernels to CUDA and deploying on AWS G4 instances
  • Collaborated with a team of three to design an automated OCT coronary artery image processing algorithm using computer vision
  • Designed an algorithm to calculate the amount of blockage, increasing treatment efficiency by 90%

Graduate Teaching Assistant

San Jose State University | San Jose, CA

Jan 2024 – May 2024

  • Led and mentored 70 students by grading assignments and teaching concepts on Excel, PL/SQL, DDL, MDX, HiveQL, SparkSQL, Scala, and MongoDB, Neo4j
  • Taught tool usage like Redshift, S3, AWS Glue, EMR, Kinesis, FireHose, Lambda, and IAM for team projects
  • Planned research project leveraging machine learning on data stored in NoSQL databases
  • Held office hours to clarify student doubts, enhancing their understanding of data engineering practices

Software Engineer

Inn4Smart Solutions | India

Jan 2022 – July 2023

  • Leveraged Python and R to analyze user and operational data across modules via Power BI dashboards, increasing user adoption by 20%
  • Performed EDA on resident data to refine ad targeting, boosting engagement by 15%
  • Built a Snowflake data pipeline processing 10,000+ daily records across features
  • Optimized ETL workflows using PySpark and Apache Airflow, cutting processing time by 30%
  • Deployed IoT and NLP analytics for real-time community management, achieving 85% accuracy in usage detection

Contributions to Google Research

This summarizes contributions by @pvrraju across google/Xee, google/weather-tools, and google-research/arco-era5.

google/Xee

Pull Requests

  • PR #253 – Implement lazy loading to defer metadata RPCs until data access time
    • Added a lazy_load=True parameter so metadata RPCs are deferred until actual data access.
    • Improved dataset open performance while keeping backward compatibility.
    • Refactored integration tests with time.perf_counter() for accurate timing.
  • PR #254 – Preallocate tiles numpy
    • Replaced nested list comprehensions with np.empty() for tile pre-allocation.
    • Updated indexing from tiles[i][j][k] to tiles[i, j, k].
    • Enhanced readability and potential performance gains.
    • Reused the lazy-loading commit from #253 as a base.

google/weather-tools

Pull Requests

  • PR #513 – Fix #243: Packaging log warnings
    • Problem: MANIFEST.in contained prune test_data, which caused unnecessary log warnings.
    • Fix: First commit commented out the line; second commit replaced it with global-exclude test_data/* to suppress warnings while excluding test data from distributions.
    • Resolved issue #243.
    • PR merged on Sep 10, 2025.

google-research/arco-era5

Pull Requests

  • PR #111 – Improve walkthrough notebooks
    • Added warnings about regridding time and descriptions for weather event date strings.
    • Expanded authentication docs: differences between gcloud auth login and gcloud auth application-default login.
    • Clarified variable origins, chunking, and improved plots with better projections and labels.
    • Explained dimensions, added conclusion "Looking Ahead", and reordered variables to match XArray order.
  • PR #113 – Improve notebooks (superseded)
    • Added hooks to open datasets in Google Colab; improved plotting and documentation.
    • Closed in favor of PR #114.
  • PR #114 – Improve notebooks new
    • README: Added "Get Started with Colab Notebooks" section.
    • Introduced a table linking to Colab notebooks (Surface Reanalysis, Model Levels).
    • Clarified dataset update cadences and metadata for ERA5/ERA5T.
    • Improved formatting and user onboarding experience.

Technical Skills

Programming & Scripting

Python SQL JavaScript Java R Scala

Big Data & Cloud

AWS (S3, EMR, Glue) PySpark Hadoop Apache Airflow Kafka Snowflake

Machine Learning & AI

TensorFlow PyTorch Scikit-learn MLflow LangGraph OpenAI API Computer Vision NLP Deep Learning CUDA U-Net OpenCV

Data Engineering Tools

Apache Spark ETL Pipelines Data Warehousing dbt FiveTran Kinesis Lambda Redshift

Software Engineering

React Node.js TypeScript FastAPI Docker Kubernetes CI/CD Git

Databases & Visualization

MySQL MongoDB Neo4j Power BI Tableau Pandas/NumPy

Featured Projects

🍷 WineIQ Predict – AI-Powered Quality Scoring

Python • Scikit-learn • MLflow • Flask • Docker • AWS • CI/CD

End-to-end MLOps pipeline for wine quality prediction with production-ready deployment. Features automated model training, experiment tracking, and containerized deployment.

  • 150ms inference response time for scoring API
  • 15% accuracy increase using ElasticNet model
  • Five-stage ML pipeline with data validation
  • Versioned models with automatic deployment
View Code

✈️ AI Travel Planner (Full-Stack LLMOps)

Python • LangGraph • FastAPI • Streamlit • OpenAI API • Docker

Agentic application powered by Large Language Models that creates personalized travel itineraries based on user queries with real-time information retrieval.

  • Sub-second response times for complex itineraries
  • LangGraph for tool orchestration and agentic workflows
  • FastAPI backend with optimized endpoints
  • Integration with multiple APIs for real-time data
View Code

🎬 MovieAssist - Intelligent Film Recommendation

Python • FastAPI • Vector Databases • LLM • Docker • React/TypeScript

Movie recommendation system leveraging vector embeddings and large language models for personalized film suggestions and detailed information.

  • Vector search for semantic movie matching
  • Knowledge graph integration for context-aware recommendations
  • FastAPI backend with efficient caching
  • Modern React frontend with TypeScript
View Code

🔧 Data Engineering with DBT

SQL • dbt • Snowflake • Python • Airflow

Implementation of modern data transformation workflows using dbt with Snowflake, demonstrating best practices for data modeling, testing, and documentation.

  • Modular data transformations with dbt
  • Automated data quality testing
  • Integration with CI/CD workflows
  • Snowflake optimized SQL patterns
View Code

☁️ OnCloud LLMOps - Cloud-Native LLM Deployment

Python • AWS/GCP • Docker • Kubernetes • LangChain

Framework for deploying and managing Large Language Models in cloud environments with focus on scalability, cost efficiency, and performance monitoring.

  • Serverless LLM inference endpoints
  • Auto-scaling based on demand patterns
  • Cost optimization techniques for LLM deployment
  • Performance monitoring and analytics
View Code

📊 Crime Analysis Dashboard

Python • Pandas • Scikit-learn • Matplotlib • Jupyter Notebook

Comprehensive analysis of crime data to identify patterns, trends, and potential predictive factors with statistical modeling and visualization.

  • Predictive modeling for crime hotspots
  • Temporal analysis of crime patterns
  • Demographic correlation studies
  • Interactive visualization dashboards
View Code

Let's Connect

Feel free to reach out for collaborations, opportunities, or just a friendly chat about data and AI!

© 2024 Vijay Rama Raju. All Rights Reserved.