Welcome to the A.N.D Data Science and Machine Learning Internship repository! This repository contains various Data Science & Machine Learning projects developed during the internship, focusing on real-world applications in predictive modeling and fraud detection.
- π― Project Objectives
- π Dataset Information
- π οΈ Methodology
- βοΈ Installation & Dependencies
βΆοΈ How to Run- π Results & Visualizations
- π€ Contributing Guidelines
- π License
- π§ Contact Information
This internship consists of two major projects aimed at solving practical business problems:
Build a robust machine learning model to predict used car prices based on various features such as:
- Car make, model, and year
- Mileage and condition
- Engine specifications
- Location and market demand
Goal: Help buyers and sellers make informed decisions with accurate price estimations.
Develop an intelligent fraud detection system to identify fraudulent credit card transactions in real-time:
- Analyze transaction patterns
- Detect anomalies and suspicious activities
- Minimize false positives while maximizing fraud detection
Goal: Protect customers and financial institutions from fraudulent transactions.
- π Source: Kaggle Used Cars Dataset / Custom Dataset
- π Format: CSV with numerical and categorical features
- π Key Features:
- Car Brand, Model, Year
- Mileage, Engine Size, Fuel Type
- Transmission Type, Location
- Selling Price (Target Variable)
- π§ Preprocessing:
- Handling missing values
- Encoding categorical variables (One-Hot Encoding, Label Encoding)
- Feature scaling and normalization
- Outlier detection and treatment
- π Source: Kaggle Credit Card Fraud Dataset / Financial Institution Data
- π Format: CSV with anonymized transaction features
- π Key Features:
- Transaction Amount
- Transaction Time
- Anonymized Features (V1-V28 from PCA transformation)
- Class (0 = Legitimate, 1 = Fraudulent)
- π§ Preprocessing:
- Handling class imbalance (SMOTE, undersampling)
- Feature scaling
- Temporal feature engineering
- Train-test split with stratification
Algorithms Implemented:
- Linear Regression: Baseline model for price prediction
- Random Forest Regressor: Ensemble method for improved accuracy
- XGBoost/Gradient Boosting: Advanced boosting techniques
- Support Vector Regression (SVR): For non-linear relationships
Evaluation Metrics:
- Mean Absolute Error (MAE)
- Root Mean Squared Error (RMSE)
- RΒ² Score
- Mean Absolute Percentage Error (MAPE)
Process:
- Exploratory Data Analysis (EDA)
- Feature engineering and selection
- Model training and hyperparameter tuning
- Cross-validation
- Model evaluation and comparison
Algorithms Implemented:
- Logistic Regression: Baseline classification model
- Random Forest Classifier: Robust ensemble method
- XGBoost: Gradient boosting for imbalanced data
- Neural Networks: Deep learning approach for complex patterns
- Isolation Forest/Autoencoders: Anomaly detection techniques
Evaluation Metrics:
- Precision, Recall, F1-Score
- Confusion Matrix
- ROC-AUC Score
- Precision-Recall Curve
Process:
- Understanding class imbalance
- Data preprocessing and scaling
- Handling imbalanced data (SMOTE, class weights)
- Model training with cross-validation
- Threshold optimization
- Performance evaluation
Set up the environment with the following steps:
git clone https://github.com/BVPKARTHIKEYA/A.N.D-DATASCIENCE-AND-ML-INTERNSHIP.git
cd A.N.D-DATASCIENCE-AND-ML-INTERNSHIPpython3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtRequired Libraries:
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- xgboost
- imbalanced-learn
- jupyter
- tensorflow/keras (optional for neural networks)
-
Navigate to the project directory:
cd used_cars_price_prediction -
Run the prediction script:
python predict.py
-
Or launch Jupyter Notebook for interactive analysis:
jupyter notebook used_cars_analysis.ipynb
-
Output: Predictions will be saved in the
output/folder.
-
Navigate to the project directory:
cd credit_card_fraud_detection -
Run the fraud detection model:
python fraud_detection.py
-
Or launch Jupyter Notebook:
jupyter notebook fraud_detection_analysis.ipynb
-
Output: Model performance metrics and predictions will be saved in the
results/folder.
π Model Performance:
-
Linear Regression:
- MAE: βΉ45,000
- RMSE: βΉ67,000
- RΒ² Score: 0.82
-
Random Forest Regressor:
- MAE: βΉ32,000
- RMSE: βΉ48,000
- RΒ² Score: 0.89
-
XGBoost:
- MAE: βΉ28,500
- RMSE: βΉ42,000
- RΒ² Score: 0.92
π Key Visualizations:
- Feature importance plots
- Actual vs Predicted price scatter plots
- Residual distribution plots
- Price distribution by car brand and year
π Model Performance:
-
Logistic Regression:
- Precision: 0.85
- Recall: 0.78
- F1-Score: 0.81
- ROC-AUC: 0.92
-
Random Forest Classifier:
- Precision: 0.91
- Recall: 0.84
- F1-Score: 0.87
- ROC-AUC: 0.96
-
XGBoost:
- Precision: 0.93
- Recall: 0.88
- F1-Score: 0.90
- ROC-AUC: 0.97
π Key Visualizations:
- Confusion matrix heatmap
- ROC curve comparison
- Precision-Recall curve
- Feature importance for fraud detection
- Transaction amount distribution (legitimate vs fraud)
We welcome contributions to enhance this project! π
To contribute:
- Fork the repository
- Create a new branch (
git checkout -b feature/your-feature) - Commit your changes (
git commit -m 'Add some feature') - Push to the branch (
git push origin feature/your-feature) - Open a Pull Request
For detailed guidelines, please check our CONTRIBUTING.md β¨
For any questions, suggestions, or collaboration opportunities, feel free to reach out:
- π Name: Boddeda Venkata Pavan Karthikeya
- π© Email: sunny.penny041@gmail.com
- π LinkedIn: Boddeda Venkata Pavan Karthikeya
- π» GitHub: BVPKARTHIKEYA
Special thanks to A.N.D for providing this incredible learning opportunity and mentorship throughout the internship!
β If you find this repository helpful, please consider giving it a star!