Data-projects-with-R-and-GitHub

Laptop Price Analysis

Jingyi Li 2026-05-12

Topic:

Wrangling and Analyzing Laptop Market Data

Introduction

The project focuses on the exploratory data analysis (EDA) and price determinants of consumer laptops. Using a raw, uncleaned dataset containing approximately 1,300 laptop entries, this study aims to quantify how hardware specifications—such as CPU performance, RAM capacity, storage technology, and brand equity—impact the market pricing of portable computers.

Questions & Hypotheses

  1. Price Distribution: What is the overall distribution of laptop prices in the dataset? Are there any significant outliers or clusters in the price data?
  2. Hardware Specifications and Price: How do different hardware specifications (e.g., CPU performance, RAM size, storage type) correlate with laptop prices? Which specifications are the most influential in determining the price?
  3. Brand Influence: Does the brand of the laptop have a significant impact on its price? Are certain brands consistently priced higher than others, and if so, why?

Hypothesis:

I hypothesize that higher-end hardware specifications (e.g., faster CPUs, larger RAM, SSD storage) and well-known brands will be associated with higher laptop prices. And high-resolution display features (e.g., IPS panels and Retina displays) contribute more to price variance in the Ultrabook segment than in the standard Notebook segment.

Data Source

Original source: https://www.kaggle.com/datasets/ehtishamsadiq/uncleaned-laptop-price-dataset

You can download the dataset directly using this link: uncleaned laptop price dataset (CSV)

Dataset Overview

The Uncleaned Laptop Price dataset is a collection of laptop product listings scraped from an online e-commerce website. The dataset includes information about various laptop models, such as their brand, screen size, processor, memory, storage capacity, operating system, and price. However, the dataset is uncleaned, meaning that it contains missing values, inconsistent formatting, and other errors that need to be addressed before the data can be used for analysis.

Company TypeName Inches ScreenResolution Cpu Ram Memory Gpu OpSys Weight Price
Apple Ultrabook 13.3 IPS Panel Retina Display 2560x1600 Intel Core i5 2.3GHz 8GB 128GB SSD Intel Iris Plus Graphics 640 macOS 1.37kg 71378.68
Apple Ultrabook 13.3 1440x900 Intel Core i5 1.8GHz 8GB 128GB Flash Storage Intel HD Graphics 6000 macOS 1.34kg 47895.52
HP Notebook 15.6 Full HD 1920x1080 Intel Core i5 7200U 2.5GHz 8GB 256GB SSD Intel HD Graphics 620 No OS 1.86kg 30636.00
Apple Ultrabook 15.4 IPS Panel Retina Display 2880x1800 Intel Core i7 2.7GHz 16GB 512GB SSD AMD Radeon Pro 455 macOS 1.83kg 135195.34
Apple Ultrabook 13.3 IPS Panel Retina Display 2560x1600 Intel Core i5 3.1GHz 8GB 256GB SSD Intel Iris Plus Graphics 650 macOS 1.37kg 96095.81
Acer Notebook 15.6 1366x768 AMD A9-Series 9420 3GHz 4GB 500GB HDD AMD Radeon R5 Windows 10 2.1kg 21312.00

Column Explanations

To ensure the description is self-contained, here is a short explanation of the core columns I will analyze:

Data Manipulation Goals

  1. Handling Blank Rows & Invalid Symbols (?):
    • Remove Empty Rows: The dataset contains exactly 30 completely blank rows.
    • Filter Invalid Strings: There are hidden non-numeric symbols ? in the data Rows.
  2. Feature Extraction & Type Conversion (String to Numeric):
    • Ram Column: Strip the “GB” text extension (e.g., converting “8GB” to “8”) and cast the column from character (chr) to integer (int).
    • Weight Column: Strip the “kg” text extension (e.g., converting “1.37kg” to “1.37”) and cast the column to a numeric (dbl) format.
    • Cpu Column: Extract the continuous numerical variable representing processor clock speed in GHz (e.g., parsing 2.3 out of "Intel Core i5 2.3GHz") .
  3. Categorical Consolidation & Engineering:
    • ScreenResolution Column: Create binary logical flags (is_IPS and is_Retina) based on text descriptions, and extract pure pixel dimensions (Width and Height) into separate numerical columns.
    • OpSys Column: Group sparse categories into broader groups (e.g., combining different variants like “Windows 10”, “Windows 10 S”, and “Windows 7” into a unified “Windows” label, and grouping “Mac OS X” with “macOS”) to ensure clear and readable visual distributions.

Visualization Goals

Setting Price as the main variable, I will investigate the relations between price and other hardware specifications and categorizations through the following visualizations:

  1. Price vs. Numeric Features (Scatter Plot)
    • Goal: Investigate how continuous numerical variables like Ram (or extracted CPU clock speed, Weight) correlate with Price.
    • Axes: Set Ram , CPU or Weight (numerical) on the X-axis and Price on the Y-axis.
    • Overplotting: Apply semi-transparency to handle overlapping points, and overlay a shaded region or density contours using ggdensityor something else to observe where the bulk of the market lies.
  2. Price vs. Categorical Features (Distribution Plot)
    • Goal: Observe the price variance across discrete categories like Company (Brand) and TypeName (Laptop Type).
    • Refinement (Violine over Boxplot): Instead of a simple bar chart or box plot, request a Violine Plot for each brand/type to show the full probability density and multi-modality of prices.
  3. Price Distribution Overlap (Ridgeline Plot)
    • Goal: Compare the overall price profile across the most common laptop types (Notebook, Ultrabook, Gaming).
    • Specification: Plot a baseline price histogram for the entire dataset, and overlay a Ridgeline Plot split by TypeName right on top, allowing an immediate visual comparison of price peaks between standard notebooks and premium segments.