Skip to content

Latest commit

 

History

History
119 lines (73 loc) · 5.72 KB

README.md

File metadata and controls

119 lines (73 loc) · 5.72 KB

Getting Started

This is a script written in python 3 that uses selenium to scrape images and metadata of Trendyol.com. Also a attribute analysis script is included to generate excel and log files that describes the downloaded data. This script has a feature to generate .csv files according to a labelmap so that downloaded dataset can easily be used for machine learning.

Installation

In order to install packages required to run the scripts, run the following command

pip install -r requirements.txt

Usage

This repository has three different scripts. The main script that does the scraping is named TrendyolScraper.js

These are the arguments for TrendyolScraper

  • --url    The url of the trendyol search that will be scraped. REQUIRED

  • --urlsPath    The path to the .txt file that contains the list of urls with each url being in each line

  • Note: Either one "url" or "urlsPath" is REQUIRED

  • --path    The path of the directory that all the image and .meta files will be downloaded into.

  • --max    Maximum number of images that will be downloaded, no limit as default. OPTIONAL

  • --prefix    A prefix that will be put in front of all files downloaded, use this if you are going to make multiple downloads on the same directory otherwise files from the first dowload will be overridden. No prefix at default.OPTIONAL

  • -n    If you do not want to download the scraped images, this mode will still generate the .meta files

  • -l    If you want to create a .txt file that lists the urls of scraped images. You can later use this .txt file with the download_images.py to download images to a remote location without having to need to rescrape.

Example usage

 python TrendyolScraper.py --url "https://www.trendyol.com/erkek-gomlek-x-g2-c75" --path ./Dataset --max 100 --prefix m

Note: Do not pass a --max argument if you want to dowload as much as possible

 

Second script is the download_image.py which is a tool to efficiently download images in bulk from the .txt file generated in the first script(TrendyolScraper.py)

It has three arguments

  • --file    The path of .txt file that has the formatted list of urls to be downloaded. REQUIRED

  • --dir    The directory where the images will be downloaded, default is the directory where the script is ran. OPTIONAL

  • --prefix    A prefix that will be put in front of all files downloaded, use this if you are going to make multiple downloads on the same directory otherwise files from the first dowload will be overridden. No prefix at default. OPTIONAL

Example usage

 python download_images.py --file "m-imageUrls.txt" --dir "./images" --prefix m

 

Third script is the attribute_analysis.py which provides few tools for interpreting the data that you downloaded

This script has three modes,

  • -xlsx    The script will generate excel file that contains all the attribute categories and attributes found within the metadata of the images in the specified directory along with the statistics of how many images were labeled with those attributes.

  • -d    The script will create a .txt file with detailed information of which atttributes were labeled for every image file in the specified directory.

  • -csv    The script will generate a .csv file that describes the entire dataset found in the specified directory according to a labelmap file, this has to be used along with --labelmap argument. See the description of --labelmap argument for detailed explanation of how to use this mode

also the script has two arguments

  • --path    The path of the directory that will be scanned for .meta files. REQUIRED

  • --labelmap    The path to the .json file that contains the labelmap for .csv file

An example label:

{
    "Kol Tipi": { //The exact name of the category as found in the .meta files
        "name": "Sleeve Type", //The name of the category that will be written into the .csv file, you can change this as you want
        "attributes": [ //List of attributes that belong to the category
            {
                "name": "Short Sleeve", //The name of the attribute, you can change this as you want. This is not written into .csv file and is here for postprocessing purposes
                "subattributes": [ //List of the exact names of the attributes as found in the .meta files, if you put multiple names they will be merged into this single attribute
                    "Kısa Kol",
                    "Kısa"
                ]
            }
        ]
    },
    "Renk" : {...}
}

Important note: for the "exact names" you need the exact names of the attributes that are given inside .meta files, You can generate a excel file by running this script in -xlsx mode to see all of the exact names of the categories and attributes easily.

See example_labelmap.json for a complete example of a labelmap generated suitable to a dataset dowloaded from the links https://www.trendyol.com/kadin-t-shirt-x-g1-c73 and https://www.trendyol.com/erkek-t-shirt-x-g2-c73

Example usage

python attribute_analysis.py  --path ./Dataset --labelmap example_labelmap.json

The csv file created with this script may look like this:

File Name,Gender,Color,Sleeve type,Collar Type,Pattern,Material Type,Fit,Style
m-1003_2.jpg,0,3,0,0,0,1,3,0

Numbers like 0 and 3 in the csv corresponds to the index of the attributes as they were given in the order of your labelmap. For example the firt 3 in the sequence point to the fourth attribute that was given in the "attributes" list of the color category, which was green for my case.