Skip to content
This repository has been archived by the owner on Feb 28, 2024. It is now read-only.

First and last values of each variable in the space are sampled 2 times less than the others #1172

Open
JordanJar8 opened this issue Jun 7, 2023 · 0 comments

Comments

@JordanJar8
Copy link

Hi,

I was testing the different samplers when I realised that for each variable in my space, whatever its type, and whatever the number of samples requested, the first and last values (in order) of each variable are 2 times less chosen than the others if it's a Sobol or Latin Hypercube sampler. This problem does not exist if we use uniform sampling.
I've looked in Issues, on the Internet and in the scikit-optimize source code for an answer but I haven't been able to find one. Do you have an answer?

Here is the code that reproduces the problem:

import numpy as np
import pandas as pd
import skopt

# Parameters
MAIN_SEED_VALUE = 8
random_generator = np.random.RandomState(seed=MAIN_SEED_VALUE)

N_EXPERIMENTS = 50

EXPERIMENT_SEED_VALUES = random_generator.randint(0, 1000000, N_EXPERIMENTS)

SEARCH_AREA = {
    'city': [
        'amsterdam', 
        'copenhagen', 
        'madrid', 
        'paris', 
        'rome', 
        'sofia', 
        'valletta', 
        'vienna',
        'vilnius',
    ],
    'date': (0, 44),
    'language': [
        'austrian', 
        'belgian', 
        'bulgarian', 
        'croatian', 
        'cypriot', 
        'czech', 
        'danish', 
        'dutch', 
        'estonian', 
        'finnish', 
        'french', 
        'german', 
        'greek', 
        'hungarian', 
        'irish', 
        'italian', 
        'latvian', 
        'lithuanian', 
        'luxembourgish', 
        'maltese', 
        'polish', 
        'portuguese', 
        'romanian', 
        'slovakian', 
        'slovene', 
        'spanish', 
        'swedish',
    ],
}


# Function
def sample(sampling_strategy: str, area: dict, n_requests: int, random_state: int = None) -> None:
    
    if sampling_strategy == 'lhs':
        sampler = skopt.sampler.Lhs(lhs_type='classic', criterion='ratio')
    elif sampling_strategy == 'sobol':
        sampler = skopt.sampler.Sobol()
    elif sampling_strategy == 'random':
        pass
    else:
        raise ValueError(f'Unknown sampler {sampler}')
    
    space = skopt.Space(list(area.values()))
    
    if sampling_strategy == 'random':
        return space.rvs(n_requests, random_state)
    else:
        return sampler.generate(space.dimensions, n_requests, random_state)


# Main
sampler_params = []

for seed in EXPERIMENT_SEED_VALUES:
    for sampling_strategy in ['random', 'sobol', 'lhs']:
        n_requests = random_generator.randint(0, 1000)
        params = sample(sampling_strategy, SEARCH_AREA, n_requests, seed)
        sampler_params += [[sampling_strategy, seed, n_requests] + p for p in params]
        
sampler_params = pd.DataFrame(sampler_params, columns=['sampling_strategy', 'seed', 'n_requests', 'city', 'date', 'language'])

for var in ['city', 'date', 'language']:
    stats = sampler_params.groupby(['sampling_strategy', 'seed', 'n_requests', var]).size().unstack(fill_value=0)
    print(stats.div(stats.max(axis=1), axis='rows').groupby(level='sampling_strategy').mean())

Thanks for your help,

Jordan

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant