First and last values of each variable in the space are sampled 2 times less than the others #1172

JordanJar8 · 2023-06-07T08:27:08Z

Hi,

I was testing the different samplers when I realised that for each variable in my space, whatever its type, and whatever the number of samples requested, the first and last values (in order) of each variable are 2 times less chosen than the others if it's a Sobol or Latin Hypercube sampler. This problem does not exist if we use uniform sampling.
I've looked in Issues, on the Internet and in the scikit-optimize source code for an answer but I haven't been able to find one. Do you have an answer?

Here is the code that reproduces the problem:

import numpy as np
import pandas as pd
import skopt

# Parameters
MAIN_SEED_VALUE = 8
random_generator = np.random.RandomState(seed=MAIN_SEED_VALUE)

N_EXPERIMENTS = 50

EXPERIMENT_SEED_VALUES = random_generator.randint(0, 1000000, N_EXPERIMENTS)

SEARCH_AREA = {
    'city': [
        'amsterdam', 
        'copenhagen', 
        'madrid', 
        'paris', 
        'rome', 
        'sofia', 
        'valletta', 
        'vienna',
        'vilnius',
    ],
    'date': (0, 44),
    'language': [
        'austrian', 
        'belgian', 
        'bulgarian', 
        'croatian', 
        'cypriot', 
        'czech', 
        'danish', 
        'dutch', 
        'estonian', 
        'finnish', 
        'french', 
        'german', 
        'greek', 
        'hungarian', 
        'irish', 
        'italian', 
        'latvian', 
        'lithuanian', 
        'luxembourgish', 
        'maltese', 
        'polish', 
        'portuguese', 
        'romanian', 
        'slovakian', 
        'slovene', 
        'spanish', 
        'swedish',
    ],
}


# Function
def sample(sampling_strategy: str, area: dict, n_requests: int, random_state: int = None) -> None:
    
    if sampling_strategy == 'lhs':
        sampler = skopt.sampler.Lhs(lhs_type='classic', criterion='ratio')
    elif sampling_strategy == 'sobol':
        sampler = skopt.sampler.Sobol()
    elif sampling_strategy == 'random':
        pass
    else:
        raise ValueError(f'Unknown sampler {sampler}')
    
    space = skopt.Space(list(area.values()))
    
    if sampling_strategy == 'random':
        return space.rvs(n_requests, random_state)
    else:
        return sampler.generate(space.dimensions, n_requests, random_state)


# Main
sampler_params = []

for seed in EXPERIMENT_SEED_VALUES:
    for sampling_strategy in ['random', 'sobol', 'lhs']:
        n_requests = random_generator.randint(0, 1000)
        params = sample(sampling_strategy, SEARCH_AREA, n_requests, seed)
        sampler_params += [[sampling_strategy, seed, n_requests] + p for p in params]
        
sampler_params = pd.DataFrame(sampler_params, columns=['sampling_strategy', 'seed', 'n_requests', 'city', 'date', 'language'])

for var in ['city', 'date', 'language']:
    stats = sampler_params.groupby(['sampling_strategy', 'seed', 'n_requests', var]).size().unstack(fill_value=0)
    print(stats.div(stats.max(axis=1), axis='rows').groupby(level='sampling_strategy').mean())

Thanks for your help,

Jordan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First and last values of each variable in the space are sampled 2 times less than the others #1172

First and last values of each variable in the space are sampled 2 times less than the others #1172

JordanJar8 commented Jun 7, 2023

First and last values of each variable in the space are sampled 2 times less than the others #1172

First and last values of each variable in the space are sampled 2 times less than the others #1172

Comments

JordanJar8 commented Jun 7, 2023