Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ontolearn/nces_data_generator/generate_data.py #325

Open
Demirrr opened this issue Dec 5, 2023 · 24 comments
Open

ontolearn/nces_data_generator/generate_data.py #325

Demirrr opened this issue Dec 5, 2023 · 24 comments

Comments

@Demirrr
Copy link
Member

Demirrr commented Dec 5, 2023

This is a standalone script and it shouldn't be part of ontolearn.

We should have a learning problem python module (e.g. ontolearn/lp_generator ) that allows us to generate learning problems.

Would you like to take care of ontolearn/lp_generator @Jean-KOUAGOU ?

@Jean-KOUAGOU
Copy link
Collaborator

Jean-KOUAGOU commented Dec 5, 2023 via email

@Demirrr
Copy link
Member Author

Demirrr commented Dec 5, 2023

nces_data_generator is not a python package or a python module but rather, it is a folder containing two scripts, i.e.,

  1. https://github.com/dice-group/Ontolearn/blob/master/ontolearn/nces_data_generator/generate_data.py
  2. https://github.com/dice-group/Ontolearn/blob/master/ontolearn/nces_data_generator/helper_classes.py

Ideally, we should have a python module that one should be able to import a learning problem generator ( say CustomLPGen() from ontolearn/lp_generator) from it to generate learning problems, e.g.,

from ontolearn.lp_generator import CustomLPGen

gen=CustomLPGen(args)
# a list of learning problems
lp=gen.generate()
# generates a list of learning problems and saves it locally
gen.generate_and_save(path)

Can you still do that ?

@Jean-KOUAGOU
Copy link
Collaborator

Jean-KOUAGOU commented Dec 5, 2023 via email

@Jean-KOUAGOU
Copy link
Collaborator

Jean-KOUAGOU commented Dec 5, 2023 via email

@Demirrr
Copy link
Member Author

Demirrr commented Dec 5, 2023

No

  1. lp_generator should be a python module (https://docs.python.org/3/tutorial/modules.html) containing __init__.py and nces_data_generator.py
  2. nces_data_generator.py implements a python class called CustomLPGen
  3. __init__.py has the following line .nces_data_generator import CustomLPGen

If it is still not clear, @alkidbaci could you take care of it if my description makes sense to you ?

@Jean-KOUAGOU
Copy link
Collaborator

Jean-KOUAGOU commented Dec 5, 2023 via email

@Jean-KOUAGOU
Copy link
Collaborator

Jean-KOUAGOU commented Dec 6, 2023

In which branch should we do this?

@Demirrr
Copy link
Member Author

Demirrr commented Dec 6, 2023

Please create one and after merging into dev, we can delete it

@Jean-KOUAGOU
Copy link
Collaborator

Jean-KOUAGOU commented Dec 11, 2023

The dev branch has some issue with owlapy, see then error below:

from ontolearn.lp_generator import LPGen
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 from ontolearn.lp_generator import LPGen

File ~/Documents/Ontolearn/ontolearn/lp_generator/__init__.py:1
----> 1 from .generate_data import LPGen
      2 from .helper_classes import RDFTriples, KB2Data

File ~/Documents/Ontolearn/ontolearn/lp_generator/generate_data.py:1
----> 1 from .helper_classes import RDFTriples, KB2Data
      3 class LPGen:
      4     def __init__(self, kb_path, storage_path=None, depth=5, max_child_length=25, refinement_expressivity=0.6,
      5                  downsample_refinements=True, k=10, num_rand_samples=150, min_num_pos_examples=1):

File ~/Documents/Ontolearn/ontolearn/lp_generator/helper_classes.py:4
      2 import random
      3 from rdflib import graph
----> 4 from ontolearn.knowledge_base import KnowledgeBase
      5 from owlapy.render import DLSyntaxObjectRenderer
      6 from ontolearn.refinement_operators import ExpressRefinement

File ~/Documents/Ontolearn/ontolearn/knowledge_base.py:6
      4 import random
      5 from typing import Iterable, Optional, Callable, overload, Union, FrozenSet, Set, Dict
----> 6 from ontolearn.base import OWLOntology_Owlready2, OWLOntologyManager_Owlready2, OWLReasoner_Owlready2
      7 from ontolearn.base.fast_instance_checker import OWLReasoner_FastInstanceChecker
      8 from owlapy.model import OWLOntologyManager, OWLOntology, OWLReasoner, OWLClassExpression, \
      9     OWLNamedIndividual, OWLObjectProperty, OWLClass, OWLDataProperty, IRI, OWLDataRange, OWLObjectSomeValuesFrom, \
     10     OWLObjectAllValuesFrom, OWLDatatype, BooleanOWLDatatype, NUMERIC_DATATYPES, TIME_DATATYPES, OWLThing, \
     11     OWLObjectPropertyExpression, OWLLiteral, OWLDataPropertyExpression

File ~/Documents/Ontolearn/ontolearn/base/__init__.py:2
      1 """Implementations of owlapy abstract classes based on owlready2."""
----> 2 from owlapy._utils import MOVE
      3 from ontolearn.base._base import OWLOntologyManager_Owlready2, OWLReasoner_Owlready2, \
      4     OWLOntology_Owlready2, BaseReasoner_Owlready2
      5 from ontolearn.base.complex_ce_instances import OWLReasoner_Owlready2_ComplexCEInstances

ModuleNotFoundError: No module named 'owlapy'

@Jean-KOUAGOU
Copy link
Collaborator

I created the learning problem generator module but I encounter the error above. Would it make sens to start from a different branch?

@Demirrr
Copy link
Member Author

Demirrr commented Dec 11, 2023

Your branch is not up-to-date. The error report shows that a dependency is missing. Please firstly merge dev into your branch.

@Jean-KOUAGOU
Copy link
Collaborator

Jean-KOUAGOU commented Dec 11, 2023 via email

@Jean-KOUAGOU
Copy link
Collaborator

Jean-KOUAGOU commented Dec 11, 2023

Thanks. It runs now. Is the following ok? The generator takes the path to the knowledge base, and other optional parameters, including the output path. If the output path is not specified, it stores the generated data at the location of the input knowledge base. Below an example:

from ontolearn.lp_generator import LPGen
* Owlready2 * Warning: optimized Cython parser module 'owlready2_optimized' is not available, defaulting to slower Python implementation

Warning: SQLite3 version 3.40.0 and 3.41.2 have huge performance regressions; please install version 3.41.1 or 3.42!

kb_
lp_gen = LPGen(kb_path="./KGs/Family/family-benchmark_rich_background.owl")
lp_gen.generate()

*** Embedding triples exist ***


############################################################
Started generating data on the family-benchmark_rich_background knowledge base
############################################################

Number of individuals in the knowledge base: 202 

|Thing refinements|:  4760
Size of sample:  150
Refining roots...: 100%|██████████| 150/150 [00:03<00:00, 42.03it/s]
Filtering process...: 100%|██████████| 71790/71790 [00:12<00:00, 5779.63it/s] 
Concepts generation done!

Number of atomic concepts:  18
Longest concept length:  12 

Total number of concepts:  9332 

Data generation completed
Sample examples and save data...: 100%|██████████| 9332/9332 [00:03<00:00, 2335.82it/s]
Data saved at ./KGs/Family/LPs/

@Demirrr
Copy link
Member Author

Demirrr commented Dec 11, 2023

Thank you. Looks great.
Could you please add two tests for this class?

  1. One fixes the number of concepts and other parameters if any and creates an assertion to check the generated concepts ( list of objects)

  2. Saves concepts into a specified local file and loads it back.

@Jean-KOUAGOU
Copy link
Collaborator

Sure

@Jean-KOUAGOU
Copy link
Collaborator

Actually, we cannot fix the number of concepts to be generated. I intentionally did not set it this way. We only specify the following hyperparameters and they determine the number of concepts to be generated--this process is also stochastic unless we fix a random seed--which I did not find necessary for NCES.

kb_path, storage_dir=None, depth=5, max_child_length=25, refinement_expressivity=0.6, downsample_refinements=True, sample_fillers_count=10, num_sub_roots=150, min_num_pos_examples=1

@Jean-KOUAGOU
Copy link
Collaborator

I can add the test about storing the generated learning problems and loading them. But if we really need a specific number (or at least the maximum number) of learning problems we can enforce this

@Demirrr
Copy link
Member Author

Demirrr commented Dec 18, 2023

--this process is also stochastic unless we fix a random seed--which I did not find necessary for NCES

We have to ensure that the data generating process is not a random process. Perhaps, we can determine the random seed for the data generation proces by introducing random_seed=1 variable

But if we really need a specific number (or at least the maximum number) of learning problems we can enforce this

Yes please do

@Jean-KOUAGOU
Copy link
Collaborator

Jean-KOUAGOU commented Dec 18, 2023

But setting max_num_lps to a specific number won't necessarily speed up the generation process. The following are the ones that can reduce the amount of data to generate: depth, refinement_expressivity, sample_fillers_count, num_sub_roots

max_num_lps will only speedup the part concerned with removing redundant concepts. Is this ok?

@Jean-KOUAGOU
Copy link
Collaborator

Jean-KOUAGOU commented Dec 18, 2023

If they (depth, refinement_expressivity, sample_fillers_count, num_sub_roots) are set too low, the actual number of generated learning problems will be less than max_num_lps

@Demirrr
Copy link
Member Author

Demirrr commented Dec 18, 2023

But setting max_num_lps to a specific number won't necessarily speed up the generation process

The efficiency in the data generation process is not the concern here.
The main concern is to fix the randomness.
Hence,

  1. random_seed=1 num_concepts=10, 10 concepts should be returned deterministically, i.e., no randomness
  2. random_seed=2 num_concepts=5, 5 concepts should be returned deterministically, i.e., no randomness.

@Jean-KOUAGOU
Copy link
Collaborator

We can return a given number of concepts but they are not the same ones for different runs despite random.seed at the top level of the LPGen class call

@Jean-KOUAGOU
Copy link
Collaborator

Jean-KOUAGOU commented Dec 18, 2023

This is how I implemented the test

import unittest
import json
from ontolearn.lp_generator import LPGen
from ontolearn.utils import setup_logging
setup_logging("ontolearn/logging_test.conf")

PATH_FAMILY = 'KGs/Family/family-benchmark_rich_background.owl'
STORAGE_DIR = 'KGs/Family/new_dir'
last_10_concepts_to_generate = ['∃ hasParent.(Parent ⊓ (∃ hasChild.(¬Granddaughter)))',
 '∃ hasParent.(PersonWithASibling ⊓ (∀ hasParent.(¬Granddaughter)))',
 '∃ hasParent.(PersonWithASibling ⊓ (∃ hasSibling.(¬Sister)))',
 '∃ hasParent.(Granddaughter ⊓ (∃ hasChild.(¬Sister)))',
 '∃ hasParent.(Granddaughter ⊓ (∀ hasChild.(¬Sister)))',
 'Sister ⊔ (∃ hasParent.(Child ⊓ (¬Brother)))',
 '∃ hasParent.(Daughter ⊓ (∃ hasSibling.(¬Sister)))',
 'Sister ⊔ (∃ hasParent.(∃ hasSibling.(¬Grandparent)))',
 '∃ hasParent.(Grandmother ⊓ (∀ hasParent.(¬Granddaughter)))',
 'Sister ⊔ (∃ hasParent.(Child ⊓ (¬Grandchild)))']

class LPGen_Test(unittest.TestCase):
    def test_generate_load(self):
        lp_gen = LPGen(kb_path=PATH_FAMILY, storage_dir=STORAGE_DIR)
        lp_gen.generate()
        print("Loading generated data...")
        with open(f"{STORAGE_DIR}/triples/train.txt") as file:
            triples_data = file.readlines()
            print("Number of triples:", len(triples_data))
        with open(f"{STORAGE_DIR}/LPs.json") as file:
            lps = json.load(file)
            print("Number of learning problems:", len(lps))
        self.assertGreaterEqual(lp_gen.lp_gen.max_num_lps, len(lps))
        self.assertEqual(list(lps.keys())[-10:], last_10_concepts_to_generate)

if __name__ == '__main__':
    unittest.main()

@Jean-KOUAGOU
Copy link
Collaborator

Jean-KOUAGOU commented Dec 18, 2023

Everything else passes except

self.assertEqual(list(lps.keys())[-10:], last_10_concepts_to_generate)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants