Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Clustering of Projects #271

Open
Ly0n opened this issue Dec 14, 2023 · 7 comments
Open

Improve Clustering of Projects #271

Ly0n opened this issue Dec 14, 2023 · 7 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@Ly0n
Copy link
Member

Ly0n commented Dec 14, 2023

It has been shown that a finer and more detailed structure of the subject areas helps many users of the database to find suitable projects. Therefore, any clustering of projects by topics is welcome as long as the clusters are not too small.

@Ly0n Ly0n added help wanted Extra attention is needed good first issue Good for newcomers labels Dec 14, 2023
@lappemic
Copy link

Hey @Ly0n, came across the website and its accompanying repo this morning when i was browsing for open source projects in sustainability. 😄 In the project classification i struggled with exactly this point. I think you have a really valuable directory here! Clustering it appropriately would make it way more accessible. So i let chatGPT cluster it for better accessibility and this is what it came up with. I think it is good organized due to the additional cluster level. Additionally i let it sort the list in alphabetical order, which also helps to find what one is looking for faster in my opinion.

Energy Technologies

  • Energy Optimization and Management
    • Energy Modeling and Optimization
    • Energy Monitoring and Management
    • Energy System Data Access.
  • Energy Storage and Distribution
    • Battery Technology
    • Energy Distribution and Grids
    • Hydrogen Storage
  • Renewable Energy Systems
    • Bioenergy
    • Geothermal Energy
    • Hydro Energy
    • Photovoltaics and Solar Energy
    • Wind Energy

Environmental Management and Conservation

  • Biodiversity and Ecosystems
    • Biodiversity and Species Distribution
    • Biomass
    • Conservation and Restoration
    • Forest Observation and Management
    • Plants and Vegetation
    • Terrestrial Animals
    • Wildfire
  • Climate and Atmosphere
    • Atmospheric Chemistry and Aerosol
    • Atmospheric Composition and Dynamics
    • Atmospheric Dispersion and Transport
    • Climate Change
    • Earth and Climate Modeling
    • Radiative Transfer
  • Natural and Marine Resources
    • Coastal and Reefs
    • Freshwater and Hydrology
    • Marine Life and Fishery
    • Ocean Carbon and Temperature
    • Ocean Circulation Models
    • Ocean and Hydrology Data Access
    • Waves and Currents

Knowledge Sharing and Data Access

  • Knowledge and Data Platforms
    • Curated Lists
    • Data Catalogs and Interfaces
    • Environmental Satellites
    • Knowledge Platforms
    • Taxonomy and Ontology
  • Sustainable Goals and Investments
    • Sustainable Development Goals
    • Sustainable Investment

Sustainable Development and Infrastructure

  • Circular Economy and Resource Management
    • Carbon Capture
    • Carbon Intensity and Accounting
    • Carbon Offsets and Trading
    • Circular Economy and Waste
    • Emission Observation and Modeling
    • Industrial Ecology
    • Life Cycle Assessment
  • Natural Hazard and Environmental Policy
    • Climate Data Access and Visualization
    • Climate Data Processing and Analysis
    • Climate Data Standards
    • Climate Downscaling
    • Integrated Assessment and Climate Policy
    • Natural Hazard and Storm
  • Sustainable Cities and Communities
    • Buildings and Heating
    • Computation and Communication
    • Mobility and Transportation
    • Production and Industry

Water, Land, and Air Management

  • Air Quality and Emissions
    • Air Quality
    • Emissions
  • Hydrosphere and Hydrology
    • Glacier and Ice Sheets
    • Sea Ice
    • Snow and Permafrost
  • Land and Soil Management
    • Soil and Land
    • Sustainable Land Management

What do you think? I would be happy to implement this if you think it might be helpful to you.

PS. I think you also thought about it, but having a searchable tag management would be really helpful as well.

@Ly0n
Copy link
Member Author

Ly0n commented Apr 18, 2024

@lappemic Thank you for taking up this topic. It is true that there is a lot of potential in the clustering, tagging and in the presentation of the projects. The clustering that ChatGPT has done is not bad, but some things are strange, such as splitting climate to "Natural hazards and environmental policy" and "Climate and atmosphere". Hydrosphere and hydrology are also wrong. I see your comment more as an impetus for a discussion on how to sort this better in the future.

In the past, the idea was to cluster the projects based on the READMEs of the projects or using the oneliner. This would give the LLM more contextual information. Do you think this is easy to implement? It would be great to have someone to collaborate on this.

@lappemic
Copy link

Tahnks @Ly0n for the feedback and the opportunity to collaborate on this. Yeah, there are indeed some misclassifications. I thought it was a good starting point for discussions but did not put in much time to properly check every entry, sorry for this.

I like the idea of using the README or the oneliner of each file to give it context for better clustering. I just tried the dummy method and pasted everything: Obviously the contextwindow is too short 😅 But we could chop the list up and paste it partwise or just do it programmatically via api call and send it always just one project and update e.g. json. I think it would take ~1 hour to do it via chat interface and ~3 hours to do it programmatically. As said, i would like to support you here. What would you suggest?

@Ly0n
Copy link
Member Author

Ly0n commented Apr 18, 2024

Something programmatic is definitely something we are looking for because we are doing science on the metadata. The data processing should be reproducible so that we can repeat it from time to time. You can find a CSV file of all the READMEs here: https://github.com/protontypes/AwesomeCure/blob/main/csv/projects_with_readme.csv

Creating a consistent set of labels / tags or just some statistics about topics would be super awesome.
If you have some initial code snippets just let me know and I can jump in so we can hack together! 👯‍♂️

@lappemic
Copy link

lappemic commented May 8, 2024

Hey @Ly0n, very sorry for the very late answer! 🙈

Unfortunately i do not have any snippets right here or a proper idea on how to approach further without a bigger effort atm. I will keep this in mind and pop by asap as i have some more resources! Or if you have something concrete to develop or want a short sync, just ping me. I am open for suggestions as well! :)

@Ly0n
Copy link
Member Author

Ly0n commented May 9, 2024

Without deeper NLP experience it is quite difficult to approach this topic programmatically. You could also manually separate different topics like energy systems into more subtopics.

What I can also offer is support in getting started with NLP. Even some very simple statistics about the wording in the projects README and description could help us a lot.

Some ideas and code snipptes how to get started can be found here:
#145

If you want to chat about this in person please contact me at tobias.augspurger@protontypes.eu.

@lappemic
Copy link

Just sent you an email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants