Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should DataFrame really be a NDArray child? #4

Open
dotChris90 opened this issue Nov 5, 2018 · 13 comments
Open

Should DataFrame really be a NDArray child? #4

dotChris90 opened this issue Nov 5, 2018 · 13 comments
Labels
further discuss need further discuss to find the best solution

Comments

@dotChris90
Copy link
Member

dotChris90 commented Nov 5, 2018

Sorry I already rise an issue while all is under construction >.<.

We should not let dataframe be a child of ndarray.
In Pandas the dataframe is a child of a general pandas object and has no inheritance connection to NDArray.
I think we will face same problems if we in heritage from ndarray.

  • Data frame is a collection of multiple NDArrays so it is not a array itself it is a collection
  • the behaviour is different from an array especially in indexing
  • frame indexing is like a dictionary df['column1'] or df['col2']
  • honestly spoken I think we need a dictionary object as property so we can store the different arrays
  • the indexing of dataframe must be via strings like in dictionary (not sure if possible but i think it must be possible)
@Oceania2018 Oceania2018 added the further discuss need further discuss to find the best solution label Nov 5, 2018
@Oceania2018
Copy link
Member

What do you think fo inheriting from DynamicObject?

@dotChris90
Copy link
Member Author

dotChris90 commented Nov 6, 2018

I am not sure. Dynamic object gives much flexibility to a class.

Maybe should not inheritage at moment from anything.

I think in future it could implement stuff like IEnumerable so we get something like index columns pairs.

But for start maybe should not inheritage from anything and see while implementing if we need inheritage.

But if u already found reason for inheritage let me know :)

@dotChris90
Copy link
Member Author

@Oceania2018 if you do not mind I add a branch for play around with the data frame. Just because the frame class is the most important class in pandas.

@Oceania2018
Copy link
Member

@dotChris90 Go ahead. DataFrame is critical. Will see your experiment.

@dotChris90
Copy link
Member Author

@Oceania2018 haha ah! now got your point why need DynamicObject! I just saw that each dataframe object has some properties which are dynamic. If the columns are A, B, C, D so it will have properties A,B,C,D. Ok - yes totally agree now with you.

@Oceania2018
Copy link
Member

But the cons is we won't get the strong type tips once we inherit from DynamicObject.

@dotChris90
Copy link
Member Author

@Oceania2018 thats true --> at the end I experimented little bit without DynamicObj. Performance counts since this is the strongest benefit of Numsharp and the corresponding projects (static types are better and faster). ;)

You can still reach the columns with df['column1']

Oceania2018 added a commit that referenced this issue Nov 7, 2018
#4 Changed Dataframe little bit more Pandas style
@VanyTang
Copy link

VanyTang commented Nov 7, 2018

Hi,
Great idea and promising project!
It is very meaningful for the people who is familiar with pandas API, and want to use C# to do data analysis.
I have watch this project since I found it, and hope I could contribute some code in the future.

Here's a advice for this issue:

Actually, pandas DataFrame could store different types of columns in a DataFrame.
So I think it may be not appropriate to define the TData generic type for the class DataFrame, neither to use the whole NDArray as a internal data container of DataFrame.

There are two libraries which could be reference for you:

  1. Library Deedle only defines <TRowKey, TColumnKey> as generic type, but no TData. It implements most of common operations for DataFrame and Series.
  2. Machinelearning_DataFrame store different type data in different data chunk (a List of IDataColumn in the project, and we can use NDArray). As far as I know, pandas also store data like this.

@dotChris90
Copy link
Member Author

@VanyTang thanks for the advice. Is this true? O.o Omg - honestly I had no idea that pandas is so dynamic .... this explains why their performance is so bad. Before I was really hoping that the columns have at least the same data type. Yes actually that is quite critical information.

@Oceania2018 maybe we should at least look Deedle + ML.Dataframe and also pandas source code itself. Honestly spoken I really hate this pandas "my columns can be anything" for performance reason. But at least we should again think about the pros and cons.

@VanyTang by the way - thanks for the kind words. A Numerical Stack is really something that is missing .NET world. Java and Python were the key languages in this area but I think it is time to show we .NET developers are also interested into this. ;)

@Oceania2018
Copy link
Member

@dotChris90 Deedle is designed for F# and complained for performance. Let's do DataFrame<TIndex,TData>, we might add a new type for Y(label) column, think about DataFrame<TIndex, Tx, Ty>

@VanyTang Thanks for you information and welcome to discuss and contribute.

Our goal is mocking python pandas in .NET, transfer python machine learning code into C# in no effort as less as possible.

@dotChris90
Copy link
Member Author

@VanyTang and one more to mention. :) no matter if sharing codes, ideas, discussions, articles, considerations, links,....

We welcome everybody to share their knowledge. We are dotnet developers, we are open source nerds, we are all just humans and if we really want to make our dotnet framework great in machine learning , our ideas and our wishes come true, so we need every possible suggestion, hint, etc from everybody of you. So please feel always free to post issues and suggestions.

😊

@Oceania2018
Copy link
Member

Oceania2018 commented Nov 8, 2018

@dotChris90 I think the dynamic column data type is necessary.

@dotChris90
Copy link
Member Author

dotChris90 commented Nov 8, 2018

Yeah probably.... But I have no idea how we shall handle this in a clean way.

need some more investigation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
further discuss need further discuss to find the best solution
Projects
None yet
Development

No branches or pull requests

3 participants