Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analysis with CGraph #1

Open
ChunelFeng opened this issue Feb 24, 2024 · 33 comments
Open

analysis with CGraph #1

ChunelFeng opened this issue Feb 24, 2024 · 33 comments

Comments

@ChunelFeng
Copy link

Hi, nice to see this repo.

CGraph is a repo with almost same function with scheduling.

can you please make a performance testing with CGraph too, such as taskflow

thank you very much

@ChunelFeng
Copy link
Author

ChunelFeng commented Feb 24, 2024

I am not very sure weather my code for scheduling is at the best point,

but it really have some span in the linear situation between CGraph and scheduling on my computer.

can you please give some benchmark on your workspace? thank you.

image

image

using namespace CGraph;

unsigned g_node_cnt = 0;
class TestNode : public CGraph::GNode {
public:
    CStatus run() override {
        g_node_cnt++;
        return CStatus();
    }
};

void test_performance() {
    // 串行执行32次,对应第二个例子,1thread,32串行,1000w次
    GPipelinePtr pipeline = GPipelineFactory::create();
    CStatus status;
    GElementPtr arr[32];
    pipeline->registerGElement<TestNode>(&arr[0]);
    for (int i = 1; i < 32; i++) {
        pipeline->registerGElement<TestNode>(&arr[i], {arr[i - 1]});
    }
    pipeline->makeSerial();
    pipeline->setAutoCheck(false);
    status += pipeline->init();
    {
        for (int t = 0; t < 1000000; t++) {
            pipeline->run();
        }
        std::cout << g_node_cnt << std::endl;
    }

    status += pipeline->destroy();
    GPipelineFactory::remove(pipeline);
}

int main() {
    auto start_ts = std::chrono::steady_clock::now();
    test_performance();

    std::chrono::duration<double, std::milli> span = std::chrono::steady_clock::now() - start_ts;
    printf("time counter is : [%0.2lf] ms \n", span.count());
    return 0;
}

@dpuyda
Copy link
Owner

dpuyda commented Feb 25, 2024

Hi! Thanks a lot for the reference to CGraph! I didn't know about CGraph earlier, this is a very interesting repository! Please give me some time to take a deeper loop and provide benchmarks, I will be very glad to do this.

@ChunelFeng
Copy link
Author

Hi! Thanks a lot for the reference to CGraph! I didn't know about CGraph earlier, this is a very interesting repository! Please give me some time to take a deeper loop and provide benchmarks, I will be very glad to do this.

Thank you very much

@dpuyda
Copy link
Owner

dpuyda commented Feb 25, 2024

Haven't had a chance to look into it in detail yet, but here are some comments which I can make now:

  • When you benchmark Scheduling, as I can see on your screenshot, you benchmark not only running the graph one million times, but also creating the graph one million times. When I do the same on my laptop (the one which was used to run the benchmarks described in readme), I get around 10 seconds for this test. On the contrary, in your code for CGraph, it looks to me like you do the initialization routines once, and then only run the already-created pipeline one million times.
  • At this point of time, Scheduling does not actually support running the same graph several times. This is because the cancelled_ flag is set when the task is executed, and there is currently no way to reset this flag after that. So, if I want to replicate your test, I need to create a graph consisting of one million nodes upfront and then run it at once (the code is below). If I do this, I get somewhere between 110 and 120ms on my computer. However, yes, this is supper memory-inefficient! I will think how to optimize this use case and will update the library to provide a better support in this situation. I will also be very interested to provide benchmark comparison for Scheduling and CGraph.
    void test() {
      int counter = 0;
      ThreadPool thread_pool;
      std::vector<Task> v(1'000'000);  // super memory-inefficient, but no other options so far
      v[0] = Task([&]{++counter;});
      for (auto i = v.begin(), j = std::next(v.begin()); j != v.end(); ++i, ++j) {
        *j = Task([&]{++counter;});
        j->Succeed(&*i);
      }
      thread_pool.Submit(&v[0]);
      thread_pool.Wait();
      std::cout << counter << std::endl;
    }

Thanks again for your interest in Scheduling and for the reference to CGraph!

@ChunelFeng
Copy link
Author

oh, no, some mistake in your current demo.
i means to create a dag as a line, with 32 node in it, and run this dag 1000,000 times,
so the result should be 32 * 1000,000

not create one node, and run this node 1000,000 times,
but your new demo‘s result seemed to be 1 * 1000,000.

the same as you, i think, create once and then run some times, is a common situation for users.
just like, open a file, write into some lines, at last close it.

hope schuduling can support more scenarios, thanks for your excellent code and sharing

@dpuyda
Copy link
Owner

dpuyda commented Feb 25, 2024

yes yes, I understand what you mean, unfortunately this use case is currently not supported by scheduling. I will try to add support for this use case, thanks a lot for this great and very useful idea! I will post an update here once implemented.

I have added a link to CGraph in my readme, I hope you don't mind. I will be glad to study your excellent library in more detail!

@ChunelFeng
Copy link
Author

yes yes, I understand what you mean, unfortunately this use case is currently not supported by scheduling. I will try to add support for this use case, thanks a lot for this great and very useful idea! I will post an update here once implemented.

I have added a link to CGraph in my readme, I hope you don't mind. I will be glad to study your excellent library in more detail!

it is my honor

@dpuyda
Copy link
Owner

dpuyda commented Feb 26, 2024

The honor is all mine, you did a great job writing CGraph, there's a lot to learn from this lib!

@dpuyda dpuyda reopened this Feb 27, 2024
@dpuyda
Copy link
Owner

dpuyda commented Feb 27, 2024

I have reopened the issue if you don't mind, since the comparison with CGraph is not done yet. I would like to implement the use case which you suggested and provide benchmarks, so let's keep it opened until it's done. Thank you!

@dpuyda
Copy link
Owner

dpuyda commented Mar 3, 2024

Hi @ChunelFeng! I hope you are doing well! I have a small question about CGraph which I would be glad to discuss with you here when you have a moment. I tried the code snippet which you suggested above and observed that everything happens on the main thread:
image
After a quick look into you code, I noticed that you define whether a task is sync or async based on the timeout value:

CBool GElement::isAsync() const {
    // 如果timeout != 0, 则异步执行
    return this->timeout_ != CGRAPH_DEFAULT_ELEMENT_TIMEOUT;
}

I tried setting the timeout for the tasks manually to make them async like below, but this did not seem to be working:

  for (const auto& task : arr) {
    task->setTimeout(1000);
  }

May I ask you kindly if you have a specific logic in your code which controls the number of available threads and affinity to avoid context switching? I have updated Scheduling to be able to run the same graph multiple times (not merged yet). When I limit the number of threads to one, I obviously get much better results compared to the multiple threads version which suffers from context switching very heavily in this use case...

@ChunelFeng
Copy link
Author

two ways both join effect to a DAG as a single line.

first, we use makeSerial(), to check weather this dag can go well in one signal thread.

pipeline->makeSerial();

second, in the code :

https://github.com/ChunelFeng/CGraph/blob/main/src/GraphCtrl/GraphElement/_GEngine/GDynamicEngine/GDynamicEngine.cpp

we use analysisDagType to make sure the lineal nodes worked in the same thread(the main thread).

analysisDagType(elements);

the two methods only worked when this dag is a line.
so, if you compose other types of dag, this one thread way can not work well.

and, when you as a node's timeout value, it can not worked in one thread for the reason of async timeout.

@ChunelFeng
Copy link
Author

we can simple check weather the dag can work in one thread,

CStatus status = pipeline->makeSerial();

if it can, the status is ok. else, it can not worked in one thread.

@dpuyda
Copy link
Owner

dpuyda commented Mar 3, 2024

so does it automatically define the minimum number of threads that are necessary to run a graph?

@ChunelFeng
Copy link
Author

not automatically.

in other case, you can set config just like this example.

https://github.com/ChunelFeng/CGraph/blob/main/tutorial/T01-Simple.cpp.

and you can automatic get the min thread num with this function:

image

@dpuyda
Copy link
Owner

dpuyda commented Mar 3, 2024

cool, thanks a lot for the explanations!

@dpuyda
Copy link
Owner

dpuyda commented Mar 3, 2024

yes, now I am able to break your code in the same way as mine is currently broken 😃

image

Let me think a little about it before merging my changes, I will let you know here once I'm ready... 😉

@ChunelFeng
Copy link
Author

what is your tools in your picture?

it seems very useful

@dpuyda
Copy link
Owner

dpuyda commented Mar 3, 2024

@dpuyda
Copy link
Owner

dpuyda commented Mar 8, 2024

Hi @ChunelFeng!
Thanks again for looking into Scheduling and for the suggestions how it can be improved, your comments are much valued!
I have updated Scheduling to be able to run the same graph multiple times, see #2

Below are a few observations that I can make:

  • If I limit the number of threads to one, with Scheduling, I seem to get the timings that are comparable to yours. On my computer, with one execution thread, I get up to 1 second with Scheduling and around 1.5 second with CGraph.
  • When using 8 worker threads, with Scheduling I usually get around 1.5 seconds (occasionally it can be more if it happens to be heavily impacted by context switching). I'm still not sure how to configure the CGraph thread pool to use more than one thread to make it comparable with Scheduling. The difference between the default behavior of CGraph and Scheduling is that in case of CGraph, it always executes everything on the main thread, while with Scheduling, each chain consisting of 32 tasks is executed on a randomly-selected thread managed by the thread pool. I think that further improvements can be made by executing each chain on the same thread to avoid context switching, but I'm still trying to find the best way to implement this.

Below is the Scheduling code which I used for experiments (you can also find it in unit tests):

  auto counter = 0;
  ThreadPool thread_pool;
  std::vector<Task> tasks(32);
  tasks[0] = Task([&] { ++counter; });
  for (auto i = tasks.begin(), j = std::next(i); j != tasks.end(); ++i, ++j) {
    *j = Task([&] { ++counter; });
    j->Succeed(&*i);
  }
  for (int i = 0; i < 1'000'000; ++i) {
    thread_pool.Submit(&tasks[0]);
    thread_pool.Wait();
  }

@ChunelFeng
Copy link
Author

in this case, CGraph run only in one thread, is because we use

pipeline->makeSerial();

if you remove this line, CGraph will create 8 thread in backend.

image

i give you some code, to run CGraph with 8 thread in this case:

image

please replace test-performance-02.cpp file with the following code.

#include "../_Materials/TestInclude.h"

using namespace CGraph;

void test_performance_02() {
    // 串行执行32次,对应第二个例子,1thread,32串行,1000w次
    GPipelinePtr pipeline = GPipelineFactory::create();
    CStatus status;
    GElementPtr arr[32];
    pipeline->registerGElement<TestAdd1GNode>(&arr[0]);
    for (int i = 1; i < 32; i++) {
        pipeline->registerGElement<TestAdd1GNode>(&arr[i], {arr[i - 1]});
    }
    pipeline->setAutoCheck(false);
    UThreadPoolConfig config;
    config.max_thread_size_ = 8;
    config.primary_thread_busy_epoch_ = 2;
    config.primary_thread_empty_interval_ = 0;
    pipeline->setUniqueThreadPoolConfig(config);

    status += pipeline->init();
    {
        UTimeCounter counter("test_performance_02");
        for (int t = 0; t < 1000000; t++) {
            pipeline->run();
        }
    }

    if (32000000 != g_test_node_cnt) {
        std::cout << "test_performance_02: g_test_node_cnt is not right : " << g_test_node_cnt << std::endl;
    }

    status += pipeline->destroy();
    GPipelineFactory::remove(pipeline);
}

int main() {
    test_performance_02();
    return 0;
}

@dpuyda
Copy link
Owner

dpuyda commented Mar 9, 2024

Thanks a lot for the code snippet, @ChunelFeng! Now the code runs 8 background threads as expected. I ran it multiple times, on my computer the test usually takes around 2.8 sec. Below is the sample output:

[2024-03-10 00:00:16.454] [test_performance_02]: time counter is : [2858.44] ms

D:\repo\CGraph\build\test\Performance\RelWithDebInfo\test-performance-02.exe (process 18100) exited with code 0.
Press any key to close this window . . .

To run the similar test for Scheduling, you can use the ThreadPoolTest.ResubmitGraph unit test. Test duration on my computer mostly varies between 1.5 and 1.9 sec. The timing is much less stable compared to CGraph, I guess the variation comes from the random context switching issue which I described above, still not sure how to fix it. Below is a sample output:

Running main() from D:\repo\scheduling\build\tests\googletest-src\googletest\src\gtest_main.cc
Note: Google Test filter = ThreadPoolTest.ResubmitGraph
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ThreadPoolTest
[ RUN      ] ThreadPoolTest.ResubmitGraph
[       OK ] ThreadPoolTest.ResubmitGraph (1516 ms)
[----------] 1 test from ThreadPoolTest (1517 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1519 ms total)
[  PASSED  ] 1 test.

D:\repo\scheduling\build\tests\RelWithDebInfo\tests.exe (process 16408) exited with code 0.
Press any key to close this window . . .

@ChunelFeng
Copy link
Author

by the way, please update CGraph to the newest version. we optimize this situation

@dpuyda
Copy link
Owner

dpuyda commented Mar 10, 2024

Hi @ChunelFeng! Thank you for letting me know about the updates! There are two commits which I didn't have, shown on the screenshot below:
image
I think you meant the earlier commit titled "[perf] optimize all parallel situation.", I already had it when I did my experiments. The latest two which I didn't have is a CMake update, it should not affect timings, so I'm getting the same results as earlier.

May I ask you kindly what timings do you get on your computer?

@ChunelFeng
Copy link
Author

current newest version in scheduling;

#include <iostream>
#include "schedule/schedule.h"

#include <atomic>
#include <vector>

using namespace scheduling;

void test_performance () {
    auto counter = 0;
    ThreadPool thread_pool;
    std::vector<Task> tasks(32);
    tasks[0] = Task([&] { ++counter; });
    for (auto i = tasks.begin(), j = std::next(i); j != tasks.end(); ++i, ++j) {
        *j = Task([&] { ++counter; });
        j->Succeed(&*i);
    }
    for (int i = 0; i < 1'000'000; ++i) {
        thread_pool.Submit(&tasks[0]);
        thread_pool.Wait();
    }
}

int main() {
    for (int i = 0; i < 5; i ++) {
        auto start_ts = std::chrono::steady_clock::now();
        test_performance();
        std::chrono::duration<double, std::milli> span = std::chrono::steady_clock::now() - start_ts;
        printf("time counter is : [%0.2lf] ms \n", span.count());
    }
    return 0;
}
image

@ChunelFeng
Copy link
Author

CGraph newest version run with 8 threads:

#include "../_Materials/TestInclude.h"

using namespace CGraph;

static int g_cnt = 0;
class TestAddGNode : public CGraph::GNode {
public:
    CStatus run() override {
        g_cnt++;
        return CStatus();
    }
};

void test_performance_02() {
    // 串行执行32次,对应第二个例子,1thread,32串行,1000w次
    GPipelinePtr pipeline = GPipelineFactory::create();
    CStatus status;
    GElementPtr arr[32];
    pipeline->registerGElement<TestAddGNode>(&arr[0]);
    for (int i = 1; i < 32; i++) {
        pipeline->registerGElement<TestAddGNode>(&arr[i], {arr[i - 1]});
    }
    pipeline->setAutoCheck(false);
    UThreadPoolConfig config;
    config.max_thread_size_ = 8;
    config.primary_thread_busy_epoch_ = 1;
    config.primary_thread_empty_interval_ = 0;
    pipeline->setUniqueThreadPoolConfig(config);

    status += pipeline->init();
    {
        for (int t = 0; t < 1000000; t++) {
            pipeline->run();
        }
    }

    status += pipeline->destroy();
    GPipelineFactory::remove(pipeline);
}

int main() {
    for (int i = 0; i < 5; i ++) {
        UTimeCounter counter("test_performance_02");
        test_performance_02();
    }
    return 0;
}
image

@ChunelFeng
Copy link
Author

and my computer is macbook pro with m1(2021)

image

@dpuyda
Copy link
Owner

dpuyda commented Mar 10, 2024

This is a very surprising result, I do not have an explanation why you observe so bad performance with Scheduling. On my computer, I observe the results as shown on the video below:

devenv_K45xBPSqK7.mp4

Could be that you are accidentally running Scheduling in debug mode, or could be that there are some MacOS-specific optimizations on your computer, I have no idea.. My computer must be much less powerful than yours:
image

@dpuyda
Copy link
Owner

dpuyda commented Mar 10, 2024

I can see you have Release on your screenshots in both cases, so accidentally running it in Debug is not the case. Could be some MacOS specific stuff, honestly I don't know.. 🤔

@dpuyda
Copy link
Owner

dpuyda commented Mar 10, 2024

May I ask you also what results do you get if you force Scheduling to use 8 threads? This can be done as below:

ThreadPool thread_pool(8);

I think a possible explanation could be as follows: By default, Scheduling uses as many threads as are available on your computer (defined by std::thread::hardware_concurrency()). The more threads, the slower it is because of the impact of context switching. On your computer, the value of std::thread::hardware_concurrency() is likely to be greater than 8, while with CGraph we hardcode 8 threads. The fewer threads, the faster it is.

@ChunelFeng
Copy link
Author

ChunelFeng commented Mar 10, 2024

i think, there must be many diff in different os and different cpu
and i know little about this diff

image

@dpuyda
Copy link
Owner

dpuyda commented Mar 10, 2024

Thank you for the quick response @ChunelFeng! Hm, okay, looks like now I've run out of ideas😄 But it's clear that I have a lot of work to do with Scheduling😉 Please give me some time to think about this, this is a very interesting discussion for me, I would be very glad if we could stay in touch about this. I will post here as soon as I have any updates on this topic

@dpuyda
Copy link
Owner

dpuyda commented Mar 10, 2024

I've asked my friend to run the same tests on his computer (more powerful than mine, also on Windows though). He sent me these screenshots:
IMG_20240310_154036_524
IMG_20240310_154040_775
So I'm still not sure what's going on, we will also try to run it on MacOS next week...

@dpuyda
Copy link
Owner

dpuyda commented Mar 18, 2024

Hi @ChunelFeng!
I hope you are doing great! Regarding what we discussed, I have asked my colleagues to run the benchmarks on MacOS. We seem to be getting around 1.1 sec for the test which we discussed with both CGraph and Scheduling, please find some screenshots below. The number of worker threads was set to 8 in both cases.
There was also one machine on which we got very unstable results for Scheduling (from 3 to 14 seconds), but we have reasons to suspect that it might have been busy with some background activities at that moment which could affect the timings (the machine was misbehaving also with other apps, so Scheduling was not an exception). We did not check CGraph on that machine..
Currently I'm not sure why it's so different on different computers and different OS, I will definitely think how to improve stability and performance of my small app...
cgraph
sched

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants