analysis with CGraph #1

ChunelFeng · 2024-02-24T06:43:11Z

Hi, nice to see this repo.

CGraph is a repo with almost same function with scheduling.

can you please make a performance testing with CGraph too, such as taskflow

thank you very much

ChunelFeng · 2024-02-24T16:10:53Z

I am not very sure weather my code for scheduling is at the best point,

but it really have some span in the linear situation between CGraph and scheduling on my computer.

can you please give some benchmark on your workspace? thank you.

using namespace CGraph;

unsigned g_node_cnt = 0;
class TestNode : public CGraph::GNode {
public:
    CStatus run() override {
        g_node_cnt++;
        return CStatus();
    }
};

void test_performance() {
    // 串行执行32次，对应第二个例子，1thread，32串行，1000w次
    GPipelinePtr pipeline = GPipelineFactory::create();
    CStatus status;
    GElementPtr arr[32];
    pipeline->registerGElement<TestNode>(&arr[0]);
    for (int i = 1; i < 32; i++) {
        pipeline->registerGElement<TestNode>(&arr[i], {arr[i - 1]});
    }
    pipeline->makeSerial();
    pipeline->setAutoCheck(false);
    status += pipeline->init();
    {
        for (int t = 0; t < 1000000; t++) {
            pipeline->run();
        }
        std::cout << g_node_cnt << std::endl;
    }

    status += pipeline->destroy();
    GPipelineFactory::remove(pipeline);
}

int main() {
    auto start_ts = std::chrono::steady_clock::now();
    test_performance();

    std::chrono::duration<double, std::milli> span = std::chrono::steady_clock::now() - start_ts;
    printf("time counter is : [%0.2lf] ms \n", span.count());
    return 0;
}

dpuyda · 2024-02-25T09:14:38Z

Hi! Thanks a lot for the reference to CGraph! I didn't know about CGraph earlier, this is a very interesting repository! Please give me some time to take a deeper loop and provide benchmarks, I will be very glad to do this.

ChunelFeng · 2024-02-25T09:24:57Z

Hi! Thanks a lot for the reference to CGraph! I didn't know about CGraph earlier, this is a very interesting repository! Please give me some time to take a deeper loop and provide benchmarks, I will be very glad to do this.

Thank you very much

dpuyda · 2024-02-25T10:38:49Z

Haven't had a chance to look into it in detail yet, but here are some comments which I can make now:

When you benchmark Scheduling, as I can see on your screenshot, you benchmark not only running the graph one million times, but also creating the graph one million times. When I do the same on my laptop (the one which was used to run the benchmarks described in readme), I get around 10 seconds for this test. On the contrary, in your code for CGraph, it looks to me like you do the initialization routines once, and then only run the already-created pipeline one million times.
At this point of time, Scheduling does not actually support running the same graph several times. This is because the cancelled_ flag is set when the task is executed, and there is currently no way to reset this flag after that. So, if I want to replicate your test, I need to create a graph consisting of one million nodes upfront and then run it at once (the code is below). If I do this, I get somewhere between 110 and 120ms on my computer. However, yes, this is supper memory-inefficient! I will think how to optimize this use case and will update the library to provide a better support in this situation. I will also be very interested to provide benchmark comparison for Scheduling and CGraph.
```
void test() {
  int counter = 0;
  ThreadPool thread_pool;
  std::vector<Task> v(1'000'000);  // super memory-inefficient, but no other options so far
  v[0] = Task([&]{++counter;});
  for (auto i = v.begin(), j = std::next(v.begin()); j != v.end(); ++i, ++j) {
    *j = Task([&]{++counter;});
    j->Succeed(&*i);
  }
  thread_pool.Submit(&v[0]);
  thread_pool.Wait();
  std::cout << counter << std::endl;
}
```

Thanks again for your interest in Scheduling and for the reference to CGraph!

ChunelFeng · 2024-02-25T11:57:48Z

oh, no, some mistake in your current demo.
i means to create a dag as a line, with 32 node in it, and run this dag 1000,000 times,
so the result should be 32 * 1000,000

not create one node, and run this node 1000,000 times,
but your new demo‘s result seemed to be 1 * 1000,000.

the same as you, i think, create once and then run some times, is a common situation for users.
just like, open a file, write into some lines, at last close it.

hope schuduling can support more scenarios, thanks for your excellent code and sharing

dpuyda · 2024-02-25T13:10:11Z

yes yes, I understand what you mean, unfortunately this use case is currently not supported by scheduling. I will try to add support for this use case, thanks a lot for this great and very useful idea! I will post an update here once implemented.

I have added a link to CGraph in my readme, I hope you don't mind. I will be glad to study your excellent library in more detail!

ChunelFeng · 2024-02-25T13:57:22Z

yes yes, I understand what you mean, unfortunately this use case is currently not supported by scheduling. I will try to add support for this use case, thanks a lot for this great and very useful idea! I will post an update here once implemented.

I have added a link to CGraph in my readme, I hope you don't mind. I will be glad to study your excellent library in more detail!

it is my honor

dpuyda · 2024-02-26T10:50:34Z

The honor is all mine, you did a great job writing CGraph, there's a lot to learn from this lib!

dpuyda · 2024-02-27T07:53:54Z

I have reopened the issue if you don't mind, since the comparison with CGraph is not done yet. I would like to implement the use case which you suggested and provide benchmarks, so let's keep it opened until it's done. Thank you!

dpuyda · 2024-03-03T08:49:11Z

Hi @ChunelFeng! I hope you are doing well! I have a small question about CGraph which I would be glad to discuss with you here when you have a moment. I tried the code snippet which you suggested above and observed that everything happens on the main thread:

After a quick look into you code, I noticed that you define whether a task is sync or async based on the timeout value:

CBool GElement::isAsync() const {
    // 如果timeout != 0, 则异步执行
    return this->timeout_ != CGRAPH_DEFAULT_ELEMENT_TIMEOUT;
}

I tried setting the timeout for the tasks manually to make them async like below, but this did not seem to be working:

  for (const auto& task : arr) {
    task->setTimeout(1000);
  }

May I ask you kindly if you have a specific logic in your code which controls the number of available threads and affinity to avoid context switching? I have updated Scheduling to be able to run the same graph multiple times (not merged yet). When I limit the number of threads to one, I obviously get much better results compared to the multiple threads version which suffers from context switching very heavily in this use case...

ChunelFeng · 2024-03-03T12:17:46Z

two ways both join effect to a DAG as a single line.

first, we use makeSerial(), to check weather this dag can go well in one signal thread.

pipeline->makeSerial();

second, in the code :

https://github.com/ChunelFeng/CGraph/blob/main/src/GraphCtrl/GraphElement/_GEngine/GDynamicEngine/GDynamicEngine.cpp

we use analysisDagType to make sure the lineal nodes worked in the same thread(the main thread).

analysisDagType(elements);

the two methods only worked when this dag is a line.
so, if you compose other types of dag, this one thread way can not work well.

and, when you as a node's timeout value, it can not worked in one thread for the reason of async timeout.

ChunelFeng · 2024-03-03T12:19:41Z

we can simple check weather the dag can work in one thread,

CStatus status = pipeline->makeSerial();

if it can, the status is ok. else, it can not worked in one thread.

dpuyda · 2024-03-03T12:37:57Z

so does it automatically define the minimum number of threads that are necessary to run a graph?

ChunelFeng · 2024-03-03T12:50:48Z

not automatically.

in other case, you can set config just like this example.

https://github.com/ChunelFeng/CGraph/blob/main/tutorial/T01-Simple.cpp.

and you can automatic get the min thread num with this function:

dpuyda · 2024-03-03T13:00:18Z

cool, thanks a lot for the explanations!

dpuyda · 2024-03-03T13:31:22Z

yes, now I am able to break your code in the same way as mine is currently broken 😃

Let me think a little about it before merging my changes, I will let you know here once I'm ready... 😉

ChunelFeng · 2024-03-03T15:40:08Z

what is your tools in your picture？

it seems very useful

dpuyda · 2024-03-03T16:25:47Z

This is Concurrency Visualizer in Visual Studio, https://learn.microsoft.com/en-us/visualstudio/profiling/threads-view-parallel-performance?view=vs-2022

dpuyda · 2024-03-08T18:55:20Z

Hi @ChunelFeng!
Thanks again for looking into Scheduling and for the suggestions how it can be improved, your comments are much valued!
I have updated Scheduling to be able to run the same graph multiple times, see #2

Below are a few observations that I can make:

If I limit the number of threads to one, with Scheduling, I seem to get the timings that are comparable to yours. On my computer, with one execution thread, I get up to 1 second with Scheduling and around 1.5 second with CGraph.
When using 8 worker threads, with Scheduling I usually get around 1.5 seconds (occasionally it can be more if it happens to be heavily impacted by context switching). I'm still not sure how to configure the CGraph thread pool to use more than one thread to make it comparable with Scheduling. The difference between the default behavior of CGraph and Scheduling is that in case of CGraph, it always executes everything on the main thread, while with Scheduling, each chain consisting of 32 tasks is executed on a randomly-selected thread managed by the thread pool. I think that further improvements can be made by executing each chain on the same thread to avoid context switching, but I'm still trying to find the best way to implement this.

Below is the Scheduling code which I used for experiments (you can also find it in unit tests):

  auto counter = 0;
  ThreadPool thread_pool;
  std::vector<Task> tasks(32);
  tasks[0] = Task([&] { ++counter; });
  for (auto i = tasks.begin(), j = std::next(i); j != tasks.end(); ++i, ++j) {
    *j = Task([&] { ++counter; });
    j->Succeed(&*i);
  }
  for (int i = 0; i < 1'000'000; ++i) {
    thread_pool.Submit(&tasks[0]);
    thread_pool.Wait();
  }

ChunelFeng · 2024-03-09T16:19:09Z

in this case, CGraph run only in one thread, is because we use

pipeline->makeSerial();

if you remove this line, CGraph will create 8 thread in backend.

i give you some code, to run CGraph with 8 thread in this case:

please replace test-performance-02.cpp file with the following code.

#include "../_Materials/TestInclude.h"

using namespace CGraph;

void test_performance_02() {
    // 串行执行32次，对应第二个例子，1thread，32串行，1000w次
    GPipelinePtr pipeline = GPipelineFactory::create();
    CStatus status;
    GElementPtr arr[32];
    pipeline->registerGElement<TestAdd1GNode>(&arr[0]);
    for (int i = 1; i < 32; i++) {
        pipeline->registerGElement<TestAdd1GNode>(&arr[i], {arr[i - 1]});
    }
    pipeline->setAutoCheck(false);
    UThreadPoolConfig config;
    config.max_thread_size_ = 8;
    config.primary_thread_busy_epoch_ = 2;
    config.primary_thread_empty_interval_ = 0;
    pipeline->setUniqueThreadPoolConfig(config);

    status += pipeline->init();
    {
        UTimeCounter counter("test_performance_02");
        for (int t = 0; t < 1000000; t++) {
            pipeline->run();
        }
    }

    if (32000000 != g_test_node_cnt) {
        std::cout << "test_performance_02: g_test_node_cnt is not right : " << g_test_node_cnt << std::endl;
    }

    status += pipeline->destroy();
    GPipelineFactory::remove(pipeline);
}

int main() {
    test_performance_02();
    return 0;
}

dpuyda · 2024-03-09T22:14:23Z

Thanks a lot for the code snippet, @ChunelFeng! Now the code runs 8 background threads as expected. I ran it multiple times, on my computer the test usually takes around 2.8 sec. Below is the sample output:

[2024-03-10 00:00:16.454] [test_performance_02]: time counter is : [2858.44] ms

D:\repo\CGraph\build\test\Performance\RelWithDebInfo\test-performance-02.exe (process 18100) exited with code 0.
Press any key to close this window . . .

To run the similar test for Scheduling, you can use the ThreadPoolTest.ResubmitGraph unit test. Test duration on my computer mostly varies between 1.5 and 1.9 sec. The timing is much less stable compared to CGraph, I guess the variation comes from the random context switching issue which I described above, still not sure how to fix it. Below is a sample output:

Running main() from D:\repo\scheduling\build\tests\googletest-src\googletest\src\gtest_main.cc
Note: Google Test filter = ThreadPoolTest.ResubmitGraph
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ThreadPoolTest
[ RUN      ] ThreadPoolTest.ResubmitGraph
[       OK ] ThreadPoolTest.ResubmitGraph (1516 ms)
[----------] 1 test from ThreadPoolTest (1517 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1519 ms total)
[  PASSED  ] 1 test.

D:\repo\scheduling\build\tests\RelWithDebInfo\tests.exe (process 16408) exited with code 0.
Press any key to close this window . . .

ChunelFeng · 2024-03-10T02:29:43Z

by the way, please update CGraph to the newest version. we optimize this situation

dpuyda · 2024-03-10T08:23:41Z

Hi @ChunelFeng! Thank you for letting me know about the updates! There are two commits which I didn't have, shown on the screenshot below:

I think you meant the earlier commit titled "[perf] optimize all parallel situation.", I already had it when I did my experiments. The latest two which I didn't have is a CMake update, it should not affect timings, so I'm getting the same results as earlier.

May I ask you kindly what timings do you get on your computer?

ChunelFeng · 2024-03-10T09:01:38Z

current newest version in scheduling;

#include <iostream>
#include "schedule/schedule.h"

#include <atomic>
#include <vector>

using namespace scheduling;

void test_performance () {
    auto counter = 0;
    ThreadPool thread_pool;
    std::vector<Task> tasks(32);
    tasks[0] = Task([&] { ++counter; });
    for (auto i = tasks.begin(), j = std::next(i); j != tasks.end(); ++i, ++j) {
        *j = Task([&] { ++counter; });
        j->Succeed(&*i);
    }
    for (int i = 0; i < 1'000'000; ++i) {
        thread_pool.Submit(&tasks[0]);
        thread_pool.Wait();
    }
}

int main() {
    for (int i = 0; i < 5; i ++) {
        auto start_ts = std::chrono::steady_clock::now();
        test_performance();
        std::chrono::duration<double, std::milli> span = std::chrono::steady_clock::now() - start_ts;
        printf("time counter is : [%0.2lf] ms \n", span.count());
    }
    return 0;
}

ChunelFeng · 2024-03-10T09:06:08Z

CGraph newest version run with 8 threads:

#include "../_Materials/TestInclude.h"

using namespace CGraph;

static int g_cnt = 0;
class TestAddGNode : public CGraph::GNode {
public:
    CStatus run() override {
        g_cnt++;
        return CStatus();
    }
};

void test_performance_02() {
    // 串行执行32次，对应第二个例子，1thread，32串行，1000w次
    GPipelinePtr pipeline = GPipelineFactory::create();
    CStatus status;
    GElementPtr arr[32];
    pipeline->registerGElement<TestAddGNode>(&arr[0]);
    for (int i = 1; i < 32; i++) {
        pipeline->registerGElement<TestAddGNode>(&arr[i], {arr[i - 1]});
    }
    pipeline->setAutoCheck(false);
    UThreadPoolConfig config;
    config.max_thread_size_ = 8;
    config.primary_thread_busy_epoch_ = 1;
    config.primary_thread_empty_interval_ = 0;
    pipeline->setUniqueThreadPoolConfig(config);

    status += pipeline->init();
    {
        for (int t = 0; t < 1000000; t++) {
            pipeline->run();
        }
    }

    status += pipeline->destroy();
    GPipelineFactory::remove(pipeline);
}

int main() {
    for (int i = 0; i < 5; i ++) {
        UTimeCounter counter("test_performance_02");
        test_performance_02();
    }
    return 0;
}

ChunelFeng · 2024-03-10T09:08:20Z

and my computer is macbook pro with m1(2021)

dpuyda · 2024-03-10T10:02:26Z

This is a very surprising result, I do not have an explanation why you observe so bad performance with Scheduling. On my computer, I observe the results as shown on the video below:

devenv_K45xBPSqK7.mp4

Could be that you are accidentally running Scheduling in debug mode, or could be that there are some MacOS-specific optimizations on your computer, I have no idea.. My computer must be much less powerful than yours:

dpuyda · 2024-03-10T10:06:36Z

I can see you have Release on your screenshots in both cases, so accidentally running it in Debug is not the case. Could be some MacOS specific stuff, honestly I don't know.. 🤔

dpuyda · 2024-03-10T10:13:35Z

May I ask you also what results do you get if you force Scheduling to use 8 threads? This can be done as below:

ThreadPool thread_pool(8);

I think a possible explanation could be as follows: By default, Scheduling uses as many threads as are available on your computer (defined by std::thread::hardware_concurrency()). The more threads, the slower it is because of the impact of context switching. On your computer, the value of std::thread::hardware_concurrency() is likely to be greater than 8, while with CGraph we hardcode 8 threads. The fewer threads, the faster it is.

ChunelFeng · 2024-03-10T10:24:09Z

i think, there must be many diff in different os and different cpu
and i know little about this diff

dpuyda · 2024-03-10T10:54:58Z

Thank you for the quick response @ChunelFeng! Hm, okay, looks like now I've run out of ideas😄 But it's clear that I have a lot of work to do with Scheduling😉 Please give me some time to think about this, this is a very interesting discussion for me, I would be very glad if we could stay in touch about this. I will post here as soon as I have any updates on this topic

dpuyda · 2024-03-10T13:58:18Z

I've asked my friend to run the same tests on his computer (more powerful than mine, also on Windows though). He sent me these screenshots:

So I'm still not sure what's going on, we will also try to run it on MacOS next week...

dpuyda · 2024-03-18T20:18:47Z

Hi @ChunelFeng!
I hope you are doing great! Regarding what we discussed, I have asked my colleagues to run the benchmarks on MacOS. We seem to be getting around 1.1 sec for the test which we discussed with both CGraph and Scheduling, please find some screenshots below. The number of worker threads was set to 8 in both cases.
There was also one machine on which we got very unstable results for Scheduling (from 3 to 14 seconds), but we have reasons to suspect that it might have been busy with some background activities at that moment which could affect the timings (the machine was misbehaving also with other apps, so Scheduling was not an exception). We did not check CGraph on that machine..
Currently I'm not sure why it's so different on different computers and different OS, I will definitely think how to improve stability and performance of my small app...

ChunelFeng closed this as completed Feb 26, 2024

dpuyda reopened this Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis with CGraph #1

analysis with CGraph #1

ChunelFeng commented Feb 24, 2024

ChunelFeng commented Feb 24, 2024 •

edited

dpuyda commented Feb 25, 2024

ChunelFeng commented Feb 25, 2024

dpuyda commented Feb 25, 2024

ChunelFeng commented Feb 25, 2024

dpuyda commented Feb 25, 2024

ChunelFeng commented Feb 25, 2024

dpuyda commented Feb 26, 2024

dpuyda commented Feb 27, 2024

dpuyda commented Mar 3, 2024

ChunelFeng commented Mar 3, 2024

ChunelFeng commented Mar 3, 2024

dpuyda commented Mar 3, 2024

ChunelFeng commented Mar 3, 2024

dpuyda commented Mar 3, 2024

dpuyda commented Mar 3, 2024 •

edited

ChunelFeng commented Mar 3, 2024

dpuyda commented Mar 3, 2024

dpuyda commented Mar 8, 2024

ChunelFeng commented Mar 9, 2024

dpuyda commented Mar 9, 2024

ChunelFeng commented Mar 10, 2024

dpuyda commented Mar 10, 2024

ChunelFeng commented Mar 10, 2024

ChunelFeng commented Mar 10, 2024

ChunelFeng commented Mar 10, 2024

dpuyda commented Mar 10, 2024

dpuyda commented Mar 10, 2024

dpuyda commented Mar 10, 2024

ChunelFeng commented Mar 10, 2024 •

edited

dpuyda commented Mar 10, 2024

dpuyda commented Mar 10, 2024

dpuyda commented Mar 18, 2024

analysis with CGraph #1

analysis with CGraph #1

Comments

ChunelFeng commented Feb 24, 2024

ChunelFeng commented Feb 24, 2024 • edited

dpuyda commented Feb 25, 2024

ChunelFeng commented Feb 25, 2024

dpuyda commented Feb 25, 2024

ChunelFeng commented Feb 25, 2024

dpuyda commented Feb 25, 2024

ChunelFeng commented Feb 25, 2024

dpuyda commented Feb 26, 2024

dpuyda commented Feb 27, 2024

dpuyda commented Mar 3, 2024

ChunelFeng commented Mar 3, 2024

ChunelFeng commented Mar 3, 2024

dpuyda commented Mar 3, 2024

ChunelFeng commented Mar 3, 2024

dpuyda commented Mar 3, 2024

dpuyda commented Mar 3, 2024 • edited

ChunelFeng commented Mar 3, 2024

dpuyda commented Mar 3, 2024

dpuyda commented Mar 8, 2024

ChunelFeng commented Mar 9, 2024

dpuyda commented Mar 9, 2024

ChunelFeng commented Mar 10, 2024

dpuyda commented Mar 10, 2024

ChunelFeng commented Mar 10, 2024

ChunelFeng commented Mar 10, 2024

ChunelFeng commented Mar 10, 2024

dpuyda commented Mar 10, 2024

dpuyda commented Mar 10, 2024

dpuyda commented Mar 10, 2024

ChunelFeng commented Mar 10, 2024 • edited

dpuyda commented Mar 10, 2024

dpuyda commented Mar 10, 2024

dpuyda commented Mar 18, 2024

ChunelFeng commented Feb 24, 2024 •

edited

dpuyda commented Mar 3, 2024 •

edited

ChunelFeng commented Mar 10, 2024 •

edited