Internal Parallelism

Internal parallelism is multithreading within a system.

The usual multithreading in Bevy is to run each system in parallel when possible (when there is no conflicting data access with other systems). This is called "external parallelism".

However, sometimes, you need to write a system that has to process a huge number of entities or events. In that case, simple query or event iteration would not scale to make good use of the CPU.

Bevy offers a solution: parallel iteration. Bevy will automatically split all the entities/events into appropriately-sized batches, and iterate each batch on a separate CPU thread for you, calling a function/closure you provide.

If there are only a few entities/events, Bevy will automatically fall back to single-threaded iteration, and it will behave the same way as if you had just iterated normally. With a few entities/events, that is faster than multi-threading.

Even through parallel iteration should automatically make a good decision regardless of the number of entities/events, it is more awkward to use and not always suitable, as you have to do everything from inside a closure and there are other limitations.

Also, if your system is unlikely to ever encounter huge numbers of entities/events, don't bother with it and just iterate your queries and events normally.

Parallel Query Iteration

Queries support parallel iteration to let you process many entities across multiple CPU threads.

fn my_particle_physics(
    mut q_particles: Query<(&mut Transform, &MyParticleState), With<MyParticle>>,
) {
    q_particles.par_iter_mut().for_each(|(mut transform, my_state)| {
        my_state.move_particle(&mut transform);
    });
}

One limitation of parallel iteration is that safe Rust does not allow you to share &mut access across CPU threads. Therefore, it is not possible to mutate any data outside of the current entity's own components.

If you need to mutate shared data, you could use something like Mutex, but beware of the added overhead. It could easily drown out any benefits you get from parallel iteration.

Parallel Commands

If you need to use commands, there is the ParallelCommands system parameter. It allows you to get access to Commands from within the parallel iteration closure.

fn my_particle_timers(
    time: Res<Time>,
    mut q_particles: Query<(Entity, &mut MyParticleState), With<MyParticle>>,
    par_commands: ParallelCommands,
) {
    q_particles.par_iter_mut().for_each(|(e_particle, mut my_state)| {
        my_state.timer.tick(time.delta());

        if my_state.timer.finished() {
            par_commands.command_scope(|mut commands| {
                commands.entity(e_particle).despawn();
            })
        }
    });
}

However, generally speaking, commands are an inefficient way to do things in Bevy, and they do not scale well to huge numbers of entities. If you need to spawn/despawn or insert/remove components on huge numbers of entities, you should probably do it from an exclusive system, instead of using commands.

In the above example, we update timers stored across many entities, and use commands to despawn any entities whose time has elapsed. It is a good use of commands, because the timers need to be ticked for all entities, but only a few entities are likely to need despawning at once.

Parallel Event Iteration

EventReader<T> offers parallel iteration for events, allowing you to process a huge number of events across multiple CPU threads.

fn handle_many_events(
    mut evr: EventReader<MyEvent>,
) {
    evr.par_read().for_each(|ev| {
        // TODO: do something with `ev`
    });
}

However, one downside is that you cannot use it for events that need to be handled in order. With parallel iteration, the order becomes undefined.

Though, if you use .for_each_with_id, your closure will be given an EventId, which is a sequential index to indicate which event you are currently processing. That can help you know where you are in the event queue, even though you are still processing events in an undefined order.

Another downside is that typically you need to be able to mutate some data in response to events, but, in safe Rust, it is not possible to share mutable access to anything across CPU threads. Thus, parallel event handling is impossible for most use cases.

If you were to use something like Mutex for shared access to data, the synchronization overhead would probably kill performance, and you'd have been better off with regular single-threaded event iteration.

Controlling the Batch Size

The batch size and number of parallel tasks are chosen automatically using smart algorithms, based on how many entities/events need to be processed, and how Bevy ECS has stored/organized the entity/component data in memory. However, it assumes that the amount of work/computation you do for each entity is roughly the same.

If you find that you want to manually control the batch size, you can specify a minimum and maximum using BatchingStrategy.

fn par_iter_custom_batch_size(
    q: Query<&MyComponent>,
) {
    q.par_iter().batching_strategy(
        BatchingStrategy::new()
            // whatever fine-tuned values you come up with ;)
            .min_batch_size(256)
            .max_batch_size(4096)
    ).for_each(|my_component| {
        // TODO: do some heavy work
    });

    q.par_iter().batching_strategy(
        // fixed batch size
        BatchingStrategy::fixed(1024)
    ).for_each(|my_component| {
        // TODO: do some heavy work
    });
}

Parallel Processing of Arbitrary Data

Internal parallelism isn't limited to just ECS constructs like entities/components or events.

It is also possible to process a slice (or anything that can be referenced as a slice, such as a Vec) in parallel chunks. If you just have a big buffer of arbitrary data, this is for you.

Use .par_splat_map/.par_splat_map_mut to spread the work across a number of parallel tasks. Specify None for the task count to automatically use the total number of CPU threads available.

Use .par_chunk_map/.par_chunk_map_mut to manually specify a specific chunk size.

In both cases, you provide a closure to process each chunk (sub-slice). It will be given the starting index of its chunk + the reference to its chunk slice. You can return values from the closure, and they will be concatenated and returned to the call site as a Vec.

use bevy::tasks::{ParallelSlice, ParallelSliceMut};

fn parallel_slices(/* ... */) {
    // say we have a big vec with a bunch of data
    let mut my_data = vec![Something; 10000];

    // and we want to process it across the number of
    // available CPU threads, splitting it into equal chunks
    my_data.par_splat_map_mut(ComputeTaskPool::get(), None, |i, data| {
        // `i` is the starting index of the current chunk
        // `data` is the sub-slice / chunk to process
        for item in data.iter_mut() {
            process_thing(item);
        }
    });

    // Example: we have a bunch of numbers
    let mut my_values = vec![10; 8192];

    // Example: process it in chunks of 1024
    // to compute the sums of each sequence of 1024 values.
    let sums = my_values.par_chunk_map(ComputeTaskPool::get(), 1024, |_, data| {
        // sum the current chunk of 1024 values
        let sum: u64 = data.iter().sum();
        // return it out of the closure
        sum
    });

    // `sums` is now a `Vec<u64>` containing
    // the returned value from each chunk, in order
}

When you are using this API from within a Bevy system, spawn your tasks on the ComputeTaskPool.

This API can also be useful when you are doing background computation, to get some extra parallelism. In that case, use the AsyncComputeTaskPool instead.

Scoped Tasks

Scoped tasks are actually the underlying primitive that all of the above abstractions (parallel iterators and slices) are built on. If the previously-discussed abstractions aren't useful to you, you can implement whatever custom processing flow you want, by spawning scoped tasks yourself.

Scoped tasks let you borrow whatever you want out of the parent function. The Scope will wait until the tasks return, before returning back to the parent function. This ensures your parallel tasks do not outlive the parent function, thus accomplishing "internal parallelism".

To get a performance benefit, make sure each of your tasks has a significant and roughly similar amount of work to do. If your tasks complete very quickly, it is possible that the overhead of parallelism outweighs the gains.

use bevy::tasks::ComputeTaskPool;

fn my_system(/* ... */) {
    // say we have a bunch of variables
    let mut a = Something;
    let mut b = Something;
    let mut more_things = [Something; 5];

    // and we want to process the above things in parallel
    ComputeTaskPool::get().scope(|scope| {
        // spawn our tasks using the scope:
        scope.spawn(async {
            process_thing(&mut a);
        });
        scope.spawn(async {
            process_thing(&mut b);
        });

        // nested spawning is also possible:
        // you can use the scope from within a task,
        // to spawn more tasks
        scope.spawn(async {
            for thing in more_things.iter_mut() {
                scope.spawn(async {
                    process_thing(thing);
                })
            }
            debug!("`more_things` array done processing.");
        });
    });

    // at this point, after the task pool scope returns,
    // all our tasks are done and everything has been processed
}

When you are using this API from within a Bevy system, spawn your tasks on the ComputeTaskPool.

This API can also be useful when you are doing background computation, to dispatch additional tasks for extra parallelism. In that case, use the AsyncComputeTaskPool instead.