Consolidate Dataflow documentation

drewnoakes · drewnoakes · commit c872f283125a · 2023-10-25T14:16:28.000+11:00
diff --git a/doc/Index.md b/doc/Index.md
@@ -23,7 +23,6 @@ VS Project System Documentation
   - [Responsive design](overview/responsive_design.md)
   - [Globbing behavior](overview/globbing_behavior.md)
   - [Dataflow](overview/dataflow.md)
-    - [Dataflow in CPS](overview/dataflow_in_CPS.md)
     - [Dataflow source blocks](extensibility/dataflow_sources.md)
   - Diagnostics
     - [How to examine Visual Studio registry](overview/examine_registry.md)
diff --git a/doc/overview/dataflow.md b/doc/overview/dataflow.md
@@ -229,6 +229,8 @@ CPS has a few subclasses of `DataflowLinkOptions` that you can use in certain ci
 
 # Dataflow in CPS
 
+One of the main goals of CPS is to move the bulk of the project system work to background threads while still maintaining data consistency. To accomplish this, CPS leverages Dataflow to produce a versioned, immutable, producer-consumer pattern to flow changes through the project system. Dataflow is not always easy, and if used wrongly can lead to corrupt state and deadlocks.
+
 ## Slim blocks
 
 TPL's Dataflow blocks are general purpose and have feautres that aren't used in CPS. Those unused features come with a performance/memory cost. To improve the scalability of CPS in large solutions, we have a replacement set of "slim" blocks that provide the required behaviours of TPL's blocks, but without the overhead associated with the unused features.
@@ -245,7 +247,9 @@ TPL's Dataflow blocks are general purpose and have feautres that aren't used in
 
 Dataflow graphs publish immutable snapshots of data between blocks, where updates are pushed through the graph in an asynchronous fashion. This gives the framework a lot of flexibility to schedule the work, but can make it difficult to know when a given input has made its way through the graph to the outputs.
 
-Another challenge with Dataflow graphs is joining data. Consider the following graph:
+Dataflow is simple when you have a single line of dependencies, but in CPS it is much more complex. It is common for a chained datasource to require input from multiple upstream sources. It is also common for those upstream sources to also have multiple inputs. This pattern introduces a data consistency problem.
+
+Consider the following Dataflow graph:
 
 ```mermaid
 flowchart LR
@@ -271,7 +275,7 @@ public interface IProjectValueVersions
 }
 ```
 
-And in fact, a versioned value can have _more than one version!_ This makes sense when you consider that a given node in the graph can have more than one source block feeding in to it. Each of those source blocks provides its own versioned value, and as messages are joined, the sets of versions are merged.
+And in fact, a versioned value can have _more than one version!_ This makes sense when you consider that a given node in the graph can have more than one source block feeding in to it. Each of those source blocks provides its own versioned value, and as messages are joined the sets of versions are merged.
 
 ```mermaid
 flowchart LR
@@ -347,6 +351,12 @@ IDisposable link = ProjectDataSources.SyncLinkTo(
 
 The `SyncLinkOptions` extension method allows the data source to be configured. If the source contains rule-based data (discussed [below](#rule-sources)) 
 
+### Allowing inconsistent versions
+
+In special cases that require it, it is possible to allow for inconsistent versions in your Dataflow. This is for when you depend on multiple upstream sources where one is drastically slower at producing values than others, but you want to be able to produce intermediate values while the slow one is still processing. An example of this is where you want data quickly from project evaluation, and also want the richer data that arrives later via design-time builds.
+
+Unfortunately, there is no built-in support for this scenario. You will have to manually link to your upstream sources and synchronize between them. When producing chained output, to calculate the data versions to publish you may be able to use `ProjectDataSources.MergeDataSourceVersions`.
+
 ## Subscribing to project data
 
 One of the main use cases for Dataflow in CPS is the processing of project data. Unlike the legacy CSPROJ project system where updates were generally applied on a single thread (the main thread), CPS uses Dataflow to schedule updates asyncrhonously on the thread pool.
@@ -476,7 +486,7 @@ CPS provides access to several such `IProjectValueDataSource<T>` instances via `
 
 ### Chained (derived) data sources
 
-Most `IProjectValueDataSource<T>` instances will produce data that was derived from other project value data sources. CPS provides the abstract base class `ChainedProjectValueDataSourceBase<T>`, which makes creating such a derived (chained) source easy.
+Most `IProjectValueDataSource<T>` instances will produce data that was derived from one or more other project value data sources. CPS provides the abstract base class `ChainedProjectValueDataSourceBase<T>`, which makes creating such a derived (chained) source easy.
 
 Let's look at an example of overriding this class to create a new data source that derives its data from one other source:
 
diff --git a/doc/overview/dataflow_in_CPS.md b/doc/overview/dataflow_in_CPS.md
@@ -1,84 +1,3 @@
 # Dataflow in CPS
 
-One of the main goals of CPS is to move the bulk of the project system work to background threads,
-while still maintaining data consistency. To accomplish this, CPS leverages the [TPL.Dataflow](https://learn.microsoft.com/dotnet/standard/parallel-programming/dataflow-task-parallel-library)
-library to produce a versioned, immutable, producer-consumer pattern to flow changes through the
-project system. Dataflow is not always easy, and if used wrong it can quickly lead to corrupt
-project states or deadlocks.
-
-## Types of Dataflow in CPS
-
-Dataflow in CPS comes primarily in two types, an original source or a chained source.
-
-1. Original Source
-   * Depends on an original source of data that is not part of dataflow.
-   * Always has its own version.
-   * IE: a file on disk
-2. Chained Source
-   * Chains into existing dataflow.
-   * Can be one or multiple dataflow blocks that feed into this one.
-   * Very __rarely__ has its own version. Typically if it does, it can
-     be pulled out into an original source.
-   * Carries all the versions of the dataflow it chains to.
-     * More about versioning later
-
-## Data Consistency Problem
-
-Dataflow is simple when you have a single line of dependencies, but in CPS it is much more complex.
-It is common for a chained datasource to require input from multiple upstream sources. It is also
-common for those upstream sources to also have multiple inputs. This pattern introduces a data
-consistency problem. Take a look at the dataflow diagram below (arrows represent dataflow):
-
-```mermaid
-flowchart LR
-    A --> C
-    C --> D
-    B --> C
-    B --> D
-```
-
-In the above layout, `A` and `B` are original sources. `C` listens to both `A` and `B`, but since
-they are _original_ sources `C` can produce a new value when either change. `D` is where it gets
-complex. `D` can only produce values when it has `B` and `C` of the same source version. `D` only
-produces a value when the version of `C` it has was produced from the same version of `B` that
-`D` currently has. To solve this consistency issue CPS versions all dataflow and then synchronizes
-around these published versions.
-
-## Dataflow Versioning
-
-To solve the problem described above, all dataflow in CPS produces types of `IProjectVersionedValue<T>`.
-This type combines `T Value` and `IImmutableDictionary<NamedIdentity, IComparable> DataSourceVersions`.
-
-Then, chained dataflow will cary the versions of its upstream data sources. When a chained source has
-multiple upstream sources its published version becomes the merged value of the its upstream sources.
-This functionality is facilitated via `ProjectDataSources.SyncLinkTo`. When using that method to link
-to multiple upstream sources, a middle dataflow block is created that only publishes to your block when
-all recieved values are in a consistent state. See [this example](../extensibility/dataflow_example.md#chained-data-source-multiple-sources)
-for how to use `SyncLinkTo`.
-
-### Rules to Follow with Versioning
-
-__When you are a...__
-* __Original source__ you have your own `DataSourceKey` and `DataSourceVersion`. The key
-  identifies who you are, and the version must incremenet whenever you produce a new value.
-  The only value in your `DataSourceVersions` published is your own.
-* __Chained source__ you must merge and carry the versions of all dataflow you are chained
-  to in your own `DataSourceVersions`. You very rarely have your own version because your
-  version is just the combined versions that you chained to. If you do need your own version,
-  consider pulling the part that publishes the original data into its own source.
-
-### Allowing Inconsistent Versions
-
-In special cases that require it, it is possible to allow for inconsistent versions in your dataflow.
-This is for when you depend on multiple upstream sources where one is drastically slower at producing
-values than others, but you want to be able to produce intermediate values while the slow one is still
-processing. Unfortunately, there is no CPS base class equivalent to `ProjectValueDataSourceBase` or
-`ChainedProjectValueDataSourceBase` for this scenario. You will have to manually link to your upstream
-sources and synchronizing between multiple sources publishing at once. For calculating the data versions
-to publish, use `ProjectDataSources.MergeDataSourceVersions`.
-
-## Further reading
-
-- [Dataflow Examples](../extensibility/dataflow_example.md)
-- [Dataflow Sources](../extensibility/dataflow_sources.md)
-- [Dataflow Best Practices](../extensibility/dataflow_best_practices.md)
+Moved to [Dataflow in CPS](dataflow#dataflow-in-cps).