<h1 id="curse-of-large-dimensionality">Curse of Large Dimensionality<a aria-hidden="true" class="anchor-heading icon-link" href="#curse-of-large-dimensionality"></a></h1>
Проклятието на големите размери...
An increase in the dimensions can in theory, add more information to the data thereby improving the quality of data but practically increases the noise and redundancy during its analysis.
<h1 id="ballpark-estimate-for-number-of-data-points-required">Ballpark estimate for number of data points required<a aria-hidden="true" class="anchor-heading icon-link" href="#ballpark-estimate-for-number-of-data-points-required"></a></h1>
Say for the number of points required by ML model to learn any value of a feature is 10.
<ul>
<li>1 binary feature: <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mn>2</mn><mn>1</mn></msup><mo>∗</mo><mn>10</mn><mo>=</mo><mn>10</mn></mrow><annotation encoding="application/x-tex">2^{1} * 10 = 10</annotation></semantics></math>21∗10=10 data points</li>
<li>2 binary features: <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mn>2</mn><mn>2</mn></msup><mo>∗</mo><mn>10</mn><mo>=</mo><mn>40</mn></mrow><annotation encoding="application/x-tex">2^{2} * 10 = 40</annotation></semantics></math>22∗10=40 data points</li>
</ul>
...
<ul>
<li>n binary features: <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mn>2</mn><mi>n</mi></msup><mo>∗</mo><mn>10</mn><mo>=</mo></mrow><annotation encoding="application/x-tex">2^{n} * 10 =</annotation></semantics></math>2n∗10= many data points</li>
</ul>
<h1 id="recognize-cod">Recognize COD<a aria-hidden="true" class="anchor-heading icon-link" href="#recognize-cod"></a></h1>
<ul>
<li>Overfitting</li>
<li>Sparse features</li>
<li>Comptational complexity</li>
</ul>
<h1 id="techniques-to-avoid-cod">Techniques to avoid COD<a aria-hidden="true" class="anchor-heading icon-link" href="#techniques-to-avoid-cod"></a></h1>
<ul>
<li>Strict forward-feature selection (add feature only when you see marginal improvement in CV)</li>
<li>Feature selection using permutation importance</li>
<li>Feature extraction: PCA/t-SNE</li>
<li>Regularization</li>
<li>Model selection, choosing models that are less prone to overfitting</li>
<li>Sample more data - bootstrapping</li>
</ul>

Ballpark estimate for number of data points required

Curse of Large Dimensionality


Welcome to my Knowledge Base! Here I write about my perception of life, document exciting things I've learned, debate (with myself) on controversial topics. If you know me you will not be surprised to find out that I write mostly about engineering and maths. Other topics I'm interested in are economics, politics, business, chess and poker.