{"id":1409,"date":"2013-11-24T17:58:06","date_gmt":"2013-11-24T17:58:06","guid":{"rendered":"http:\/\/www.blopig.com\/blog\/?p=1409"},"modified":"2013-11-25T11:28:21","modified_gmt":"2013-11-25T11:28:21","slug":"how-many-bins","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2013\/11\/how-many-bins\/","title":{"rendered":"How many bins?"},"content":{"rendered":"<p>As it&#8217;s known in non-parametric kernel density estimation the effect of the bandwidth on the estimated density is large and it is usually the parameter who makes the tradeoff between bias and roughness of the estimation (Jones et.al 1996). An analogous problem for histograms is the choice of the bin length and in cases of equal bin lengths the problem can be seen as finding the number of bins to use. \u00a0A data-base methodology for building equal bin-length histograms proposed by (Knuth 2013) based on the marginal of the joint posterior of the number of bins and heights of the bins. To build the histogram first the number of bins has to be selected as the the value (<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7BM%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{M} \" class=\"latex\" \/>) that maximises the following posterior distribution for the number of bins:<br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=P%28M%7Cd%2CI%29%5C%2C+%5Calpha+%5C%2C%28M%2FV%29%5EN+%5Cfrac%7B%5CGamma%28M%2F2%29+%5Cprod_%7Bk%3D1%7D%5EM+%5CGamma%28n_k%2B1%2F2%29%7D%7B%5CGamma%281%2F2%29%5EM+%5CGamma%28N%2BM%2F2%29%7D++&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"P(M|d,I)&#92;, &#92;alpha &#92;,(M\/V)^N &#92;frac{&#92;Gamma(M\/2) &#92;prod_{k=1}^M &#92;Gamma(n_k+1\/2)}{&#92;Gamma(1\/2)^M &#92;Gamma(N+M\/2)}  \" class=\"latex\" \/><\/p>\n<p><span style=\"line-height: 1.714285714;font-size: 1rem\">where <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=M&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"M\" class=\"latex\" \/> is the number of bins, <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"d\" class=\"latex\" \/> is the data, <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=I&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"I\" class=\"latex\" \/> is prior knowledge about the problem, i.e. in particular the use of equal length bins and the range of data <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=V&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"V\" class=\"latex\" \/>, which has the relation <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=V%3DMw&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"V=Mw\" class=\"latex\" \/> where <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=w&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"w\" class=\"latex\" \/> is the width of bins, <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=N&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"N\" class=\"latex\" \/> is the number of data points and <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n_k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n_k\" class=\"latex\" \/> is the number of observations that fall in the <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"k\" class=\"latex\" \/>th\u00a0<\/span>bin.<\/p>\n<p>Now, the height (<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h_k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h_k\" class=\"latex\" \/>) of the bins of the histogram is given by:<br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h_k%3D%5Cfrac%7BM%7D%7BV%7D+%5Cfrac%7Bn_k%2B1%2F2%7D%7BN%2BM%2F2%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h_k=&#92;frac{M}{V} &#92;frac{n_k+1\/2}{N+M\/2}\" class=\"latex\" \/>.<\/p>\n<p>In the case of a normal distribution the authors suggest a sample of 150 data points to &#8220;accurately and consistently estimate the shape of the distribution&#8221;.<\/p>\n<p>The following figure shows the relative log-posterior of the number of bins (left) and the estimated histogram for a mixture of three normal samples and a uniform [0,50] (right).<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2013\/11\/Optimal-binning.jpg?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" class=\"size-medium wp-image-1412 aligncenter\" alt=\"Optimal binning\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2013\/11\/Optimal-binning.jpg?resize=300%2C159&#038;ssl=1\" width=\"300\" height=\"159\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2013\/11\/Optimal-binning.jpg?resize=300%2C159&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2013\/11\/Optimal-binning.jpg?resize=1024%2C544&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2013\/11\/Optimal-binning.jpg?resize=624%2C332&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2013\/11\/Optimal-binning.jpg?w=1045&amp;ssl=1 1045w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<div title=\"Page 5\">\n<p>Knuth, K. H. (2013). Optimal data-based binning for histograms. arXiv preprint\u00a0physics\/0605197. The first version of this paper was published on 2006.<\/p>\n<div title=\"Page 5\">\n<p>Jones, M. C., Marron, J. S., and Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association,91(433), 401\u2013407.<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>As it&#8217;s known in non-parametric kernel density estimation the effect of the bandwidth on the estimated density is large and it is usually the parameter who makes the tradeoff between bias and roughness of the estimation (Jones et.al 1996). An analogous problem for histograms is the choice of the bin length and in cases of [&hellip;]<\/p>\n","protected":false},"author":26,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[10],"tags":[],"ppma_author":[514],"class_list":["post-1409","post","type-post","status-publish","format-standard","hentry","category-groupmeetings"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":514,"user_id":26,"is_guest":0,"slug":"luis","display_name":"Luis Ospina Forero","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/310cef32cd5dac5a383fe35d2e6fa0ed40cb03d0712d2b5a5ef81092db812b3e?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/1409","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/26"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=1409"}],"version-history":[{"count":10,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/1409\/revisions"}],"predecessor-version":[{"id":1427,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/1409\/revisions\/1427"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=1409"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=1409"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=1409"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1409"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}