<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic I have not seen a lot of in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121319#M6186</link>
    <description>&lt;P&gt;I have not seen a lot of definitive statements on this topic.&amp;nbsp;&amp;nbsp; (Opinion: Intel generally likes to keep their descriptions as fuzzy as possible to maximize their options for future implementations....)&lt;/P&gt;

&lt;P&gt;In recent processors I have not seen any evidence of TSC drift across cores on a single socket.&amp;nbsp; My current (tentative) interpretation is that reading the TSC involves retrieving a value from the "master" timer in the uncore and then doing an interpolation on that value. If this is the case, then there is no mechanism to generate drift.&lt;/P&gt;

&lt;P&gt;I have also not seen evidence of TSC drift across sockets in 2-socket systems.&amp;nbsp; These should be receiving the same 100 MHz reference clock, so again there is no obvious mechanism to generate drift.&lt;/P&gt;

&lt;P&gt;It is challenging to design experimental methodologies to say anything about simultaneity in parallel computing systems, but I routinely look at TSC data across tightly synchronized threads and have not seen any systematic variation by core or socket that would lead me to be concerned about drift.&lt;/P&gt;</description>
    <pubDate>Fri, 23 Dec 2016 00:10:26 GMT</pubDate>
    <dc:creator>McCalpinJohn</dc:creator>
    <dc:date>2016-12-23T00:10:26Z</dc:date>
    <item>
      <title>TSCs per logical processor? Per socket?</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121318#M6185</link>
      <description>&lt;P&gt;Given that one can use WRMSR to adjust the TSC per-logical-processor, this implies each logical processor has at least its own counter. Are the increments of these counters completely synchronized on a single socket? The case I'm worried about is TSC drift (which is small but non-existent) causing the TSCs on separate cores to slowly diverge.&lt;/P&gt;

&lt;P&gt;A related question is if the TSCs across sockets are also always synchronized, assuming no one uses WRMSR to futz with it. I wouldn't expect separate sockets to share an ART, but I might be wrong.&lt;/P&gt;

&lt;P&gt;This is all assuming a recent chip with an invariant TSC.&lt;/P&gt;</description>
      <pubDate>Thu, 22 Dec 2016 09:54:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121318#M6185</guid>
      <dc:creator>Corey_R_</dc:creator>
      <dc:date>2016-12-22T09:54:56Z</dc:date>
    </item>
    <item>
      <title>I have not seen a lot of</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121319#M6186</link>
      <description>&lt;P&gt;I have not seen a lot of definitive statements on this topic.&amp;nbsp;&amp;nbsp; (Opinion: Intel generally likes to keep their descriptions as fuzzy as possible to maximize their options for future implementations....)&lt;/P&gt;

&lt;P&gt;In recent processors I have not seen any evidence of TSC drift across cores on a single socket.&amp;nbsp; My current (tentative) interpretation is that reading the TSC involves retrieving a value from the "master" timer in the uncore and then doing an interpolation on that value. If this is the case, then there is no mechanism to generate drift.&lt;/P&gt;

&lt;P&gt;I have also not seen evidence of TSC drift across sockets in 2-socket systems.&amp;nbsp; These should be receiving the same 100 MHz reference clock, so again there is no obvious mechanism to generate drift.&lt;/P&gt;

&lt;P&gt;It is challenging to design experimental methodologies to say anything about simultaneity in parallel computing systems, but I routinely look at TSC data across tightly synchronized threads and have not seen any systematic variation by core or socket that would lead me to be concerned about drift.&lt;/P&gt;</description>
      <pubDate>Fri, 23 Dec 2016 00:10:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121319#M6186</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-12-23T00:10:26Z</dc:date>
    </item>
    <item>
      <title>Thanks again John! I guess</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121320#M6187</link>
      <description>&lt;P&gt;Thanks again John! I guess that's good enough for me. I did come up with an experiment to measure TSC drift using multiple systems and some stats, but this one was stumping me. If I had anything approximating a test bench I could measure the drift on the BCLK pins as an approximation of the TSC drift (hopefully it wouldn't be *worse*), but alas, I am but a poor software person.&lt;/P&gt;</description>
      <pubDate>Fri, 23 Dec 2016 07:20:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121320#M6187</guid>
      <dc:creator>Corey_R_</dc:creator>
      <dc:date>2016-12-23T07:20:00Z</dc:date>
    </item>
    <item>
      <title>Clock drift across nodes is a</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121321#M6188</link>
      <description>&lt;P&gt;Clock drift across nodes is a completely different topic, with a long history....&amp;nbsp; In the early 1990's while I was an assistant professor at the University of Delaware, I worked a bit with David Mills (also at U.Del.) who designed the Network Time Protocol.&amp;nbsp; That protocol has to deal with extremely large latency variability, with minimum latencies that are not particularly small.&amp;nbsp;&amp;nbsp; It is a lot easier in a single machine room, where the variability is (usually) modest, and where the minimum latencies can be quite low.&lt;/P&gt;

&lt;P&gt;MPI has an option for "globally synchronized" output from MPI_Wtime().&amp;nbsp; It is not completely clear how well synchronized these timers will be, especially given the typical MPI_Wtime() resolution of 1 microsecond.&amp;nbsp; The MPI 2.2 standard says:&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;DIV class="page" title="Page 297"&gt;
		&lt;DIV class="layoutArea"&gt;
			&lt;DIV class="column"&gt;
				&lt;P&gt;&lt;SPAN style="font-size: 11.000000pt; font-family: 'CMR10'"&gt;A collection of clocks is considered synchronized if explicit effort has been taken to synchronize them. The expectation is that the variation in time, as measured by calls to &lt;/SPAN&gt;&lt;SPAN style="font-size: 11.000000pt; font-family: 'CMSS10'"&gt;MPI&lt;/SPAN&gt;&lt;SPAN style="font-size: 11.000000pt; font-family: 'CMTT10'"&gt;_&lt;/SPAN&gt;&lt;SPAN style="font-size: 11.000000pt; font-family: 'CMSS10'"&gt;WTIME&lt;/SPAN&gt;&lt;SPAN style="font-size: 11.000000pt; font-family: 'CMR10'"&gt;, will be less then one half the round-trip time for an &lt;/SPAN&gt;&lt;SPAN style="font-size: 11.000000pt; font-family: 'CMSS10'"&gt;MPI &lt;/SPAN&gt;&lt;SPAN style="font-size: 11.000000pt; font-family: 'CMR10'"&gt;message of length zero. If time is measured at a process just before a send and at another process just after a matching receive, the second time should be always higher than the first one. &lt;/SPAN&gt;&lt;/P&gt;
			&lt;/DIV&gt;
		&lt;/DIV&gt;
	&lt;/DIV&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;One half of the round trip time for modern high-performance fabrics is in the 1 microsecond range -- about the same as the MPI_Wtime() resolution that I typically see.&amp;nbsp; If the target is achieved, then one can treat results from MPI_Wtime() calls as accurate to about one "unit" (1 microsecond), no matter which node made the measurement.&lt;/P&gt;

&lt;P&gt;If I recall correctly, the SGI Origin2000 had hardware support for clock synchronization that enabled 1 microsecond consistency across the entire NUMA system, but this was designed into the NUMA interconnect and did not depend on external/3rd-Party interconnects.&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Dec 2016 17:57:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121321#M6188</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-12-23T17:57:50Z</dc:date>
    </item>
    <item>
      <title>I haven't been trying to</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121322#M6189</link>
      <description>&lt;P&gt;I haven't been trying to synchronize nodes (yet), certainly not at the TSC level. I don't think it's feasible, from software, to within more than a few hundred ns. My experiment was merely designed to estimate how much the TSC frequency varied over time across machines. I haven't run it yet, because I haven't written the software needed yet.&lt;/P&gt;</description>
      <pubDate>Sat, 24 Dec 2016 03:02:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121322#M6189</guid>
      <dc:creator>Corey_R_</dc:creator>
      <dc:date>2016-12-24T03:02:13Z</dc:date>
    </item>
    <item>
      <title>Not mentioned above is the</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121323#M6190</link>
      <description>&lt;P&gt;Not mentioned above is the mechanism for starting the TSC clocks on multiple socket systems.&amp;nbsp; The OS startup should send signals nearly simultaneously to all CPUs to restart their TSC clocks.&amp;nbsp; In my experience with Intel platforms, this appears to keep them within 150 ns of synchronization, but they tell us not to rely on this.&amp;nbsp; On past AMD platforms, there didn't appear to be any such synchronization among sockets.&lt;/P&gt;

&lt;P&gt;Methods for keeping various nodes on a cluster synchronized don't appear to take care of TSC timers.&amp;nbsp; I'm not sure of the mechanism, but within a single node the MPI_Wtime() and omp_get_wtime() timers have appeared in my tests to be synchronized within the resolution of those timers, which didn't appear to be as good as 1 microsecond&amp;nbsp; (not nearly as good on Windows as on linux).&amp;nbsp; Both current Intel and gnu implementations of omp_get_wtime() for Windows use the Query_Performance API, which won't be synchronized across nodes.&amp;nbsp; Windows gfortran uses that also for the system_clock intrinsic.&lt;/P&gt;</description>
      <pubDate>Sat, 24 Dec 2016 20:35:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121323#M6190</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2016-12-24T20:35:32Z</dc:date>
    </item>
    <item>
      <title>Quote:Tim P. wrote:</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121324#M6191</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Tim P. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Not mentioned above is the mechanism for starting the TSC clocks on multiple socket systems.&amp;nbsp; The OS startup should send signals nearly simultaneously to all CPUs to restart their TSC clocks.&amp;nbsp; In my experience with Intel platforms, this appears to keep them within 150 ns of synchronization, but they tell us not to rely on this.&amp;nbsp; On past AMD platforms, there didn't appear to be any such synchronization among sockets.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I'm woefully unaware of multisocket issues. Is this the INIT IPI?&lt;/P&gt;</description>
      <pubDate>Mon, 26 Dec 2016 05:57:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/TSCs-per-logical-processor-Per-socket/m-p/1121324#M6191</guid>
      <dc:creator>Corey_R_</dc:creator>
      <dc:date>2016-12-26T05:57:30Z</dc:date>
    </item>
  </channel>
</rss>

