I am having a lot of trouble with VTune on Linux lately. I've been running VTune 2017.1 on Broadwell-EP with different operating systems (Ubuntu 16.04, CentOS 7, Debian 8) and after several days of automated tests with VTune, at some point the file system becomes corrupt.
A year ago I had a similar issue with VTune 2015/2016 and Ubuntu 14.04 on IvyBridge-EP. The solution back then was to run the automated tests with VTune in RHEL7.
Since I got new hardware, a new version of VTune and a new version of Ubuntu, I was hoping that the issue would disappear but this isn't the case. Since the file system becomes completely corrupt, I cannot debug the issue.
I do hotspot analysis with VTune. I get VTune to start the nightly build of the application in test mode and collect some data. Has anybody experienced this?
We are very sorry to hear that your file system is corrupted, though we've never heard about such cases before.
VTune Amplifier hotspots collection when fails may result in corruption of its own trace file, but that should not impact a file system as a whole.
- Did you run Basic or Advanced Hotspots analysis?
- Has your new system ever crashed right at the time of VTune Amplifier 2017 analysis or by any other reason (power outage, etc,)?
- What is your Broadwell-EP configuration, hard drive and memory chip model, and Ubuntu kernel version?
- Did you submit Premier Support issue for 2015/2016 version cases?
Aren't you running VTune collections in parallel during automated testing?
Can you please clarify what "completely corrupt" file system means in your case - the system stops booting, there are many errors reported by fsck, some of your files have a garbage inside, etc?
Can I ask you to do an experiment? In your automated VTune runs - can you please disable all the VTune analysis, stop running command line reports, but still run VTune collection as you usually do, but add the following option: "-no-auto-finalize", like:
$ amplxe-cl -collect hotspots -no-auto-finalize -result-dir <result> <application>
With your help we'll try to identify if this is caused by VTune collection or analysis stages.
I am running just one instance of VTune at a time.
Actually, I was wrong about the hotspot analysis. This is the exact command:
/opt/intel/vtune_amplifier_xe_2017/bin64/amplxe-cl -collect-with runsa -knob event-config=FP_ARITH_INST_RETIRED.SCALAR_DOUBLE:sa=2000000,FP_ARITH_INST_RETIRED.SCALAR_SINGLE:sa=2000000,FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE:sa=2000000,FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE:sa=2000000,FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE:sa=2000000,FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE:sa=2000000, -data-limit=4000 -start-paused --result-dir
I don't know exactly at what point the system goes corrupt. I will experiment with the -no-auto-finalize option to see if it makes any difference but it will take several days since I have to reinstall the system.
I am currently using a single Xeon E5-2623v4 with 4 DIMMs of 2133 MHz ECC memory and two SSDs in RAID1 via Perc RAID controller. I have experienced this problem even on dual Xeon E5-2670v2 with 8 DIMMs of ECC and a single HDD (no RAID). I am running with the default UEFI (BIOS) settings.
I always use the latest LTS version of Ubuntu, or Debian, or CentOS with the stock kernel.
When the system goes corrupt, sometimes I can SSH into it and other times I cannot. When I SSH into it, the disk is mounted read-only. Upon restart the system does not boot.
I haven't opened a support case but this could be a good idea.
Just to clarify, the corruption typically doesn't occur on first run. It happens after several days or weeks. So you probably can't reproduce the problem with only one test run.
This is indeed very strange, but like Ekaterina said, a VTune analysis should not corrupt your system. From your last message, I could infer that SSH breaks after some time and the system is unbootable some times.
This looks to me like a symptom of a hardware failure. What are the odds that you had a defective hardware twice? I actually hope not, but just out of curiosity, have you done a hardware test on your system?
More precisely, have you done an extensive memory test? I would recommend doing an in-depth multi-pass test using memtest86+.
What's the verdict?