From: Christian König christian.koenig@amd.com
When we set the valid bit on invalid GART entries they are loaded into the TLB when an adjacent entry is loaded. This poisons the TLB with invalid entries which are sometimes not correctly removed on TLB flush.
For stable inclusion the patch probably needs to be modified a bit.
Signed-off-by: Christian König christian.koenig@amd.com Cc: stable@vger.kernel.org --- drivers/gpu/drm/radeon/rs600.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/radeon/rs600.c b/drivers/gpu/drm/radeon/rs600.c index 0a8be63..e0465b2 100644 --- a/drivers/gpu/drm/radeon/rs600.c +++ b/drivers/gpu/drm/radeon/rs600.c @@ -634,7 +634,10 @@ int rs600_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr) return -EINVAL; } addr = addr & 0xFFFFFFFFFFFFF000ULL; - addr |= R600_PTE_GART; + if (addr == rdev->dummy_page.addr) + addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED; + else + addr |= R600_PTE_GART; writeq(addr, ptr + (i * 8)); return 0; }
From: Christian König christian.koenig@amd.com
We never check the return value anyway and if the index isn't valid would crash way before calling the functions.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/radeon/r100.c | 8 ++------ drivers/gpu/drm/radeon/r300.c | 7 ++----- drivers/gpu/drm/radeon/radeon.h | 3 ++- drivers/gpu/drm/radeon/radeon_asic.h | 12 ++++++++---- drivers/gpu/drm/radeon/rs400.c | 7 +------ drivers/gpu/drm/radeon/rs600.c | 6 +----- 6 files changed, 16 insertions(+), 27 deletions(-)
diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c index ad99813..1544efc 100644 --- a/drivers/gpu/drm/radeon/r100.c +++ b/drivers/gpu/drm/radeon/r100.c @@ -682,15 +682,11 @@ void r100_pci_gart_disable(struct radeon_device *rdev) WREG32(RADEON_AIC_HI_ADDR, 0); }
-int r100_pci_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr) +void r100_pci_gart_set_page(struct radeon_device *rdev, unsigned i, + uint64_t addr) { u32 *gtt = rdev->gart.ptr; - - if (i < 0 || i > rdev->gart.num_gpu_pages) { - return -EINVAL; - } gtt[i] = cpu_to_le32(lower_32_bits(addr)); - return 0; }
void r100_pci_gart_fini(struct radeon_device *rdev) diff --git a/drivers/gpu/drm/radeon/r300.c b/drivers/gpu/drm/radeon/r300.c index 206caf9..3c21d77 100644 --- a/drivers/gpu/drm/radeon/r300.c +++ b/drivers/gpu/drm/radeon/r300.c @@ -72,13 +72,11 @@ void rv370_pcie_gart_tlb_flush(struct radeon_device *rdev) #define R300_PTE_WRITEABLE (1 << 2) #define R300_PTE_READABLE (1 << 3)
-int rv370_pcie_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr) +void rv370_pcie_gart_set_page(struct radeon_device *rdev, unsigned i, + uint64_t addr) { void __iomem *ptr = rdev->gart.ptr;
- if (i < 0 || i > rdev->gart.num_gpu_pages) { - return -EINVAL; - } addr = (lower_32_bits(addr) >> 8) | ((upper_32_bits(addr) & 0xff) << 24) | R300_PTE_WRITEABLE | R300_PTE_READABLE; @@ -86,7 +84,6 @@ int rv370_pcie_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr) * on powerpc without HW swappers, it'll get swapped on way * into VRAM - so no need for cpu_to_le32 on VRAM tables */ writel(addr, ((void __iomem *)ptr) + (i * 4)); - return 0; }
int rv370_pcie_gart_init(struct radeon_device *rdev) diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 0661a77..c08987c 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -1778,7 +1778,8 @@ struct radeon_asic { /* gart */ struct { void (*tlb_flush)(struct radeon_device *rdev); - int (*set_page)(struct radeon_device *rdev, int i, uint64_t addr); + void (*set_page)(struct radeon_device *rdev, unsigned i, + uint64_t addr); } gart; struct { int (*init)(struct radeon_device *rdev); diff --git a/drivers/gpu/drm/radeon/radeon_asic.h b/drivers/gpu/drm/radeon/radeon_asic.h index 0eab015..01e7c0a 100644 --- a/drivers/gpu/drm/radeon/radeon_asic.h +++ b/drivers/gpu/drm/radeon/radeon_asic.h @@ -67,7 +67,8 @@ bool r100_gpu_is_lockup(struct radeon_device *rdev, struct radeon_ring *cp); int r100_asic_reset(struct radeon_device *rdev); u32 r100_get_vblank_counter(struct radeon_device *rdev, int crtc); void r100_pci_gart_tlb_flush(struct radeon_device *rdev); -int r100_pci_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr); +void r100_pci_gart_set_page(struct radeon_device *rdev, unsigned i, + uint64_t addr); void r100_ring_start(struct radeon_device *rdev, struct radeon_ring *ring); int r100_irq_set(struct radeon_device *rdev); int r100_irq_process(struct radeon_device *rdev); @@ -171,7 +172,8 @@ extern void r300_fence_ring_emit(struct radeon_device *rdev, struct radeon_fence *fence); extern int r300_cs_parse(struct radeon_cs_parser *p); extern void rv370_pcie_gart_tlb_flush(struct radeon_device *rdev); -extern int rv370_pcie_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr); +extern void rv370_pcie_gart_set_page(struct radeon_device *rdev, unsigned i, + uint64_t addr); extern void rv370_set_pcie_lanes(struct radeon_device *rdev, int lanes); extern int rv370_get_pcie_lanes(struct radeon_device *rdev); extern void r300_set_reg_safe(struct radeon_device *rdev); @@ -206,7 +208,8 @@ extern void rs400_fini(struct radeon_device *rdev); extern int rs400_suspend(struct radeon_device *rdev); extern int rs400_resume(struct radeon_device *rdev); void rs400_gart_tlb_flush(struct radeon_device *rdev); -int rs400_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr); +void rs400_gart_set_page(struct radeon_device *rdev, unsigned i, + uint64_t addr); uint32_t rs400_mc_rreg(struct radeon_device *rdev, uint32_t reg); void rs400_mc_wreg(struct radeon_device *rdev, uint32_t reg, uint32_t v); int rs400_gart_init(struct radeon_device *rdev); @@ -229,7 +232,8 @@ int rs600_irq_process(struct radeon_device *rdev); void rs600_irq_disable(struct radeon_device *rdev); u32 rs600_get_vblank_counter(struct radeon_device *rdev, int crtc); void rs600_gart_tlb_flush(struct radeon_device *rdev); -int rs600_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr); +void rs600_gart_set_page(struct radeon_device *rdev, unsigned i, + uint64_t addr); uint32_t rs600_mc_rreg(struct radeon_device *rdev, uint32_t reg); void rs600_mc_wreg(struct radeon_device *rdev, uint32_t reg, uint32_t v); void rs600_bandwidth_update(struct radeon_device *rdev); diff --git a/drivers/gpu/drm/radeon/rs400.c b/drivers/gpu/drm/radeon/rs400.c index 130d5cc..a0f96de 100644 --- a/drivers/gpu/drm/radeon/rs400.c +++ b/drivers/gpu/drm/radeon/rs400.c @@ -212,21 +212,16 @@ void rs400_gart_fini(struct radeon_device *rdev) #define RS400_PTE_WRITEABLE (1 << 2) #define RS400_PTE_READABLE (1 << 3)
-int rs400_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr) +void rs400_gart_set_page(struct radeon_device *rdev, unsigned i, uint64_t addr) { uint32_t entry; u32 *gtt = rdev->gart.ptr;
- if (i < 0 || i > rdev->gart.num_gpu_pages) { - return -EINVAL; - } - entry = (lower_32_bits(addr) & PAGE_MASK) | ((upper_32_bits(addr) & 0xff) << 4) | RS400_PTE_WRITEABLE | RS400_PTE_READABLE; entry = cpu_to_le32(entry); gtt[i] = entry; - return 0; }
int rs400_mc_wait_for_idle(struct radeon_device *rdev) diff --git a/drivers/gpu/drm/radeon/rs600.c b/drivers/gpu/drm/radeon/rs600.c index e0465b2..d1a35cb 100644 --- a/drivers/gpu/drm/radeon/rs600.c +++ b/drivers/gpu/drm/radeon/rs600.c @@ -626,20 +626,16 @@ static void rs600_gart_fini(struct radeon_device *rdev) radeon_gart_table_vram_free(rdev); }
-int rs600_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr) +void rs600_gart_set_page(struct radeon_device *rdev, unsigned i, uint64_t addr) { void __iomem *ptr = (void *)rdev->gart.ptr;
- if (i < 0 || i > rdev->gart.num_gpu_pages) { - return -EINVAL; - } addr = addr & 0xFFFFFFFFFFFFF000ULL; if (addr == rdev->dummy_page.addr) addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED; else addr |= R600_PTE_GART; writeq(addr, ptr + (i * 8)); - return 0; }
int rs600_irq_set(struct radeon_device *rdev)
From: Christian König christian.koenig@amd.com
The underlying reason for the crashes seems to be fixed now.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/radeon/radeon_asic.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/radeon/radeon_asic.c b/drivers/gpu/drm/radeon/radeon_asic.c index 34ea53d..34b9aa9 100644 --- a/drivers/gpu/drm/radeon/radeon_asic.c +++ b/drivers/gpu/drm/radeon/radeon_asic.c @@ -2029,8 +2029,8 @@ static struct radeon_asic ci_asic = { .blit_ring_index = RADEON_RING_TYPE_GFX_INDEX, .dma = &cik_copy_dma, .dma_ring_index = R600_RING_TYPE_DMA_INDEX, - .copy = &cik_copy_cpdma, - .copy_ring_index = RADEON_RING_TYPE_GFX_INDEX, + .copy = &cik_copy_dma, + .copy_ring_index = R600_RING_TYPE_DMA_INDEX, }, .surface = { .set_reg = r600_set_surface_reg,
On Wed, Jun 4, 2014 at 9:29 AM, Christian König deathsimple@vodafone.de wrote:
From: Christian König christian.koenig@amd.com
When we set the valid bit on invalid GART entries they are loaded into the TLB when an adjacent entry is loaded. This poisons the TLB with invalid entries which are sometimes not correctly removed on TLB flush.
For stable inclusion the patch probably needs to be modified a bit.
Signed-off-by: Christian König christian.koenig@amd.com Cc: stable@vger.kernel.org
Series is: Reviewed-by: Alex Deucher alexander.deucher@amd.com
stable cc on patch 2 or 3 as well? I suppose we'd need to modify the patches anyway so that they would apply on older kernels anyway.
Alex
drivers/gpu/drm/radeon/rs600.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/radeon/rs600.c b/drivers/gpu/drm/radeon/rs600.c index 0a8be63..e0465b2 100644 --- a/drivers/gpu/drm/radeon/rs600.c +++ b/drivers/gpu/drm/radeon/rs600.c @@ -634,7 +634,10 @@ int rs600_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr) return -EINVAL; } addr = addr & 0xFFFFFFFFFFFFF000ULL;
addr |= R600_PTE_GART;
if (addr == rdev->dummy_page.addr)
addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED;
else
addr |= R600_PTE_GART; writeq(addr, ptr + (i * 8)); return 0;
}
1.9.1
Am 04.06.2014 15:46, schrieb Alex Deucher:
On Wed, Jun 4, 2014 at 9:29 AM, Christian König deathsimple@vodafone.de wrote:
From: Christian König christian.koenig@amd.com
When we set the valid bit on invalid GART entries they are loaded into the TLB when an adjacent entry is loaded. This poisons the TLB with invalid entries which are sometimes not correctly removed on TLB flush.
For stable inclusion the patch probably needs to be modified a bit.
Signed-off-by: Christian König christian.koenig@amd.com Cc: stable@vger.kernel.org
Series is: Reviewed-by: Alex Deucher alexander.deucher@amd.com
stable cc on patch 2 or 3 as well? I suppose we'd need to modify the patches anyway so that they would apply on older kernels anyway.
No, the second patch is just an improvement of removing unnecessary checks and I think using the CPDMA on stable kernels is maybe still a good idea.
Christian
Alex
drivers/gpu/drm/radeon/rs600.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/radeon/rs600.c b/drivers/gpu/drm/radeon/rs600.c index 0a8be63..e0465b2 100644 --- a/drivers/gpu/drm/radeon/rs600.c +++ b/drivers/gpu/drm/radeon/rs600.c @@ -634,7 +634,10 @@ int rs600_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr) return -EINVAL; } addr = addr & 0xFFFFFFFFFFFFF000ULL;
addr |= R600_PTE_GART;
if (addr == rdev->dummy_page.addr)
addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED;
else
}addr |= R600_PTE_GART; writeq(addr, ptr + (i * 8)); return 0;
-- 1.9.1
Sorry to tell you the bad news. This patch doesn't fix the hangs on my machine.
I tested drm-next-3.16 from Alex's tree. I also switched copying from SDMA to CP DMA, which hung too.
I also tried this:
git checkout (the problematic commit): 6d2f294 - drm/radeon: use normal BOs for the page tables v4
git cherry-pick (fixes): 0e97703c - drm/radeon: add define for flags used in R600+ GTT 0986c1a5 - drm/radeon: stop poisoning the GART TLB 4906f689 - drm/radeon: fix page directory update size estimation 4b095566 - drm/radeon: fix buffer placement under memory pressure v2
Then I tested both SDMA and CP DMA copying. Both were unstable.
Testing was done with piglit / quick.tests.
Marek
On Wed, Jun 4, 2014 at 3:29 PM, Christian König deathsimple@vodafone.de wrote:
From: Christian König christian.koenig@amd.com
When we set the valid bit on invalid GART entries they are loaded into the TLB when an adjacent entry is loaded. This poisons the TLB with invalid entries which are sometimes not correctly removed on TLB flush.
For stable inclusion the patch probably needs to be modified a bit.
Signed-off-by: Christian König christian.koenig@amd.com Cc: stable@vger.kernel.org
drivers/gpu/drm/radeon/rs600.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/radeon/rs600.c b/drivers/gpu/drm/radeon/rs600.c index 0a8be63..e0465b2 100644 --- a/drivers/gpu/drm/radeon/rs600.c +++ b/drivers/gpu/drm/radeon/rs600.c @@ -634,7 +634,10 @@ int rs600_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr) return -EINVAL; } addr = addr & 0xFFFFFFFFFFFFF000ULL;
addr |= R600_PTE_GART;
if (addr == rdev->dummy_page.addr)
addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED;
else
addr |= R600_PTE_GART; writeq(addr, ptr + (i * 8)); return 0;
}
1.9.1
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Crap, I already wanted to check back with you if that really fixes your problems.
Thanks for the info, this crash also only happens on CIK doesn't it?
Christian.
Am 11.06.2014 01:30, schrieb Marek Olšák:
Sorry to tell you the bad news. This patch doesn't fix the hangs on my machine.
I tested drm-next-3.16 from Alex's tree. I also switched copying from SDMA to CP DMA, which hung too.
I also tried this:
git checkout (the problematic commit): 6d2f294 - drm/radeon: use normal BOs for the page tables v4
git cherry-pick (fixes): 0e97703c - drm/radeon: add define for flags used in R600+ GTT 0986c1a5 - drm/radeon: stop poisoning the GART TLB 4906f689 - drm/radeon: fix page directory update size estimation 4b095566 - drm/radeon: fix buffer placement under memory pressure v2
Then I tested both SDMA and CP DMA copying. Both were unstable.
Testing was done with piglit / quick.tests.
Marek
On Wed, Jun 4, 2014 at 3:29 PM, Christian König deathsimple@vodafone.de wrote:
From: Christian König christian.koenig@amd.com
When we set the valid bit on invalid GART entries they are loaded into the TLB when an adjacent entry is loaded. This poisons the TLB with invalid entries which are sometimes not correctly removed on TLB flush.
For stable inclusion the patch probably needs to be modified a bit.
Signed-off-by: Christian König christian.koenig@amd.com Cc: stable@vger.kernel.org
drivers/gpu/drm/radeon/rs600.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/radeon/rs600.c b/drivers/gpu/drm/radeon/rs600.c index 0a8be63..e0465b2 100644 --- a/drivers/gpu/drm/radeon/rs600.c +++ b/drivers/gpu/drm/radeon/rs600.c @@ -634,7 +634,10 @@ int rs600_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr) return -EINVAL; } addr = addr & 0xFFFFFFFFFFFFF000ULL;
addr |= R600_PTE_GART;
if (addr == rdev->dummy_page.addr)
addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED;
else
}addr |= R600_PTE_GART; writeq(addr, ptr + (i * 8)); return 0;
-- 1.9.1
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
I only tested Bonaire. I can test Cape Verde if needed.
Marek
On Wed, Jun 11, 2014 at 11:29 AM, Christian König deathsimple@vodafone.de wrote:
Crap, I already wanted to check back with you if that really fixes your problems.
Thanks for the info, this crash also only happens on CIK doesn't it?
Christian.
Am 11.06.2014 01:30, schrieb Marek Olšák:
Sorry to tell you the bad news. This patch doesn't fix the hangs on my machine.
I tested drm-next-3.16 from Alex's tree. I also switched copying from SDMA to CP DMA, which hung too.
I also tried this:
git checkout (the problematic commit): 6d2f294 - drm/radeon: use normal BOs for the page tables v4
git cherry-pick (fixes): 0e97703c - drm/radeon: add define for flags used in R600+ GTT 0986c1a5 - drm/radeon: stop poisoning the GART TLB 4906f689 - drm/radeon: fix page directory update size estimation 4b095566 - drm/radeon: fix buffer placement under memory pressure v2
Then I tested both SDMA and CP DMA copying. Both were unstable.
Testing was done with piglit / quick.tests.
Marek
On Wed, Jun 4, 2014 at 3:29 PM, Christian König deathsimple@vodafone.de wrote:
From: Christian König christian.koenig@amd.com
When we set the valid bit on invalid GART entries they are loaded into the TLB when an adjacent entry is loaded. This poisons the TLB with invalid entries which are sometimes not correctly removed on TLB flush.
For stable inclusion the patch probably needs to be modified a bit.
Signed-off-by: Christian König christian.koenig@amd.com Cc: stable@vger.kernel.org
drivers/gpu/drm/radeon/rs600.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/radeon/rs600.c b/drivers/gpu/drm/radeon/rs600.c index 0a8be63..e0465b2 100644 --- a/drivers/gpu/drm/radeon/rs600.c +++ b/drivers/gpu/drm/radeon/rs600.c @@ -634,7 +634,10 @@ int rs600_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr) return -EINVAL; } addr = addr & 0xFFFFFFFFFFFFF000ULL;
addr |= R600_PTE_GART;
if (addr == rdev->dummy_page.addr)
addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED;
else
}addr |= R600_PTE_GART; writeq(addr, ptr + (i * 8)); return 0;
-- 1.9.1
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Please do so, and you might want to try 3.15.0 as well.
I've tested multiple piglit runs over night with my Bonaire and 3.15.0 and that seemed to work perfectly fine.
Going to test Alex drm-next-3.16 a bit more as well.
Christian.
Am 11.06.2014 12:56, schrieb Marek Olšák:
I only tested Bonaire. I can test Cape Verde if needed.
Marek
On Wed, Jun 11, 2014 at 11:29 AM, Christian König deathsimple@vodafone.de wrote:
Crap, I already wanted to check back with you if that really fixes your problems.
Thanks for the info, this crash also only happens on CIK doesn't it?
Christian.
Am 11.06.2014 01:30, schrieb Marek Olšák:
Sorry to tell you the bad news. This patch doesn't fix the hangs on my machine.
I tested drm-next-3.16 from Alex's tree. I also switched copying from SDMA to CP DMA, which hung too.
I also tried this:
git checkout (the problematic commit): 6d2f294 - drm/radeon: use normal BOs for the page tables v4
git cherry-pick (fixes): 0e97703c - drm/radeon: add define for flags used in R600+ GTT 0986c1a5 - drm/radeon: stop poisoning the GART TLB 4906f689 - drm/radeon: fix page directory update size estimation 4b095566 - drm/radeon: fix buffer placement under memory pressure v2
Then I tested both SDMA and CP DMA copying. Both were unstable.
Testing was done with piglit / quick.tests.
Marek
On Wed, Jun 4, 2014 at 3:29 PM, Christian König deathsimple@vodafone.de wrote:
From: Christian König christian.koenig@amd.com
When we set the valid bit on invalid GART entries they are loaded into the TLB when an adjacent entry is loaded. This poisons the TLB with invalid entries which are sometimes not correctly removed on TLB flush.
For stable inclusion the patch probably needs to be modified a bit.
Signed-off-by: Christian König christian.koenig@amd.com Cc: stable@vger.kernel.org
drivers/gpu/drm/radeon/rs600.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/radeon/rs600.c b/drivers/gpu/drm/radeon/rs600.c index 0a8be63..e0465b2 100644 --- a/drivers/gpu/drm/radeon/rs600.c +++ b/drivers/gpu/drm/radeon/rs600.c @@ -634,7 +634,10 @@ int rs600_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr) return -EINVAL; } addr = addr & 0xFFFFFFFFFFFFF000ULL;
addr |= R600_PTE_GART;
if (addr == rdev->dummy_page.addr)
addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED;
else
}addr |= R600_PTE_GART; writeq(addr, ptr + (i * 8)); return 0;
-- 1.9.1
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi,
With my "force_gtt" patch, Cape Verde is unstable too, so all GCN chips are affected.
I recommend applying that patch, because it will reproduce the problem faster. Without it, the hangs are very rare and it may take a while before they occur.
Marek
On Thu, Jun 12, 2014 at 1:23 PM, Christian König deathsimple@vodafone.de wrote:
Please do so, and you might want to try 3.15.0 as well.
I've tested multiple piglit runs over night with my Bonaire and 3.15.0 and that seemed to work perfectly fine.
Going to test Alex drm-next-3.16 a bit more as well.
Christian.
Am 11.06.2014 12:56, schrieb Marek Olšák:
I only tested Bonaire. I can test Cape Verde if needed.
Marek
On Wed, Jun 11, 2014 at 11:29 AM, Christian König deathsimple@vodafone.de wrote:
Crap, I already wanted to check back with you if that really fixes your problems.
Thanks for the info, this crash also only happens on CIK doesn't it?
Christian.
Am 11.06.2014 01:30, schrieb Marek Olšák:
Sorry to tell you the bad news. This patch doesn't fix the hangs on my machine.
I tested drm-next-3.16 from Alex's tree. I also switched copying from SDMA to CP DMA, which hung too.
I also tried this:
git checkout (the problematic commit): 6d2f294 - drm/radeon: use normal BOs for the page tables v4
git cherry-pick (fixes): 0e97703c - drm/radeon: add define for flags used in R600+ GTT 0986c1a5 - drm/radeon: stop poisoning the GART TLB 4906f689 - drm/radeon: fix page directory update size estimation 4b095566 - drm/radeon: fix buffer placement under memory pressure v2
Then I tested both SDMA and CP DMA copying. Both were unstable.
Testing was done with piglit / quick.tests.
Marek
On Wed, Jun 4, 2014 at 3:29 PM, Christian König deathsimple@vodafone.de wrote:
From: Christian König christian.koenig@amd.com
When we set the valid bit on invalid GART entries they are loaded into the TLB when an adjacent entry is loaded. This poisons the TLB with invalid entries which are sometimes not correctly removed on TLB flush.
For stable inclusion the patch probably needs to be modified a bit.
Signed-off-by: Christian König christian.koenig@amd.com Cc: stable@vger.kernel.org
drivers/gpu/drm/radeon/rs600.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/radeon/rs600.c b/drivers/gpu/drm/radeon/rs600.c index 0a8be63..e0465b2 100644 --- a/drivers/gpu/drm/radeon/rs600.c +++ b/drivers/gpu/drm/radeon/rs600.c @@ -634,7 +634,10 @@ int rs600_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr) return -EINVAL; } addr = addr & 0xFFFFFFFFFFFFF000ULL;
addr |= R600_PTE_GART;
if (addr == rdev->dummy_page.addr)
addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED;
else
}addr |= R600_PTE_GART; writeq(addr, ptr + (i * 8)); return 0;
-- 1.9.1
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi Marek,
ah, yes! Piglit in combination with that patch can indeed crash the box.
Going to investigate now that I can reproduce it.
Thanks, Christian.
Am 13.06.2014 15:19, schrieb Marek Olšák:
Hi,
With my "force_gtt" patch, Cape Verde is unstable too, so all GCN chips are affected.
I recommend applying that patch, because it will reproduce the problem faster. Without it, the hangs are very rare and it may take a while before they occur.
Marek
On Thu, Jun 12, 2014 at 1:23 PM, Christian König deathsimple@vodafone.de wrote:
Please do so, and you might want to try 3.15.0 as well.
I've tested multiple piglit runs over night with my Bonaire and 3.15.0 and that seemed to work perfectly fine.
Going to test Alex drm-next-3.16 a bit more as well.
Christian.
Am 11.06.2014 12:56, schrieb Marek Olšák:
I only tested Bonaire. I can test Cape Verde if needed.
Marek
On Wed, Jun 11, 2014 at 11:29 AM, Christian König deathsimple@vodafone.de wrote:
Crap, I already wanted to check back with you if that really fixes your problems.
Thanks for the info, this crash also only happens on CIK doesn't it?
Christian.
Am 11.06.2014 01:30, schrieb Marek Olšák:
Sorry to tell you the bad news. This patch doesn't fix the hangs on my machine.
I tested drm-next-3.16 from Alex's tree. I also switched copying from SDMA to CP DMA, which hung too.
I also tried this:
git checkout (the problematic commit): 6d2f294 - drm/radeon: use normal BOs for the page tables v4
git cherry-pick (fixes): 0e97703c - drm/radeon: add define for flags used in R600+ GTT 0986c1a5 - drm/radeon: stop poisoning the GART TLB 4906f689 - drm/radeon: fix page directory update size estimation 4b095566 - drm/radeon: fix buffer placement under memory pressure v2
Then I tested both SDMA and CP DMA copying. Both were unstable.
Testing was done with piglit / quick.tests.
Marek
On Wed, Jun 4, 2014 at 3:29 PM, Christian König deathsimple@vodafone.de wrote:
From: Christian König christian.koenig@amd.com
When we set the valid bit on invalid GART entries they are loaded into the TLB when an adjacent entry is loaded. This poisons the TLB with invalid entries which are sometimes not correctly removed on TLB flush.
For stable inclusion the patch probably needs to be modified a bit.
Signed-off-by: Christian König christian.koenig@amd.com Cc: stable@vger.kernel.org
drivers/gpu/drm/radeon/rs600.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/radeon/rs600.c b/drivers/gpu/drm/radeon/rs600.c index 0a8be63..e0465b2 100644 --- a/drivers/gpu/drm/radeon/rs600.c +++ b/drivers/gpu/drm/radeon/rs600.c @@ -634,7 +634,10 @@ int rs600_gart_set_page(struct radeon_device *rdev, int i, uint64_t addr) return -EINVAL; } addr = addr & 0xFFFFFFFFFFFFF000ULL;
addr |= R600_PTE_GART;
if (addr == rdev->dummy_page.addr)
addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED;
else
}addr |= R600_PTE_GART; writeq(addr, ptr + (i * 8)); return 0;
-- 1.9.1
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Fri, Jun 13, 2014 at 11:45 AM, Christian König deathsimple@vodafone.de wrote:
Hi Marek,
ah, yes! Piglit in combination with that patch can indeed crash the box.
Going to investigate now that I can reproduce it.
I wonder if it's a clockgating issue with the MC or BIF? You might try adjusting the rdev->cg_flags (try setting it to 0) in radeon_asic.c or disabling dpm.
Alex
Thanks, Christian.
Am 13.06.2014 15:19, schrieb Marek Olšák:
Hi,
With my "force_gtt" patch, Cape Verde is unstable too, so all GCN chips are affected.
I recommend applying that patch, because it will reproduce the problem faster. Without it, the hangs are very rare and it may take a while before they occur.
Marek
On Thu, Jun 12, 2014 at 1:23 PM, Christian König deathsimple@vodafone.de wrote:
Please do so, and you might want to try 3.15.0 as well.
I've tested multiple piglit runs over night with my Bonaire and 3.15.0 and that seemed to work perfectly fine.
Going to test Alex drm-next-3.16 a bit more as well.
Christian.
Am 11.06.2014 12:56, schrieb Marek Olšák:
I only tested Bonaire. I can test Cape Verde if needed.
Marek
On Wed, Jun 11, 2014 at 11:29 AM, Christian König deathsimple@vodafone.de wrote:
Crap, I already wanted to check back with you if that really fixes your problems.
Thanks for the info, this crash also only happens on CIK doesn't it?
Christian.
Am 11.06.2014 01:30, schrieb Marek Olšák:
Sorry to tell you the bad news. This patch doesn't fix the hangs on my machine.
I tested drm-next-3.16 from Alex's tree. I also switched copying from SDMA to CP DMA, which hung too.
I also tried this:
git checkout (the problematic commit): 6d2f294 - drm/radeon: use normal BOs for the page tables v4
git cherry-pick (fixes): 0e97703c - drm/radeon: add define for flags used in R600+ GTT 0986c1a5 - drm/radeon: stop poisoning the GART TLB 4906f689 - drm/radeon: fix page directory update size estimation 4b095566 - drm/radeon: fix buffer placement under memory pressure v2
Then I tested both SDMA and CP DMA copying. Both were unstable.
Testing was done with piglit / quick.tests.
Marek
On Wed, Jun 4, 2014 at 3:29 PM, Christian König deathsimple@vodafone.de wrote: > > From: Christian König christian.koenig@amd.com > > When we set the valid bit on invalid GART entries they are > loaded into the TLB when an adjacent entry is loaded. This > poisons the TLB with invalid entries which are sometimes > not correctly removed on TLB flush. > > For stable inclusion the patch probably needs to be modified a bit. > > Signed-off-by: Christian König christian.koenig@amd.com > Cc: stable@vger.kernel.org > --- > drivers/gpu/drm/radeon/rs600.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/radeon/rs600.c > b/drivers/gpu/drm/radeon/rs600.c > index 0a8be63..e0465b2 100644 > --- a/drivers/gpu/drm/radeon/rs600.c > +++ b/drivers/gpu/drm/radeon/rs600.c > @@ -634,7 +634,10 @@ int rs600_gart_set_page(struct radeon_device > *rdev, > int i, uint64_t addr) > return -EINVAL; > } > addr = addr & 0xFFFFFFFFFFFFF000ULL; > - addr |= R600_PTE_GART; > + if (addr == rdev->dummy_page.addr) > + addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED; > + else > + addr |= R600_PTE_GART; > writeq(addr, ptr + (i * 8)); > return 0; > } > -- > 1.9.1 > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/dri-devel
Am 13.06.2014 23:31, schrieb Alex Deucher:
On Fri, Jun 13, 2014 at 11:45 AM, Christian König deathsimple@vodafone.de wrote:
Hi Marek,
ah, yes! Piglit in combination with that patch can indeed crash the box.
Going to investigate now that I can reproduce it.
I wonder if it's a clockgating issue with the MC or BIF? You might try adjusting the rdev->cg_flags (try setting it to 0) in radeon_asic.c or disabling dpm.
Unfortunately that was just a false alarm.
I was just on a branch which didn't had the "stop poisoning the GART TLB" patch, after applying this patch I can again let piglit run for the whole night without a lockup.
No idea what goes wrong when Marek runs piglit, but 3.15.0+"stop poisoning the GART TLB"+"force_gtt" is rock solid here.
Christian.
Alex
Thanks, Christian.
Am 13.06.2014 15:19, schrieb Marek Olšák:
Hi,
With my "force_gtt" patch, Cape Verde is unstable too, so all GCN chips are affected.
I recommend applying that patch, because it will reproduce the problem faster. Without it, the hangs are very rare and it may take a while before they occur.
Marek
On Thu, Jun 12, 2014 at 1:23 PM, Christian König deathsimple@vodafone.de wrote:
Please do so, and you might want to try 3.15.0 as well.
I've tested multiple piglit runs over night with my Bonaire and 3.15.0 and that seemed to work perfectly fine.
Going to test Alex drm-next-3.16 a bit more as well.
Christian.
Am 11.06.2014 12:56, schrieb Marek Olšák:
I only tested Bonaire. I can test Cape Verde if needed.
Marek
On Wed, Jun 11, 2014 at 11:29 AM, Christian König deathsimple@vodafone.de wrote:
Crap, I already wanted to check back with you if that really fixes your problems.
Thanks for the info, this crash also only happens on CIK doesn't it?
Christian.
Am 11.06.2014 01:30, schrieb Marek Olšák:
> Sorry to tell you the bad news. This patch doesn't fix the hangs on my > machine. > > I tested drm-next-3.16 from Alex's tree. I also switched copying from > SDMA to CP DMA, which hung too. > > I also tried this: > > git checkout (the problematic commit): > 6d2f294 - drm/radeon: use normal BOs for the page tables v4 > > git cherry-pick (fixes): > 0e97703c - drm/radeon: add define for flags used in R600+ GTT > 0986c1a5 - drm/radeon: stop poisoning the GART TLB > 4906f689 - drm/radeon: fix page directory update size estimation > 4b095566 - drm/radeon: fix buffer placement under memory pressure v2 > > Then I tested both SDMA and CP DMA copying. Both were unstable. > > Testing was done with piglit / quick.tests. > > Marek > > > On Wed, Jun 4, 2014 at 3:29 PM, Christian König > deathsimple@vodafone.de > wrote: >> From: Christian König christian.koenig@amd.com >> >> When we set the valid bit on invalid GART entries they are >> loaded into the TLB when an adjacent entry is loaded. This >> poisons the TLB with invalid entries which are sometimes >> not correctly removed on TLB flush. >> >> For stable inclusion the patch probably needs to be modified a bit. >> >> Signed-off-by: Christian König christian.koenig@amd.com >> Cc: stable@vger.kernel.org >> --- >> drivers/gpu/drm/radeon/rs600.c | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/radeon/rs600.c >> b/drivers/gpu/drm/radeon/rs600.c >> index 0a8be63..e0465b2 100644 >> --- a/drivers/gpu/drm/radeon/rs600.c >> +++ b/drivers/gpu/drm/radeon/rs600.c >> @@ -634,7 +634,10 @@ int rs600_gart_set_page(struct radeon_device >> *rdev, >> int i, uint64_t addr) >> return -EINVAL; >> } >> addr = addr & 0xFFFFFFFFFFFFF000ULL; >> - addr |= R600_PTE_GART; >> + if (addr == rdev->dummy_page.addr) >> + addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED; >> + else >> + addr |= R600_PTE_GART; >> writeq(addr, ptr + (i * 8)); >> return 0; >> } >> -- >> 1.9.1 >> >> _______________________________________________ >> dri-devel mailing list >> dri-devel@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/dri-devel
On 15.06.2014 21:48, Christian König wrote:
Am 13.06.2014 23:31, schrieb Alex Deucher:
On Fri, Jun 13, 2014 at 11:45 AM, Christian König deathsimple@vodafone.de wrote:
Hi Marek,
ah, yes! Piglit in combination with that patch can indeed crash the box.
Going to investigate now that I can reproduce it.
I wonder if it's a clockgating issue with the MC or BIF? You might try adjusting the rdev->cg_flags (try setting it to 0) in radeon_asic.c or disabling dpm.
Unfortunately that was just a false alarm.
I was just on a branch which didn't had the "stop poisoning the GART TLB" patch, after applying this patch I can again let piglit run for the whole night without a lockup.
No idea what goes wrong when Marek runs piglit, but 3.15.0+"stop poisoning the GART TLB"+"force_gtt" is rock solid here.
FWIW, 3.15 doesn't survive piglit on my Bonaire either, but 3.14 is fine. 3.15 seems stable on Kaveri though, but I haven't tried the force_gtt patch on that yet.
There have also been a number of bug reports about stability regressions in 3.15 on various SI and CIK cards. It seems likely that at least some of those are related to this issue as well.
If we can't figure out the problem soon, we probably need to revert the 'Use normal BOs for page tables' and dependent changes at least for 3.15.y?
Am 19.06.2014 03:48, schrieb Michel Dänzer:
On 15.06.2014 21:48, Christian König wrote:
Am 13.06.2014 23:31, schrieb Alex Deucher:
On Fri, Jun 13, 2014 at 11:45 AM, Christian König deathsimple@vodafone.de wrote:
Hi Marek,
ah, yes! Piglit in combination with that patch can indeed crash the box.
Going to investigate now that I can reproduce it.
I wonder if it's a clockgating issue with the MC or BIF? You might try adjusting the rdev->cg_flags (try setting it to 0) in radeon_asic.c or disabling dpm.
Unfortunately that was just a false alarm.
I was just on a branch which didn't had the "stop poisoning the GART TLB" patch, after applying this patch I can again let piglit run for the whole night without a lockup.
No idea what goes wrong when Marek runs piglit, but 3.15.0+"stop poisoning the GART TLB"+"force_gtt" is rock solid here.
FWIW, 3.15 doesn't survive piglit on my Bonaire either, but 3.14 is fine. 3.15 seems stable on Kaveri though, but I haven't tried the force_gtt patch on that yet.
Yeah, I think it's just me who has a stable system with 3.15 and that annoys me quite a bit.
No idea what's the difference. What versions of LLVM/Mesa/Piglit are you using for the test?
There have also been a number of bug reports about stability regressions in 3.15 on various SI and CIK cards. It seems likely that at least some of those are related to this issue as well.
If we can't figure out the problem soon, we probably need to revert the 'Use normal BOs for page tables' and dependent changes at least for 3.15.y?
I thought about this for the whole 3.15 release cycle, but decided against it. But what we could do is applying the attached trivial patch, it pins down the page tables and so pretty much reverts to the old behavior.
I think even when we revert to the old code we have a couple of unsolved problems with the VM support or in the driver in general where we should try to understand the underlying reason for it instead of applying more workarounds.
Going to try harder crashing my 3.15 system, Christian.
On 19.06.2014 18:45, Christian König wrote:
Am 19.06.2014 03:48, schrieb Michel Dänzer:
On 15.06.2014 21:48, Christian König wrote:
No idea what goes wrong when Marek runs piglit, but 3.15.0+"stop poisoning the GART TLB"+"force_gtt" is rock solid here.
FWIW, 3.15 doesn't survive piglit on my Bonaire either, but 3.14 is fine. 3.15 seems stable on Kaveri though, but I haven't tried the force_gtt patch on that yet.
Yeah, I think it's just me who has a stable system with 3.15 and that annoys me quite a bit.
FWIW though, my Kaveri doesn't always survive piglit either, e.g. this morning it didn't once again, then did after a reboot. (That's using SDMA; Kaveri was never switched back to CPDMA)
No idea what's the difference. What versions of LLVM/Mesa/Piglit are you using for the test?
Current Git of everything.
There have also been a number of bug reports about stability regressions in 3.15 on various SI and CIK cards. It seems likely that at least some of those are related to this issue as well.
If we can't figure out the problem soon, we probably need to revert the 'Use normal BOs for page tables' and dependent changes at least for 3.15.y?
I thought about this for the whole 3.15 release cycle, but decided against it. But what we could do is applying the attached trivial patch, it pins down the page tables and so pretty much reverts to the old behavior.
This patch applied on top of 3.15 + stop poisoning the GART TLB doesn't seem to help on my Bonaire, unfortunately.
I think even when we revert to the old code we have a couple of unsolved problems with the VM support or in the driver in general where we should try to understand the underlying reason for it instead of applying more workarounds.
I'm not suggesting applying more workarounds but going back to a known more stable state. It seems like we've maneuvered ourselves to a rather uncomfortable position from there, with no clear way to a better place. But if we basically started from the 3.14 state again, we have a few known hurdles like mine and Marek's Bonaire etc. which we know any further improvements will have to pass before they can be considered for general consumption.
Am 23.06.2014 10:15, schrieb Michel Dänzer:
On 19.06.2014 18:45, Christian König wrote:
Am 19.06.2014 03:48, schrieb Michel Dänzer:
On 15.06.2014 21:48, Christian König wrote:
No idea what goes wrong when Marek runs piglit, but 3.15.0+"stop poisoning the GART TLB"+"force_gtt" is rock solid here.
FWIW, 3.15 doesn't survive piglit on my Bonaire either, but 3.14 is fine. 3.15 seems stable on Kaveri though, but I haven't tried the force_gtt patch on that yet.
Yeah, I think it's just me who has a stable system with 3.15 and that annoys me quite a bit.
FWIW though, my Kaveri doesn't always survive piglit either, e.g. this morning it didn't once again, then did after a reboot. (That's using SDMA; Kaveri was never switched back to CPDMA)
No idea what's the difference. What versions of LLVM/Mesa/Piglit are you using for the test?
Current Git of everything.
There have also been a number of bug reports about stability regressions in 3.15 on various SI and CIK cards. It seems likely that at least some of those are related to this issue as well.
If we can't figure out the problem soon, we probably need to revert the 'Use normal BOs for page tables' and dependent changes at least for 3.15.y?
I thought about this for the whole 3.15 release cycle, but decided against it. But what we could do is applying the attached trivial patch, it pins down the page tables and so pretty much reverts to the old behavior.
This patch applied on top of 3.15 + stop poisoning the GART TLB doesn't seem to help on my Bonaire, unfortunately.
That's unfortunately what I already expected. Making the page tables movable isn't really the cause of the problem, it must be rather something else which is a bit more subtle. Like incorrect aligning somewhere or something like this.
I think even when we revert to the old code we have a couple of unsolved problems with the VM support or in the driver in general where we should try to understand the underlying reason for it instead of applying more workarounds.
I'm not suggesting applying more workarounds but going back to a known more stable state. It seems like we've maneuvered ourselves to a rather uncomfortable position from there, with no clear way to a better place. But if we basically started from the 3.14 state again, we have a few known hurdles like mine and Marek's Bonaire etc. which we know any further improvements will have to pass before they can be considered for general consumption.
Yeah agree, especially on the uncomfortable position.
Please try with the two attached patches applied on top of 3.15 and retest. They should revert back to the old implementation.
Thanks for the help, Christian.
On 23.06.2014 18:56, Christian König wrote:
Am 23.06.2014 10:15, schrieb Michel Dänzer:
On 19.06.2014 18:45, Christian König wrote:
I think even when we revert to the old code we have a couple of unsolved problems with the VM support or in the driver in general where we should try to understand the underlying reason for it instead of applying more workarounds.
I'm not suggesting applying more workarounds but going back to a known more stable state. It seems like we've maneuvered ourselves to a rather uncomfortable position from there, with no clear way to a better place. But if we basically started from the 3.14 state again, we have a few known hurdles like mine and Marek's Bonaire etc. which we know any further improvements will have to pass before they can be considered for general consumption.
Yeah agree, especially on the uncomfortable position.
Please try with the two attached patches applied on top of 3.15 and retest. They should revert back to the old implementation.
Unfortunately, X fails to start with these, see the attached excerpt from dmesg.
Am 24.06.2014 08:49, schrieb Michel Dänzer:
On 23.06.2014 18:56, Christian König wrote:
Am 23.06.2014 10:15, schrieb Michel Dänzer:
On 19.06.2014 18:45, Christian König wrote:
I think even when we revert to the old code we have a couple of unsolved problems with the VM support or in the driver in general where we should try to understand the underlying reason for it instead of applying more workarounds.
I'm not suggesting applying more workarounds but going back to a known more stable state. It seems like we've maneuvered ourselves to a rather uncomfortable position from there, with no clear way to a better place. But if we basically started from the 3.14 state again, we have a few known hurdles like mine and Marek's Bonaire etc. which we know any further improvements will have to pass before they can be considered for general consumption.
Yeah agree, especially on the uncomfortable position.
Please try with the two attached patches applied on top of 3.15 and retest. They should revert back to the old implementation.
Unfortunately, X fails to start with these, see the attached excerpt from dmesg.
My fault, incorrectly solved a merge conflict and then failed to test the right kernel.
BTW: Wasn't there an option to tell grup to use the latest installed kernel instead of the one with the highest version number? Can't seem to find that any more.
Please try attached patches instead, Christian.
On 24.06.2014 19:14, Christian König wrote:
Am 24.06.2014 08:49, schrieb Michel Dänzer:
On 23.06.2014 18:56, Christian König wrote:
Am 23.06.2014 10:15, schrieb Michel Dänzer:
On 19.06.2014 18:45, Christian König wrote:
I think even when we revert to the old code we have a couple of unsolved problems with the VM support or in the driver in general where we should try to understand the underlying reason for it instead of applying more workarounds.
I'm not suggesting applying more workarounds but going back to a known more stable state. It seems like we've maneuvered ourselves to a rather uncomfortable position from there, with no clear way to a better place. But if we basically started from the 3.14 state again, we have a few known hurdles like mine and Marek's Bonaire etc. which we know any further improvements will have to pass before they can be considered for general consumption.
Yeah agree, especially on the uncomfortable position.
Please try with the two attached patches applied on top of 3.15 and retest. They should revert back to the old implementation.
Unfortunately, X fails to start with these, see the attached excerpt from dmesg.
My fault, incorrectly solved a merge conflict and then failed to test the right kernel.
BTW: Wasn't there an option to tell grup to use the latest installed kernel instead of the one with the highest version number? Can't seem to find that any more.
No idea, unfortunately.
Please try attached patches instead,
With these patches, 3.15 just survived two piglit runs on my Bonaire, one with the GART poisoning fix and one without. It never survived a single run before.
Acked-and-Tested-by: Michel Dänzer michel.daenzer@amd.com
Am 25.06.2014 05:59, schrieb Michel Dänzer:
On 24.06.2014 19:14, Christian König wrote:
Am 24.06.2014 08:49, schrieb Michel Dänzer:
On 23.06.2014 18:56, Christian König wrote:
Am 23.06.2014 10:15, schrieb Michel Dänzer:
On 19.06.2014 18:45, Christian König wrote:
I think even when we revert to the old code we have a couple of unsolved problems with the VM support or in the driver in general where we should try to understand the underlying reason for it instead of applying more workarounds.
I'm not suggesting applying more workarounds but going back to a known more stable state. It seems like we've maneuvered ourselves to a rather uncomfortable position from there, with no clear way to a better place. But if we basically started from the 3.14 state again, we have a few known hurdles like mine and Marek's Bonaire etc. which we know any further improvements will have to pass before they can be considered for general consumption.
Yeah agree, especially on the uncomfortable position.
Please try with the two attached patches applied on top of 3.15 and retest. They should revert back to the old implementation.
Unfortunately, X fails to start with these, see the attached excerpt from dmesg.
My fault, incorrectly solved a merge conflict and then failed to test the right kernel.
BTW: Wasn't there an option to tell grup to use the latest installed kernel instead of the one with the highest version number? Can't seem to find that any more.
Maybe this helps (section 5. Grub 2 Files & Options). http://ubuntuforums.org/showthread.php?t=1195275
GRUB_DEFAULT
Regards, Dieter
On 25.06.2014 12:59, Michel Dänzer wrote:
On 24.06.2014 19:14, Christian König wrote:
Am 24.06.2014 08:49, schrieb Michel Dänzer:
On 23.06.2014 18:56, Christian König wrote:
Am 23.06.2014 10:15, schrieb Michel Dänzer:
On 19.06.2014 18:45, Christian König wrote:
I think even when we revert to the old code we have a couple of unsolved problems with the VM support or in the driver in general where we should try to understand the underlying reason for it instead of applying more workarounds.
I'm not suggesting applying more workarounds but going back to a known more stable state. It seems like we've maneuvered ourselves to a rather uncomfortable position from there, with no clear way to a better place. But if we basically started from the 3.14 state again, we have a few known hurdles like mine and Marek's Bonaire etc. which we know any further improvements will have to pass before they can be considered for general consumption.
Yeah agree, especially on the uncomfortable position.
Please try with the two attached patches applied on top of 3.15 and retest. They should revert back to the old implementation.
Unfortunately, X fails to start with these, see the attached excerpt from dmesg.
My fault, incorrectly solved a merge conflict and then failed to test the right kernel.
[...]
Please try attached patches instead,
With these patches, 3.15 just survived two piglit runs on my Bonaire, one with the GART poisoning fix and one without. It never survived a single run before.
Acked-and-Tested-by: Michel Dänzer michel.daenzer@amd.com
So, are these patches going to 3.16 and 3.15?
Am 27.06.2014 04:31, schrieb Michel Dänzer:
On 25.06.2014 12:59, Michel Dänzer wrote:
On 24.06.2014 19:14, Christian König wrote:
Am 24.06.2014 08:49, schrieb Michel Dänzer:
On 23.06.2014 18:56, Christian König wrote:
Am 23.06.2014 10:15, schrieb Michel Dänzer:
On 19.06.2014 18:45, Christian König wrote:
> I think even when we revert to the old code we have a couple of > unsolved > problems with the VM support or in the driver in general where we > should > try to understand the underlying reason for it instead of applying > more > workarounds. I'm not suggesting applying more workarounds but going back to a known more stable state. It seems like we've maneuvered ourselves to a rather uncomfortable position from there, with no clear way to a better place. But if we basically started from the 3.14 state again, we have a few known hurdles like mine and Marek's Bonaire etc. which we know any further improvements will have to pass before they can be considered for general consumption.
Yeah agree, especially on the uncomfortable position.
Please try with the two attached patches applied on top of 3.15 and retest. They should revert back to the old implementation.
Unfortunately, X fails to start with these, see the attached excerpt from dmesg.
My fault, incorrectly solved a merge conflict and then failed to test the right kernel.
[...]
Please try attached patches instead,
With these patches, 3.15 just survived two piglit runs on my Bonaire, one with the GART poisoning fix and one without. It never survived a single run before.
Acked-and-Tested-by: Michel Dänzer michel.daenzer@amd.com
So, are these patches going to 3.16 and 3.15?
We could send them in for 3.15, but for 3.16 we have some new features that depend on the new code.
We could backport them to the old code, but I really want to work on figuring out what's wrong with the new approach instead.
Going to prepare a branch for you to test over the weekend, would be nice if you could give it a try on Monday and see if that fixes the issues as well.
Thanks, Christian.
On 27.06.2014 17:26, Christian König wrote:
Am 27.06.2014 04:31, schrieb Michel Dänzer:
On 25.06.2014 12:59, Michel Dänzer wrote:
With these patches, 3.15 just survived two piglit runs on my Bonaire, one with the GART poisoning fix and one without. It never survived a single run before.
Acked-and-Tested-by: Michel Dänzer michel.daenzer@amd.com
So, are these patches going to 3.16 and 3.15?
We could send them in for 3.15,
What's the alternative for 3.15?
Looks like e.g. https://bugs.freedesktop.org/show_bug.cgi?id=80141 is confirmed to be this.
but for 3.16 we have some new features that depend on the new code.
We could backport them to the old code, but I really want to work on figuring out what's wrong with the new approach instead.
Going to prepare a branch for you to test over the weekend, would be nice if you could give it a try on Monday and see if that fixes the issues as well.
Sure, will do.
Am 27.06.2014 10:59, schrieb Michel Dänzer:
On 27.06.2014 17:26, Christian König wrote:
Am 27.06.2014 04:31, schrieb Michel Dänzer:
On 25.06.2014 12:59, Michel Dänzer wrote:
With these patches, 3.15 just survived two piglit runs on my Bonaire, one with the GART poisoning fix and one without. It never survived a single run before.
Acked-and-Tested-by: Michel Dänzer michel.daenzer@amd.com
So, are these patches going to 3.16 and 3.15?
We could send them in for 3.15,
What's the alternative for 3.15?
Well, figuring out what's the real reason behind those lockups would be a good start :)
Looks like e.g. https://bugs.freedesktop.org/show_bug.cgi?id=80141 is confirmed to be this.
but for 3.16 we have some new features that depend on the new code.
We could backport them to the old code, but I really want to work on figuring out what's wrong with the new approach instead.
Going to prepare a branch for you to test over the weekend, would be nice if you could give it a try on Monday and see if that fixes the issues as well.
Sure, will do.
I've just pushed the branch testing-3.15 to git://people.freedesktop.org/~deathsimple/linux. It's based on 3.15.2 and contains the "stop poisoning the GART TLB" patch backported to 3.15 and a couple of things that I would like to try.
I've disabled the redirection of page faults to the dummy page for now and so the system should lockup on the first page fault it encounters. Apart from that the page directory and page tables are now completely over allocated and over aligned.
Setting the READABLE bit on invalid entries shouldn't have an effect other than making those entries non zero. So please try to lockup your bonaire with this branch and as soon as you encounter the first page fault take a look at VM_CONTEXT1_PROTECTION_FAULT_STATUS and figure out which VMID caused the lockup.
Then use the attached script to make a dump from the complete page directory and page table of the VMID in question. E.g. "./dump_vm.sh 1" if the lockup was caused by VMID 1 etc... Make sure you've got a radeontool that supports CIK, otherwise it would only return zeros as page directory address.
Since even the invalid page table entries should now have at least the READABLE bit set there shouldn't be anything zero in this dump and look out for anything else suspicious as well (0xdeadbeef etc...).
Thanks for the help, Christian.
On 29.06.2014 19:34, Christian König wrote:
I've just pushed the branch testing-3.15 to git://people.freedesktop.org/~deathsimple/linux. It's based on 3.15.2 and contains the "stop poisoning the GART TLB" patch backported to 3.15 and a couple of things that I would like to try.
Running that branch, my Bonaire just survived a piglit run without lockup. I hope that's an interesting result. :)
Am 30.06.2014 08:10, schrieb Michel Dänzer:
On 29.06.2014 19:34, Christian König wrote:
I've just pushed the branch testing-3.15 to git://people.freedesktop.org/~deathsimple/linux. It's based on 3.15.2 and contains the "stop poisoning the GART TLB" patch backported to 3.15 and a couple of things that I would like to try.
Running that branch, my Bonaire just survived a piglit run without lockup. I hope that's an interesting result. :)
That's indeed an interesting result. Can you try to figure out which of the patches on the branch did the trick for you?
Thanks, Christian.
On 30.06.2014 16:43, Christian König wrote:
Am 30.06.2014 08:10, schrieb Michel Dänzer:
On 29.06.2014 19:34, Christian König wrote:
I've just pushed the branch testing-3.15 to git://people.freedesktop.org/~deathsimple/linux. It's based on 3.15.2 and contains the "stop poisoning the GART TLB" patch backported to 3.15 and a couple of things that I would like to try.
Running that branch, my Bonaire just survived a piglit run without lockup. I hope that's an interesting result. :)
That's indeed an interesting result. Can you try to figure out which of the patches on the branch did the trick for you?
The winner is 'drm/radeon: completely over allocate PD and PTs'. That patch alone on top of 3.15.2 makes piglit survive on my Bonaire.
Am 01.07.2014 08:48, schrieb Michel Dänzer:
On 30.06.2014 16:43, Christian König wrote:
Am 30.06.2014 08:10, schrieb Michel Dänzer:
On 29.06.2014 19:34, Christian König wrote:
I've just pushed the branch testing-3.15 to git://people.freedesktop.org/~deathsimple/linux. It's based on 3.15.2 and contains the "stop poisoning the GART TLB" patch backported to 3.15 and a couple of things that I would like to try.
Running that branch, my Bonaire just survived a piglit run without lockup. I hope that's an interesting result. :)
That's indeed an interesting result. Can you try to figure out which of the patches on the branch did the trick for you?
The winner is 'drm/radeon: completely over allocate PD and PTs'. That patch alone on top of 3.15.2 makes piglit survive on my Bonaire.
Sounds like we either need to align the buffers a bit more, accidentally overwrite parts of them or indeed messed up their size calculation somewhere.
I've just pushed a new branch testing-3.15-v2 to git://people.freedesktop.org/~deathsimple/linux. It only contains the two patches already submitted for 3.15 inclusion and the "drm/radeon: completely over allocate PD and PTs" patch split into four separate changes.
Please retest and if it still works try once more which change fixed it. I'm going to try to purposely un-align the buffers on my bonaire in the meantime, maybe I get it to crash as well.
Thanks, Christian.
On 01.07.2014 21:16, Christian König wrote:
Am 01.07.2014 08:48, schrieb Michel Dänzer:
On 30.06.2014 16:43, Christian König wrote:
Am 30.06.2014 08:10, schrieb Michel Dänzer:
On 29.06.2014 19:34, Christian König wrote:
I've just pushed the branch testing-3.15 to git://people.freedesktop.org/~deathsimple/linux. It's based on 3.15.2 and contains the "stop poisoning the GART TLB" patch backported to 3.15 and a couple of things that I would like to try.
Running that branch, my Bonaire just survived a piglit run without lockup. I hope that's an interesting result. :)
That's indeed an interesting result. Can you try to figure out which of the patches on the branch did the trick for you?
The winner is 'drm/radeon: completely over allocate PD and PTs'. That patch alone on top of 3.15.2 makes piglit survive on my Bonaire.
Sounds like we either need to align the buffers a bit more, accidentally overwrite parts of them or indeed messed up their size calculation somewhere.
I've just pushed a new branch testing-3.15-v2 to git://people.freedesktop.org/~deathsimple/linux. It only contains the two patches already submitted for 3.15 inclusion and the "drm/radeon: completely over allocate PD and PTs" patch split into four separate changes.
Please retest and if it still works try once more which change fixed it.
It's hard to say, I'm afraid. I had a successful run with only the first two of the split up changes, but then after both of them failing by themselves, another run with both of them failed as well. So it seems like both of those are required, but maybe not sufficient.
FWIW, I've also had successful runs with the first three of the split changes, and with all of them.
FWIW, I've also had successful runs with the first three of the split changes, and with all of them.
Ok I've just pushed a branch testing-3.15-v3 to fdo which moves all page table allocation to the end of VRAM. Please try with this memory layout, it should give us a good idea if it's indeed a memory corruption or something else.
Apart from that please try to lockup your system with radeon.lockup_timeout=0 on the kernel commandline and then try to get a dump of the vm page tables with the script I've send to you in one of the mails.
Thanks for the help, Christian.
Am 02.07.2014 08:57, schrieb Michel Dänzer:
On 01.07.2014 21:16, Christian König wrote:
Am 01.07.2014 08:48, schrieb Michel Dänzer:
On 30.06.2014 16:43, Christian König wrote:
Am 30.06.2014 08:10, schrieb Michel Dänzer:
On 29.06.2014 19:34, Christian König wrote:
I've just pushed the branch testing-3.15 to git://people.freedesktop.org/~deathsimple/linux. It's based on 3.15.2 and contains the "stop poisoning the GART TLB" patch backported to 3.15 and a couple of things that I would like to try.
Running that branch, my Bonaire just survived a piglit run without lockup. I hope that's an interesting result. :)
That's indeed an interesting result. Can you try to figure out which of the patches on the branch did the trick for you?
The winner is 'drm/radeon: completely over allocate PD and PTs'. That patch alone on top of 3.15.2 makes piglit survive on my Bonaire.
Sounds like we either need to align the buffers a bit more, accidentally overwrite parts of them or indeed messed up their size calculation somewhere.
I've just pushed a new branch testing-3.15-v2 to git://people.freedesktop.org/~deathsimple/linux. It only contains the two patches already submitted for 3.15 inclusion and the "drm/radeon: completely over allocate PD and PTs" patch split into four separate changes.
Please retest and if it still works try once more which change fixed it.
It's hard to say, I'm afraid. I had a successful run with only the first two of the split up changes, but then after both of them failing by themselves, another run with both of them failed as well. So it seems like both of those are required, but maybe not sufficient.
FWIW, I've also had successful runs with the first three of the split changes, and with all of them.
On 03.07.2014 04:31, Christian König wrote:
FWIW, I've also had successful runs with the first three of the split changes, and with all of them.
Ok I've just pushed a branch testing-3.15-v3 to fdo which moves all page table allocation to the end of VRAM. Please try with this memory layout, it should give us a good idea if it's indeed a memory corruption or something else.
That branch just survived piglit as well.
Apart from that please try to lockup your system with radeon.lockup_timeout=0 on the kernel commandline and then try to get a dump of the vm page tables with the script I've send to you in one of the mails.
Any preference for which changes of which branch I should try this with? E.g. with the two overalignment changes from testing-3.15-v2?
Am 03.07.2014 05:48, schrieb Michel Dänzer:
On 03.07.2014 04:31, Christian König wrote:
FWIW, I've also had successful runs with the first three of the split changes, and with all of them.
Ok I've just pushed a branch testing-3.15-v3 to fdo which moves all page table allocation to the end of VRAM. Please try with this memory layout, it should give us a good idea if it's indeed a memory corruption or something else.
That branch just survived piglit as well.
Ok, so it's probably not an alignment issue but indeed a memory corruption (crap, the former would be easier to fix).
Apart from that please try to lockup your system with radeon.lockup_timeout=0 on the kernel commandline and then try to get a dump of the vm page tables with the script I've send to you in one of the mails.
Any preference for which changes of which branch I should try this with? E.g. with the two overalignment changes from testing-3.15-v2?
Just a blank 3.15 should be sufficient, I just want to take a look at the hexdump of the page tables to figure out what kind of memory corruption we have here.
Thanks, Christian.
Hi Michel,
3.15 doesn't contain Christian's fix yet, so it should be always broken for everybody. The fix is currently only in 3.16.
Alternatively, you can cherry-pick the fix to 3.15, but it doesn't apply cleanly.
There is a workaround in 3.15 which disables sDMA and uses CP DMA for copying buffers. It seems to help Christian's machine, but not mine.
When I said the kernel driver was broken, I meant that it was broken *with* the fix applied regardless of which engine was used for the copying.
Marek
On Thu, Jun 19, 2014 at 3:48 AM, Michel Dänzer michel@daenzer.net wrote:
On 15.06.2014 21:48, Christian König wrote:
Am 13.06.2014 23:31, schrieb Alex Deucher:
On Fri, Jun 13, 2014 at 11:45 AM, Christian König deathsimple@vodafone.de wrote:
Hi Marek,
ah, yes! Piglit in combination with that patch can indeed crash the box.
Going to investigate now that I can reproduce it.
I wonder if it's a clockgating issue with the MC or BIF? You might try adjusting the rdev->cg_flags (try setting it to 0) in radeon_asic.c or disabling dpm.
Unfortunately that was just a false alarm.
I was just on a branch which didn't had the "stop poisoning the GART TLB" patch, after applying this patch I can again let piglit run for the whole night without a lockup.
No idea what goes wrong when Marek runs piglit, but 3.15.0+"stop poisoning the GART TLB"+"force_gtt" is rock solid here.
FWIW, 3.15 doesn't survive piglit on my Bonaire either, but 3.14 is fine. 3.15 seems stable on Kaveri though, but I haven't tried the force_gtt patch on that yet.
There have also been a number of bug reports about stability regressions in 3.15 on various SI and CIK cards. It seems likely that at least some of those are related to this issue as well.
If we can't figure out the problem soon, we probably need to revert the 'Use normal BOs for page tables' and dependent changes at least for 3.15.y?
-- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi Marek,
There is a workaround in 3.15 which disables sDMA and uses CP DMA for copying buffers. It seems to help Christian's machine, but not mine.
With stressing the box with piglit I was able to bring my machine down with the CP DMA as well, only cherry-picking the "stop poisoning the GART TLB" really fixed that issue.
But I'm pretty sure that even with "stop poisoning the GART TLB" back-ported we still have at least one stability issue I can't reproduce.
Christian.
Am 19.06.2014 12:20, schrieb Marek Olšák:
Hi Michel,
3.15 doesn't contain Christian's fix yet, so it should be always broken for everybody. The fix is currently only in 3.16.
Alternatively, you can cherry-pick the fix to 3.15, but it doesn't apply cleanly.
There is a workaround in 3.15 which disables sDMA and uses CP DMA for copying buffers. It seems to help Christian's machine, but not mine.
When I said the kernel driver was broken, I meant that it was broken *with* the fix applied regardless of which engine was used for the copying.
Marek
On Thu, Jun 19, 2014 at 3:48 AM, Michel Dänzer michel@daenzer.net wrote:
On 15.06.2014 21:48, Christian König wrote:
Am 13.06.2014 23:31, schrieb Alex Deucher:
On Fri, Jun 13, 2014 at 11:45 AM, Christian König deathsimple@vodafone.de wrote:
Hi Marek,
ah, yes! Piglit in combination with that patch can indeed crash the box.
Going to investigate now that I can reproduce it.
I wonder if it's a clockgating issue with the MC or BIF? You might try adjusting the rdev->cg_flags (try setting it to 0) in radeon_asic.c or disabling dpm.
Unfortunately that was just a false alarm.
I was just on a branch which didn't had the "stop poisoning the GART TLB" patch, after applying this patch I can again let piglit run for the whole night without a lockup.
No idea what goes wrong when Marek runs piglit, but 3.15.0+"stop poisoning the GART TLB"+"force_gtt" is rock solid here.
FWIW, 3.15 doesn't survive piglit on my Bonaire either, but 3.14 is fine. 3.15 seems stable on Kaveri though, but I haven't tried the force_gtt patch on that yet.
There have also been a number of bug reports about stability regressions in 3.15 on various SI and CIK cards. It seems likely that at least some of those are related to this issue as well.
If we can't figure out the problem soon, we probably need to revert the 'Use normal BOs for page tables' and dependent changes at least for 3.15.y?
-- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On 19.06.2014 19:20, Marek Olšák wrote:
Hi Michel,
3.15 doesn't contain Christian's fix yet, so it should be always broken for everybody. The fix is currently only in 3.16.
Alternatively, you can cherry-pick the fix to 3.15, but it doesn't apply cleanly.
That's a good point. Sorry, I should have mentioned I've been testing with the GART poisoning fix backported to 3.15.
There is a workaround in 3.15 which disables sDMA and uses CP DMA for copying buffers. It seems to help Christian's machine, but not mine.
I've been testing with CP DMA on Bonaire FWIW.
dri-devel@lists.freedesktop.org