Attached is a set of patches that make it possible for drivers using TTM API (nouveau and radeon graphic drivers) to work under Xen. The explanation is a bit complex and I am not sure if I am explaining it that well..so if something is unclear please do ping me.
Changes since v1: [https://lkml.org/lkml/2010/12/6/516] - Cleaned up commit message (forgot to add my SoB).
Short explanation of problem: What we are hitting under Xen is that instead of programming the GART with the physical DMA address of the TTM page, we end up programming the bounce buffer DMA (bus) address!
Long explanation: The reason we end up doing this is that:
1). alloc_page with GFP_DMA32 does not allocate "real" (Machine Frame Numbers - MFN) under the 4GB in Xen. That is b/c if actually made the pages underneath 4GB available to the the Linux page allocator we would not be able to give those to other guest devices. This would mean if we tried to pass in a USB device to one guest and in another were running the Xorg server we wouldn't be able to do so as we would run out of pages under 4GB. So pages that we get from alloc_page have a PFN that is under 4GB but in reality the real physical address (MFN) is above 4GB. Ugh..
2). The backends for "struct ttm_backend_func" utilize the PCI API. When they get a page allocated via alloc_page, the use 'pci_map_page' and program the DMA (bus) address in the GART - which is correct. But then the calls that kick off the graphic driver to process the pages do not use the pci_page_sync_* calls. If the physical address of the page is the same as the DMA bus address returned from pci_map_page then there are no trouble. But if they are different: virt_to_phys(page_address(p)) != pci_map_page(p,..) then the graphic card fetches data from the DMA (bus) address (so the value returned from pci_map_page). The data however that the user wrote to (the page p) ends up being untouched. You are probably saying: "What? Nonsense, we stitch the page in ttm_bo_vm_fault using the PFN and .. and even if the virt_to_phys(page_address(p)) != pci_map_page(p) the GART ends up with the bus (DMA) address of the PFN!" That is true. But if you combine this with 1) where you end up with page that is above the dma_mask (even if you called it with GFP_DMA32) and then make a call on pci_map_page you would end up with a bounce buffer!
The problem above can be easily reproduced on bare-metal if you pass in "swiotlb=force iommu=soft".
There are two ways of fixing this:
1). Use the 'dma_alloc_coherent' (or pci_alloc_consistent if there is struct pcidev present), instead of alloc_page for GFP_DMA32. The 'dma_alloc_coherent' guarantees that the allocated page fits within the device dma_mask (or uses the default DMA32 if no device is passed in). This also guarantees that any subsequent call to the PCI API for this page will return the same DMA (bus) address as the first call (so pci_alloc_consistent, and then pci_map_page will give the same DMA bus address).
2). Use the pci_sync_range_* after sending a page to the graphics engine. If the bounce buffer is used then we end up copying the pages.
3). This one I really don't want to think about, but I should mention it. Alter the alloc_page and its backend to know about Xen. The pushback from the MM guys will be probably: "why not use the PCI API.."
So attached is a draft set of patches that use solution #1. I've tested this on baremetal (and Xen) on PCIe Radeon and nouveau cards with success.
1) The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying with modifying the TTM API to pass in 'struct device' or 'struct pci_device' but figured it would make first sense to get your guys input before heading that route.
This patch-set is also available at:
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/ttm.pci-api.v3
Konrad Rzeszutek Wilk (5): ttm: Introduce a placeholder for DMA (bus) addresses. tm: Utilize the dma_addr_t array for pages that are to in DMA32 pool. ttm: Expand (*populate) to support an array of DMA addresses. radeon/ttm/PCIe: Use dma_addr if TTM has set it. nouveau/ttm/PCIe: Use dma_addr if TTM has set it.
And diffstat:
drivers/gpu/drm/nouveau/nouveau_sgdma.c | 31 +++++++++++++++++++------- drivers/gpu/drm/radeon/radeon.h | 4 ++- drivers/gpu/drm/radeon/radeon_gart.c | 36 ++++++++++++++++++++++-------- drivers/gpu/drm/radeon/radeon_ttm.c | 8 +++++- drivers/gpu/drm/ttm/ttm_agp_backend.c | 3 +- drivers/gpu/drm/ttm/ttm_page_alloc.c | 35 +++++++++++++++++++++++++----- drivers/gpu/drm/ttm/ttm_tt.c | 12 +++++++-- drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c | 3 +- include/drm/ttm/ttm_bo_driver.h | 6 ++++- include/drm/ttm/ttm_page_alloc.h | 8 +++++- 10 files changed, 111 insertions(+), 35 deletions(-)
This is right now limited to only non-pool constructs.
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com --- drivers/gpu/drm/ttm/ttm_page_alloc.c | 9 ++++++--- drivers/gpu/drm/ttm/ttm_tt.c | 10 ++++++++-- include/drm/ttm/ttm_bo_driver.h | 2 ++ include/drm/ttm/ttm_page_alloc.h | 8 ++++++-- 4 files changed, 22 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c index b1e02ff..6859288 100644 --- a/drivers/gpu/drm/ttm/ttm_page_alloc.c +++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c @@ -38,6 +38,7 @@ #include <linux/mm.h> #include <linux/seq_file.h> /* for seq_printf */ #include <linux/slab.h> +#include <linux/dma-mapping.h>
#include <asm/atomic.h>
@@ -662,7 +663,8 @@ out: * cached pages. */ int ttm_get_pages(struct list_head *pages, int flags, - enum ttm_caching_state cstate, unsigned count) + enum ttm_caching_state cstate, unsigned count, + dma_addr_t *dma_address) { struct ttm_page_pool *pool = ttm_get_pool(flags, cstate); struct page *p = NULL; @@ -720,7 +722,7 @@ int ttm_get_pages(struct list_head *pages, int flags, printk(KERN_ERR TTM_PFX "Failed to allocate extra pages " "for large request."); - ttm_put_pages(pages, 0, flags, cstate); + ttm_put_pages(pages, 0, flags, cstate, NULL); return r; } } @@ -731,7 +733,8 @@ int ttm_get_pages(struct list_head *pages, int flags,
/* Put all pages in pages list to correct pool to wait for reuse */ void ttm_put_pages(struct list_head *pages, unsigned page_count, int flags, - enum ttm_caching_state cstate) + enum ttm_caching_state cstate, + dma_addr_t *dma_address) { unsigned long irq_flags; struct ttm_page_pool *pool = ttm_get_pool(flags, cstate); diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c index af789dc..0d39001 100644 --- a/drivers/gpu/drm/ttm/ttm_tt.c +++ b/drivers/gpu/drm/ttm/ttm_tt.c @@ -49,12 +49,16 @@ static int ttm_tt_swapin(struct ttm_tt *ttm); static void ttm_tt_alloc_page_directory(struct ttm_tt *ttm) { ttm->pages = drm_calloc_large(ttm->num_pages, sizeof(*ttm->pages)); + ttm->dma_address = drm_calloc_large(ttm->num_pages, + sizeof(*ttm->dma_address)); }
static void ttm_tt_free_page_directory(struct ttm_tt *ttm) { drm_free_large(ttm->pages); ttm->pages = NULL; + drm_free_large(ttm->dma_address); + ttm->dma_address = NULL; }
static void ttm_tt_free_user_pages(struct ttm_tt *ttm) @@ -105,7 +109,8 @@ static struct page *__ttm_tt_get_page(struct ttm_tt *ttm, int index)
INIT_LIST_HEAD(&h);
- ret = ttm_get_pages(&h, ttm->page_flags, ttm->caching_state, 1); + ret = ttm_get_pages(&h, ttm->page_flags, ttm->caching_state, 1, + &ttm->dma_address[index]);
if (ret != 0) return NULL; @@ -298,7 +303,8 @@ static void ttm_tt_free_alloced_pages(struct ttm_tt *ttm) count++; } } - ttm_put_pages(&h, count, ttm->page_flags, ttm->caching_state); + ttm_put_pages(&h, count, ttm->page_flags, ttm->caching_state, + ttm->dma_address); ttm->state = tt_unpopulated; ttm->first_himem_page = ttm->num_pages; ttm->last_lomem_page = -1; diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h index 8e0c848..6dc4fcc 100644 --- a/include/drm/ttm/ttm_bo_driver.h +++ b/include/drm/ttm/ttm_bo_driver.h @@ -149,6 +149,7 @@ enum ttm_caching_state { * @swap_storage: Pointer to shmem struct file for swap storage. * @caching_state: The current caching state of the pages. * @state: The current binding state of the pages. + * @dma_address: The DMA (bus) addresses of the pages (if TTM_PAGE_FLAG_DMA32) * * This is a structure holding the pages, caching- and aperture binding * status for a buffer object that isn't backed by fixed (VRAM / AGP) @@ -173,6 +174,7 @@ struct ttm_tt { tt_unbound, tt_unpopulated, } state; + dma_addr_t *dma_address; };
#define TTM_MEMTYPE_FLAG_FIXED (1 << 0) /* Fixed (on-card) PCI memory */ diff --git a/include/drm/ttm/ttm_page_alloc.h b/include/drm/ttm/ttm_page_alloc.h index 1168214..8062890 100644 --- a/include/drm/ttm/ttm_page_alloc.h +++ b/include/drm/ttm/ttm_page_alloc.h @@ -36,11 +36,13 @@ * @flags: ttm flags for page allocation. * @cstate: ttm caching state for the page. * @count: number of pages to allocate. + * @dma_address: The DMA (bus) address of pages (if TTM_PAGE_FLAG_DMA32 set). */ int ttm_get_pages(struct list_head *pages, int flags, enum ttm_caching_state cstate, - unsigned count); + unsigned count, + dma_addr_t *dma_address); /** * Put linked list of pages to pool. * @@ -49,11 +51,13 @@ int ttm_get_pages(struct list_head *pages, * count. * @flags: ttm flags for page allocation. * @cstate: ttm caching state. + * @dma_address: The DMA (bus) address of pages (if TTM_PAGE_FLAG_DMA32 set). */ void ttm_put_pages(struct list_head *pages, unsigned page_count, int flags, - enum ttm_caching_state cstate); + enum ttm_caching_state cstate, + dma_addr_t *dma_address); /** * Initialize pool allocator. */
Hi!
Please see below comment, Otherwise
Reviewed-by: Thomas Hellstrom thomas@shipmail.org
On 01/07/2011 06:11 PM, Konrad Rzeszutek Wilk wrote:
This is right now limited to only non-pool constructs.
Signed-off-by: Konrad Rzeszutek Wilk<konrad.wilk at oracle.com>
drivers/gpu/drm/ttm/ttm_page_alloc.c | 9 ++++++--- drivers/gpu/drm/ttm/ttm_tt.c | 10 ++++++++-- include/drm/ttm/ttm_bo_driver.h | 2 ++ include/drm/ttm/ttm_page_alloc.h | 8 ++++++-- 4 files changed, 22 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c index b1e02ff..6859288 100644 --- a/drivers/gpu/drm/ttm/ttm_page_alloc.c +++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c @@ -38,6 +38,7 @@ #include<linux/mm.h> #include<linux/seq_file.h> /* for seq_printf */ #include<linux/slab.h> +#include<linux/dma-mapping.h>
#include<asm/atomic.h>
@@ -662,7 +663,8 @@ out:
- cached pages.
*/ int ttm_get_pages(struct list_head *pages, int flags,
enum ttm_caching_state cstate, unsigned count)
enum ttm_caching_state cstate, unsigned count,
dma_addr_t *dma_address)
Indentation looks suspicious here and in other function prototypes...
{ struct ttm_page_pool *pool = ttm_get_pool(flags, cstate); struct page *p = NULL; @@ -720,7 +722,7 @@ int ttm_get_pages(struct list_head *pages, int flags, printk(KERN_ERR TTM_PFX "Failed to allocate extra pages " "for large request.");
ttm_put_pages(pages, 0, flags, cstate);
} }ttm_put_pages(pages, 0, flags, cstate, NULL); return r;
@@ -731,7 +733,8 @@ int ttm_get_pages(struct list_head *pages, int flags,
/* Put all pages in pages list to correct pool to wait for reuse */ void ttm_put_pages(struct list_head *pages, unsigned page_count, int flags,
enum ttm_caching_state cstate)
enum ttm_caching_state cstate,
{ unsigned long irq_flags; struct ttm_page_pool *pool = ttm_get_pool(flags, cstate);dma_addr_t *dma_address)
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c index af789dc..0d39001 100644 --- a/drivers/gpu/drm/ttm/ttm_tt.c +++ b/drivers/gpu/drm/ttm/ttm_tt.c @@ -49,12 +49,16 @@ static int ttm_tt_swapin(struct ttm_tt *ttm); static void ttm_tt_alloc_page_directory(struct ttm_tt *ttm) { ttm->pages = drm_calloc_large(ttm->num_pages, sizeof(*ttm->pages));
ttm->dma_address = drm_calloc_large(ttm->num_pages,
sizeof(*ttm->dma_address));
}
static void ttm_tt_free_page_directory(struct ttm_tt *ttm) { drm_free_large(ttm->pages); ttm->pages = NULL;
drm_free_large(ttm->dma_address);
ttm->dma_address = NULL; }
static void ttm_tt_free_user_pages(struct ttm_tt *ttm)
@@ -105,7 +109,8 @@ static struct page *__ttm_tt_get_page(struct ttm_tt *ttm, int index)
INIT_LIST_HEAD(&h);
ret = ttm_get_pages(&h, ttm->page_flags, ttm->caching_state, 1);
ret = ttm_get_pages(&h, ttm->page_flags, ttm->caching_state, 1,
&ttm->dma_address[index]);
if (ret != 0) return NULL;
@@ -298,7 +303,8 @@ static void ttm_tt_free_alloced_pages(struct ttm_tt *ttm) count++; } }
- ttm_put_pages(&h, count, ttm->page_flags, ttm->caching_state);
- ttm_put_pages(&h, count, ttm->page_flags, ttm->caching_state,
ttm->state = tt_unpopulated; ttm->first_himem_page = ttm->num_pages; ttm->last_lomem_page = -1;ttm->dma_address);
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h index 8e0c848..6dc4fcc 100644 --- a/include/drm/ttm/ttm_bo_driver.h +++ b/include/drm/ttm/ttm_bo_driver.h @@ -149,6 +149,7 @@ enum ttm_caching_state {
- @swap_storage: Pointer to shmem struct file for swap storage.
- @caching_state: The current caching state of the pages.
- @state: The current binding state of the pages.
- @dma_address: The DMA (bus) addresses of the pages (if TTM_PAGE_FLAG_DMA32)
- This is a structure holding the pages, caching- and aperture binding
- status for a buffer object that isn't backed by fixed (VRAM / AGP)
@@ -173,6 +174,7 @@ struct ttm_tt { tt_unbound, tt_unpopulated, } state;
dma_addr_t *dma_address; };
#define TTM_MEMTYPE_FLAG_FIXED (1<< 0) /* Fixed (on-card) PCI memory */
diff --git a/include/drm/ttm/ttm_page_alloc.h b/include/drm/ttm/ttm_page_alloc.h index 1168214..8062890 100644 --- a/include/drm/ttm/ttm_page_alloc.h +++ b/include/drm/ttm/ttm_page_alloc.h @@ -36,11 +36,13 @@
- @flags: ttm flags for page allocation.
- @cstate: ttm caching state for the page.
- @count: number of pages to allocate.
*/ int ttm_get_pages(struct list_head *pages, int flags, enum ttm_caching_state cstate,
- @dma_address: The DMA (bus) address of pages (if TTM_PAGE_FLAG_DMA32 set).
unsigned count);
unsigned count,
/**dma_addr_t *dma_address);
- Put linked list of pages to pool.
@@ -49,11 +51,13 @@ int ttm_get_pages(struct list_head *pages,
- count.
- @flags: ttm flags for page allocation.
- @cstate: ttm caching state.
*/ void ttm_put_pages(struct list_head *pages, unsigned page_count, int flags,
- @dma_address: The DMA (bus) address of pages (if TTM_PAGE_FLAG_DMA32 set).
enum ttm_caching_state cstate);
enum ttm_caching_state cstate,
/**dma_addr_t *dma_address);
*/
- Initialize pool allocator.
We only use the "if (pool == NULL)" path for right now.
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com --- drivers/gpu/drm/ttm/ttm_page_alloc.c | 26 +++++++++++++++++++++++--- 1 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c index 6859288..5d09677 100644 --- a/drivers/gpu/drm/ttm/ttm_page_alloc.c +++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c @@ -683,14 +683,22 @@ int ttm_get_pages(struct list_head *pages, int flags, gfp_flags |= GFP_HIGHUSER;
for (r = 0; r < count; ++r) { - p = alloc_page(gfp_flags); + if ((flags & TTM_PAGE_FLAG_DMA32) && dma_address) { + void *addr; + addr = dma_alloc_coherent(NULL, PAGE_SIZE, + &dma_address[r], + gfp_flags); + if (addr == NULL) + return -ENOMEM; + p = virt_to_page(addr); + } else + p = alloc_page(gfp_flags); if (!p) {
printk(KERN_ERR TTM_PFX "Unable to allocate page."); return -ENOMEM; } - list_add(&p->lru, pages); } return 0; @@ -739,12 +747,24 @@ void ttm_put_pages(struct list_head *pages, unsigned page_count, int flags, unsigned long irq_flags; struct ttm_page_pool *pool = ttm_get_pool(flags, cstate); struct page *p, *tmp; + unsigned r;
if (pool == NULL) { /* No pool for this memory type so free the pages */
+ r = page_count-1; list_for_each_entry_safe(p, tmp, pages, lru) { - __free_page(p); + if ((flags & TTM_PAGE_FLAG_DMA32) && dma_address) { + void *addr = page_address(p); + WARN_ON(!addr || !dma_address[r]); + if (addr) + dma_free_coherent(NULL, PAGE_SIZE, + addr, + dma_address[r]); + dma_address[r] = 0; + } else + __free_page(p); + r--; } /* Make the pages list empty */ INIT_LIST_HEAD(pages);
On 01/07/2011 06:11 PM, Konrad Rzeszutek Wilk wrote:
We only use the "if (pool == NULL)" path for right now.
Signed-off-by: Konrad Rzeszutek Wilk<konrad.wilk at oracle.com>
Patch name is incorrect, also see potential indentation issue below, otherwise Reviewed-by: Thomas Hellstrom thomas@shipmail.org
drivers/gpu/drm/ttm/ttm_page_alloc.c | 26 +++++++++++++++++++++++--- 1 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c index 6859288..5d09677 100644 --- a/drivers/gpu/drm/ttm/ttm_page_alloc.c +++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c @@ -683,14 +683,22 @@ int ttm_get_pages(struct list_head *pages, int flags, gfp_flags |= GFP_HIGHUSER;
for (r = 0; r< count; ++r) {
p = alloc_page(gfp_flags);
if ((flags& TTM_PAGE_FLAG_DMA32)&& dma_address) {
void *addr;
addr = dma_alloc_coherent(NULL, PAGE_SIZE,
&dma_address[r],
gfp_flags);
Indentation again
if (addr == NULL)
return -ENOMEM;
p = virt_to_page(addr);
} else
p = alloc_page(gfp_flags); if (!p) { printk(KERN_ERR TTM_PFX "Unable to allocate page."); return -ENOMEM; }
} return 0;list_add(&p->lru, pages);
@@ -739,12 +747,24 @@ void ttm_put_pages(struct list_head *pages, unsigned page_count, int flags, unsigned long irq_flags; struct ttm_page_pool *pool = ttm_get_pool(flags, cstate); struct page *p, *tmp;
- unsigned r;
if (pool == NULL) { /* No pool for this memory type so free the pages */
list_for_each_entry_safe(p, tmp, pages, lru) {r = page_count-1;
__free_page(p);
if ((flags& TTM_PAGE_FLAG_DMA32)&& dma_address) {
void *addr = page_address(p);
WARN_ON(!addr || !dma_address[r]);
if (addr)
dma_free_coherent(NULL, PAGE_SIZE,
addr,
dma_address[r]);
dma_address[r] = 0;
} else
__free_page(p);
} /* Make the pages list empty */ INIT_LIST_HEAD(pages);r--;
We pass in the array of ttm pages to be populated in the GART/MM of the card (or AGP). Patch titled: "ttm: Utilize the dma_addr_t array for pages that are to in DMA32 pool." uses the DMA API to make those pages have a proper DMA addresses (in the situation where page_to_phys or virt_to_phys do not give use the DMA (bus) address).
Since we are using the DMA API on those pages, we should pass in the DMA address to this function so it can save it in its proper fields (later patches use it).
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com --- drivers/gpu/drm/nouveau/nouveau_sgdma.c | 3 ++- drivers/gpu/drm/radeon/radeon_ttm.c | 3 ++- drivers/gpu/drm/ttm/ttm_agp_backend.c | 3 ++- drivers/gpu/drm/ttm/ttm_tt.c | 2 +- drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c | 3 ++- include/drm/ttm/ttm_bo_driver.h | 4 +++- 6 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_sgdma.c b/drivers/gpu/drm/nouveau/nouveau_sgdma.c index 288baca..edc140a 100644 --- a/drivers/gpu/drm/nouveau/nouveau_sgdma.c +++ b/drivers/gpu/drm/nouveau/nouveau_sgdma.c @@ -20,7 +20,8 @@ struct nouveau_sgdma_be {
static int nouveau_sgdma_populate(struct ttm_backend *be, unsigned long num_pages, - struct page **pages, struct page *dummy_read_page) + struct page **pages, struct page *dummy_read_page, + dma_addr_t *dma_addrs) { struct nouveau_sgdma_be *nvbe = (struct nouveau_sgdma_be *)be; struct drm_device *dev = nvbe->dev; diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c index 01c2c73..6f156e9 100644 --- a/drivers/gpu/drm/radeon/radeon_ttm.c +++ b/drivers/gpu/drm/radeon/radeon_ttm.c @@ -655,7 +655,8 @@ struct radeon_ttm_backend { static int radeon_ttm_backend_populate(struct ttm_backend *backend, unsigned long num_pages, struct page **pages, - struct page *dummy_read_page) + struct page *dummy_read_page, + dma_addr_t *dma_addrs) { struct radeon_ttm_backend *gtt;
diff --git a/drivers/gpu/drm/ttm/ttm_agp_backend.c b/drivers/gpu/drm/ttm/ttm_agp_backend.c index f999e36..1c4a72f 100644 --- a/drivers/gpu/drm/ttm/ttm_agp_backend.c +++ b/drivers/gpu/drm/ttm/ttm_agp_backend.c @@ -47,7 +47,8 @@ struct ttm_agp_backend {
static int ttm_agp_populate(struct ttm_backend *backend, unsigned long num_pages, struct page **pages, - struct page *dummy_read_page) + struct page *dummy_read_page, + dma_addr_t *dma_addrs) { struct ttm_agp_backend *agp_be = container_of(backend, struct ttm_agp_backend, backend); diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c index 0d39001..86d5b17 100644 --- a/drivers/gpu/drm/ttm/ttm_tt.c +++ b/drivers/gpu/drm/ttm/ttm_tt.c @@ -169,7 +169,7 @@ int ttm_tt_populate(struct ttm_tt *ttm) }
be->func->populate(be, ttm->num_pages, ttm->pages, - ttm->dummy_read_page); + ttm->dummy_read_page, ttm->dma_address); ttm->state = tt_unbound; return 0; } diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c b/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c index 80bc37b..87e43e0 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c @@ -102,7 +102,8 @@ struct vmw_ttm_backend {
static int vmw_ttm_populate(struct ttm_backend *backend, unsigned long num_pages, struct page **pages, - struct page *dummy_read_page) + struct page *dummy_read_page, + dma_addr_t *dma_addrs) { struct vmw_ttm_backend *vmw_be = container_of(backend, struct vmw_ttm_backend, backend); diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h index 6dc4fcc..ebcd3dd 100644 --- a/include/drm/ttm/ttm_bo_driver.h +++ b/include/drm/ttm/ttm_bo_driver.h @@ -50,13 +50,15 @@ struct ttm_backend_func { * @pages: Array of pointers to ttm pages. * @dummy_read_page: Page to be used instead of NULL pages in the * array @pages. + * @dma_addrs: Array of DMA (bus) address of the ttm pages. * * Populate the backend with ttm pages. Depending on the backend, * it may or may not copy the @pages array. */ int (*populate) (struct ttm_backend *backend, unsigned long num_pages, struct page **pages, - struct page *dummy_read_page); + struct page *dummy_read_page, + dma_addr_t *dma_addrs); /** * struct ttm_backend_func member clear *
On 01/07/2011 06:11 PM, Konrad Rzeszutek Wilk wrote:
Apart from previously mentioned indentation issues, Reviewed-by: Thomas Hellstrom thellstrom@shipmail.org
We pass in the array of ttm pages to be populated in the GART/MM of the card (or AGP). Patch titled: "ttm: Utilize the dma_addr_t array for pages that are to in DMA32 pool." uses the DMA API to make those pages have a proper DMA addresses (in the situation where page_to_phys or virt_to_phys do not give use the DMA (bus) address).
Since we are using the DMA API on those pages, we should pass in the DMA address to this function so it can save it in its proper fields (later patches use it).
Signed-off-by: Konrad Rzeszutek Wilk<konrad.wilk at oracle.com>
drivers/gpu/drm/nouveau/nouveau_sgdma.c | 3 ++- drivers/gpu/drm/radeon/radeon_ttm.c | 3 ++- drivers/gpu/drm/ttm/ttm_agp_backend.c | 3 ++- drivers/gpu/drm/ttm/ttm_tt.c | 2 +- drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c | 3 ++- include/drm/ttm/ttm_bo_driver.h | 4 +++- 6 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_sgdma.c b/drivers/gpu/drm/nouveau/nouveau_sgdma.c index 288baca..edc140a 100644 --- a/drivers/gpu/drm/nouveau/nouveau_sgdma.c +++ b/drivers/gpu/drm/nouveau/nouveau_sgdma.c @@ -20,7 +20,8 @@ struct nouveau_sgdma_be {
static int nouveau_sgdma_populate(struct ttm_backend *be, unsigned long num_pages,
struct page **pages, struct page *dummy_read_page)
struct page **pages, struct page *dummy_read_page,
{ struct nouveau_sgdma_be *nvbe = (struct nouveau_sgdma_be *)be; struct drm_device *dev = nvbe->dev;dma_addr_t *dma_addrs)
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c index 01c2c73..6f156e9 100644 --- a/drivers/gpu/drm/radeon/radeon_ttm.c +++ b/drivers/gpu/drm/radeon/radeon_ttm.c @@ -655,7 +655,8 @@ struct radeon_ttm_backend { static int radeon_ttm_backend_populate(struct ttm_backend *backend, unsigned long num_pages, struct page **pages,
struct page *dummy_read_page)
struct page *dummy_read_page,
{ struct radeon_ttm_backend *gtt;dma_addr_t *dma_addrs)
diff --git a/drivers/gpu/drm/ttm/ttm_agp_backend.c b/drivers/gpu/drm/ttm/ttm_agp_backend.c index f999e36..1c4a72f 100644 --- a/drivers/gpu/drm/ttm/ttm_agp_backend.c +++ b/drivers/gpu/drm/ttm/ttm_agp_backend.c @@ -47,7 +47,8 @@ struct ttm_agp_backend {
static int ttm_agp_populate(struct ttm_backend *backend, unsigned long num_pages, struct page **pages,
struct page *dummy_read_page)
struct page *dummy_read_page,
{ struct ttm_agp_backend *agp_be = container_of(backend, struct ttm_agp_backend, backend);dma_addr_t *dma_addrs)
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c index 0d39001..86d5b17 100644 --- a/drivers/gpu/drm/ttm/ttm_tt.c +++ b/drivers/gpu/drm/ttm/ttm_tt.c @@ -169,7 +169,7 @@ int ttm_tt_populate(struct ttm_tt *ttm) }
be->func->populate(be, ttm->num_pages, ttm->pages,
ttm->dummy_read_page);
ttm->state = tt_unbound; return 0; }ttm->dummy_read_page, ttm->dma_address);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c b/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c index 80bc37b..87e43e0 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c @@ -102,7 +102,8 @@ struct vmw_ttm_backend {
static int vmw_ttm_populate(struct ttm_backend *backend, unsigned long num_pages, struct page **pages,
struct page *dummy_read_page)
struct page *dummy_read_page,
{ struct vmw_ttm_backend *vmw_be = container_of(backend, struct vmw_ttm_backend, backend);dma_addr_t *dma_addrs)
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h index 6dc4fcc..ebcd3dd 100644 --- a/include/drm/ttm/ttm_bo_driver.h +++ b/include/drm/ttm/ttm_bo_driver.h @@ -50,13 +50,15 @@ struct ttm_backend_func { * @pages: Array of pointers to ttm pages. * @dummy_read_page: Page to be used instead of NULL pages in the * array @pages.
* @dma_addrs: Array of DMA (bus) address of the ttm pages.
*/ int (*populate) (struct ttm_backend *backend, unsigned long num_pages, struct page **pages,
- Populate the backend with ttm pages. Depending on the backend,
- it may or may not copy the @pages array.
struct page *dummy_read_page);
struct page *dummy_read_page,
/**dma_addr_t *dma_addrs);
- struct ttm_backend_func member clear
On Thu, Jan 27, 2011 at 10:19:55AM +0100, Thomas Hellstrom wrote:
On 01/07/2011 06:11 PM, Konrad Rzeszutek Wilk wrote:
Apart from previously mentioned indentation issues, Reviewed-by: Thomas Hellstrom thellstrom@shipmail.org
Thank you..
.. snip..
nouveau_sgdma_populate(struct ttm_backend *be, unsigned long num_pages,
struct page **pages, struct page *dummy_read_page)
struct page **pages, struct page *dummy_read_page,
dma_addr_t *dma_addrs)
This is weird, but when I pull this up in 'vim' it looks OK. I think it is just the emailer playing tricks? (attached a screenshot)
checkpatch.pl which usually throws a fit when it comes to indentations does not complain here at all.
{ struct nouveau_sgdma_be *nvbe = (struct nouveau_sgdma_be *)be; struct drm_device *dev = nvbe->dev; diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c index 01c2c73..6f156e9 100644 --- a/drivers/gpu/drm/radeon/radeon_ttm.c +++ b/drivers/gpu/drm/radeon/radeon_ttm.c @@ -655,7 +655,8 @@ struct radeon_ttm_backend { static int radeon_ttm_backend_populate(struct ttm_backend *backend, unsigned long num_pages, struct page **pages,
struct page *dummy_read_page)
struct page *dummy_read_page,
dma_addr_t *dma_addrs)
Ditto. Looks just fine.
{ struct radeon_ttm_backend *gtt;
diff --git a/drivers/gpu/drm/ttm/ttm_agp_backend.c b/drivers/gpu/drm/ttm/ttm_agp_backend.c index f999e36..1c4a72f 100644 --- a/drivers/gpu/drm/ttm/ttm_agp_backend.c +++ b/drivers/gpu/drm/ttm/ttm_agp_backend.c @@ -47,7 +47,8 @@ struct ttm_agp_backend {
static int ttm_agp_populate(struct ttm_backend *backend, unsigned long num_pages, struct page **pages,
struct page *dummy_read_page)
struct page *dummy_read_page,
dma_addr_t *dma_addrs)
{ struct ttm_agp_backend *agp_be = container_of(backend, struct ttm_agp_backend, backend); diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c index 0d39001..86d5b17 100644 --- a/drivers/gpu/drm/ttm/ttm_tt.c +++ b/drivers/gpu/drm/ttm/ttm_tt.c @@ -169,7 +169,7 @@ int ttm_tt_populate(struct ttm_tt *ttm) }
be->func->populate(be, ttm->num_pages, ttm->pages,
ttm->dummy_read_page);
ttm->state = tt_unbound; return 0;ttm->dummy_read_page, ttm->dma_address);
} diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c b/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c index 80bc37b..87e43e0 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c @@ -102,7 +102,8 @@ struct vmw_ttm_backend {
static int vmw_ttm_populate(struct ttm_backend *backend, unsigned long num_pages, struct page **pages,
struct page *dummy_read_page)
struct page *dummy_read_page,
dma_addr_t *dma_addrs)
Grrrr... same thing - looks OK?
{ struct vmw_ttm_backend *vmw_be = container_of(backend, struct vmw_ttm_backend, backend); diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h index 6dc4fcc..ebcd3dd 100644 --- a/include/drm/ttm/ttm_bo_driver.h +++ b/include/drm/ttm/ttm_bo_driver.h @@ -50,13 +50,15 @@ struct ttm_backend_func { * @pages: Array of pointers to ttm pages. * @dummy_read_page: Page to be used instead of NULL pages in the * array @pages.
* @dma_addrs: Array of DMA (bus) address of the ttm pages.
*/ int (*populate) (struct ttm_backend *backend, unsigned long num_pages, struct page **pages,
- Populate the backend with ttm pages. Depending on the backend,
- it may or may not copy the @pages array.
struct page *dummy_read_page);
struct page *dummy_read_page,
/**dma_addr_t *dma_addrs);
- struct ttm_backend_func member clear
If the TTM layer has used the DMA API to setup pages that are TTM_PAGE_FLAG_DMA32 (look at patch titled: "ttm: Utilize the dma_addr_t array for pages that are to in DMA32 pool."), lets use it when programming the GART in the PCIe type cards.
This patch skips doing the pci_map_page (and pci_unmap_page) if there is a DMA addresses passed in for that page. If the dma_address is zero (or DMA_ERROR_CODE), then we continue on with our old behaviour.
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com --- drivers/gpu/drm/radeon/radeon.h | 4 ++- drivers/gpu/drm/radeon/radeon_gart.c | 36 ++++++++++++++++++++++++--------- drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++- 3 files changed, 33 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 73f600d..c9bbab9 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -317,6 +317,7 @@ struct radeon_gart { union radeon_gart_table table; struct page **pages; dma_addr_t *pages_addr; + bool *ttm_alloced; bool ready; };
@@ -329,7 +330,8 @@ void radeon_gart_fini(struct radeon_device *rdev); void radeon_gart_unbind(struct radeon_device *rdev, unsigned offset, int pages); int radeon_gart_bind(struct radeon_device *rdev, unsigned offset, - int pages, struct page **pagelist); + int pages, struct page **pagelist, + dma_addr_t *dma_addr);
/* diff --git a/drivers/gpu/drm/radeon/radeon_gart.c b/drivers/gpu/drm/radeon/radeon_gart.c index e65b903..4a5ac4b 100644 --- a/drivers/gpu/drm/radeon/radeon_gart.c +++ b/drivers/gpu/drm/radeon/radeon_gart.c @@ -149,8 +149,9 @@ void radeon_gart_unbind(struct radeon_device *rdev, unsigned offset, p = t / (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); for (i = 0; i < pages; i++, p++) { if (rdev->gart.pages[p]) { - pci_unmap_page(rdev->pdev, rdev->gart.pages_addr[p], - PAGE_SIZE, PCI_DMA_BIDIRECTIONAL); + if (!rdev->gart.ttm_alloced[p]) + pci_unmap_page(rdev->pdev, rdev->gart.pages_addr[p], + PAGE_SIZE, PCI_DMA_BIDIRECTIONAL); rdev->gart.pages[p] = NULL; rdev->gart.pages_addr[p] = rdev->dummy_page.addr; page_base = rdev->gart.pages_addr[p]; @@ -165,7 +166,7 @@ void radeon_gart_unbind(struct radeon_device *rdev, unsigned offset, }
int radeon_gart_bind(struct radeon_device *rdev, unsigned offset, - int pages, struct page **pagelist) + int pages, struct page **pagelist, dma_addr_t *dma_addr) { unsigned t; unsigned p; @@ -180,15 +181,22 @@ int radeon_gart_bind(struct radeon_device *rdev, unsigned offset, p = t / (PAGE_SIZE / RADEON_GPU_PAGE_SIZE);
for (i = 0; i < pages; i++, p++) { - /* we need to support large memory configurations */ - /* assume that unbind have already been call on the range */ - rdev->gart.pages_addr[p] = pci_map_page(rdev->pdev, pagelist[i], + /* On TTM path, we only use the DMA API if TTM_PAGE_FLAG_DMA32 + * is requested. */ + if (dma_addr[i] != DMA_ERROR_CODE) { + rdev->gart.ttm_alloced[p] = true; + rdev->gart.pages_addr[p] = dma_addr[i]; + } else { + /* we need to support large memory configurations */ + /* assume that unbind have already been call on the range */ + rdev->gart.pages_addr[p] = pci_map_page(rdev->pdev, pagelist[i], 0, PAGE_SIZE, PCI_DMA_BIDIRECTIONAL); - if (pci_dma_mapping_error(rdev->pdev, rdev->gart.pages_addr[p])) { - /* FIXME: failed to map page (return -ENOMEM?) */ - radeon_gart_unbind(rdev, offset, pages); - return -ENOMEM; + if (pci_dma_mapping_error(rdev->pdev, rdev->gart.pages_addr[p])) { + /* FIXME: failed to map page (return -ENOMEM?) */ + radeon_gart_unbind(rdev, offset, pages); + return -ENOMEM; + } } rdev->gart.pages[p] = pagelist[i]; page_base = rdev->gart.pages_addr[p]; @@ -251,6 +259,12 @@ int radeon_gart_init(struct radeon_device *rdev) radeon_gart_fini(rdev); return -ENOMEM; } + rdev->gart.ttm_alloced = kzalloc(sizeof(bool) * + rdev->gart.num_cpu_pages, GFP_KERNEL); + if (rdev->gart.ttm_alloced == NULL) { + radeon_gart_fini(rdev); + return -ENOMEM; + } /* set GART entry to point to the dummy page by default */ for (i = 0; i < rdev->gart.num_cpu_pages; i++) { rdev->gart.pages_addr[i] = rdev->dummy_page.addr; @@ -267,6 +281,8 @@ void radeon_gart_fini(struct radeon_device *rdev) rdev->gart.ready = false; kfree(rdev->gart.pages); kfree(rdev->gart.pages_addr); + kfree(rdev->gart.ttm_alloced); rdev->gart.pages = NULL; rdev->gart.pages_addr = NULL; + rdev->gart.ttm_alloced = NULL; } diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c index 6f156e9..ca04505 100644 --- a/drivers/gpu/drm/radeon/radeon_ttm.c +++ b/drivers/gpu/drm/radeon/radeon_ttm.c @@ -647,6 +647,7 @@ struct radeon_ttm_backend { unsigned long num_pages; struct page **pages; struct page *dummy_read_page; + dma_addr_t *dma_addrs; bool populated; bool bound; unsigned offset; @@ -662,6 +663,7 @@ static int radeon_ttm_backend_populate(struct ttm_backend *backend,
gtt = container_of(backend, struct radeon_ttm_backend, backend); gtt->pages = pages; + gtt->dma_addrs = dma_addrs; gtt->num_pages = num_pages; gtt->dummy_read_page = dummy_read_page; gtt->populated = true; @@ -674,6 +676,7 @@ static void radeon_ttm_backend_clear(struct ttm_backend *backend)
gtt = container_of(backend, struct radeon_ttm_backend, backend); gtt->pages = NULL; + gtt->dma_addrs = NULL; gtt->num_pages = 0; gtt->dummy_read_page = NULL; gtt->populated = false; @@ -694,7 +697,7 @@ static int radeon_ttm_backend_bind(struct ttm_backend *backend, gtt->num_pages, bo_mem, backend); } r = radeon_gart_bind(gtt->rdev, gtt->offset, - gtt->num_pages, gtt->pages); + gtt->num_pages, gtt->pages, gtt->dma_addrs); if (r) { DRM_ERROR("failed to bind %lu pages at 0x%08X\n", gtt->num_pages, gtt->offset);
On Fri, Jan 07, 2011 at 12:11:43PM -0500, Konrad Rzeszutek Wilk wrote:
If the TTM layer has used the DMA API to setup pages that are TTM_PAGE_FLAG_DMA32 (look at patch titled: "ttm: Utilize the dma_addr_t array for pages that are to in DMA32 pool."), lets use it when programming the GART in the PCIe type cards.
This patch skips doing the pci_map_page (and pci_unmap_page) if there is a DMA addresses passed in for that page. If the dma_address is zero (or DMA_ERROR_CODE), then we continue on with our old behaviour.
Hey Jerome,
I should have CC-ed you earlier but missed that and instead just CC-ed the mailing list. I was wondering what your thoughts are about this patchset? Thomas took a look at the patchset and he is OK but more eyes never hurt.
FYI, for clarity I hadn't made the old calls that got moved due to adding of "{" checkpatch.pl compliant. It complains about it being past 80 characters. I can naturally fix that up, but thought that it might detract from the nature of the patch.
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
drivers/gpu/drm/radeon/radeon.h | 4 ++- drivers/gpu/drm/radeon/radeon_gart.c | 36 ++++++++++++++++++++++++--------- drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++- 3 files changed, 33 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 73f600d..c9bbab9 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -317,6 +317,7 @@ struct radeon_gart { union radeon_gart_table table; struct page **pages; dma_addr_t *pages_addr;
- bool *ttm_alloced; bool ready;
};
@@ -329,7 +330,8 @@ void radeon_gart_fini(struct radeon_device *rdev); void radeon_gart_unbind(struct radeon_device *rdev, unsigned offset, int pages); int radeon_gart_bind(struct radeon_device *rdev, unsigned offset,
int pages, struct page **pagelist);
int pages, struct page **pagelist,
dma_addr_t *dma_addr);
/* diff --git a/drivers/gpu/drm/radeon/radeon_gart.c b/drivers/gpu/drm/radeon/radeon_gart.c index e65b903..4a5ac4b 100644 --- a/drivers/gpu/drm/radeon/radeon_gart.c +++ b/drivers/gpu/drm/radeon/radeon_gart.c @@ -149,8 +149,9 @@ void radeon_gart_unbind(struct radeon_device *rdev, unsigned offset, p = t / (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); for (i = 0; i < pages; i++, p++) { if (rdev->gart.pages[p]) {
pci_unmap_page(rdev->pdev, rdev->gart.pages_addr[p],
PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
if (!rdev->gart.ttm_alloced[p])
pci_unmap_page(rdev->pdev, rdev->gart.pages_addr[p],
PAGE_SIZE, PCI_DMA_BIDIRECTIONAL); rdev->gart.pages[p] = NULL; rdev->gart.pages_addr[p] = rdev->dummy_page.addr; page_base = rdev->gart.pages_addr[p];
@@ -165,7 +166,7 @@ void radeon_gart_unbind(struct radeon_device *rdev, unsigned offset, }
int radeon_gart_bind(struct radeon_device *rdev, unsigned offset,
int pages, struct page **pagelist)
int pages, struct page **pagelist, dma_addr_t *dma_addr)
{ unsigned t; unsigned p; @@ -180,15 +181,22 @@ int radeon_gart_bind(struct radeon_device *rdev, unsigned offset, p = t / (PAGE_SIZE / RADEON_GPU_PAGE_SIZE);
for (i = 0; i < pages; i++, p++) {
/* we need to support large memory configurations */
/* assume that unbind have already been call on the range */
rdev->gart.pages_addr[p] = pci_map_page(rdev->pdev, pagelist[i],
/* On TTM path, we only use the DMA API if TTM_PAGE_FLAG_DMA32
* is requested. */
if (dma_addr[i] != DMA_ERROR_CODE) {
rdev->gart.ttm_alloced[p] = true;
rdev->gart.pages_addr[p] = dma_addr[i];
} else {
/* we need to support large memory configurations */
/* assume that unbind have already been call on the range */
rdev->gart.pages_addr[p] = pci_map_page(rdev->pdev, pagelist[i], 0, PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
if (pci_dma_mapping_error(rdev->pdev, rdev->gart.pages_addr[p])) {
/* FIXME: failed to map page (return -ENOMEM?) */
radeon_gart_unbind(rdev, offset, pages);
return -ENOMEM;
if (pci_dma_mapping_error(rdev->pdev, rdev->gart.pages_addr[p])) {
/* FIXME: failed to map page (return -ENOMEM?) */
radeon_gart_unbind(rdev, offset, pages);
return -ENOMEM;
} rdev->gart.pages[p] = pagelist[i]; page_base = rdev->gart.pages_addr[p];}
@@ -251,6 +259,12 @@ int radeon_gart_init(struct radeon_device *rdev) radeon_gart_fini(rdev); return -ENOMEM; }
- rdev->gart.ttm_alloced = kzalloc(sizeof(bool) *
rdev->gart.num_cpu_pages, GFP_KERNEL);
- if (rdev->gart.ttm_alloced == NULL) {
radeon_gart_fini(rdev);
return -ENOMEM;
- } /* set GART entry to point to the dummy page by default */ for (i = 0; i < rdev->gart.num_cpu_pages; i++) { rdev->gart.pages_addr[i] = rdev->dummy_page.addr;
@@ -267,6 +281,8 @@ void radeon_gart_fini(struct radeon_device *rdev) rdev->gart.ready = false; kfree(rdev->gart.pages); kfree(rdev->gart.pages_addr);
- kfree(rdev->gart.ttm_alloced); rdev->gart.pages = NULL; rdev->gart.pages_addr = NULL;
- rdev->gart.ttm_alloced = NULL;
} diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c index 6f156e9..ca04505 100644 --- a/drivers/gpu/drm/radeon/radeon_ttm.c +++ b/drivers/gpu/drm/radeon/radeon_ttm.c @@ -647,6 +647,7 @@ struct radeon_ttm_backend { unsigned long num_pages; struct page **pages; struct page *dummy_read_page;
- dma_addr_t *dma_addrs; bool populated; bool bound; unsigned offset;
@@ -662,6 +663,7 @@ static int radeon_ttm_backend_populate(struct ttm_backend *backend,
gtt = container_of(backend, struct radeon_ttm_backend, backend); gtt->pages = pages;
- gtt->dma_addrs = dma_addrs; gtt->num_pages = num_pages; gtt->dummy_read_page = dummy_read_page; gtt->populated = true;
@@ -674,6 +676,7 @@ static void radeon_ttm_backend_clear(struct ttm_backend *backend)
gtt = container_of(backend, struct radeon_ttm_backend, backend); gtt->pages = NULL;
- gtt->dma_addrs = NULL; gtt->num_pages = 0; gtt->dummy_read_page = NULL; gtt->populated = false;
@@ -694,7 +697,7 @@ static int radeon_ttm_backend_bind(struct ttm_backend *backend, gtt->num_pages, bo_mem, backend); } r = radeon_gart_bind(gtt->rdev, gtt->offset,
gtt->num_pages, gtt->pages);
if (r) { DRM_ERROR("failed to bind %lu pages at 0x%08X\n", gtt->num_pages, gtt->offset);gtt->num_pages, gtt->pages, gtt->dma_addrs);
-- 1.7.1
On Thu, Jan 27, 2011 at 4:20 PM, Konrad Rzeszutek Wilk konrad.wilk@oracle.com wrote:
On Fri, Jan 07, 2011 at 12:11:43PM -0500, Konrad Rzeszutek Wilk wrote:
If the TTM layer has used the DMA API to setup pages that are TTM_PAGE_FLAG_DMA32 (look at patch titled: "ttm: Utilize the dma_addr_t array for pages that are to in DMA32 pool."), lets use it when programming the GART in the PCIe type cards.
This patch skips doing the pci_map_page (and pci_unmap_page) if there is a DMA addresses passed in for that page. If the dma_address is zero (or DMA_ERROR_CODE), then we continue on with our old behaviour.
Hey Jerome,
I should have CC-ed you earlier but missed that and instead just CC-ed the mailing list. I was wondering what your thoughts are about this patchset? Thomas took a look at the patchset and he is OK but more eyes never hurt.
FYI, for clarity I hadn't made the old calls that got moved due to adding of "{" checkpatch.pl compliant. It complains about it being past 80 characters. I can naturally fix that up, but thought that it might detract from the nature of the patch.
What happen when we don't ask dma32 alloc to ttm ? Will this still work in virtualized environnement ?
Code looks good but GART stuff can be picky, i will try to do a full scale testing in next couple week.
Cheers, Jerome
On Fri, Jan 28, 2011 at 09:42:48AM -0500, Jerome Glisse wrote:
On Thu, Jan 27, 2011 at 4:20 PM, Konrad Rzeszutek Wilk konrad.wilk@oracle.com wrote:
On Fri, Jan 07, 2011 at 12:11:43PM -0500, Konrad Rzeszutek Wilk wrote:
If the TTM layer has used the DMA API to setup pages that are TTM_PAGE_FLAG_DMA32 (look at patch titled: "ttm: Utilize the dma_addr_t array for pages that are to in DMA32 pool."), lets use it when programming the GART in the PCIe type cards.
This patch skips doing the pci_map_page (and pci_unmap_page) if there is a DMA addresses passed in for that page. If the dma_address is zero (or DMA_ERROR_CODE), then we continue on with our old behaviour.
Hey Jerome,
I should have CC-ed you earlier but missed that and instead just CC-ed the mailing list. I was wondering what your thoughts are about this patchset? Thomas took a look at the patchset and he is OK but more eyes never hurt.
FYI, for clarity I hadn't made the old calls that got moved due to adding of "{" checkpatch.pl compliant. It complains about it being past 80 characters. I can naturally fix that up, but thought that it might detract from the nature of the patch.
What happen when we don't ask dma32 alloc to ttm ? Will this still work in virtualized environnement ?
You mean if we don't pass in the TTM_PAGE_FLAG_DMA32? It will go back to using the old mechanism - which is to allocate page using alloc_page(GFP_HIGHUSER). Which should be OK in baremetal or virtualized environment - and as long as the driver uses the page for non-DMA type things, or the graphic card is 64-bit capable at which point it has no trouble reaching it.
Code looks good but GART stuff can be picky, i will try to do a full scale testing in next couple week.
Thank you! Much appreciated. Would it be OK if I pinged you in two weeks time just in case?
On Wed, Feb 16, 2011 at 10:54 AM, Konrad Rzeszutek Wilk konrad.wilk@oracle.com wrote:
Code looks good but GART stuff can be picky, i will try to do a full scale testing in next couple week.
ping?
Sorry haven't time to do one. But it looks good to me.
Cheers, Jerome
If the TTM layer has used the DMA API to setup pages that are TTM_PAGE_FLAG_DMA32 (look at patch titled: "ttm: Utilize the dma_addr_t array for pages that are to in DMA32 pool."), lets use it when programming the GART in the PCIe type cards.
This patch skips doing the pci_map_page (and pci_unmap_page) if there is a DMA addresses passed in for that page. If the dma_address is zero (or DMA_ERROR_CODE), then we continue on with our old behaviour.
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com --- drivers/gpu/drm/nouveau/nouveau_sgdma.c | 28 +++++++++++++++++++++------- 1 files changed, 21 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_sgdma.c b/drivers/gpu/drm/nouveau/nouveau_sgdma.c index edc140a..bbdd982 100644 --- a/drivers/gpu/drm/nouveau/nouveau_sgdma.c +++ b/drivers/gpu/drm/nouveau/nouveau_sgdma.c @@ -12,6 +12,7 @@ struct nouveau_sgdma_be { struct drm_device *dev;
dma_addr_t *pages; + bool *ttm_alloced; unsigned nr_pages;
unsigned pte_start; @@ -35,15 +36,25 @@ nouveau_sgdma_populate(struct ttm_backend *be, unsigned long num_pages, if (!nvbe->pages) return -ENOMEM;
+ nvbe->ttm_alloced = kmalloc(sizeof(bool) * num_pages, GFP_KERNEL); + if (!nvbe->ttm_alloced) + return -ENOMEM; + nvbe->nr_pages = 0; while (num_pages--) { - nvbe->pages[nvbe->nr_pages] = - pci_map_page(dev->pdev, pages[nvbe->nr_pages], 0, + if (dma_addrs[nvbe->nr_pages] != DMA_ERROR_CODE) { + nvbe->pages[nvbe->nr_pages] = + dma_addrs[nvbe->nr_pages]; + nvbe->ttm_alloced[nvbe->nr_pages] = true; + } else { + nvbe->pages[nvbe->nr_pages] = + pci_map_page(dev->pdev, pages[nvbe->nr_pages], 0, PAGE_SIZE, PCI_DMA_BIDIRECTIONAL); - if (pci_dma_mapping_error(dev->pdev, - nvbe->pages[nvbe->nr_pages])) { - be->func->clear(be); - return -EFAULT; + if (pci_dma_mapping_error(dev->pdev, + nvbe->pages[nvbe->nr_pages])) { + be->func->clear(be); + return -EFAULT; + } }
nvbe->nr_pages++; @@ -66,11 +77,14 @@ nouveau_sgdma_clear(struct ttm_backend *be) be->func->unbind(be);
while (nvbe->nr_pages--) { - pci_unmap_page(dev->pdev, nvbe->pages[nvbe->nr_pages], + if (!nvbe->ttm_alloced[nvbe->nr_pages]) + pci_unmap_page(dev->pdev, nvbe->pages[nvbe->nr_pages], PAGE_SIZE, PCI_DMA_BIDIRECTIONAL); } kfree(nvbe->pages); + kfree(nvbe->ttm_alloced); nvbe->pages = NULL; + nvbe->ttm_alloced = NULL; nvbe->nr_pages = 0; } }
On Fri, Jan 07, 2011 at 12:11:44PM -0500, Konrad Rzeszutek Wilk wrote:
If the TTM layer has used the DMA API to setup pages that are TTM_PAGE_FLAG_DMA32 (look at patch titled: "ttm: Utilize the dma_addr_t array for pages that are to in DMA32 pool."), lets use it when programming the GART in the PCIe type cards.
This patch skips doing the pci_map_page (and pci_unmap_page) if there is a DMA addresses passed in for that page. If the dma_address is zero (or DMA_ERROR_CODE), then we continue on with our old behaviour.
Hey Ben and Jerome,
I should have CC-ed you guys earlier but missed that and instead just CC-ed the mailing list. I was wondering what your thoughts are about this patchset? Thomas took a look at the patchset and he is OK but more eyes never hurt.
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
drivers/gpu/drm/nouveau/nouveau_sgdma.c | 28 +++++++++++++++++++++------- 1 files changed, 21 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_sgdma.c b/drivers/gpu/drm/nouveau/nouveau_sgdma.c index edc140a..bbdd982 100644 --- a/drivers/gpu/drm/nouveau/nouveau_sgdma.c +++ b/drivers/gpu/drm/nouveau/nouveau_sgdma.c @@ -12,6 +12,7 @@ struct nouveau_sgdma_be { struct drm_device *dev;
dma_addr_t *pages;
bool *ttm_alloced; unsigned nr_pages;
unsigned pte_start;
@@ -35,15 +36,25 @@ nouveau_sgdma_populate(struct ttm_backend *be, unsigned long num_pages, if (!nvbe->pages) return -ENOMEM;
- nvbe->ttm_alloced = kmalloc(sizeof(bool) * num_pages, GFP_KERNEL);
- if (!nvbe->ttm_alloced)
return -ENOMEM;
- nvbe->nr_pages = 0; while (num_pages--) {
nvbe->pages[nvbe->nr_pages] =
pci_map_page(dev->pdev, pages[nvbe->nr_pages], 0,
if (dma_addrs[nvbe->nr_pages] != DMA_ERROR_CODE) {
nvbe->pages[nvbe->nr_pages] =
dma_addrs[nvbe->nr_pages];
nvbe->ttm_alloced[nvbe->nr_pages] = true;
} else {
nvbe->pages[nvbe->nr_pages] =
pci_map_page(dev->pdev, pages[nvbe->nr_pages], 0, PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
if (pci_dma_mapping_error(dev->pdev,
nvbe->pages[nvbe->nr_pages])) {
be->func->clear(be);
return -EFAULT;
if (pci_dma_mapping_error(dev->pdev,
nvbe->pages[nvbe->nr_pages])) {
be->func->clear(be);
return -EFAULT;
}
}
nvbe->nr_pages++;
@@ -66,11 +77,14 @@ nouveau_sgdma_clear(struct ttm_backend *be) be->func->unbind(be);
while (nvbe->nr_pages--) {
pci_unmap_page(dev->pdev, nvbe->pages[nvbe->nr_pages],
if (!nvbe->ttm_alloced[nvbe->nr_pages])
} kfree(nvbe->pages);pci_unmap_page(dev->pdev, nvbe->pages[nvbe->nr_pages], PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
nvbe->pages = NULL;kfree(nvbe->ttm_alloced);
nvbe->nr_pages = 0; }nvbe->ttm_alloced = NULL;
}
1.7.1
On Fri, 2011-01-07 at 12:11 -0500, Konrad Rzeszutek Wilk wrote:
Attached is a set of patches that make it possible for drivers using TTM API (nouveau and radeon graphic drivers) to work under Xen. The explanation is a bit complex and I am not sure if I am explaining it that well..so if something is unclear please do ping me.
Changes since v1: [https://lkml.org/lkml/2010/12/6/516]
- Cleaned up commit message (forgot to add my SoB).
v1 was: Tested-by: Ian Campbell ian.campbell@citrix.com (actually I tested a backport to the 2.6.32.x Debian Squeeze kernel which uses 2.6.33.x era DRM code, others have also reported success with this backport in http://bugs.debian.org/602418)
Ian.
Hi, Konrad!
Just back from vacation. I'll try to review this patch set in the upcoming week!
Thanks, Thomas
On 01/07/2011 06:11 PM, Konrad Rzeszutek Wilk wrote:
Attached is a set of patches that make it possible for drivers using TTM API (nouveau and radeon graphic drivers) to work under Xen. The explanation is a bit complex and I am not sure if I am explaining it that well..so if something is unclear please do ping me.
Changes since v1: [https://lkml.org/lkml/2010/12/6/516]
- Cleaned up commit message (forgot to add my SoB).
Short explanation of problem: What we are hitting under Xen is that instead of programming the GART with the physical DMA address of the TTM page, we end up programming the bounce buffer DMA (bus) address!
Long explanation: The reason we end up doing this is that:
1). alloc_page with GFP_DMA32 does not allocate "real" (Machine Frame Numbers - MFN) under the 4GB in Xen. That is b/c if actually made the pages underneath 4GB available to the the Linux page allocator we would not be able to give those to other guest devices. This would mean if we tried to pass in a USB device to one guest and in another were running the Xorg server we wouldn't be able to do so as we would run out of pages under 4GB. So pages that we get from alloc_page have a PFN that is under 4GB but in reality the real physical address (MFN) is above 4GB. Ugh..
2). The backends for "struct ttm_backend_func" utilize the PCI API. When they get a page allocated via alloc_page, the use 'pci_map_page' and program the DMA (bus) address in the GART - which is correct. But then the calls that kick off the graphic driver to process the pages do not use the pci_page_sync_* calls. If the physical address of the page is the same as the DMA bus address returned from pci_map_page then there are no trouble. But if they are different: virt_to_phys(page_address(p)) != pci_map_page(p,..) then the graphic card fetches data from the DMA (bus) address (so the value returned from pci_map_page). The data however that the user wrote to (the page p) ends up being untouched. You are probably saying: "What? Nonsense, we stitch the page in ttm_bo_vm_fault using the PFN and .. and even if the virt_to_phys(page_address(p)) != pci_map_page(p) the GART ends up with the bus (DMA) address of the PFN!" That is true. But if you combine this with 1) where you end up with page that is above the dma_mask (even if you called it with GFP_DMA32) and then make a call on pci_map_page you would end up with a bounce buffer!
The problem above can be easily reproduced on bare-metal if you pass in "swiotlb=force iommu=soft".
There are two ways of fixing this:
1). Use the 'dma_alloc_coherent' (or pci_alloc_consistent if there is struct pcidev present), instead of alloc_page for GFP_DMA32. The 'dma_alloc_coherent' guarantees that the allocated page fits within the device dma_mask (or uses the default DMA32 if no device is passed in). This also guarantees that any subsequent call to the PCI API for this page will return the same DMA (bus) address as the first call (so pci_alloc_consistent, and then pci_map_page will give the same DMA bus address).
2). Use the pci_sync_range_* after sending a page to the graphics engine. If the bounce buffer is used then we end up copying the pages.
3). This one I really don't want to think about, but I should mention it. Alter the alloc_page and its backend to know about Xen. The pushback from the MM guys will be probably: "why not use the PCI API.."
So attached is a draft set of patches that use solution #1. I've tested this on baremetal (and Xen) on PCIe Radeon and nouveau cards with success.
- The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying
with modifying the TTM API to pass in 'struct device' or 'struct pci_device' but figured it would make first sense to get your guys input before heading that route.
This patch-set is also available at:
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/ttm.pci-api.v3
Konrad Rzeszutek Wilk (5): ttm: Introduce a placeholder for DMA (bus) addresses. tm: Utilize the dma_addr_t array for pages that are to in DMA32 pool. ttm: Expand (*populate) to support an array of DMA addresses. radeon/ttm/PCIe: Use dma_addr if TTM has set it. nouveau/ttm/PCIe: Use dma_addr if TTM has set it.
And diffstat:
drivers/gpu/drm/nouveau/nouveau_sgdma.c | 31 +++++++++++++++++++------- drivers/gpu/drm/radeon/radeon.h | 4 ++- drivers/gpu/drm/radeon/radeon_gart.c | 36 ++++++++++++++++++++++-------- drivers/gpu/drm/radeon/radeon_ttm.c | 8 +++++- drivers/gpu/drm/ttm/ttm_agp_backend.c | 3 +- drivers/gpu/drm/ttm/ttm_page_alloc.c | 35 +++++++++++++++++++++++++----- drivers/gpu/drm/ttm/ttm_tt.c | 12 +++++++-- drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c | 3 +- include/drm/ttm/ttm_bo_driver.h | 6 ++++- include/drm/ttm/ttm_page_alloc.h | 8 +++++- 10 files changed, 111 insertions(+), 35 deletions(-)
Konrad,
Before looking further into the patch series, I need to make sure I've completely understood the problem and why you've chosen this solution: Please see inline.
On 01/07/2011 06:11 PM, Konrad Rzeszutek Wilk wrote:
Attached is a set of patches that make it possible for drivers using TTM API (nouveau and radeon graphic drivers) to work under Xen. The explanation is a bit complex and I am not sure if I am explaining it that well..so if something is unclear please do ping me.
Changes since v1: [https://lkml.org/lkml/2010/12/6/516]
- Cleaned up commit message (forgot to add my SoB).
Short explanation of problem: What we are hitting under Xen is that instead of programming the GART with the physical DMA address of the TTM page, we end up programming the bounce buffer DMA (bus) address!
Long explanation: The reason we end up doing this is that:
1). alloc_page with GFP_DMA32 does not allocate "real" (Machine Frame Numbers - MFN) under the 4GB in Xen. That is b/c if actually made the pages underneath 4GB available to the the Linux page allocator we would not be able to give those to other guest devices. This would mean if we tried to pass in a USB device to one guest and in another were running the Xorg server we wouldn't be able to do so as we would run out of pages under 4GB. So pages that we get from alloc_page have a PFN that is under 4GB but in reality the real physical address (MFN) is above 4GB. Ugh..
2). The backends for "struct ttm_backend_func" utilize the PCI API. When they get a page allocated via alloc_page, the use 'pci_map_page' and program the DMA (bus) address in the GART - which is correct. But then the calls that kick off the graphic driver to process the pages do not use the pci_page_sync_* calls. If the physical address of the page is the same as the DMA bus address returned from pci_map_page then there are no trouble. But if they are different: virt_to_phys(page_address(p)) != pci_map_page(p,..) then the graphic card fetches data from the DMA (bus) address (so the value returned from pci_map_page). The data however that the user wrote to (the page p) ends up being untouched. You are probably saying: "What? Nonsense, we stitch the page in ttm_bo_vm_fault using the PFN and .. and even if the virt_to_phys(page_address(p)) != pci_map_page(p) the GART ends up with the bus (DMA) address of the PFN!" That is true. But if you combine this with 1) where you end up with page that is above the dma_mask (even if you called it with GFP_DMA32) and then make a call on pci_map_page you would end up with a bounce buffer!
The problem above can be easily reproduced on bare-metal if you pass in "swiotlb=force iommu=soft".
At a first glance, this would seem to be a driver error since the drivers are not calling pci_page_sync(), however I understand that the TTM infrastructure and desire to avoid bounce buffers add more implications to this...
There are two ways of fixing this:
1). Use the 'dma_alloc_coherent' (or pci_alloc_consistent if there is struct pcidev present), instead of alloc_page for GFP_DMA32. The 'dma_alloc_coherent' guarantees that the allocated page fits within the device dma_mask (or uses the default DMA32 if no device is passed in). This also guarantees that any subsequent call to the PCI API for this page will return the same DMA (bus) address as the first call (so pci_alloc_consistent, and then pci_map_page will give the same DMA bus address).
I guess dma_alloc_coherent() will allocate *real* DMA32 pages? that brings up a couple of questions: 1) Is it possible to change caching policy on pages allocated using dma_alloc_coherent? 2) What about accounting? In a *non-Xen* environment, will the number of coherent pages be less than the number of DMA32 pages, or will dma_alloc_coherent just translate into a alloc_page(GFP_DMA32)? 3) Same as above, but in a Xen environment, what will stop multiple guests to exhaust the coherent pages? It seems that the TTM accounting mechanisms will no longer be valid unless the number of available coherent pages are split across the guests?
2). Use the pci_sync_range_* after sending a page to the graphics engine. If the bounce buffer is used then we end up copying the pages.
Is the reason for choosing 1) instead of 2) purely a performance concern?
Finally, I wanted to ask why we need to pass / store the dma address of the TTM pages? Isn't it possible to just call into the DMA / PCI api to obtain it, and the coherent allocation will make sure it doesn't change?
Thanks, Thomas
On Mon, Jan 10, 2011 at 03:25:55PM +0100, Thomas Hellstrom wrote:
Konrad,
Before looking further into the patch series, I need to make sure I've completely understood the problem and why you've chosen this solution: Please see inline.
Of course.
.. snip ..
The problem above can be easily reproduced on bare-metal if you pass in "swiotlb=force iommu=soft".
At a first glance, this would seem to be a driver error since the drivers are not calling pci_page_sync(), however I understand that the TTM infrastructure and desire to avoid bounce buffers add more implications to this...
<nods>
There are two ways of fixing this:
1). Use the 'dma_alloc_coherent' (or pci_alloc_consistent if there is struct pcidev present), instead of alloc_page for GFP_DMA32. The 'dma_alloc_coherent' guarantees that the allocated page fits within the device dma_mask (or uses the default DMA32 if no device is passed in). This also guarantees that any subsequent call to the PCI API for this page will return the same DMA (bus) address as the first call (so pci_alloc_consistent, and then pci_map_page will give the same DMA bus address).
I guess dma_alloc_coherent() will allocate *real* DMA32 pages? that brings up a couple of questions:
- Is it possible to change caching policy on pages allocated using
dma_alloc_coherent?
Yes. They are the same "form-factor" as any normal page, except that the IOMMU makes extra efforts to set this page up.
- What about accounting? In a *non-Xen* environment, will the
number of coherent pages be less than the number of DMA32 pages, or will dma_alloc_coherent just translate into a alloc_page(GFP_DMA32)?
The code in the IOMMUs end up calling __get_free_pages, which ends up in alloc_pages. So the call doe ends up in alloc_page(flags).
native SWIOTLB (so no IOMMU): GFP_DMA32 GART (AMD's old IOMMU): GFP_DMA32:
For the hardware IOMMUs:
AMD VI: if it is in Passthrough mode, it calls it with GFP_DMA32. If it is in DMA translation mode (normal mode) it allocates a page with GFP_ZERO | ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32) and immediately translates the bus address.
The flags change a bit: VT-d: if there is no identity mapping, nor the PCI device is one of the special ones (GFX, Azalia), then it will pass it with GFP_DMA32. If it is in identity mapping state, and the device is a GFX or Azalia sound card, then it will ~(__GFP_DMA | GFP_DMA32) and immediately translate the buss address.
However, the interesting thing is that I've passed in the 'NULL' as the struct device (not intentionally - did not want to add more changes to the API) so all of the IOMMUs end up doing GFP_DMA32.
But it does mess up the accounting with the AMD-VI and VT-D as they strip of the __GFP_DMA32 flag off. That is a big problem, I presume?
- Same as above, but in a Xen environment, what will stop multiple
guests to exhaust the coherent pages? It seems that the TTM accounting mechanisms will no longer be valid unless the number of available coherent pages are split across the guests?
Say I pass in four ATI Radeon cards (wherein each is a 32-bit card) to four guests. Lets also assume that we are doing heavy operations in all of the guests. Since there are no communication between each TTM accounting in each guest you could end up eating all of the 4GB physical memory that is available to each guest. It could end up that the first guess gets a lion share of the 4GB memory, while the other ones are less so.
And if one was to do that on baremetal, with four ATI Radeon cards, the TTM accounting mechanism would realize it is nearing the watermark and do.. something, right? What would it do actually?
I think the error path would be the same in both cases?
2). Use the pci_sync_range_* after sending a page to the graphics engine. If the bounce buffer is used then we end up copying the pages.
Is the reason for choosing 1) instead of 2) purely a performance concern?
Yes, and also not understanding where I should insert the pci_sync_range calls in the drivers.
Finally, I wanted to ask why we need to pass / store the dma address of the TTM pages? Isn't it possible to just call into the DMA / PCI api to obtain it, and the coherent allocation will make sure it doesn't change?
It won't change, but you need the dma address during de-allocation: dma_free_coherent..
On 01/10/2011 04:21 PM, Konrad Rzeszutek Wilk wrote:
On Mon, Jan 10, 2011 at 03:25:55PM +0100, Thomas Hellstrom wrote:
Konrad,
Before looking further into the patch series, I need to make sure I've completely understood the problem and why you've chosen this solution: Please see inline.
Of course.
.. snip ..
The problem above can be easily reproduced on bare-metal if you pass in "swiotlb=force iommu=soft".
At a first glance, this would seem to be a driver error since the drivers are not calling pci_page_sync(), however I understand that the TTM infrastructure and desire to avoid bounce buffers add more implications to this...
<nods>
There are two ways of fixing this:
1). Use the 'dma_alloc_coherent' (or pci_alloc_consistent if there is struct pcidev present), instead of alloc_page for GFP_DMA32. The 'dma_alloc_coherent' guarantees that the allocated page fits within the device dma_mask (or uses the default DMA32 if no device is passed in). This also guarantees that any subsequent call to the PCI API for this page will return the same DMA (bus) address as the first call (so pci_alloc_consistent, and then pci_map_page will give the same DMA bus address).
I guess dma_alloc_coherent() will allocate *real* DMA32 pages? that brings up a couple of questions:
- Is it possible to change caching policy on pages allocated using
dma_alloc_coherent?
Yes. They are the same "form-factor" as any normal page, except that the IOMMU makes extra efforts to set this page up.
- What about accounting? In a *non-Xen* environment, will the
number of coherent pages be less than the number of DMA32 pages, or will dma_alloc_coherent just translate into a alloc_page(GFP_DMA32)?
The code in the IOMMUs end up calling __get_free_pages, which ends up in alloc_pages. So the call doe ends up in alloc_page(flags).
native SWIOTLB (so no IOMMU): GFP_DMA32 GART (AMD's old IOMMU): GFP_DMA32:
For the hardware IOMMUs:
AMD VI: if it is in Passthrough mode, it calls it with GFP_DMA32. If it is in DMA translation mode (normal mode) it allocates a page with GFP_ZERO | ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32) and immediately translates the bus address.
The flags change a bit: VT-d: if there is no identity mapping, nor the PCI device is one of the special ones (GFX, Azalia), then it will pass it with GFP_DMA32. If it is in identity mapping state, and the device is a GFX or Azalia sound card, then it will ~(__GFP_DMA | GFP_DMA32) and immediately translate the buss address.
However, the interesting thing is that I've passed in the 'NULL' as the struct device (not intentionally - did not want to add more changes to the API) so all of the IOMMUs end up doing GFP_DMA32.
But it does mess up the accounting with the AMD-VI and VT-D as they strip of the __GFP_DMA32 flag off. That is a big problem, I presume?
Actually, I don't think it's a big problem. TTM allows a small discrepancy between allocated pages and accounted pages to be able to account on actual allocation result. IIRC, This means that a DMA32 page will always be accounted as such, or at least we can make it behave that way. As long as the device can always handle the page, we should be fine.
- Same as above, but in a Xen environment, what will stop multiple
guests to exhaust the coherent pages? It seems that the TTM accounting mechanisms will no longer be valid unless the number of available coherent pages are split across the guests?
Say I pass in four ATI Radeon cards (wherein each is a 32-bit card) to four guests. Lets also assume that we are doing heavy operations in all of the guests. Since there are no communication between each TTM accounting in each guest you could end up eating all of the 4GB physical memory that is available to each guest. It could end up that the first guess gets a lion share of the 4GB memory, while the other ones are less so.
And if one was to do that on baremetal, with four ATI Radeon cards, the TTM accounting mechanism would realize it is nearing the watermark and do.. something, right? What would it do actually?
I think the error path would be the same in both cases?
Not really. The really dangerous situation is if TTM is allowed to exhaust all GFP_KERNEL memory. Then any application or kernel task might fail with an OOM, so TTM doesn't really allow that to happen *). Within a Xen guest OS using this patch that won't happen either, but TTM itself may receive unexpected allocation failures, since the amount of GFP_DMA32 memory TTM thinks is available is larger than actually available. It is possible to trigger such allocation failures on bare metal as well, but they'd be much less likely. Those errors should result in application OOM errors with a possible application crash. Anyway it's possible to adjust TTM's memory limits using sysfs (even on the fly) so any advanced user should be able to do that.
What *might* be possible, however, is that the GFP_KERNEL memory on the host gets exhausted due to extensive TTM allocations in the guest, but I guess that's a problem for XEN to resolve, not TTM.
2). Use the pci_sync_range_* after sending a page to the graphics engine. If the bounce buffer is used then we end up copying the pages.
Is the reason for choosing 1) instead of 2) purely a performance concern?
Yes, and also not understanding where I should insert the pci_sync_range calls in the drivers.
Finally, I wanted to ask why we need to pass / store the dma address of the TTM pages? Isn't it possible to just call into the DMA / PCI api to obtain it, and the coherent allocation will make sure it doesn't change?
It won't change, but you need the dma address during de-allocation: dma_free_coherent..
Isn't there a quick way to determine the DMA address from the struct page pointer, or would that require an explicit dma_map() operation?
/Thomas
*) I think gem's flink still is vulnerable to this, though, so it affects Nvidia and Radeon.
. snip ..
- What about accounting? In a *non-Xen* environment, will the
number of coherent pages be less than the number of DMA32 pages, or will dma_alloc_coherent just translate into a alloc_page(GFP_DMA32)?
The code in the IOMMUs end up calling __get_free_pages, which ends up in alloc_pages. So the call doe ends up in alloc_page(flags).
native SWIOTLB (so no IOMMU): GFP_DMA32 GART (AMD's old IOMMU): GFP_DMA32:
For the hardware IOMMUs:
AMD VI: if it is in Passthrough mode, it calls it with GFP_DMA32. If it is in DMA translation mode (normal mode) it allocates a page with GFP_ZERO | ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32) and immediately translates the bus address.
The flags change a bit: VT-d: if there is no identity mapping, nor the PCI device is one of the special ones (GFX, Azalia), then it will pass it with GFP_DMA32. If it is in identity mapping state, and the device is a GFX or Azalia sound card, then it will ~(__GFP_DMA | GFP_DMA32) and immediately translate the buss address.
However, the interesting thing is that I've passed in the 'NULL' as the struct device (not intentionally - did not want to add more changes to the API) so all of the IOMMUs end up doing GFP_DMA32.
But it does mess up the accounting with the AMD-VI and VT-D as they strip of the __GFP_DMA32 flag off. That is a big problem, I presume?
Actually, I don't think it's a big problem. TTM allows a small discrepancy between allocated pages and accounted pages to be able to account on actual allocation result. IIRC, This means that a DMA32 page will always be accounted as such, or at least we can make it behave that way. As long as the device can always handle the page, we should be fine.
Excellent.
- Same as above, but in a Xen environment, what will stop multiple
guests to exhaust the coherent pages? It seems that the TTM accounting mechanisms will no longer be valid unless the number of available coherent pages are split across the guests?
Say I pass in four ATI Radeon cards (wherein each is a 32-bit card) to four guests. Lets also assume that we are doing heavy operations in all of the guests. Since there are no communication between each TTM accounting in each guest you could end up eating all of the 4GB physical memory that is available to each guest. It could end up that the first guess gets a lion share of the 4GB memory, while the other ones are less so.
And if one was to do that on baremetal, with four ATI Radeon cards, the TTM accounting mechanism would realize it is nearing the watermark and do.. something, right? What would it do actually?
I think the error path would be the same in both cases?
Not really. The really dangerous situation is if TTM is allowed to exhaust all GFP_KERNEL memory. Then any application or kernel task
Ok, since GFP_KERNEL does not contain the GFP_DMA32 flag then this should be OK?
might fail with an OOM, so TTM doesn't really allow that to happen *). Within a Xen guest OS using this patch that won't happen either, but TTM itself may receive unexpected allocation failures, since the amount of GFP_DMA32 memory TTM thinks is available is larger than actually available.
Ooooh, perfect opportunity to test the error paths then :-)
It is possible to trigger such allocation failures on bare metal as well, but they'd be much less likely. Those errors should result in application OOM errors with a possible application crash. Anyway it's possible to adjust TTM's memory limits using sysfs (even on the fly) so any advanced user should be able to do that.
What *might* be possible, however, is that the GFP_KERNEL memory on the host gets exhausted due to extensive TTM allocations in the guest, but I guess that's a problem for XEN to resolve, not TTM.
Hmm. I think I am missing something here. The GFP_KERNEL is any memory and the GFP_DMA32 is memory from the ZONE_DMA32. When we do start using the PCI-API, what happens underneath (so under Linux) is that "real PFNs" (Machine Frame Numbers) which are above the 0x100000 mark get swizzled in for the guest's PFNs (this is for the PCI devices that have the dma_mask set to 32bit). However, that is a Xen MMU accounting issue.
The GFP_KERNEL memory on the other hand does not get the same treatment, so whichever MFNs were allocated for that memory are still the same.
The amount of memory in the guest during the treatment that the PCI API does when running under Xen remains the same. The PFNs, the zone's, etc are all the same. It is just that when you program the PTE's or pass in the (DMA) bus address to the devices, the numbers are different from what a 'virt_to_phys' call would return.
.. snip..
Finally, I wanted to ask why we need to pass / store the dma address of the TTM pages? Isn't it possible to just call into the DMA / PCI api to obtain it, and the coherent allocation will make sure it doesn't change?
It won't change, but you need the dma address during de-allocation: dma_free_coherent..
Isn't there a quick way to determine the DMA address from the struct
Sadly no. You need to squirrel it away.
page pointer, or would that require an explicit dma_map() operation?
<nods> The DMA API only offers two ways to get the (DMA) bus address.
The first is the 'dma_alloc_coherent' (pci_create_coherent) and the other is the 'dma_map_page' (or pci_map_page). Both calls return the DMA address and there is no "translate this virtual address to a DMA address please API call."
One way to potentially not carry this dma address around is in the de-alloc path do this:
dma_addr_t _d = pci_map_page(...); pci_unmap_page(..); dma_free_coherent(_d, ... )
which would be one way of avoiding carrying around the dma_addr_t array.
/Thomas
*) I think gem's flink still is vulnerable to this, though, so it
Is there a good test-case for this?
affects Nvidia and Radeon.
On 01/10/2011 05:45 PM, Konrad Rzeszutek Wilk wrote:
. snip ..
- What about accounting? In a *non-Xen* environment, will the
number of coherent pages be less than the number of DMA32 pages, or will dma_alloc_coherent just translate into a alloc_page(GFP_DMA32)?
The code in the IOMMUs end up calling __get_free_pages, which ends up in alloc_pages. So the call doe ends up in alloc_page(flags).
native SWIOTLB (so no IOMMU): GFP_DMA32 GART (AMD's old IOMMU): GFP_DMA32:
For the hardware IOMMUs:
AMD VI: if it is in Passthrough mode, it calls it with GFP_DMA32. If it is in DMA translation mode (normal mode) it allocates a page with GFP_ZERO | ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32) and immediately translates the bus address.
The flags change a bit: VT-d: if there is no identity mapping, nor the PCI device is one of the special ones (GFX, Azalia), then it will pass it with GFP_DMA32. If it is in identity mapping state, and the device is a GFX or Azalia sound card, then it will ~(__GFP_DMA | GFP_DMA32) and immediately translate the buss address.
However, the interesting thing is that I've passed in the 'NULL' as the struct device (not intentionally - did not want to add more changes to the API) so all of the IOMMUs end up doing GFP_DMA32.
But it does mess up the accounting with the AMD-VI and VT-D as they strip of the __GFP_DMA32 flag off. That is a big problem, I presume?
Actually, I don't think it's a big problem. TTM allows a small discrepancy between allocated pages and accounted pages to be able to account on actual allocation result. IIRC, This means that a DMA32 page will always be accounted as such, or at least we can make it behave that way. As long as the device can always handle the page, we should be fine.
Excellent.
- Same as above, but in a Xen environment, what will stop multiple
guests to exhaust the coherent pages? It seems that the TTM accounting mechanisms will no longer be valid unless the number of available coherent pages are split across the guests?
Say I pass in four ATI Radeon cards (wherein each is a 32-bit card) to four guests. Lets also assume that we are doing heavy operations in all of the guests. Since there are no communication between each TTM accounting in each guest you could end up eating all of the 4GB physical memory that is available to each guest. It could end up that the first guess gets a lion share of the 4GB memory, while the other ones are less so.
And if one was to do that on baremetal, with four ATI Radeon cards, the TTM accounting mechanism would realize it is nearing the watermark and do.. something, right? What would it do actually?
I think the error path would be the same in both cases?
Not really. The really dangerous situation is if TTM is allowed to exhaust all GFP_KERNEL memory. Then any application or kernel task
Ok, since GFP_KERNEL does not contain the GFP_DMA32 flag then this should be OK?
No, Unless I miss something, on a machine with 4GB or less, GFP_DMA32 and GFP_KERNEL are allocated from the same pool of pages?
What *might* be possible, however, is that the GFP_KERNEL memory on the host gets exhausted due to extensive TTM allocations in the guest, but I guess that's a problem for XEN to resolve, not TTM.
Hmm. I think I am missing something here. The GFP_KERNEL is any memory and the GFP_DMA32 is memory from the ZONE_DMA32. When we do start using the PCI-API, what happens underneath (so under Linux) is that "real PFNs" (Machine Frame Numbers) which are above the 0x100000 mark get swizzled in for the guest's PFNs (this is for the PCI devices that have the dma_mask set to 32bit). However, that is a Xen MMU accounting issue.
So I was under the impression that when you allocate coherent memory in the guest, the physical page comes from DMA32 memory in the host. On a 4GB machine or less, that would be the same as kernel memory. Now, if 4 guests think they can allocate 2GB of coherent memory each, you might run out of kernel memory on the host?
Another thing that I was thinking of is what happens if you have a huge gart and allocate a lot of coherent memory. Could that potentially exhaust IOMMU resources?
/Thomas
*) I think gem's flink still is vulnerable to this, though, so it
Is there a good test-case for this?
Not put in code. What you can do (for example in an openGL app) is to write some code that tries to flink with a guessed bo name until it succeeds. Then repeatedly from within the app, try to flink the same name until something crashes. I don't think the linux OOM killer can handle that situation. Should be fairly easy to put together.
/Thomas
. snip ..
I think the error path would be the same in both cases?
Not really. The really dangerous situation is if TTM is allowed to exhaust all GFP_KERNEL memory. Then any application or kernel task
Ok, since GFP_KERNEL does not contain the GFP_DMA32 flag then this should be OK?
No, Unless I miss something, on a machine with 4GB or less, GFP_DMA32 and GFP_KERNEL are allocated from the same pool of pages?
Yes. Depending on the E820 and where the PCI hole is present. More details below.
What *might* be possible, however, is that the GFP_KERNEL memory on the host gets exhausted due to extensive TTM allocations in the guest, but I guess that's a problem for XEN to resolve, not TTM.
Hmm. I think I am missing something here. The GFP_KERNEL is any memory and the GFP_DMA32 is memory from the ZONE_DMA32. When we do start using the PCI-API, what happens underneath (so under Linux) is that "real PFNs" (Machine Frame Numbers) which are above the 0x100000 mark get swizzled in for the guest's PFNs (this is for the PCI devices that have the dma_mask set to 32bit). However, that is a Xen MMU accounting issue.
So I was under the impression that when you allocate coherent memory in the guest, the physical page comes from DMA32 memory in the host.
No. It comes from DMA32 zone off the hypervisor pool. If say you have a machine with 24GB, the first guest (Dom0) could allocate memory from 20->24GB (so only 4GB allocated to it). It will then also fetch 64MB from the DMA32 zone for the SWIOTLB. Then the next guest, say 4GB (gets 16GB->20GB) - gets 64MB from DMA32. And so on.
So at the end we have 16GB taken from 8GB->24GB, and 320MB taken from 0->4GB. When you start allocating coherent memory from each guest (and yeah, say we use 2GB each), we end up with the first guest getting the 2GB, the second getting 1.7GB, and then the next two getting zil.
You still have GFP_KERNEL memory in each guest - the first one has 2GB left , then second 2.3, the next two have each 4GB.
From the hyprevisor pool perspective, the 0-4GB zone is exhausted, so
is the 8GB->24GB, but it still has 4GB->8GB free - so it can launch one more guest (but without PCI passthrough devices).
On a 4GB machine or less, that would be the same as kernel memory. Now, if 4 guests think they can allocate 2GB of coherent memory each, you might run out of kernel memory on the host?
So host in this case refers to the Hypervisor and it does not care about the DMA or what - it does not have any device drivers(*) or such. The first guest (dom0) is the one that deals with the device drivers.
*: It has one: the serial port, but that is not really that important for this discussion.
Another thing that I was thinking of is what happens if you have a huge gart and allocate a lot of coherent memory. Could that potentially exhaust IOMMU resources?
<scratches his head>
So the GART is in the PCI space in one of the BARs of the device right? (We are talking about the discrete card GART, not the poor man AMD IOMMU?) The PCI space is under the 4GB, so it would be considered coherent by definition.
However the PCI space with its BARs eats in the 4GB space, so if you have a 1GB region from 0xC0000->0x100000, then you only have 3GB left of DMA32 zone.
If I think of this as an accounting, and if the PCI space goes further down (say 0x40000, so from 2GB->4GB it is a E820 gap, and 0GB->2GB is System RAM with 4GB->6GB being the other System RAM, for a cumulative number of 4GB of memory in the machine), we would only have 2GB of DMA32 zone (The GFP_KERNEL zone is 4GB, while GFP_DMA32 zone is 2GB).
Then the answer is yes. However, wouldn't such device be 64-bit? And if they are 64-bit, then the TTM API wouldn't bother to allocate pages from the 32-bit region, right?
/Thomas
*) I think gem's flink still is vulnerable to this, though, so it
Is there a good test-case for this?
Not put in code. What you can do (for example in an openGL app) is to write some code that tries to flink with a guessed bo name until it succeeds. Then repeatedly from within the app, try to flink the same name until something crashes. I don't think the linux OOM killer can handle that situation. Should be fairly easy to put together.
Uhhh, OK, you just flew over what I know about graphics. Let me research this a bit more.
/Thomas
On Tue, Jan 11, 2011 at 10:55 AM, Konrad Rzeszutek Wilk konrad.wilk@oracle.com wrote:
. snip ..
I think the error path would be the same in both cases?
Not really. The really dangerous situation is if TTM is allowed to exhaust all GFP_KERNEL memory. Then any application or kernel task
Ok, since GFP_KERNEL does not contain the GFP_DMA32 flag then this should be OK?
No, Unless I miss something, on a machine with 4GB or less, GFP_DMA32 and GFP_KERNEL are allocated from the same pool of pages?
Yes. Depending on the E820 and where the PCI hole is present. More details below.
What *might* be possible, however, is that the GFP_KERNEL memory on the host gets exhausted due to extensive TTM allocations in the guest, but I guess that's a problem for XEN to resolve, not TTM.
Hmm. I think I am missing something here. The GFP_KERNEL is any memory and the GFP_DMA32 is memory from the ZONE_DMA32. When we do start using the PCI-API, what happens underneath (so under Linux) is that "real PFNs" (Machine Frame Numbers) which are above the 0x100000 mark get swizzled in for the guest's PFNs (this is for the PCI devices that have the dma_mask set to 32bit). However, that is a Xen MMU accounting issue.
So I was under the impression that when you allocate coherent memory in the guest, the physical page comes from DMA32 memory in the host.
No. It comes from DMA32 zone off the hypervisor pool. If say you have a machine with 24GB, the first guest (Dom0) could allocate memory from 20->24GB (so only 4GB allocated to it). It will then also fetch 64MB from the DMA32 zone for the SWIOTLB. Then the next guest, say 4GB (gets 16GB->20GB) - gets 64MB from DMA32. And so on.
So at the end we have 16GB taken from 8GB->24GB, and 320MB taken from 0->4GB. When you start allocating coherent memory from each guest (and yeah, say we use 2GB each), we end up with the first guest getting the 2GB, the second getting 1.7GB, and then the next two getting zil.
You still have GFP_KERNEL memory in each guest - the first one has 2GB left , then second 2.3, the next two have each 4GB.
From the hyprevisor pool perspective, the 0-4GB zone is exhausted, so is the 8GB->24GB, but it still has 4GB->8GB free - so it can launch one more guest (but without PCI passthrough devices).
On a 4GB machine or less, that would be the same as kernel memory. Now, if 4 guests think they can allocate 2GB of coherent memory each, you might run out of kernel memory on the host?
So host in this case refers to the Hypervisor and it does not care about the DMA or what - it does not have any device drivers(*) or such. The first guest (dom0) is the one that deals with the device drivers.
*: It has one: the serial port, but that is not really that important for this discussion.
Another thing that I was thinking of is what happens if you have a huge gart and allocate a lot of coherent memory. Could that potentially exhaust IOMMU resources?
<scratches his head>
So the GART is in the PCI space in one of the BARs of the device right? (We are talking about the discrete card GART, not the poor man AMD IOMMU?) The PCI space is under the 4GB, so it would be considered coherent by definition.
GART is not a PCI BAR; it's just a remapper for system pages. On radeon GPUs at least there is a memory controller with 3 programmable apertures: vram, internal gart, and agp gart. You can map these resources whereever you want in the GPU's address space and then the memory controller takes care of the translation to off-board resources like gart pages. On chip memory clients (display controllers, texture blocks, render blocks, etc.) write to internal GPU addresses. The GPU has it's own direct connection to vram, so that's not an issue. For AGP, the GPU specifies aperture base and size, and you point it to the bus address of gart aperture provided by the northbridge's AGP controller. For internal gart, the GPU has a page table stored in either vram or uncached system memory depending on the asic. It provides a contiguous linear aperture to GPU clients and the memory controller translates the transactions to the backing pages via the pagetable.
Alex
However the PCI space with its BARs eats in the 4GB space, so if you have a 1GB region from 0xC0000->0x100000, then you only have 3GB left of DMA32 zone.
If I think of this as an accounting, and if the PCI space goes further down (say 0x40000, so from 2GB->4GB it is a E820 gap, and 0GB->2GB is System RAM with 4GB->6GB being the other System RAM, for a cumulative number of 4GB of memory in the machine), we would only have 2GB of DMA32 zone (The GFP_KERNEL zone is 4GB, while GFP_DMA32 zone is 2GB).
Then the answer is yes. However, wouldn't such device be 64-bit? And if they are 64-bit, then the TTM API wouldn't bother to allocate pages from the 32-bit region, right?
/Thomas
*) I think gem's flink still is vulnerable to this, though, so it
Is there a good test-case for this?
Not put in code. What you can do (for example in an openGL app) is to write some code that tries to flink with a guessed bo name until it succeeds. Then repeatedly from within the app, try to flink the same name until something crashes. I don't think the linux OOM killer can handle that situation. Should be fairly easy to put together.
Uhhh, OK, you just flew over what I know about graphics. Let me research this a bit more.
/Thomas
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Another thing that I was thinking of is what happens if you have a huge gart and allocate a lot of coherent memory. Could that potentially exhaust IOMMU resources?
<scratches his head>
So the GART is in the PCI space in one of the BARs of the device right? (We are talking about the discrete card GART, not the poor man AMD IOMMU?) The PCI space is under the 4GB, so it would be considered coherent by definition.
GART is not a PCI BAR; it's just a remapper for system pages. On radeon GPUs at least there is a memory controller with 3 programmable apertures: vram, internal gart, and agp gart. You can map these
To access it, ie, to program it, you would need to access the PCIe card MMIO regions, right? So that would be considered in PCI BAR space?
resources whereever you want in the GPU's address space and then the memory controller takes care of the translation to off-board resources like gart pages. On chip memory clients (display controllers, texture blocks, render blocks, etc.) write to internal GPU addresses. The GPU has it's own direct connection to vram, so that's not an issue. For AGP, the GPU specifies aperture base and size, and you point it to the bus address of gart aperture provided by the northbridge's AGP controller. For internal gart, the GPU has a page table stored in
I think we are just talking about the GART on the GPU, not the old AGP GART.
either vram or uncached system memory depending on the asic. It provides a contiguous linear aperture to GPU clients and the memory controller translates the transactions to the backing pages via the pagetable.
So I think I misunderstood what is meant by 'huge gart'. That sounds like linear address space provided by GPU. And hooking up a lot of coherent memory (so System RAM) to that linear address space would be no different that what is currently being done. When you allocate memory using page_alloc(GFP_DMA32) and hook up that memory to the linear space you exhaust the same amount of ZONE_DMA32 memory as if you were to use the PCI API. It comes from the same pool, except that doing it from the PCI API gets you the bus address right away.
On Tue, Jan 11, 2011 at 11:59 AM, Konrad Rzeszutek Wilk konrad.wilk@oracle.com wrote:
Another thing that I was thinking of is what happens if you have a huge gart and allocate a lot of coherent memory. Could that potentially exhaust IOMMU resources?
<scratches his head>
So the GART is in the PCI space in one of the BARs of the device right? (We are talking about the discrete card GART, not the poor man AMD IOMMU?) The PCI space is under the 4GB, so it would be considered coherent by definition.
GART is not a PCI BAR; it's just a remapper for system pages. On radeon GPUs at least there is a memory controller with 3 programmable apertures: vram, internal gart, and agp gart. You can map these
To access it, ie, to program it, you would need to access the PCIe card MMIO regions, right? So that would be considered in PCI BAR space?
yes, you need access to the mmio aperture to configure the gpu. I was thinking you mean something akin the the framebuffer BAR only for gart space which is not the case.
resources whereever you want in the GPU's address space and then the memory controller takes care of the translation to off-board resources like gart pages. On chip memory clients (display controllers, texture blocks, render blocks, etc.) write to internal GPU addresses. The GPU has it's own direct connection to vram, so that's not an issue. For AGP, the GPU specifies aperture base and size, and you point it to the bus address of gart aperture provided by the northbridge's AGP controller. For internal gart, the GPU has a page table stored in
I think we are just talking about the GART on the GPU, not the old AGP GART.
Ok. I just mentioned it for completeness.
either vram or uncached system memory depending on the asic. It provides a contiguous linear aperture to GPU clients and the memory controller translates the transactions to the backing pages via the pagetable.
So I think I misunderstood what is meant by 'huge gart'. That sounds like linear address space provided by GPU. And hooking up a lot of coherent memory (so System RAM) to that linear address space would be no different that what is currently being done. When you allocate memory using page_alloc(GFP_DMA32) and hook up that memory to the linear space you exhaust the same amount of ZONE_DMA32 memory as if you were to use the PCI API. It comes from the same pool, except that doing it from the PCI API gets you the bus address right away.
In this case GPU clients refers to the hw blocks on the GPU; they are the ones that see the contiguous linear aperture. From the application's perspective, gart memory looks like any other pages.
Alex
On Tue, Jan 11, 2011 at 01:12:57PM -0500, Alex Deucher wrote:
On Tue, Jan 11, 2011 at 11:59 AM, Konrad Rzeszutek Wilk konrad.wilk@oracle.com wrote:
Another thing that I was thinking of is what happens if you have a huge gart and allocate a lot of coherent memory. Could that potentially exhaust IOMMU resources?
<scratches his head>
So the GART is in the PCI space in one of the BARs of the device right? (We are talking about the discrete card GART, not the poor man AMD IOMMU?) The PCI space is under the 4GB, so it would be considered coherent by definition.
GART is not a PCI BAR; it's just a remapper for system pages. On radeon GPUs at least there is a memory controller with 3 programmable apertures: vram, internal gart, and agp gart. You can map these
To access it, ie, to program it, you would need to access the PCIe card MMIO regions, right? So that would be considered in PCI BAR space?
yes, you need access to the mmio aperture to configure the gpu. I was thinking you mean something akin the the framebuffer BAR only for gart space which is not the case.
Aaah, gotcha.
resources whereever you want in the GPU's address space and then the memory controller takes care of the translation to off-board resources like gart pages. On chip memory clients (display controllers, texture blocks, render blocks, etc.) write to internal GPU addresses. The GPU has it's own direct connection to vram, so that's not an issue. For AGP, the GPU specifies aperture base and size, and you point it to the bus address of gart aperture provided by the northbridge's AGP controller. For internal gart, the GPU has a page table stored in
I think we are just talking about the GART on the GPU, not the old AGP GART.
Ok. I just mentioned it for completeness.
<nods>
either vram or uncached system memory depending on the asic. It provides a contiguous linear aperture to GPU clients and the memory controller translates the transactions to the backing pages via the pagetable.
So I think I misunderstood what is meant by 'huge gart'. That sounds like linear address space provided by GPU. And hooking up a lot of coherent memory (so System RAM) to that linear address space would be no different that what is currently being done. When you allocate memory using page_alloc(GFP_DMA32) and hook up that memory to the linear space you exhaust the same amount of ZONE_DMA32 memory as if you were to use the PCI API. It comes from the same pool, except that doing it from the PCI API gets you the bus address right away.
In this case GPU clients refers to the hw blocks on the GPU; they are the ones that see the contiguous linear aperture. From the application's perspective, gart memory looks like any other pages.
<nods>. Those 'hw blocks' or 'gart memory' are in reality just pages received via the 'alloc_page()' (before this patchset and also after this patchset) Oh wait, this 'hw blocks' or 'gart memory' can also refer to the VRAM memory right? In which case that is not memory allocated via 'alloc_page' but using a different mechanism. Is TTM used then? If so how do you stick those VRAM pages under its accounting rules? Or do the drivers use some other mechanism for that that is dependent on each driver?
On Tue, Jan 11, 2011 at 1:28 PM, Konrad Rzeszutek Wilk konrad.wilk@oracle.com wrote:
On Tue, Jan 11, 2011 at 01:12:57PM -0500, Alex Deucher wrote:
On Tue, Jan 11, 2011 at 11:59 AM, Konrad Rzeszutek Wilk konrad.wilk@oracle.com wrote:
Another thing that I was thinking of is what happens if you have a huge gart and allocate a lot of coherent memory. Could that potentially exhaust IOMMU resources?
<scratches his head>
So the GART is in the PCI space in one of the BARs of the device right? (We are talking about the discrete card GART, not the poor man AMD IOMMU?) The PCI space is under the 4GB, so it would be considered coherent by definition.
GART is not a PCI BAR; it's just a remapper for system pages. On radeon GPUs at least there is a memory controller with 3 programmable apertures: vram, internal gart, and agp gart. You can map these
To access it, ie, to program it, you would need to access the PCIe card MMIO regions, right? So that would be considered in PCI BAR space?
yes, you need access to the mmio aperture to configure the gpu. I was thinking you mean something akin the the framebuffer BAR only for gart space which is not the case.
Aaah, gotcha.
resources whereever you want in the GPU's address space and then the memory controller takes care of the translation to off-board resources like gart pages. On chip memory clients (display controllers, texture blocks, render blocks, etc.) write to internal GPU addresses. The GPU has it's own direct connection to vram, so that's not an issue. For AGP, the GPU specifies aperture base and size, and you point it to the bus address of gart aperture provided by the northbridge's AGP controller. For internal gart, the GPU has a page table stored in
I think we are just talking about the GART on the GPU, not the old AGP GART.
Ok. I just mentioned it for completeness.
<nods> > > > > >> either vram or uncached system memory depending on the asic. It > >> provides a contiguous linear aperture to GPU clients and the memory > >> controller translates the transactions to the backing pages via the > >> pagetable. > > > > So I think I misunderstood what is meant by 'huge gart'. That sounds > > like linear address space provided by GPU. And hooking up a lot of coherent > > memory (so System RAM) to that linear address space would be no different that what > > is currently being done. When you allocate memory using page_alloc(GFP_DMA32) > > and hook up that memory to the linear space you exhaust the same amount of > > ZONE_DMA32 memory as if you were to use the PCI API. It comes from the same > > pool, except that doing it from the PCI API gets you the bus address right > > away. > > > > In this case GPU clients refers to the hw blocks on the GPU; they are > the ones that see the contiguous linear aperture. From the > application's perspective, gart memory looks like any other pages.
<nods>. Those 'hw blocks' or 'gart memory' are in reality just pages received via the 'alloc_page()' (before this patchset and also after this patchset) Oh wait, this 'hw blocks' or 'gart memory' can also refer to the VRAM memory right? In which case that is not memory allocated via 'alloc_page' but using a different mechanism. Is TTM used then? If so how do you stick those VRAM pages under its accounting rules? Or do the drivers use some other mechanism for that that is dependent on each driver?
"hw blocks" refers to the different sections of the GPU (texture fetchers, render targets, display controllers), not memory buffers. E.g., if you want to read a texture from vram or gart, you'd program the texture base address to the address of the texture in the GPU's address space. E.g., you might map 512 MB of vram at from 0x00000000 and a 512 MB gart aperture at 0x20000000 in the GPU's address space. If you have a texture at the start of vram, you'd program the texture base address to 0x0000000 or if it was at the start of the gart aperture, you'd program it to 0x20000000. To the GPU, the gart looks like a linear array, but to everything else (driver, apps), it's just pages. The driver manages vram access using ttm. The GPU has access to the entire amount of vram directly, but the CPU can only access it via the PCI framebuffer BAR. On systems with more vram than framebuffer BAR space, the CPU can only access buffers in the region covered by the BAR (usually the first 128 or 256 MB of vram depending on the BAR). For the CPU to access a buffer in vram, the GPU has to move it to the area covered by the BAR or to gart memory.
Alex
Hi, Konrad.
This discussion has become a bit lenghty. I'll filter out the sorted-out stuff, which leaves me with two remaining issues:
On 01/11/2011 04:55 PM, Konrad Rzeszutek Wilk wrote:
So at the end we have 16GB taken from 8GB->24GB, and 320MB taken from 0->4GB. When you start allocating coherent memory from each guest (and yeah, say we use 2GB each), we end up with the first guest getting the 2GB, the second getting 1.7GB, and then the next two getting zil.
You still have GFP_KERNEL memory in each guest - the first one has 2GB left , then second 2.3, the next two have each 4GB.
From the hyprevisor pool perspective, the 0-4GB zone is exhausted, so
is the 8GB->24GB, but it still has 4GB->8GB free - so it can launch one more guest (but without PCI passthrough devices).
On a 4GB machine or less, that would be the same as kernel memory. Now, if 4 guests think they can allocate 2GB of coherent memory each, you might run out of kernel memory on the host?
So host in this case refers to the Hypervisor and it does not care about the DMA or what - it does not have any device drivers(*) or such. The first guest (dom0) is the one that deals with the device drivers.
*: It has one: the serial port, but that is not really that important for this discussion.
Let's assume we're at where the hypervisor (or host) has exhausted the 0-4GB zone, due to guests coherent memory allocations, and that the physical machine has 4GB of memory, all in the 0-4GB zone. Now if the hypervisor was running on a Linux kernel, there would be no more GFP_KERNEL memory available on the *host* (hypervisor), and the hypervisor would crash. Now I don't know much about Xen, but it might be that this is not a problem with Xen at all?
Another thing that I was thinking of is what happens if you have a huge gart and allocate a lot of coherent memory. Could that potentially exhaust IOMMU resources?
<scratches his head>
I need to be more specific. Let's assume we're on "bare metal", and we want to allocate 4GB of coherent memory. For most IOMMUs that would mean as you previously state, that we actually allocate GFP_DMA32 memory. But for some IOMMUs that would perhaps mean that we allocate *any* memory and set up a permanent DMA mapping in the IOMMU for the coherent pages. What if, in such a case, the IOMMU can only set up 2GB of coherent memory?
Or in short, can there *ever* be "bare metal" cases where the amount of coherent memory available is less than DMA32 memory available?
Thanks, Thomas
On Wed, Jan 12, 2011 at 10:12:14AM +0100, Thomas Hellstrom wrote:
Hi, Konrad.
This discussion has become a bit lenghty. I'll filter out the sorted-out stuff, which leaves me with two remaining issues:
<nods>
On 01/11/2011 04:55 PM, Konrad Rzeszutek Wilk wrote:
So at the end we have 16GB taken from 8GB->24GB, and 320MB taken from 0->4GB. When you start allocating coherent memory from each guest (and yeah, say we use 2GB each), we end up with the first guest getting the 2GB, the second getting 1.7GB, and then the next two getting zil.
You still have GFP_KERNEL memory in each guest - the first one has 2GB left , then second 2.3, the next two have each 4GB.
From the hyprevisor pool perspective, the 0-4GB zone is exhausted, so
is the 8GB->24GB, but it still has 4GB->8GB free - so it can launch one more guest (but without PCI passthrough devices).
On a 4GB machine or less, that would be the same as kernel memory. Now, if 4 guests think they can allocate 2GB of coherent memory each, you might run out of kernel memory on the host?
So host in this case refers to the Hypervisor and it does not care about the DMA or what - it does not have any device drivers(*) or such. The first guest (dom0) is the one that deals with the device drivers.
*: It has one: the serial port, but that is not really that important for this discussion.
Let's assume we're at where the hypervisor (or host) has exhausted the 0-4GB zone, due to guests coherent memory allocations, and that the physical machine has 4GB of memory, all in the 0-4GB zone. Now if the hypervisor was running on a Linux kernel, there would be no more GFP_KERNEL memory available on the *host* (hypervisor), and the hypervisor would crash. Now I don't know much about Xen, but it might be that this is not a problem with Xen at all?
It will have no problem. It allocates at boot all the memory it needs and won't get bigger (or smaller) after that.
Another thing that I was thinking of is what happens if you have a huge gart and allocate a lot of coherent memory. Could that potentially exhaust IOMMU resources?
<scratches his head>
I need to be more specific. Let's assume we're on "bare metal", and we want to allocate 4GB of coherent memory. For most IOMMUs that would mean as you previously state, that we actually allocate GFP_DMA32 memory. But for some IOMMUs that would perhaps mean that we allocate *any* memory and set up a permanent DMA mapping in the IOMMU for the coherent pages. What if, in such a case, the IOMMU can only set up 2GB of coherent memory?
Or in short, can there *ever* be "bare metal" cases where the amount of coherent memory available is less than DMA32 memory available?
There is no such case where the amount of coherent memory is less than DMA32 memory. [unless the IOMMU has some chipset problem where it can't map 2^31 -> 2^32 addresses, but that is not a something we should worry about]
On Wed, Jan 12, 2011 at 10:19:39AM -0500, Konrad Rzeszutek Wilk wrote:
On Wed, Jan 12, 2011 at 10:12:14AM +0100, Thomas Hellstrom wrote:
Hi, Konrad.
This discussion has become a bit lenghty. I'll filter out the sorted-out stuff, which leaves me with two remaining issues:
<nods>
ping?
... did I miss some extra discussion?
Konrad, Dave
Given our previous discussion on the list, I believe these patches shouldn't introduce any regressions in the non-Xen case, however we should probably be prepared to back them out quickly if that turns out to be the case...
Thanks, Thomas
On 01/07/2011 06:11 PM, Konrad Rzeszutek Wilk wrote:
Attached is a set of patches that make it possible for drivers using TTM API (nouveau and radeon graphic drivers) to work under Xen. The explanation is a bit complex and I am not sure if I am explaining it that well..so if something is unclear please do ping me.
Changes since v1: [https://lkml.org/lkml/2010/12/6/516]
- Cleaned up commit message (forgot to add my SoB).
Short explanation of problem: What we are hitting under Xen is that instead of programming the GART with the physical DMA address of the TTM page, we end up programming the bounce buffer DMA (bus) address!
Long explanation: The reason we end up doing this is that:
1). alloc_page with GFP_DMA32 does not allocate "real" (Machine Frame Numbers - MFN) under the 4GB in Xen. That is b/c if actually made the pages underneath 4GB available to the the Linux page allocator we would not be able to give those to other guest devices. This would mean if we tried to pass in a USB device to one guest and in another were running the Xorg server we wouldn't be able to do so as we would run out of pages under 4GB. So pages that we get from alloc_page have a PFN that is under 4GB but in reality the real physical address (MFN) is above 4GB. Ugh..
2). The backends for "struct ttm_backend_func" utilize the PCI API. When they get a page allocated via alloc_page, the use 'pci_map_page' and program the DMA (bus) address in the GART - which is correct. But then the calls that kick off the graphic driver to process the pages do not use the pci_page_sync_* calls. If the physical address of the page is the same as the DMA bus address returned from pci_map_page then there are no trouble. But if they are different: virt_to_phys(page_address(p)) != pci_map_page(p,..) then the graphic card fetches data from the DMA (bus) address (so the value returned from pci_map_page). The data however that the user wrote to (the page p) ends up being untouched. You are probably saying: "What? Nonsense, we stitch the page in ttm_bo_vm_fault using the PFN and .. and even if the virt_to_phys(page_address(p)) != pci_map_page(p) the GART ends up with the bus (DMA) address of the PFN!" That is true. But if you combine this with 1) where you end up with page that is above the dma_mask (even if you called it with GFP_DMA32) and then make a call on pci_map_page you would end up with a bounce buffer!
The problem above can be easily reproduced on bare-metal if you pass in "swiotlb=force iommu=soft".
There are two ways of fixing this:
1). Use the 'dma_alloc_coherent' (or pci_alloc_consistent if there is struct pcidev present), instead of alloc_page for GFP_DMA32. The 'dma_alloc_coherent' guarantees that the allocated page fits within the device dma_mask (or uses the default DMA32 if no device is passed in). This also guarantees that any subsequent call to the PCI API for this page will return the same DMA (bus) address as the first call (so pci_alloc_consistent, and then pci_map_page will give the same DMA bus address).
2). Use the pci_sync_range_* after sending a page to the graphics engine. If the bounce buffer is used then we end up copying the pages.
3). This one I really don't want to think about, but I should mention it. Alter the alloc_page and its backend to know about Xen. The pushback from the MM guys will be probably: "why not use the PCI API.."
So attached is a draft set of patches that use solution #1. I've tested this on baremetal (and Xen) on PCIe Radeon and nouveau cards with success.
- The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying
with modifying the TTM API to pass in 'struct device' or 'struct pci_device' but figured it would make first sense to get your guys input before heading that route.
This patch-set is also available at:
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/ttm.pci-api.v3
Konrad Rzeszutek Wilk (5): ttm: Introduce a placeholder for DMA (bus) addresses. tm: Utilize the dma_addr_t array for pages that are to in DMA32 pool. ttm: Expand (*populate) to support an array of DMA addresses. radeon/ttm/PCIe: Use dma_addr if TTM has set it. nouveau/ttm/PCIe: Use dma_addr if TTM has set it.
And diffstat:
drivers/gpu/drm/nouveau/nouveau_sgdma.c | 31 +++++++++++++++++++------- drivers/gpu/drm/radeon/radeon.h | 4 ++- drivers/gpu/drm/radeon/radeon_gart.c | 36 ++++++++++++++++++++++-------- drivers/gpu/drm/radeon/radeon_ttm.c | 8 +++++- drivers/gpu/drm/ttm/ttm_agp_backend.c | 3 +- drivers/gpu/drm/ttm/ttm_page_alloc.c | 35 +++++++++++++++++++++++++----- drivers/gpu/drm/ttm/ttm_tt.c | 12 +++++++-- drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c | 3 +- include/drm/ttm/ttm_bo_driver.h | 6 ++++- include/drm/ttm/ttm_page_alloc.h | 8 +++++- 10 files changed, 111 insertions(+), 35 deletions(-)
------------------------------------------------------------------------------ Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)! Finally, a world-class log management solution at an even better price-free! Download using promo code Free_Logger_4_Dev2Dev. Offer expires February 28th, so secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsight-sfd2d -- _______________________________________________ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
On Thu, Jan 27, 2011 at 10:28:45AM +0100, Thomas Hellstrom wrote:
Konrad, Dave
Given our previous discussion on the list, I believe these patches shouldn't introduce any regressions in the non-Xen case, however we should probably be prepared to back them out quickly if that turns out to be the case...
Absolutly. I've rebased the patches with your review comments (thought I am not seeing the indentation issues in the source code - not sure why?).
Stuck the patches on
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/ttm.pci-api.v4
Let me ask some of the radeon drivers to take a look at the radeon/ttm/PCIe: Use dma_addr if TTM has set it.
patch.
Thank you very much for taking a look at the patches!
------------------------------------------------------------------------------ Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)! Finally, a world-class log management solution at an even better price-free! Download using promo code Free_Logger_4_Dev2Dev. Offer expires February 28th, so secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsight-sfd2d -- _______________________________________________ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
On Fre, 2011-01-07 at 12:11 -0500, Konrad Rzeszutek Wilk wrote:
- The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying
with modifying the TTM API to pass in 'struct device' or 'struct pci_device' but figured it would make first sense to get your guys input before heading that route.
It's worse than unsightly: It breaks TTM on PPC. See arch/powerpc/include/asm/dma-mapping.h: get_dma_ops() returns NULL if NULL is passed in for the device, and most of its callers BUG in that case. The exception being dma_supported(), so a possible solution might be to use that for checking if dma_alloc_coherent can be used.
Dave, please prevent this change from entering mainline before there's a solution for this.
On Mon, Mar 21, 2011 at 02:11:07PM +0100, Michel Dänzer wrote:
On Fre, 2011-01-07 at 12:11 -0500, Konrad Rzeszutek Wilk wrote:
- The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying
with modifying the TTM API to pass in 'struct device' or 'struct pci_device' but figured it would make first sense to get your guys input before heading that route.
It's worse than unsightly: It breaks TTM on PPC. See arch/powerpc/include/asm/dma-mapping.h: get_dma_ops() returns NULL if NULL is passed in for the device, and most of its callers BUG in that case. The exception being dma_supported(), so a possible solution might be to use that for checking if dma_alloc_coherent can be used.
Dave, please prevent this change from entering mainline before there's a solution for this.
We do have a fix for it:
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/ttm.pci-api.v5
Can you tell me if that works for you?
On Mon, 2011-03-21 at 19:18 -0400, Konrad Rzeszutek Wilk wrote:
On Mon, Mar 21, 2011 at 02:11:07PM +0100, Michel Dänzer wrote:
On Fre, 2011-01-07 at 12:11 -0500, Konrad Rzeszutek Wilk wrote:
- The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying
with modifying the TTM API to pass in 'struct device' or 'struct pci_device' but figured it would make first sense to get your guys input before heading that route.
It's worse than unsightly: It breaks TTM on PPC. See arch/powerpc/include/asm/dma-mapping.h: get_dma_ops() returns NULL if NULL is passed in for the device, and most of its callers BUG in that case. The exception being dma_supported(), so a possible solution might be to use that for checking if dma_alloc_coherent can be used.
Dave, please prevent this change from entering mainline before there's a solution for this.
We do have a fix for it:
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/ttm.pci-api.v5
Can you tell me if that works for you?
Yeah, http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=02bb... and http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=7333... fix the problem.
On Tue, Mar 22, 2011 at 02:13:19PM +0100, Michel Dänzer wrote:
On Mon, 2011-03-21 at 19:18 -0400, Konrad Rzeszutek Wilk wrote:
On Mon, Mar 21, 2011 at 02:11:07PM +0100, Michel Dänzer wrote:
On Fre, 2011-01-07 at 12:11 -0500, Konrad Rzeszutek Wilk wrote:
- The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying
with modifying the TTM API to pass in 'struct device' or 'struct pci_device' but figured it would make first sense to get your guys input before heading that route.
It's worse than unsightly: It breaks TTM on PPC. See arch/powerpc/include/asm/dma-mapping.h: get_dma_ops() returns NULL if NULL is passed in for the device, and most of its callers BUG in that case. The exception being dma_supported(), so a possible solution might be to use that for checking if dma_alloc_coherent can be used.
Dave, please prevent this change from entering mainline before there's a solution for this.
We do have a fix for it:
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/ttm.pci-api.v5
Can you tell me if that works for you?
Yeah, http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=02bb... and http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=7333... fix the problem.
Wheew. Good! What kind of hardware do you have that triggered this? When I implemented this patchset I hadn't thought about PPC b/c..well I didn't have the hardware nor did I think there were any ATI/Nvidia cards that worked with it. Can you give me an idea of type of hardware this is and where I could purchase it?
Much appreciated and thank you for testing this!
On Die, 2011-03-22 at 10:54 -0400, Konrad Rzeszutek Wilk wrote:
On Tue, Mar 22, 2011 at 02:13:19PM +0100, Michel Dänzer wrote:
On Mon, 2011-03-21 at 19:18 -0400, Konrad Rzeszutek Wilk wrote:
On Mon, Mar 21, 2011 at 02:11:07PM +0100, Michel Dänzer wrote:
On Fre, 2011-01-07 at 12:11 -0500, Konrad Rzeszutek Wilk wrote:
- The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying
with modifying the TTM API to pass in 'struct device' or 'struct pci_device' but figured it would make first sense to get your guys input before heading that route.
It's worse than unsightly: It breaks TTM on PPC. See arch/powerpc/include/asm/dma-mapping.h: get_dma_ops() returns NULL if NULL is passed in for the device, and most of its callers BUG in that case. The exception being dma_supported(), so a possible solution might be to use that for checking if dma_alloc_coherent can be used.
Dave, please prevent this change from entering mainline before there's a solution for this.
We do have a fix for it:
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/ttm.pci-api.v5
Can you tell me if that works for you?
Yeah, http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=02bb... and http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=7333... fix the problem.
Wheew. Good! What kind of hardware do you have that triggered this? When I implemented this patchset I hadn't thought about PPC b/c..well I didn't have the hardware nor did I think there were any ATI/Nvidia cards that worked with it. Can you give me an idea of type of hardware this is and where I could purchase it?
It's an Apple PowerBook with a Mobility Radeon 9700, to buy one new you'd have to go back in time a couple of years first. ;) You should be able to use more recent GPUs in a second-hand Apple G5 or YDL PowerStation, which also have reasonably fast CPUs.
dri-devel@lists.freedesktop.org